Allocated Fused Linear¶

This file hosts the high-performance kernel that computes RMSNorm(hidden) @ wQKV. This implementation uses the direct allocation API to achieve better performance.