AWS Trainium2/3 MoE Kernel Challenge
Write custom kernels using the Neuron Kernel Interface (NKI) to maximize inference performance for the Qwen3-30B-A3B Mixture of Experts model on AWS Trainium2/3 hardware.
The Challenge
Optimize MoE model inference by implementing high-performance custom kernels targeting routing logic, expert computation, and attention mechanisms on dedicated AWS Trainium silicon.
The Goal
Maximize a combined score based on latency reduction, throughput improvement, and NKI implementation coverage — pushing the limits of what's possible on Trainium hardware.
The Hardware
Compete on AWS Trainium2 in Round One, then top teams gain access to dedicated Trainium3 instances for Round Two — all running AWS Neuron SDK 2.28.
Competition Timeline
Trainium2 — March 15–25
Submit optimized NKI kernels evaluated on a single Trn2 chip with Neuron SDK 2.28. The top 15 teams advance to Round Two.
Trainium3 — April 14–24
Qualified teams receive dedicated single-chip Trn3 instances for further optimization. Final rankings determine competition winners.
Key Optimization Areas
- Routing MoE routing and expert selection logic
- Expert Layers Gate, up, and down projection computations
- Attention Attention mechanisms adapted for MoE architectures
- Memory Memory-efficient sparse operations
How to Participate
Get set up and start optimizing. Follow the steps below to prepare your environment and submit your kernels.
Provision a Trainium2 Instance
Launch an AWS Trainium2 instance with AWS Neuron SDK v2.27+ and activate the PyTorch 2.9 environment.
Download the Model
Pull the Qwen3-30B-A3B model weights from Hugging Face using the Hugging Face CLI.
huggingface-cli download Qwen/Qwen3-30B-A3B
Clone the Repository
Fork or clone the competition repo to access sample kernels, implementation templates, and the benchmarking API.
git clone https://github.com/aws-neuron/nki-moe
Write Your NKI Kernels
Modify the provided model files with your custom NKI kernels. Sample implementations for tensor addition and RMSNorm are provided as starting points.
Benchmark & Submit
Use the included benchmarking API to measure your TTFT and throughput at batch size 1. Verify accuracy against the reference implementation, then submit.
Scoring
Submissions are ranked by a combined score that rewards accuracy, speed, and depth of NKI implementation.
Scoring Formula
Accuracy
Output must match the reference implementation within the defined tolerance threshold. Failing accuracy results in a score of 0.
Reduced Latency
Measured as Reference TTFT ÷ Submission TTFT. A 10× reduction in Time to First Token yields a 10× multiplier.
Increased Throughput
Measured as Submission tokens/sec ÷ Reference tokens/sec. A 2× throughput improvement yields a 2× multiplier.
NKI FLOPS Coverage
Measured as NKI FLOPS ÷ Total Model FLOPS. Full NKI coverage (1.0) doubles the bonus factor to 2.0.
Scoring Example
A submission achieving 10× latency reduction, 2× throughput improvement, and 85% NKI coverage:
Evaluation Details
- All submissions are evaluated at batch size 1
- Round One hardware: Single Trn2 chip, Neuron SDK 2.28
- Round Two hardware: Single Trn3 chip, Neuron SDK 2.28
- Top 15 teams from Round One advance to Round Two
- Team registration is currently closed
Leaderboard
Live rankings updated as submissions are evaluated. Scores reflect the best submission per team.