AWS Trainium2/3 MoE Kernel Challenge

Write custom kernels using the Neuron Kernel Interface (NKI) to maximize inference performance for the Qwen3-30B-A3B Mixture of Experts model on AWS Trainium2/3 hardware.

⚡

The Challenge

Optimize MoE model inference by implementing high-performance custom kernels targeting routing logic, expert computation, and attention mechanisms on dedicated AWS Trainium silicon.

🎯

The Goal

Maximize a combined score based on latency reduction, throughput improvement, and NKI implementation coverage — pushing the limits of what's possible on Trainium hardware.

💻

The Hardware

Compete on AWS Trainium2 in Round One, then top teams gain access to dedicated Trainium3 instances for Round Two — all running AWS Neuron SDK 2.28.

Competition Timeline

Round 1

Trainium2 — March 15–25

Submit optimized NKI kernels evaluated on a single Trn2 chip with Neuron SDK 2.28. The top 15 teams advance to Round Two.

Round 2

Trainium3 — April 14–24

Qualified teams receive dedicated single-chip Trn3 instances for further optimization. Final rankings determine competition winners.

Key Optimization Areas

Routing MoE routing and expert selection logic
Expert Layers Gate, up, and down projection computations
Attention Attention mechanisms adapted for MoE architectures
Memory Memory-efficient sparse operations

View on GitHub Contact Us

How to Participate

Get set up and start optimizing. Follow the steps below to prepare your environment and submit your kernels.

1

Provision a Trainium2 Instance

Launch an AWS Trainium2 instance with AWS Neuron SDK v2.27+ and activate the PyTorch 2.9 environment.

2

Download the Model

Pull the Qwen3-30B-A3B model weights from Hugging Face using the Hugging Face CLI.

huggingface-cli download Qwen/Qwen3-30B-A3B

3

Clone the Repository

Fork or clone the competition repo to access sample kernels, implementation templates, and the benchmarking API.

git clone https://github.com/aws-neuron/nki-moe

4

Write Your NKI Kernels

Modify the provided model files with your custom NKI kernels. Sample implementations for tensor addition and RMSNorm are provided as starting points.

5

Benchmark & Submit

Use the included benchmarking API to measure your TTFT and throughput at batch size 1. Verify accuracy against the reference implementation, then submit.

Resources

GitHub Repository Sample kernels, templates, benchmarking API NKI Documentation Official Neuron Kernel Interface docs & tutorials Neuron Explorer Profiling tool for Trainium workloads Competition Support nki-mlsys-2026@amazon.com

Scoring

Submissions are ranked by a combined score that rewards accuracy, speed, and depth of NKI implementation.

Scoring Formula

Score = Accuracy × Reduced Latency × Increased Throughput × (1 + Normalized NKI FLOPS)

Accuracy

Pass / Fail

Output must match the reference implementation within the defined tolerance threshold. Failing accuracy results in a score of 0.

Reduced Latency

Multiplier

Measured as Reference TTFT ÷ Submission TTFT. A 10× reduction in Time to First Token yields a 10× multiplier.

Increased Throughput

Multiplier

Measured as Submission tokens/sec ÷ Reference tokens/sec. A 2× throughput improvement yields a 2× multiplier.

NKI FLOPS Coverage

Bonus

Measured as NKI FLOPS ÷ Total Model FLOPS. Full NKI coverage (1.0) doubles the bonus factor to 2.0.

Scoring Example

A submission achieving 10× latency reduction, 2× throughput improvement, and 85% NKI coverage:

1 (pass) × 10 (latency) × 2 (throughput) × (1 + 0.85) = 37 points

Evaluation Details

All submissions are evaluated at batch size 1
Round One hardware: Single Trn2 chip, Neuron SDK 2.28
Round Two hardware: Single Trn3 chip, Neuron SDK 2.28
Top 15 teams from Round One advance to Round Two
Team registration is currently closed

Leaderboard

Live rankings updated as submissions are evaluated. Scores reflect the best submission per team.