AWS Trainium2/3 MoE Kernel Challenge

Write custom kernels using the Neuron Kernel Interface (NKI) to maximize inference performance for the Qwen3-30B-A3B Mixture of Experts model on AWS Trainium2/3 hardware.

The Challenge

Optimize MoE model inference by implementing high-performance custom kernels targeting routing logic, expert computation, and attention mechanisms on dedicated AWS Trainium silicon.

🎯

The Goal

Maximize a combined score based on latency reduction, throughput improvement, and NKI implementation coverage — pushing the limits of what's possible on Trainium hardware.

💻

The Hardware

Compete on AWS Trainium2 in Round One, then top teams gain access to dedicated Trainium3 instances for Round Two — all running AWS Neuron SDK 2.28.

Competition Timeline

Round 1

Trainium2 — March 15–25

Submit optimized NKI kernels evaluated on a single Trn2 chip with Neuron SDK 2.28. The top 15 teams advance to Round Two.

Round 2

Trainium3 — April 14–24

Qualified teams receive dedicated single-chip Trn3 instances for further optimization. Final rankings determine competition winners.

Key Optimization Areas

  • Routing MoE routing and expert selection logic
  • Expert Layers Gate, up, and down projection computations
  • Attention Attention mechanisms adapted for MoE architectures
  • Memory Memory-efficient sparse operations

How to Participate

Get set up and start optimizing. Follow the steps below to prepare your environment and submit your kernels.

1

Provision a Trainium2 Instance

Launch an AWS Trainium2 instance with AWS Neuron SDK v2.27+ and activate the PyTorch 2.9 environment.

2

Download the Model

Pull the Qwen3-30B-A3B model weights from Hugging Face using the Hugging Face CLI.

huggingface-cli download Qwen/Qwen3-30B-A3B
3

Clone the Repository

Fork or clone the competition repo to access sample kernels, implementation templates, and the benchmarking API.

git clone https://github.com/aws-neuron/nki-moe
4

Write Your NKI Kernels

Modify the provided model files with your custom NKI kernels. Sample implementations for tensor addition and RMSNorm are provided as starting points.

5

Benchmark & Submit

Use the included benchmarking API to measure your TTFT and throughput at batch size 1. Verify accuracy against the reference implementation, then submit.

Scoring

Submissions are ranked by a combined score that rewards accuracy, speed, and depth of NKI implementation.

Scoring Formula

Score = Accuracy × Reduced Latency × Increased Throughput × (1 + Normalized NKI FLOPS)

Accuracy

Pass / Fail

Output must match the reference implementation within the defined tolerance threshold. Failing accuracy results in a score of 0.

Reduced Latency

Multiplier

Measured as Reference TTFT ÷ Submission TTFT. A 10× reduction in Time to First Token yields a 10× multiplier.

Increased Throughput

Multiplier

Measured as Submission tokens/sec ÷ Reference tokens/sec. A 2× throughput improvement yields a 2× multiplier.

NKI FLOPS Coverage

Bonus

Measured as NKI FLOPS ÷ Total Model FLOPS. Full NKI coverage (1.0) doubles the bonus factor to 2.0.

Scoring Example

A submission achieving 10× latency reduction, 2× throughput improvement, and 85% NKI coverage:

1 (pass) × 10 (latency) × 2 (throughput) × (1 + 0.85) = 37 points

Evaluation Details

  • All submissions are evaluated at batch size 1
  • Round One hardware: Single Trn2 chip, Neuron SDK 2.28
  • Round Two hardware: Single Trn3 chip, Neuron SDK 2.28
  • Top 15 teams from Round One advance to Round Two
  • Team registration is currently closed

Leaderboard

Live rankings updated as submissions are evaluated. Scores reflect the best submission per team.