Skip to content

NKI validation status

A snapshot of where each sub-project stands on Phase 1 — the "NKI kernels replace stubs and match the PyTorch reference on real trn1/trn2 hardware" milestone.

Three gates, in order of how cheaply they can be run:

  1. Simulator gatenki.simulate(kernel)(numpy_args) runs the NKI program on CPU with no device and no NEFF compile. Exercised in every PR via the nki-simulator CI job on ubuntu-latest across all six libraries. Catches Python-layer correctness, shape mismatches, and API drift.
  2. Hardware gate@pytest.mark.neuron tests dispatched to a per-repo trn1.2xlarge via SSM, run manually by maintainers via scripts/run_neuron_tests.sh. Catches MLIR-verifier issues, numerical behavior, and real NEFF compile.
  3. Canonical reference — published spec vectors, PyTorch-parity tolerance, or scipy / PySCF / LAPACK agreement, depending on the library.

Per-kernel details live in each sub-project's docs/architecture.md. Per-phase definitions live in the suite roadmap.

Status by sub-project

Sub-project Phase 1 tracker Status NKI kernels Blog retrospective
trnfft #51 ✅ Hardware-validated butterfly FFT, batched FFT, complex GEMM, DFT-as-GEMM fast path, Kahan butterfly trnfft: FFT on hardware that doesn't want to be an FFT engine
trnblas #21 ✅ Hardware-validated GEMM, SYRK, fused DF-MP2 energy reduction trnblas: fusing DF-MP2 energy reduction into one NKI kernel
trnrand #18 🚧 Simulator-validated, upstream-blocked Philox 4×32-10, Box-Muller trnrand: RNG is a four-engine workload
trnsolver #26 🕑 Simulator-validated batched-sweep Jacobi eigh, Newton–Schulz inverse-sqrt, CG / GMRES with Jacobi preconditioner trnsolver: Jacobi for Trainium
trnsparse #14 ✅ Hardware-validated BSR-128 SpMM, fused screened SpMM, CSR-materialized SpMM trnsparse: the tile is the unit, not the nonzero
trntensor #27 ✅ Hardware-validated 2-index + batched nc_matmul, fused MP2 energy, 4-index AO→MO transform trntensor: when the kernel boundary is the API

Legend:

  • Hardware-validated — kernels pass @pytest.mark.neuron on trn1; NKI is the default dispatch target when neuronxcc is available.
  • 🕑 Simulator-validated — kernels pass nki-simulator CI; hardware runs queued. PyTorch is the default until hardware passes.
  • 🚧 Simulator-validated, upstream-blocked — kernels compile and run; the library has a named, trackable NKI primitive gap preventing numerically-correct hardware output. See the blog retrospective for specifics.

Cross-suite infrastructure

  • NKI 0.3.0 migration complete across all six libraries. Coordination tracked in trnsci/trnsci#5. Narrative: The dev loop just got a lot shorter.
  • nki-simulator CI gate on ubuntu-latest for every library. Fast iteration for Python-layer correctness; does not replace hardware for MLIR-verifier or numerical-behavior checks.
  • Hardware CI runs manually via per-repo scripts/run_neuron_tests.sh against a <repo>-ci-trn1 instance.

Looking ahead

  • trnrand's Phase 1 closes when either an NKI integer-multiply primitive or a bitwise-exact nl.copy path lands. Tracked in aws-neuron-sdk#1308.
  • trnsolver's hardware validation is the remaining simulator-to-hardware gap; Phase 3 introduces the Tensor Engine reformulation of the Jacobi rotation and dispatch-count reduction.
  • Phases 2–5 (precision, single-chip perf, multi-chip, generation-specific optimization) build on top of Phase 1 per the roadmap and are tracked per sub-project via the matching phase-N labels.

Design RFCs

Sub-projects with published design docs for upcoming phases:

Maintenance

This page is updated when a Phase 1 tracker closes (or opens, for any new sub-project added to the suite). Historic state lives in the git history — no versioning beyond that.