NKI validation status¶
A snapshot of where each sub-project stands on Phase 1 — the "NKI kernels replace stubs and match the PyTorch reference on real trn1/trn2 hardware" milestone.
Three gates, in order of how cheaply they can be run:
- Simulator gate —
nki.simulate(kernel)(numpy_args)runs the NKI program on CPU with no device and no NEFF compile. Exercised in every PR via thenki-simulatorCI job onubuntu-latestacross all six libraries. Catches Python-layer correctness, shape mismatches, and API drift. - Hardware gate —
@pytest.mark.neurontests dispatched to a per-repotrn1.2xlargevia SSM, run manually by maintainers viascripts/run_neuron_tests.sh. Catches MLIR-verifier issues, numerical behavior, and real NEFF compile. - Canonical reference — published spec vectors, PyTorch-parity tolerance, or scipy / PySCF / LAPACK agreement, depending on the library.
Per-kernel details live in each sub-project's docs/architecture.md. Per-phase definitions live in the suite roadmap.
Status by sub-project¶
| Sub-project | Phase 1 tracker | Status | NKI kernels | Blog retrospective |
|---|---|---|---|---|
| trnfft | #51 | ✅ Hardware-validated | butterfly FFT, batched FFT, complex GEMM, DFT-as-GEMM fast path, Kahan butterfly | trnfft: FFT on hardware that doesn't want to be an FFT engine |
| trnblas | #21 | ✅ Hardware-validated | GEMM, SYRK, fused DF-MP2 energy reduction | trnblas: fusing DF-MP2 energy reduction into one NKI kernel |
| trnrand | #18 | 🚧 Simulator-validated, upstream-blocked | Philox 4×32-10, Box-Muller | trnrand: RNG is a four-engine workload |
| trnsolver | #26 | 🕑 Simulator-validated | batched-sweep Jacobi eigh, Newton–Schulz inverse-sqrt, CG / GMRES with Jacobi preconditioner |
trnsolver: Jacobi for Trainium |
| trnsparse | #14 | ✅ Hardware-validated | BSR-128 SpMM, fused screened SpMM, CSR-materialized SpMM | trnsparse: the tile is the unit, not the nonzero |
| trntensor | #27 | ✅ Hardware-validated | 2-index + batched nc_matmul, fused MP2 energy, 4-index AO→MO transform |
trntensor: when the kernel boundary is the API |
Legend:
- ✅ Hardware-validated — kernels pass
@pytest.mark.neuronon trn1; NKI is the default dispatch target whenneuronxccis available. - 🕑 Simulator-validated — kernels pass
nki-simulatorCI; hardware runs queued. PyTorch is the default until hardware passes. - 🚧 Simulator-validated, upstream-blocked — kernels compile and run; the library has a named, trackable NKI primitive gap preventing numerically-correct hardware output. See the blog retrospective for specifics.
Cross-suite infrastructure¶
- NKI 0.3.0 migration complete across all six libraries. Coordination tracked in trnsci/trnsci#5. Narrative: The dev loop just got a lot shorter.
nki-simulatorCI gate onubuntu-latestfor every library. Fast iteration for Python-layer correctness; does not replace hardware for MLIR-verifier or numerical-behavior checks.- Hardware CI runs manually via per-repo
scripts/run_neuron_tests.shagainst a<repo>-ci-trn1instance.
Looking ahead¶
- trnrand's Phase 1 closes when either an NKI integer-multiply primitive or a bitwise-exact
nl.copypath lands. Tracked in aws-neuron-sdk#1308. - trnsolver's hardware validation is the remaining simulator-to-hardware gap; Phase 3 introduces the Tensor Engine reformulation of the Jacobi rotation and dispatch-count reduction.
- Phases 2–5 (precision, single-chip perf, multi-chip, generation-specific optimization) build on top of Phase 1 per the roadmap and are tracked per sub-project via the matching
phase-Nlabels.
Design RFCs¶
Sub-projects with published design docs for upcoming phases:
- trnblas: fused DF-MP2 pair-energy kernel (Phase 3) — collapse
(T*(2T−T.T)/denom).sum()into one SBUF-resident Vector+Scalar pass; targets 3–6× speedup on the DF-MP2 hot path. - trnrand: SBUF-resident streaming Generator (Phase 3) — pre-compiled multi-distribution kernel with pipelined GpSimd / Vector / Scalar engines.
- trnrand: counter-partitioned multi-chip RNG (Phase 4) — bit-exact cross-cluster-size RNG via Philox counter-space partitioning.
Maintenance¶
This page is updated when a Phase 1 tracker closes (or opens, for any new sub-project added to the suite). Historic state lives in the git history — no versioning beyond that.