Developing NKI kernels¶
trnsolver currently ships one NKI kernel (rotate_pairs_kernel) in
trnsolver/nki/dispatch.py. It's the per-round primitive for the
parallel Jacobi sweep in _jacobi_eigh_nki. This page is the short
trnsolver-specific version of the suite-wide guide at
trnsci/docs/developing_kernels.md.
Three dispatch modes¶
| Mode | Trigger | When to use |
|---|---|---|
| PyTorch fallback | HAS_NKI = False (non-Neuron host), or an exception in the NKI path gets caught |
Laptops, GPUs, CI's ubuntu-latest test matrix — the default for anyone without Neuron installed |
| NKI hardware | HAS_NKI = True, default env. Kernel runs through torch_xla → NEFF compile → NeuronCore |
Real perf numbers, final validation |
| NKI simulator | TRNSOLVER_USE_SIMULATOR=1 + HAS_NKI = True. Kernel runs through nki.simulate(kernel)(numpy_args) on CPU |
Fast correctness iteration during kernel design |
All three modes share the same kernel source — the @nki.jit definition
inside if HAS_NKI: in trnsolver/nki/dispatch.py. The host driver
(_jacobi_eigh_nki in trnsolver/eigen.py) picks the path based on
_use_nki() and _use_simulator().
Simulator workflow¶
NKI 0.3.0 Stable (Neuron SDK 2.29, April 2026) ships a CPU simulator that runs kernels without Trainium hardware. It collapses the iteration loop from ~8–12 min per attempt (instance start + SSM + NEFF compile) to seconds per iteration.
Three equivalent ways to run:
# (1) GH Actions — automatic on every push, zero AWS cost
# See .github/workflows/ci.yml::nki-simulator
# (2) Local Linux x86_64 with nki installed
TRNSOLVER_USE_SIMULATOR=1 pytest tests/ -m nki_simulator -v
# (3) SSM against the provisioned trn1 CI instance (uses the bundled Neuron venv)
AWS_PROFILE=aws ./scripts/run_simulator_tests.sh
CI coverage¶
| Gate | Runner | Catches | Misses |
|---|---|---|---|
test matrix (py 3.10/3.11/3.12) |
ubuntu-latest |
Pure-Python correctness against torch.linalg.eigh. ~1 s. |
Anything NKI-kernel-specific. |
nki-simulator |
ubuntu-latest |
Python trace-level kernel errors: wrong nc_matmul kwargs, dropped ops (nl.divide), shape mismatches, tile-size violations. Seconds per kernel. |
MLIR verifier errors — simulator explicitly skips compile. Perf. |
neuron (SSM, manual) |
trn1.2xlarge |
Full NEFF compile + on-hardware execution. MLIR verification. Real perf. | Nothing (this is the ground truth). |
The nki-simulator gate catches the majority of the iteration pain
(Python-trace breakage) without AWS round-trips. Hardware runs are
reserved for MLIR verification and perf numbers.
Architecture-first (reminder)¶
Every NKI kernel in trnsolver should pull at least one documented architectural lever from #36 — stationary-operand reuse, FP32-PSUM-free mixed precision, 4-engine concurrency, SBUF residency. Kernel changes that are "port scipy faster" get re-scoped or rejected.
The rotation-kernel post-mortem on #9 is the concrete lesson.