Releases: eugenehp/gpu-fft
Releases · eugenehp/gpu-fft
1.1.0
Added
Batched FFT / IFFT
fft_batch/ifft_batchpublic API — accepts&[Vec<f32>]; pads all
signals to the same power-of-two length, packs them into a flat GPU buffer,
and runs the full butterfly pipeline in a single set of kernel launches
regardless of batch sizebutterfly_inner_batchCubeCL kernel — extends the shared-memory inner pass
to coverB × tiles_per_signalworkgroups simultaneouslybutterfly_stage_batchCubeCL kernel — extends each outer-stage dispatch to
cover allB × N/2butterfly pairs at once;batch_sizeis a#[comptime]
parameter so the thread-guard is a compile-time constant- Test suite —
tests/fft_batch.rs(9 cases: impulse, DC, alternating,
random, mixed lengths) andtests/ifft_batch.rs(7 cases)
Radix-4 outer butterfly
-
butterfly_stage_radix4(scalar) andbutterfly_stage_radix4_batch
(batched) CubeCL kernels — each of the N/4 threads loads four elements
{p, p+q, p+2q, p+3q}and executes two radix-2 butterfly stages entirely in
registers, eliminating one global-memory round-trip per pair of outer stages -
Analytical W₂b — the second twiddle is derived from the first without an
extra trig call:cos₂b = +sin₂a,sin₂b = −cos₂a(forward); sign flips
for the inverse transform -
Updated outer loop in
fft,ifft,fft_batch,ifft_batch:
while s + 1 < mdispatches radix-4; trailingif s < mdispatches a
single radix-2 when the remaining outer-stage count is odd -
Launch count reduction:
N v1.0 launches v1.1 launches Saved ≤ 1 024 1 1 — 2 048 2 2 — 4 096 3 2 1 8 192 4 3 1 16 384 5 4 1 65 536 7 4 3
Benchmarks (benches/fft_bench.rs)
- 9 new Criterion groups covering the batched path:
fft_batch/batch_size,fft_batch/signal_len,fft_batch_vs_sequential,
ifft_batch/batch_size,ifft_batch/signal_len,ifft_batch_vs_sequential,
roundtrip_batch,roundtrip_batch/signal_len - 6 radix-4 groups (
fft_radix4_outer,ifft_radix4_outer,
roundtrip_radix4_outer,fft_batch_radix4_outer,ifft_batch_radix4_outer,
roundtrip_batch_radix4_outer) sweepingRADIX4_OUTER_SIZES=
{2 048, 4 096, 8 192, 16 384, 65 536} — one size per outer-stage dispatch
pattern (trailing r2 · pure r4 · mixed r4+r2) bench.shforwards extra flags (e.g.--save-baseline,--baseline) to
the Criterion binary for before/after comparisons
Charts & report generation (scripts/export_bench.py)
- Full chart pipeline built from scratch: colour palette, marker shapes,
human-readable labels, signal-length / batch-size / vs-sequential /
radix-4 chart families - 7 SVG charts auto-generated on every bench run:
latency.svg,throughput.svg,batch_signal.svg,batch_size.svg,
vs_sequential.svg,radix4_outer.svg,radix4_batch_outer.svg _TABLE_SECTIONSpartitions the Markdown summary table into labelled
groups; a catch-all appends any unrecognised groups automatically- Image paths in generated reports now use
../charts/<name>so archive
files underbench-results/archive/render correctly
Tests
test_fft_impulse_large_even(N = 4 096) — single radix-4 outer dispatchtest_fft_impulse_large_odd(N = 8 192) — mixed r4 + trailing r2test_fft_dc_large(N = 4 096) — DC bin correctness at large Ntest_roundtrip_large_even/test_roundtrip_large_odd— round-trip
max error bounded by5 · log₂(N) · ε_machine
Documentation
- Algorithm section — radix-4 butterfly formula and W₂b derivation written
out in full; launch-count tables updated for both scalar and batched paths - Benchmark tables — scalar baselines, all batch signal-length/batch-size
sweeps, batch-vs-sequential with speedup column, two radix-4 outer tables
(scalar + batch=16) - Citation section — BibTeX
@softwareentry and APA plain-text - Roadmap — radix-4 item marked complete; batch item marked complete
Changed
- Outer butterfly loop in
fft/ifft/fft_batch/ifft_batchreplaced
from sequential single-stage radix-2 dispatches to paired radix-4 dispatches
with an optional trailing radix-2
Performance
Measured 2026-02-22, commit b231b99, wgpu backend.
| Benchmark | v1.0.0 | v1.1.0 | Change |
|---|---|---|---|
fft N = 4 096 |
557 µs · 7.35 Melem/s | 447 µs · 9.16 Melem/s | −20% latency |
fft N = 65 536 |
1.12 ms · 58.7 Melem/s | 940 µs · 69.7 Melem/s | −16% latency |
ifft N = 4 096 |
894 µs · 4.58 Melem/s | 476 µs · 8.61 Melem/s | −47% latency |
ifft N = 65 536 |
1.18 ms · 55.6 Melem/s | 1.12 ms · 58.8 Melem/s | −5% latency |
fft_batch N = 65 536, B = 16 |
— | 7.90 ms · 133 Melem/s | new |
fft_batch vs sequential, B = 64 |
— | 13.5× | new |
roundtrip_batch vs sequential, B = 64 |
— | 14.6× | new |
1.0.0
Features
- Cooley-Tukey radix-2 DIT FFT/IFFT — O(N log₂ N), runs
log₂ Nbutterfly-stage
kernel dispatches of N/2 threads each instead of a single O(N²) DFT pass - Automatic zero-padding — non-power-of-two inputs are padded to the next power of two
- Power Spectral Density — one-sided PSD with correct 1/N normalisation
- Dominant frequency detection — local-peak search above a threshold
- wgpu and CUDA backends via CubeCL feature flags
Implemented
- GPU kernel (CubeCL / wgpu)
- Precomputed twiddle factors (WIP branch)
- Cooley-Tukey radix-2 FFT — O(N log N)
- Shared-memory tiling — inner stages fused into one launch; 10 → 1 dispatch for N ≤ 1 024