Hi,
Sharing a training-free KV cache compression approach we've been developing that hooks into DynamicCache. Might be useful for folks running into memory limits with long contexts.
NexusQuant compresses the KV cache by 10-33x by combining attention-based token eviction with E8 lattice vector quantization. It monkey-patches DynamicLayer.update to intercept KV writes — same pattern as kvpress.
Some recent GPU results across 3 models:
- Mistral-7B: 9x compression, essentially zero PPL loss with our real attention scorer
- Phi-3-mini (head_dim=96): works via zero-padding, 9x at +0.59%
- Qwen2.5-7B: needs first/last 2 layers at FP16 (boundary protection) — otherwise catastrophic
One thing that surprised us: 3-bit keys + 2-bit values dramatically outperforms symmetric 2-bit on all models. The softmax amplifies key quantization noise across all positions, so keys deserve more precision. This is consistent with what the TurboQuant+ project found on Apple Silicon.
The API is a context manager:
from nexusquant.integrations.huggingface import nexusquant_evict
with nexusquant_evict(model, quality="high"):
output = model.generate(input_ids, max_new_tokens=200)
We also added physical KV truncation (actually remove evicted tokens from tensors, not just mask them) and asymmetric K/V as options.
Code: https://github.com/jagmarques/nexusquant
Would welcome any feedback, especially on the DynamicCache hook pattern.
Hi,
Sharing a training-free KV cache compression approach we've been developing that hooks into DynamicCache. Might be useful for folks running into memory limits with long contexts.
NexusQuant compresses the KV cache by 10-33x by combining attention-based token eviction with E8 lattice vector quantization. It monkey-patches
DynamicLayer.updateto intercept KV writes — same pattern as kvpress.Some recent GPU results across 3 models:
One thing that surprised us: 3-bit keys + 2-bit values dramatically outperforms symmetric 2-bit on all models. The softmax amplifies key quantization noise across all positions, so keys deserve more precision. This is consistent with what the TurboQuant+ project found on Apple Silicon.
The API is a context manager:
We also added physical KV truncation (actually remove evicted tokens from tensors, not just mask them) and asymmetric K/V as options.
Code: https://github.com/jagmarques/nexusquant
Would welcome any feedback, especially on the DynamicCache hook pattern.