NexusQuant: training-free KV cache compression (10-33x) via DynamicCache hooks

Hi,

Sharing a training-free KV cache compression approach we've been developing that hooks into DynamicCache. Might be useful for folks running into memory limits with long contexts.

NexusQuant compresses the KV cache by 10-33x by combining attention-based token eviction with E8 lattice vector quantization. It monkey-patches DynamicLayer.update to intercept KV writes — same pattern as kvpress.

Some recent GPU results across 3 models:

Mistral-7B: 9x compression, essentially zero PPL loss with our real attention scorer
Phi-3-mini (head_dim=96): works via zero-padding, 9x at +0.59%
Qwen2.5-7B: needs first/last 2 layers at FP16 (boundary protection) — otherwise catastrophic

One thing that surprised us: 3-bit keys + 2-bit values dramatically outperforms symmetric 2-bit on all models. The softmax amplifies key quantization noise across all positions, so keys deserve more precision. This is consistent with what the TurboQuant+ project found on Apple Silicon.

The API is a context manager:

from nexusquant.integrations.huggingface import nexusquant_evict

with nexusquant_evict(model, quality="high"):
    output = model.generate(input_ids, max_new_tokens=200)

We also added physical KV truncation (actually remove evicted tokens from tensors, not just mask them) and asymmetric K/V as options.

Code: https://github.com/jagmarques/nexusquant

Would welcome any feedback, especially on the DynamicCache hook pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NexusQuant: training-free KV cache compression (10-33x) via DynamicCache hooks #45304

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

NexusQuant: training-free KV cache compression (10-33x) via DynamicCache hooks #45304

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions