8000
Skip to content

NexusQuant: training-free KV cache compression (10-33x) via DynamicCache hooks #45304

@jagmarques

Description

@jagmarques

Hi,

Sharing a training-free KV cache compression approach we've been developing that hooks into DynamicCache. Might be useful for folks running into memory limits with long contexts.

NexusQuant compresses the KV cache by 10-33x by combining attention-based token eviction with E8 lattice vector quantization. It monkey-patches DynamicLayer.update to intercept KV writes — same pattern as kvpress.

Some recent GPU results across 3 models:

  • Mistral-7B: 9x compression, essentially zero PPL loss with our real attention scorer
  • Phi-3-mini (head_dim=96): works via zero-padding, 9x at +0.59%
  • Qwen2.5-7B: needs first/last 2 layers at FP16 (boundary protection) — otherwise catastrophic

One thing that surprised us: 3-bit keys + 2-bit values dramatically outperforms symmetric 2-bit on all models. The softmax amplifies key quantization noise across all positions, so keys deserve more precision. This is consistent with what the TurboQuant+ project found on Apple Silicon.

The API is a context manager:

from nexusquant.integrations.huggingface import nexusquant_evict

with nexusquant_evict(model, quality="high"):
    output = model.generate(input_ids, max_new_tokens=200)

We also added physical KV truncation (actually remove evicted tokens from tensors, not just mask them) and asymmetric K/V as options.

Code: https://github.com/jagmarques/nexusquant

Would welcome any feedback, especially on the DynamicCache hook pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0