Abstract — Attention head pruning removes individual attention heads from transformer layers to reduce model size and compute. Existing work evaluates this by accuracy retained. But when you remove a component from a circuit, the remaining components don't always draw less power — sometimes they compensate, redistributing load and maintaining or even increasing total energy consumption. This paper uses Energy Per Intelligence (EPI) and the epi-meter power trace to answer a question that accuracy-only evaluation cannot: when you remove attention heads, does the energy actually drop — or do the remaining heads compensate? We correlate per-layer PyTorch timing hooks with epi-meter power traces to detect compensation patterns invisible to accuracy metrics, and identify the pruning threshold where energy savings become real.
- Introduction
- The Compensation Hypothesis
- Research Questions
- Experimental Design
- Surgery Matrix
- Methodology
- Power Trace Correlation
- Results
- Analysis
- Discussion
- Comparison to Prior Work
- Reproducibility
- Future Work
- Citation
- References
- License
An attention head is a parallel computation path. In a transformer with 32 heads per layer, each head independently computes queries, keys, and values, attends to the input, and produces a weighted representation. The outputs of all heads are concatenated and projected. Remove a head, and one computation path disappears.
In electrical terms, attention heads are parallel loads. In a parallel circuit, removing a load reduces total current — if nothing else changes. But in a transformer, something else might change. The remaining heads see different inputs (the concatenated output has fewer channels, the projection matrix adapts during any fine-tuning, and the softmax distributions shift). The remaining heads may compensate — processing more information per head to maintain output quality.
If compensation occurs, the energy savings from head removal are less than expected. In the worst case, remaining heads draw more power per head, and total energy barely changes despite having fewer heads. Accuracy-only evaluation cannot detect this — the model's accuracy may hold steady while the energy stays the same or even increases.
An electrician knows that removing a load from a parallel circuit doesn't always reduce total power. This paper applies that insight to transformers.
BEFORE: 32 heads (parallel loads)
┌──┬──┬──┬──┬──┬──┬──┬──┐
│H0│H1│H2│H3│ ... │H31│
└──┴──┴──┴──┴──┴──┴──┴──┘
Total power: P_total = P₀ + P₁ + ... + P₃₁
AFTER: 28 heads (4 removed)
┌──┬──┬──┬──┬──┬──┬──┬──┐
│H0│H1│ │H3│ ... │ │H31│
└──┴──┴──┴──┴──┴──┴──┴──┘
Expected: P_total' < P_total (fewer loads)
Possible: P_total' ≈ P_total (compensation!)
| Outcome | What Happens | EPI Effect |
|---|---|---|
| A. Energy drops proportionally | Removing N% of heads saves ~N% of energy | EPI improves (best case) |
| B. Partial compensation | Remaining heads work harder; energy drops less than expected | EPI improves, but less than predicted |
| C. Full compensation | Remaining heads fully compensate; total energy unchanged | EPI worsens (accuracy drops but en 8000 ergy doesn't) |
Outcome C is the danger zone. Accuracy-only pruning recommends removing heads because perplexity holds. But if energy doesn't drop, EPI gets worse — you lost intelligence without saving joules.
The epi-meter captures total system power. PyTorch forward hooks capture per-layer execution time. By correlating:
- Per-layer timing (which layers take longer after pruning)
- Total power trace (does wattage actually drop)
- Tokens per second (does throughput change)
We can identify whether remaining heads compensate or whether the energy savings are real.
| # | Question |
|---|---|
| RQ1 | Does removing attention heads produce proportional, partial, or negligible energy savings? |
| RQ2 | Can compensation be detected in the epi-meter power trace (power holds steady while heads are removed)? |
| RQ3 | Do specific layers show more compensation than others? |
| RQ4 | Is there a pruning threshold below which compensation dominates (energy savings disappear)? |
| RQ5 | Does the EPI-optimal head pruning depth differ from the accuracy-optimal depth? |
- Baseline: Full model (all heads) — measure EPI, power trace, per-layer timing
- Progressive pruning: Remove 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50% of heads
- Layer targeting: All layers, early only, late only
- For each configuration: Capture power trace AND per-layer timing hooks
- Correlate: Does per-layer timing redistribution correspond to power trace patterns?
| Component | Specification |
|---|---|
| Surgery platform | DGX Spark (GB10, 128GB) |
| Deployment target | Pi 5 cluster (4x 16GB, distributed-llama) |
| Power measurement | epi-meter board (4x ATM90E26, CT clamps, AC side) |
| Per-layer timing | PyTorch forward hooks (CPU timestamps per layer) |
| Model | Parameters | Heads/Layer | Total Heads | KV Heads |
|---|---|---|---|---|
| Llama-3.1-8B | 8B | 32 | 1024 | 8 (GQA) |
| Mistral-7B-v0.3 | 7.2B | 32 | 1024 | 8 (GQA) |
Both use Grouped Query Attention (GQA) — pruning interacts with KV head sharing, adding another dimension to the analysis.
| Prune % | Heads Removed (per 32-head layer) | Remaining |
|---|---|---|
| 0% (baseline) | 0 | 32 |
| 5% | 1–2 | 30–31 |
| 10% | 3 | 29 |
| 15% | 5 | 27 |
| 20% | 6 | 26 |
| 25% | 8 | 24 |
| 30% | 10 | 22 |
| 40% | 13 | 19 |
| 50% | 16 | 16 |
| Target | Layers Affected |
|---|---|
| All | Uniform pruning across all 32 layers |
| Early (0–25%) | Layers 0–7 only |
| Late (75–100%) | Layers 24–31 only |
Heads ranked by importance score (gradient-weighted activation magnitude). Least-important heads removed first. This follows established practice to enable direct comparison with prior work.
8 pruning depths × 3 layer targets + 1 baseline = 25 configurations per model.
1. Rank all attention heads by importance score
2. Remove heads according to config (% and layer target)
3. Validate modified model (coherence check)
4. Quantize to Q4_K_M GGUF
5. Deploy to Pi cluster
6. Wait 60s thermal stabilization
7. Run benchmark suite with PyTorch timing hooks enabled:
a. Record per-layer forward pass duration
b. epi-meter records power simultaneously
8. Calculate EPI (epi-bench)
9. Correlate timing and power data
10. Repeat 3x, median reported
# Gradient-weighted activation magnitude
# For each head h in layer l:
importance[l][h] = mean(|activation[l][h]| * |gradient[l][h]|)
# Over a calibration dataset (1000 tokens)
# Lowest importance = first to be prunedThis is the unique contribution of this paper — correlating internal model timing with external power measurement to detect compensation.
PyTorch Hooks (internal) epi-meter (external)
───────────────────── ────────────────────
Layer 0: 12.3 ms t=0.000: 48.2 W
Layer 1: 11.8 ms t=0.100: 48.5 W
Layer 2: 13.1 ms ← slower? t=0.200: 49.1 W ← higher?
Layer 3: 11.5 ms t=0.300: 48.3 W
... ...
Correlation: Does Layer 2 taking longer correspond to
higher power at that timestamp?
| Pattern | Timing Signal | Power Signal | Interpretation |
|---|---|---|---|
| No compensation | Pruned layers faster, others unchanged | Total watts drops proportionally | Energy savings are real |
| Partial compensation | Pruned layers faster, neighboring layers slower | Total watts drops less than expected | Some load redistribution |
| Full compensation | Unpruned layers significantly slower | Total watts unchanged or higher | Remaining heads absorbing removed load |
| Inference speedup | All layers faster (fewer ops total) | Watts same but duration shorter → fewer joules | Energy saved via speed, not power draw |
Status: Data collection pending. May 2026.
| Prune % | J/Token | Accuracy | EPI | vs Baseline | Compensation? |
|---|---|---|---|---|---|
| 0% | — | — | — | — | — |
| 5% | — | — | — | —% | — |
| 10% | — | — | — | —% | — |
| 15% | — | — | — | —% | — |
| 20% | — | — | — | —% | — |
| 25% | — | — | — | —% | — |
| 30% | — | — | — | —% | — |
| 40% | — | — | — | —% | — |
| 50% | — | — | — | —% | — |
| Prune % | Heads Removed | Expected Energy Δ | Measured Energy Δ | Gap (Compensation) |
|---|---|---|---|---|
| 5% | ~5% | ~-5% | — | — |
| 10% | ~10% | ~-10% | — | — |
| 25% | ~25% | ~-25% | — | — |
| 50% | ~50% | ~-50% | — | — |
Pending measurement data.
Planned:
- Compensation detection — Compare expected vs. measured energy savings at each pruning depth
- Per-layer timing heatmap — Before vs. after pruning, which layers slow down?
- Power trace overlay — Baseline power trace vs. pruned power trace, same time axis
- EPI curve — Is there a "compensation threshold" where EPI stops improving?
- GQA interaction — A613 Do KV head sharing effects amplify or dampen compensation?
Pending measurement data.
Expected topics:
- The parallel circuit insight: Removing loads from a parallel circuit is textbook EE. Why hasn't anyone applied this to transformers before?
- Softmax redistribution: When heads are removed, remaining heads' softmax distributions change. This is the mechanism of compensation.
- Practical guidance: "Prune up to X% for real energy savings. Beyond X%, you're just losing accuracy."
- GQA complication: Grouped Query Attention means KV heads are shared — pruning a query head doesn't remove the shared KV computation. This limits energy savings.
| Paper | Measures | Method | Hardware | Energy? | Compensation? |
|---|---|---|---|---|---|
| Voita et al. (2019) | Accuracy, head importance | Differentiable masking | GPU | No | No |
| Michel et al. (2019) | Accuracy per head removed | Ablation study | GPU | No | No |
| Sixteen Heads (2022) | Accuracy, efficiency | Structured pruning | GPU | No | No |
| This paper | EPI, power trace, timing | Pruning + dual measurement | Pi 5 cluster | Yes | Yes |
Unique contribution: Dual-stream correlation (internal timing × external power) to detect compensation — something no prior work has attempted.
| Component | Repository |
|---|---|
| EPI Framework | energy-per-intelligence |
| Measurement Board | epi-meter |
| Calculation Tooling | epi-bench |
| Raw Data | data/ in this repo |
| Surgery Code | code/surgery/ |
| Analysis Code | code/analysis/ |
| Direction | Description |
|---|---|
| Head regrowth | Fine-tune after pruning — does compensation reverse when the model adapts? |
| KV head pruning | Prune KV heads (not just query heads) for deeper energy savings in GQA |
| Combined surgery | Head pruning + expert pruning + mixed quantization on the same model |
| Compensation prediction | Can the DGX predict which layers will compensate before deployment? |
| Dynamic head selection | Activate different head subsets per token — runtime energy optimization |
@article{abner2026attentionheadepi,
title = {Removing Attention Heads: Does the Energy Actually Drop
or Do Remaining Heads Compensate?},
author = {Abner, Francisco},
year = {2026},
url = {https://github.com/Franzabner/attention-head-surgery-epi},
note = {YOSO-YAi LLC. Data collection in progress.}
}| # | Reference |
|---|---|
| [1] | Abner, F. "Energy Per Intelligence." YOSO-YAi LLC, 2026. GitHub |
| [2] | Voita et al. "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting" (2019) |
| [3] | Michel et al. "Are Sixteen Heads Really Better than One?" (2019) |
| [4] | Shim et al. "Towards the Efficient Transformer: A Survey" (2022) |
| Content | License |
|---|---|
| Paper | CC BY 4.0 |
| Code | MIT |
| Data | CC BY 4.0 |