8000
Skip to content

Franzabner/attention-head-surgery-epi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOSO-YAi

Removing Attention Heads: Does the Energy Actually Drop or Do Remaining Heads Compensate?

An Electrician's Approach to Transformer Surgery

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Paper Status License Framework Tooling Instrument


Abstract — Attention head pruning removes individual attention heads from transformer layers to reduce model size and compute. Existing work evaluates this by accuracy retained. But when you remove a component from a circuit, the remaining components don't always draw less power — sometimes they compensate, redistributing load and maintaining or even increasing total energy consumption. This paper uses Energy Per Intelligence (EPI) and the epi-meter power trace to answer a question that accuracy-only evaluation cannot: when you remove attention heads, does the energy actually drop — or do the remaining heads compensate? We correlate per-layer PyTorch timing hooks with epi-meter power traces to detect compensation patterns invisible to accuracy metrics, and identify the pruning threshold where energy savings become real.


Table of Contents

  1. Introduction
  2. The Compensation Hypothesis
  3. Research Questions
  4. Experimental Design
  5. Surgery Matrix
  6. Methodology
  7. Power Trace Correlation
  8. Results
  9. Analysis
  10. Discussion
  11. Comparison to Prior Work
  12. Reproducibility
  13. Future Work
  14. Citation
  15. References
  16. License

1. Introduction

An attention head is a parallel computation path. In a transformer with 32 heads per layer, each head independently computes queries, keys, and values, attends to the input, and produces a weighted representation. The outputs of all heads are concatenated and projected. Remove a head, and one computation path disappears.

In electrical terms, attention heads are parallel loads. In a parallel circuit, removing a load reduces total current — if nothing else changes. But in a transformer, something else might change. The remaining heads see different inputs (the concatenated output has fewer channels, the projection matrix adapts during any fine-tuning, and the softmax distributions shift). The remaining heads may compensate — processing more information per head to maintain output quality.

If compensation occurs, the energy savings from head removal are less than expected. In the worst case, remaining heads draw more power per head, and total energy barely changes despite having fewer heads. Accuracy-only evaluation cannot detect this — the model's accuracy may hold steady while the energy stays the same or even increases.

An electrician knows that removing a load from a parallel circuit doesn't always reduce total power. This paper applies that insight to transformers.


2. The Compensation Hypothesis

The Circuit Analogy

  BEFORE: 32 heads (parallel loads)
  ┌──┬──┬──┬──┬──┬──┬──┬──┐
  │H0│H1│H2│H3│ ... │H31│
  └──┴──┴──┴──┴──┴──┴──┴──┘
  Total power: P_total = P₀ + P₁ + ... + P₃₁

  AFTER: 28 heads (4 removed)
  ┌──┬──┬──┬──┬──┬──┬──┬──┐
  │H0│H1│  │H3│ ... │  │H31│
  └──┴──┴──┴──┴──┴──┴──┴──┘
  Expected:  P_total' < P_total     (fewer loads)
  Possible:  P_total' ≈ P_total     (compensation!)

Three Possible Outcomes

Outcome What Happens EPI Effect
A. Energy drops proportionally Removing N% of heads saves ~N% of energy EPI improves (best case)
B. Partial compensation Remaining heads work harder; energy drops less than expected EPI improves, but less than predicted
C. Full compensation Remaining heads fully compensate; total energy unchanged EPI worsens (accuracy drops but en 8000 ergy doesn't)

Outcome C is the danger zone. Accuracy-only pruning recommends removing heads because perplexity holds. But if energy doesn't drop, EPI gets worse — you lost intelligence without saving joules.

How to Detect Compensation

The epi-meter captures total system power. PyTorch forward hooks capture per-layer execution time. By correlating:

  • Per-layer timing (which layers take longer after pruning)
  • Total power trace (does wattage actually drop)
  • Tokens per second (does throughput change)

We can identify whether remaining heads compensate or whether the energy savings are real.


3. Research Questions

# Question
RQ1 Does removing attention heads produce proportional, partial, or negligible energy savings?
RQ2 Can compensation be detected in the epi-meter power trace (power holds steady while heads are removed)?
RQ3 Do specific layers show more compensation than others?
RQ4 Is there a pruning threshold below which compensation dominates (energy savings disappear)?
RQ5 Does the EPI-optimal head pruning depth differ from the accuracy-optimal depth?

4. Experimental Design

Overview

  1. Baseline: Full model (all heads) — measure EPI, power trace, per-layer timing
  2. Progressive pruning: Remove 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50% of heads
  3. Layer targeting: All layers, early only, late only
  4. For each configuration: Capture power trace AND per-layer timing hooks
  5. Correlate: Does per-layer timing redistribution correspond to power trace patterns?

Hardware

Component Specification
Surgery platform DGX Spark (GB10, 128GB)
Deployment target Pi 5 cluster (4x 16GB, distributed-llama)
Power measurement epi-meter board (4x ATM90E26, CT clamps, AC side)
Per-layer timing PyTorch forward hooks (CPU timestamps per layer)

Target Models

Model Parameters Heads/Layer Total Heads KV Heads
Llama-3.1-8B 8B 32 1024 8 (GQA)
Mistral-7B-v0.3 7.2B 32 1024 8 (GQA)

Both use Grouped Query Attention (GQA) — pruning interacts with KV head sharing, adding another dimension to the analysis.


5. Surgery Matrix

Pruning Depths

Prune % Heads Removed (per 32-head layer) Remaining
0% (baseline) 0 32
5% 1–2 30–31
10% 3 29
15% 5 27
20% 6 26
25% 8 24
30% 10 22
40% 13 19
50% 16 16

Layer Targeting

Target Layers Affected
All Uniform pruning across all 32 layers
Early (0–25%) Layers 0–7 only
Late (75–100%) Layers 24–31 only

Head Selection

Heads ranked by importance score (gradient-weighted activation magnitude). Least-important heads removed first. This follows established practice to enable direct comparison with prior work.

Total Matrix

8 pruning depths × 3 layer targets + 1 baseline = 25 configurations per model.


6. Methodology

Per-Configuration Pipeline

1. Rank all attention heads by importance score
2. Remove heads according to config (% and layer target)
3. Validate modified model (coherence check)
4. Quantize to Q4_K_M GGUF
5. Deploy to Pi cluster
6. Wait 60s thermal stabilization
7. Run benchmark suite with PyTorch timing hooks enabled:
   a. Record per-layer forward pass duration
   b. epi-meter records power simultaneously
8. Calculate EPI (epi-bench)
9. Correlate timing and power data
10. Repeat 3x, median reported

Importance Scoring

# Gradient-weighted activation magnitude
# For each head h in layer l:
importance[l][h] = mean(|activation[l][h]| * |gradient[l][h]|)

# Over a calibration dataset (1000 tokens)
# Lowest importance = first to be pruned

7. Power Trace Correlation

This is the unique contribution of this paper — correlating internal model timing with external power measurement to detect compensation.

Dual Data Streams

  PyTorch Hooks (internal)          epi-meter (external)
  ─────────────────────             ────────────────────
  Layer 0:  12.3 ms                 t=0.000: 48.2 W
  Layer 1:  11.8 ms                 t=0.100: 48.5 W
  Layer 2:  13.1 ms  ← slower?     t=0.200: 49.1 W  ← higher?
  Layer 3:  11.5 ms                 t=0.300: 48.3 W
  ...                               ...

  Correlation: Does Layer 2 taking longer correspond to
               higher power at that timestamp?

Compensation Signatures

Pattern Timing Signal Power Signal Interpretation
No compensation Pruned layers faster, others unchanged Total watts drops proportionally Energy savings are real
Partial compensation Pruned layers faster, neighboring layers slower Total watts drops less than expected Some load redistribution
Full compensation Unpruned layers significantly slower Total watts unchanged or higher Remaining heads absorbing removed load
Inference speedup All layers faster (fewer ops total) Watts same but duration shorter → fewer joules Energy saved via speed, not power draw

8. Results

Status: Data collection pending. May 2026.

EPI vs Head Pruning Depth (All Layers)

Prune % J/Token Accuracy EPI vs Baseline Compensation?
0%
5% —%
10% —%
15% —%
20% —%
25% —%
30% —%
40% —%
50% —%

Energy Savings vs Expected

Prune % Heads Removed Expected Energy Δ Measured Energy Δ Gap (Compensation)
5% ~5% ~-5%
10% ~10% ~-10%
25% ~25% ~-25%
50% ~50% ~-50%

9. Analysis

Pending measurement data.

Planned:

  1. Compensation detection — Compare expected vs. measured energy savings at each pruning depth
  2. Per-layer timing heatmap — Before vs. after pruning, which layers slow down?
  3. Power trace overlay — Baseline power trace vs. pruned power trace, same time axis
  4. EPI curve — Is there a "compensation threshold" where EPI stops improving?
  5. GQA interaction — A613 Do KV head sharing effects amplify or dampen compensation?

10. Discussion

Pending measurement data.

Expected topics:

  • The parallel circuit insight: Removing loads from a parallel circuit is textbook EE. Why hasn't anyone applied this to transformers before?
  • Softmax redistribution: When heads are removed, remaining heads' softmax distributions change. This is the mechanism of compensation.
  • Practical guidance: "Prune up to X% for real energy savings. Beyond X%, you're just losing accuracy."
  • GQA complication: Grouped Query Attention means KV heads are shared — pruning a query head doesn't remove the shared KV computation. This limits energy savings.

11. Comparison to Prior Work

Paper Measures Method Hardware Energy? Compensation?
Voita et al. (2019) Accuracy, head importance Differentiable masking GPU No No
Michel et al. (2019) Accuracy per head removed Ablation study GPU No No
Sixteen Heads (2022) Accuracy, efficiency Structured pruning GPU No No
This paper EPI, power trace, timing Pruning + dual measurement Pi 5 cluster Yes Yes

Unique contribution: Dual-stream correlation (internal timing × external power) to detect compensation — something no prior work has attempted.


12. Reproducibility

Component Repository
EPI Framework energy-per-intelligence
Measurement Board epi-meter
Calculation Tooling epi-bench
Raw Data data/ in this repo
Surgery Code code/surgery/
Analysis Code code/analysis/

13. Future Work

Direction Description
Head regrowth Fine-tune after pruning — does compensation reverse when the model adapts?
KV head pruning Prune KV heads (not just query heads) for deeper energy savings in GQA
Combined surgery Head pruning + expert pruning + mixed quantization on the same model
Compensation prediction Can the DGX predict which layers will compensate before deployment?
Dynamic head selection Activate different head subsets per token — runtime energy optimization

14. Citation

@article{abner2026attentionheadepi,
  title   = {Removing Attention Heads: Does the Energy Actually Drop
             or Do Remaining Heads Compensate?},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/attention-head-surgery-epi},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

15. References

# Reference
[1] Abner, F. "Energy Per Intelligence." YOSO-YAi LLC, 2026. GitHub
[2] Voita et al. "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting" (2019)
[3] Michel et al. "Are Sixteen Heads Really Better than One?" (2019)
[4] Shim et al. "Towards the Efficient Transformer: A Survey" (2022)

16. License

Content License
Paper CC BY 4.0
Code MIT
Data CC BY 4.0

Removing a load from a parallel circuit doesn't always reduce total power.

An electrician knows this. Now the transformer community will too.

YOSO-YAi

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

About

Removing Attention Heads: Does the Energy Actually Drop or Do Remaining Heads Compensate? Dual-stream correlation of PyTorch timing hooks and epi-meter power traces on Pi 5 cluster.

Topics

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-CODE

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

0