Removing Attention Heads: Does the Energy Actually Drop or Do Remaining Heads Compensate?

Removing Attention Heads: Does the Energy Actually Drop or Do Remaining Heads Compensate?

An Electrician's Approach to Transformer Surgery

Francisco Abner Electrical Engineer, CEO & Founder — YOSO-YAi LLC New Albany, Ohio

Abstract — Attention head pruning removes individual attention heads from transformer layers to reduce model size and compute. Existing work evaluates this by accuracy retained. But when you remove a component from a circuit, the remaining components don't always draw less power — sometimes they compensate, redistributing load and maintaining or even increasing total energy consumption. This paper uses Energy Per Intelligence (EPI) and the epi-meter power trace to answer a question that accuracy-only evaluation cannot: when you remove attention heads, does the energy actually drop — or do the remaining heads compensate? We correlate per-layer PyTorch timing hooks with epi-meter power traces to detect compensation patterns invisible to accuracy metrics, and identify the pruning threshold where energy savings become real.

1. Introduction

An attention head is a parallel computation path. In a transformer with 32 heads per layer, each head independently computes queries, keys, and values, attends to the input, and produces a weighted representation. The outputs of all heads are concatenated and projected. Remove a head, and one computation path disappears.

In electrical terms, attention heads are parallel loads. In a parallel circuit, removing a load reduces total current — if nothing else changes. But in a transformer, something else might change. The remaining heads see different inputs (the concatenated output has fewer channels, the projection matrix adapts during any fine-tuning, and the softmax distributions shift). The remaining heads may compensate — processing more information per head to maintain output quality.

If compensation occurs, the energy savings from head removal are less than expected. In the worst case, remaining heads draw more power per head, and total energy barely changes despite having fewer heads. Accuracy-only evaluation cannot detect this — the model's accuracy may hold steady while the energy stays the same or even increases.

An electrician knows that removing a load from a parallel circuit doesn't always reduce total power. This paper applies that insight to transformers.

2. The Compensation Hypothesis

The Circuit Analogy

  BEFORE: 32 heads (parallel loads)
  ┌──┬──┬──┬──┬──┬──┬──┬──┐
  │H0│H1│H2│H3│ ... │H31│
  └──┴──┴──┴──┴──┴──┴──┴──┘
  Total power: P_total = P₀ + P₁ + ... + P₃₁

  AFTER: 28 heads (4 removed)
  ┌──┬──┬──┬──┬──┬──┬──┬──┐
  │H0│H1│  │H3│ ... │  │H31│
  └──┴──┴──┴──┴──┴──┴──┴──┘
  Expected:  P_total' < P_total     (fewer loads)
  Possible:  P_total' ≈ P_total     (compensation!)

Three Possible Outcomes

Outcome	What Happens	EPI Effect
A. Energy drops proportionally	Removing N% of heads saves ~N% of energy	EPI improves (best case)
B. Partial compensation	Remaining heads work harder; energy drops less than expected	EPI improves, but less than predicted
C. Full compensation	Remaining heads fully compensate; total energy unchanged	EPI worsens (accuracy drops but en 8000 ergy doesn't)

Outcome C is the danger zone. Accuracy-only pruning recommends removing heads because perplexity holds. But if energy doesn't drop, EPI gets worse — you lost intelligence without saving joules.

How to Detect Compensation

The epi-meter captures total system power. PyTorch forward hooks capture per-layer execution time. By correlating:

Per-layer timing (which layers take longer after pruning)
Total power trace (does wattage actually drop)
Tokens per second (does throughput change)

We can identify whether remaining heads compensate or whether the energy savings are real.

3. Research Questions

#	Question
RQ1	Does removing attention heads produce proportional, partial, or negligible energy savings?
RQ2	Can compensation be detected in the epi-meter power trace (power holds steady while heads are removed)?
RQ3	Do specific layers show more compensation than others?
RQ4	Is there a pruning threshold below which compensation dominates (energy savings disappear)?
RQ5	Does the EPI-optimal head pruning depth differ from the accuracy-optimal depth?

4. Experimental Design

Overview

Baseline: Full model (all heads) — measure EPI, power trace, per-layer timing
Progressive pruning: Remove 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50% of heads
Layer targeting: All layers, early only, late only
For each configuration: Capture power trace AND per-layer timing hooks
Correlate: Does per-layer timing redistribution correspond to power trace patterns?

Hardware

Component	Specification
Surgery platform	DGX Spark (GB10, 128GB)
Deployment target	Pi 5 cluster (4x 16GB, distributed-llama)
Power measurement	epi-meter board (4x ATM90E26, CT clamps, AC side)
Per-layer timing	PyTorch forward hooks (CPU timestamps per layer)

Target Models

Model	Parameters	Heads/Layer	Total Heads	KV Heads
Llama-3.1-8B	8B	32	1024	8 (GQA)
Mistral-7B-v0.3	7.2B	32	1024	8 (GQA)

Both use Grouped Query Attention (GQA) — pruning interacts with KV head sharing, adding another dimension to the analysis.

5. Surgery Matrix

Pruning Depths

Prune %	Heads Removed (per 32-head layer)	Remaining
0% (baseline)	0	32
5%	1–2	30–31
10%	3	29
15%	5	27
20%	6	26
25%	8	24
30%	10	22
40%	13	19
50%	16	16

Layer Targeting

Target	Layers Affected
All	Uniform pruning across all 32 layers
Early (0–25%)	Layers 0–7 only
Late (75–100%)	Layers 24–31 only

Head Selection

Heads ranked by importance score (gradient-weighted activation magnitude). Least-important heads removed first. This follows established practice to enable direct comparison with prior work.

Total Matrix

8 pruning depths × 3 layer targets + 1 baseline = 25 configurations per model.

6. Methodology

Per-Configuration Pipeline

1. Rank all attention heads by importance score
2. Remove heads according to config (% and layer target)
3. Validate modified model (coherence check)
4. Quantize to Q4_K_M GGUF
5. Deploy to Pi cluster
6. Wait 60s thermal stabilization
7. Run benchmark suite with PyTorch timing hooks enabled:
   a. Record per-layer forward pass duration
   b. epi-meter records power simultaneously
8. Calculate EPI (epi-bench)
9. Correlate timing and power data
10. Repeat 3x, median reported

Importance Scoring

# Gradient-weighted activation magnitude
# For each head h in layer l:
importance[l][h] = mean(|activation[l][h]| * |gradient[l][h]|)

# Over a calibration dataset (1000 tokens)
# Lowest importance = first to be pruned

7. Power Trace Correlation

This is the unique contribution of this paper — correlating internal model timing with external power measurement to detect compensation.

Dual Data Streams

  PyTorch Hooks (internal)          epi-meter (external)
  ─────────────────────             ────────────────────
  Layer 0:  12.3 ms                 t=0.000: 48.2 W
  Layer 1:  11.8 ms                 t=0.100: 48.5 W
  Layer 2:  13.1 ms  ← slower?     t=0.200: 49.1 W  ← higher?
  Layer 3:  11.5 ms                 t=0.300: 48.3 W
  ...                               ...

  Correlation: Does Layer 2 taking longer correspond to
               higher power at that timestamp?

Compensation Signatures

Pattern	Timing Signal	Power Signal	Interpretation
No compensation	Pruned layers faster, others unchanged	Total watts drops proportionally	Energy savings are real
Partial compensation	Pruned layers faster, neighboring layers slower	Total watts drops less than expected	Some load redistribution
Full compensation	Unpruned layers significantly slower	Total watts unchanged or higher	Remaining heads absorbing removed load
Inference speedup	All layers faster (fewer ops total)	Watts same but duration shorter → fewer joules	Energy saved via speed, not power draw

8. Results

Status: Data collection pending. May 2026.

EPI vs Head Pruning Depth (All Layers)

Prune %	J/Token	Accuracy	EPI	vs Baseline	Compensation?
0%	—	—	—	—	—
5%	—	—	—	—%	—
10%	—	—	—	—%	—
15%	—	—	—	—%	—
20%	—	—	—	—%	—
25%	—	—	—	—%	—
30%	—	—	—	—%	—
40%	—	—	—	—%	—
50%	—	—	—	—%	—

Energy Savings vs Expected

Prune %	Heads Removed	Expected Energy Δ	Measured Energy Δ	Gap (Compensation)
5%	~5%	~-5%	—	—
10%	~10%	~-10%	—	—
25%	~25%	~-25%	—	—
50%	~50%	~-50%	—	—

9. Analysis

Pending measurement data.

Planned:

Compensation detection — Compare expected vs. measured energy savings at each pruning depth
Per-layer timing heatmap — Before vs. after pruning, which layers slow down?
Power trace overlay — Baseline power trace vs. pruned power trace, same time axis
EPI curve — Is there a "compensation threshold" where EPI stops improving?
GQA interaction — A613 Do KV head sharing effects amplify or dampen compensation?

10. Discussion

Pending measurement data.

Expected topics:

The parallel circuit insight: Removing loads from a parallel circuit is textbook EE. Why hasn't anyone applied this to transformers before?
Softmax redistribution: When heads are removed, remaining heads' softmax distributions change. This is the mechanism of compensation.
Practical guidance: "Prune up to X% for real energy savings. Beyond X%, you're just losing accuracy."
GQA complication: Grouped Query Attention means KV heads are shared — pruning a query head doesn't remove the shared KV computation. This limits energy savings.

11. Comparison to Prior Work

Paper	Measures	Method	Hardware	Energy?	Compensation?
Voita et al. (2019)	Accuracy, head importance	Differentiable masking	GPU	No	No
Michel et al. (2019)	Accuracy per head removed	Ablation study	GPU	No	No
Sixteen Heads (2022)	Accuracy, efficiency	Structured pruning	GPU	No	No
This paper	EPI, power trace, timing	Pruning + dual measurement	Pi 5 cluster	Yes	Yes

Unique contribution: Dual-stream correlation (internal timing × external power) to detect compensation — something no prior work has attempted.

12. Reproducibility

Component	Repository
EPI Framework	`energy-per-intelligence`
Measurement Board	`epi-meter`
Calculation Tooling	`epi-bench`
Raw Data	`data/` in this repo
Surgery Code	`code/surgery/`
Analysis Code	`code/analysis/`

13. Future Work

Direction	Description
Head regrowth	Fine-tune after pruning — does compensation reverse when the model adapts?
KV head pruning	Prune KV heads (not just query heads) for deeper energy savings in GQA
Combined surgery	Head pruning + expert pruning + mixed quantization on the same model
Compensation prediction	Can the DGX predict which layers will compensate before deployment?
Dynamic head selection	Activate different head subsets per token — runtime energy optimization

14. Citation

@article{abner2026attentionheadepi,
  title   = {Removing Attention Heads: Does the Energy Actually Drop
             or Do Remaining Heads Compensate?},
  author  = {Abner, Francisco},
  year    = {2026},
  url     = {https://github.com/Franzabner/attention-head-surgery-epi},
  note    = {YOSO-YAi LLC. Data collection in progress.}
}

15. References

#	Reference
[1]	Abner, F. "Energy Per Intelligence." YOSO-YAi LLC, 2026. GitHub
[2]	Voita et al. "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting" (2019)
[3]	Michel et al. "Are Sixteen Heads Really Better than One?" (2019)
[4]	Shim et al. "Towards the Efficient Transformer: A Survey" (2022)

16. License

Content	License
Paper	CC BY 4.0
Code	MIT
Data	CC BY 4.0

Removing a load from a parallel circuit doesn't always reduce total power.

An electrician knows this. Now the transformer community will too.

Francisco Abner — Electrical Engineer, CEO & Founder, YOSO-YAi LLC

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
figures		figures
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-CODE		LICENSE-CODE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Removing Attention Heads: Does the Energy Actually Drop or Do Remaining Heads Compensate?

An Electrician's Approach to Transformer Surgery

Table of Contents

1. Introduction

2. The Compensation Hypothesis

The Circuit Analogy

Three Possible Outcomes

How to Detect Compensation

3. Research Questions

4. Experimental Design

Overview

Hardware

Target Models

5. Surgery Matrix

Pruning Depths

Layer Targeting

Head Selection

Total Matrix

6. Methodology

Per-Configuration Pipeline

Importance Scoring

7. Power Trace Correlation

Dual Data Streams

Compensation Signatures

8. Results

EPI vs Head Pruning Depth (All Layers)

Energy Savings vs Expected

9. Analysis

10. Discussion

11. Comparison to Prior Work

12. Reproducibility

13. Future Work

14. Citation

15. References

16. License

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages