Computer Science > Computer Vision and Pattern Recognition

arXiv:2512.03643 (cs)

[Submitted on 3 Dec 2025 (v1), last revised 4 Apr 2026 (this version, v2)]

Title:Optical Context Compression Is Just (Bad) Autoencoding

Authors:Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick

Abstract:DeepSeek-OCR shows that rendered text can be reconstructed from a small number of vision tokens, sparking excitement about using vision as a compression medium for long textual contexts. But this pipeline requires rendering token embeddings to pixels and compressing from there -- discarding learned representations in favor of an image the vision encoder must then recover from. We ask whether this detour helps. Comparing DeepSeek-OCR's vision encoder against near-zero-parameter mean pooling and a learned hierarchical encoder, we find it does not. For reconstruction, simple direct methods match or surpass vision at every compression ratio. For language modeling, vision performs comparably to truncation -- a baseline that simply discards context -- and loses to the hierarchical encoder at every compression ratio. As expected, all compression methods outperform truncation for factual recall, but vision never surpasses the best direct baseline. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2512.03643 [cs.CV]
	(or arXiv:2512.03643v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2512.03643

Submission history

From: Ivan Yee Lee [view email]
[v1] Wed, 3 Dec 2025 10:27:27 UTC (97 KB)
[v2] Sat, 4 Apr 2026 05:28:18 UTC (123 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Optical Context Compression Is Just (Bad) Autoencoding

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Optical Context Compression Is Just (Bad) Autoencoding

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators