Computer Science > Computation and Language

arXiv:2510.05152 (cs)

[Submitted on 2 Oct 2025]

Title:A Single Character can Make or Break Your LLM Evals

Authors:Jingtong Su, Jianyu Zhang, Karen Ullrich, Léon Bottou, Mark Ibrahim

Abstract:Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $\pm 23\%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2510.05152 [cs.CL]
	(or arXiv:2510.05152v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2510.05152

Submission history

From: Mark Ibrahim [view email]
[v1] Thu, 2 Oct 2025 13:27:28 UTC (875 KB)

Computer Science > Computation and Language

Title:A Single Character can Make or Break Your LLM Evals

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Single Character can Make or Break Your LLM Evals

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators