Computer Science > Machine Learning

arXiv:2512.11315 (cs)

[Submitted on 12 Dec 2025]

Title:Benchmarking the Generality of Vision-Language-Action Models

Authors:Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang

View PDF HTML (experimental)

Abstract:Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training this http URL failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain this http URL findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation this http URL v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist this http URL, data, and leaderboards are publicly available.

Comments:	23 pages, 7 figures, and 1 table
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2512.11315 [cs.LG]
	(or arXiv:2512.11315v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2512.11315

Submission history

From: Pranav Guruprasad [view email]
[v1] Fri, 12 Dec 2025 06:31:52 UTC (1,557 KB)

Computer Science > Machine Learning

Title:Benchmarking the Generality of Vision-Language-Action Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Benchmarking the Generality of Vision-Language-Action Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators