-
Evaluating Cooperation in LLM Social Groups through Elected Leadership
Authors:
Ryan Faulkner,
Anushka Deshpande,
David Guzman Piedrahita,
Joel Z. Leibo,
Zhijing Jin
Abstract:
Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a…
▽ More
Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
Tracing the Thought of a Grandmaster-level Chess-Playing Transformer
Authors:
Rui Lin,
Zhenyu Jin,
Guancheng Zhou,
Xuyang Ge,
Wentao Shu,
Jiaxing Wu,
Junxuan Wang,
Zhengfu He,
Junping Zhang,
Xipeng Qiu
Abstract:
While modern transformer neural networks achieve grandmaster-level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary com…
▽ More
While modern transformer neural networks achieve grandmaster-level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary computation process of LC0. We conduct a detailed case study showing that these pathways expose rich, interpretable tactical considerations that are empirically verifiable. We further introduce three quantitative metrics and show that LC0 exhibits parallel reasoning behavior consistent with the inductive bias of its policy head architecture. To the best of our knowledge, this is the first work to decompose the internal computation of a transformer on both MLP and attention modules for interpretability. Combining sparse replacement layers and causal interventions in LC0 provides a comprehensive understanding of advanced tactical reasoning, offering critical insights into the underlying mechanisms of superhuman systems. Our code is available at https://github.com/JacklE0niden/Leela-SAEs.
△ Less
Submitted 11 April, 2026;
originally announced April 2026.
-
Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events
Authors:
Yuqin Yang,
Haowu Zhou,
Haoran Tu,
Zhiwen Hui,
Shiqi Yan,
HaoYang Li,
Dong She,
Xianrong Yao,
Yang Gao,
Zhanpeng Jin
Abstract:
Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illu…
▽ More
Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illusion'' -- relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating "personality illusion.'
△ Less
Submitted 10 April, 2026;
originally announced April 2026.
-
ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents
Authors:
Kenan Li,
Qirui Jin,
Liao Zhu,
Xiaosong Huang,
Yijia Wu,
Yikai Zhang,
Xin Zhang,
Zijian Jin,
Yufan Huang,
Elsie Nallipogu,
Chaoyun Zhang,
Yu Kang,
Saravan Rajmohan,
Qingwei Lin,
Wenke Lee,
Dongmei Zhang
Abstract:
Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, t…
▽ More
Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Administrative Decentralization in Edge-Cloud Multi-Agent for Mobile Automation
Authors:
Senyao Li,
Zhigang Zuo,
Haozhao Wang,
Junyu Chen,
Zhanbo Jin,
Ruixuan LI
Abstract:
Collaborative edge-cloud frameworks have emerged as the main- stream paradigm for mobile automation, mitigating the latency and privacy risks inherent to monolithic cloud agents. However, existing approaches centralize administration in the cloud while relegating the device to passive execution, inducing a cognitive lag regard- ing real-time UI dynamics. To tackle this, we introduce AdecPilot by a…
▽ More
Collaborative edge-cloud frameworks have emerged as the main- stream paradigm for mobile automation, mitigating the latency and privacy risks inherent to monolithic cloud agents. However, existing approaches centralize administration in the cloud while relegating the device to passive execution, inducing a cognitive lag regard- ing real-time UI dynamics. To tackle this, we introduce AdecPilot by applying the principle of administrative decentralization to the edge-cloud multi-agent framework, which redefines edge agency by decoupling high-level strategic designing from tactical grounding. AdecPilot integrates a UI-agnostic cloud designer generating ab- stract milestones with a bimodal edge team capable of autonomous tactical planning and self-correction without cloud intervention. Furthermore, AdecPilot employs a Hierarchical Implicit Termi- nation protocol to enforce deterministic stops and prevent post- completion hallucinations. Extensive experiments demonstrate pro- posed approach improves task success rate by 21.7% while reducing cloud token consumption by 37.5% against EcoAgent and decreas- ing end to end latency by 88.9% against CORE. The source code is available at https://anonymous.4open.science/r/Anonymous_code- B8AB.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
Radio-Frequency Inverse Rendering for Wireless Environment Modeling
Authors:
Fuhai Wang,
Zihan Jin,
Lehang Wang,
Xuehui Dong,
Tiebin Mi,
Robert Caiming Qiu,
Zenan ling
Abstract:
Neural rendering paradigms have recently emerged as powerful tools for radio frequency (RF). However, by entangling RF sources with scene geometry and material properties, existing approaches limit downstream manipulation of scene geometry, wireless system configuration, and RF reasoning. To address this, we propose a physically grounded RF inverse rendering (RFIR) framework that explicitly decoup…
▽ More
Neural rendering paradigms have recently emerged as powerful tools for radio frequency (RF). However, by entangling RF sources with scene geometry and material properties, existing approaches limit downstream manipulation of scene geometry, wireless system configuration, and RF reasoning. To address this, we propose a physically grounded RF inverse rendering (RFIR) framework that explicitly decouples RF emission, geometry, and material electromagnetic properties. Our key insight is an RF-aware bidirectional scattering distribution function, embedded into the Gaussian splatting paradigm as an RF rendering equation. Each Gaussian primitive is endowed with intrinsic physical attributes, including surface normals, material electromagnetic parameters, and roughness, and leveraged by a customized ray-tracing scheme to represent RF signal synthesis. The proposed RFIR generalizes three typical RF tasks: radar cross-section synthesis, received signal strength indicator prediction, and wireless scene editability. Experiments demonstrate significant performance advantages, underscoring the potential for wireless world modeling.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
MARS: Enabling Autoregressive Models Multi-Token Generation
Authors:
Ziqi Jin,
Lei Wang,
Ziwei Luo,
Aixin Sun
Abstract:
Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model t…
▽ More
Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling
Authors:
Qingguo Meng,
Xingbo Dong,
Zhe Jin,
Massimo Tistarelli
Abstract:
Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representatio…
▽ More
Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
Neutron Star Merger Rates from Multi-messenger Observations: Clues to the Physical Origin of the Short and Long-short Gamma-ray Bursts
Authors:
Zhi-Ping Jin,
Yuan-Zhu Wang,
Yin-Jie Li,
Yun Wang,
Hao Wang,
Shao-Peng Tang,
Da-Ming Wei
Abstract:
Short and long-short gamma-ray bursts (GRBs) are widely believed to be powered by neutron star mergers. In this work, we calculate local rate of such GRBs and find a relatively high value of $\sim 786-2468~{\rm Gpc^{-3}~yr^{-1}}$ when including the very narrow collimation event GRB 061201. Considering that its redshift is not very reliable, after excluding this event, the rate is…
▽ More
Short and long-short gamma-ray bursts (GRBs) are widely believed to be powered by neutron star mergers. In this work, we calculate local rate of such GRBs and find a relatively high value of $\sim 786-2468~{\rm Gpc^{-3}~yr^{-1}}$ when including the very narrow collimation event GRB 061201. Considering that its redshift is not very reliable, after excluding this event, the rate is $\sim 195-666~{\rm Gpc^{-3}~yr^{-1}}$. We also calculate the electromagnetically (EM) bright neutron star merger rate inferred from the LIGO/Virgo/KAGRA observations up to the end of the first epoch of the O4 run, and derive a rate of $\sim 66-347~{\rm Gpc^{-3}~yr^{-1}}$. This rate is somewhat lower than the value obtained from the GRBs, even after excluding GRB 061201. The non-detection of any viable EM bright merger in the O4b and O4c observing runs favors an even lower rate, which starts to challenge the neutron star merger origin of the short and long-short GRBs and may suggest additional contribution from the mergers of other compact object (like the neutron star-white dwarf) binaries, as speculated initially by King et al. (2007) in interpreting the long-short event GRB 060614.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
Authors:
Dong She,
Xianrong Yao,
Liqun Chen,
Jinghe Yu,
Yang Gao,
Zhanpeng Jin
Abstract:
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-…
▽ More
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception
Authors:
Jerick Shi,
Terry Jingcheng Zhang,
Zhijing Jin,
Vincent Conitzer
Abstract:
Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechani…
▽ More
Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechanism (fabrication, omission, or pragmatic distortion). Applying this taxonomy to 50 existing benchmarks reveals that every benchmark tests fabrication while pragmatic distortion, attribution, and capability self-knowledge remain critically under-covered, and strategic deception benchmarks are nascent. We offer concrete recommendations for developers and regulators, including a minimal reporting template for positioning future work within our framework.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest
Authors:
Jerick Shi,
Terry Jingcheng Zhang,
Zhijing Jin,
Vincent Conitzer
Abstract:
Large language models are increasingly deployed as autonomous agents in multi-agent settings where they communicate intentions and take consequential actions with limited human oversight. A critical safety question is whether agents that publicly commit to actions break those promises when they can privately deviate, and what the consequences are for both themselves and the collective. We study de…
▽ More
Large language models are increasingly deployed as autonomous agents in multi-agent settings where they communicate intentions and take consequential actions with limited human oversight. A critical safety question is whether agents that publicly commit to actions break those promises when they can privately deviate, and what the consequences are for both themselves and the collective. We study deception as a deviation from a publicly announced action in one-shot normal-form games, classifying each deviation by its effect on individual payoff and collective welfare into four categories: win-win, selfish, altruistic, and sabotaging. By exhaustively enumerating announcement profiles across six canonical games, nine frontier models, and varying group sizes, we identify all opportunities for each deviation type and measure how often agents exploit them. Across all settings, agents deviate from promises in approximately 56.6% of scenarios, but the character of deception varies substantially across models even at similar overall rates. Most critically, for the majority of the models, promise-breaking occurs without verbalized awareness of the fact that they are breaking promises.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
Authors:
Bin Wang,
Tianyao He,
Linke Ouyang,
Fan Wu,
Zhiyuan Zhao,
Tao Chu,
Yuan Qu,
Zhenjiang Jin,
Weijun Zeng,
Ziyang Miao,
Bangrui Xu,
Junbo Niu,
Mengzhang Cai,
Jiantao Qiu,
Qintong Zhang,
Dongsheng Ma,
Yuefeng Sun,
Hejun Dong,
Wenzheng Zhang,
Jutao Xiao,
Jiayong Shi,
Pengyu Liao,
Xiaomeng Zhao,
Huaping Zhong,
Liqun Wei
, et al. (18 additional authors not shown)
Abstract:
Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training…
▽ More
Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than from architectural differences. Building on this finding, we present MinerU2.5-Pro, which advances the state of the art purely through data engineering and training strategy design while retaining the 1.2B-parameter architecture of MinerU2.5 unchanged. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift; Cross-Model Consistency Verification leverages output consensus among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy--large-scale pre-training, hard sample fine-tuning, and GRPO alignment--sequentially exploits these data at different quality tiers. On the evaluation front, we rectify element-matching biases in OmniDocBench v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench v1.6 protocol. Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods, including those based on models with over 200x more parameters.
△ Less
Submitted 9 April, 2026; v1 submitted 6 April, 2026;
originally announced April 2026.
-
Nonlocal advantage of quantum imaginarity in Schwarzchild spacetime
Authors:
Bing Yu,
Xiao-Yong Yang,
Xiaoli Hu,
Zhi-Xiang Jin,
Xiaofen Huang
Abstract:
Black hole spacetimes provide a natural setting for quantum systems in curved spacetime, where effects such as Hawking radiation arise from event horizons. In this work, we investigate the impact of the Hawking effect on quantum imaginarity in Schwarzschild spacetime, focusing on nonlocal advantage of quantum imaginarity (NAQI) and assisted imaginarity distillation. For NAQI, it is significantly a…
▽ More
Black hole spacetimes provide a natural setting for quantum systems in curved spacetime, where effects such as Hawking radiation arise from event horizons. In this work, we investigate the impact of the Hawking effect on quantum imaginarity in Schwarzschild spacetime, focusing on nonlocal advantage of quantum imaginarity (NAQI) and assisted imaginarity distillation. For NAQI, it is significantly affected by Hawking radiation, exhibiting a pronounced difference between physically accessible and inaccessible regions. It is suppressed in the physically accessible region with increasing Hawking temperature and may vanish, while remaining absent in the physically inaccessible region across the parameter regime. For assisted imaginarity distillation, the Hawking effect modifies the assisted fidelity in a state-dependent manner. In the physically accessible region, the fidelity generally decreases with increasing temperature, indicating reduced distillation capability, whereas the physically inaccessible region exhibits the opposite monotonic trend, indicating enhanced distillation capability. These results highlight distinct operational behaviors of physically accessible and inaccessible regions under relativistic effects, providing insight into quantum imaginarity in curved spacetime.
△ Less
Submitted 14 April, 2026; v1 submitted 4 April, 2026;
originally announced April 2026.
-
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
Authors:
Linyu Li,
Zhi Jin,
Yichi Zhang,
Dongming Jin,
Yuanpeng He,
Haoran Duan,
Gadeng Luosang,
Nyima Tashi
Abstract:
Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer ca…
▽ More
Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer catastrophic forgetting as graphs evolve. To address this gap, we present a systematic study of continual multimodal knowledge graph reasoning (CMMKGR). We construct several continual multimodal knowledge graph benchmarks from existing MMKG datasets and propose MRCKG, a new CMMKGR model. Specifically, MRCKG employs a multimodal-structural collaborative curriculum to schedule progressive learning based on the structural connectivity of new triples to the historical graph and their multimodal compatibility. It also introduces a cross-modal knowledge preservation mechanism to mitigate forgetting through entity representation stability, relational semantic consistency, and modality anchoring. In addition, a multimodal contrastive replay scheme with a two-stage optimization strategy reinforces learned knowledge via multimodal importance sampling and representation alignment. Experiments on multiple datasets show that MRCKG preserves previously learned multimodal knowledge while substantially improving the learning of new knowledge.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging
Authors:
Leyang Jin,
Zirong Jin,
Zisheng Ye,
Haokai Pang,
Xiaoguang Han,
Yujian Zheng,
Hao Li
Abstract:
Recovering sewing patterns from draped 3D garments is a challenging problem in human digitization research. In contrast to the well-studied forward process of draping designed sewing patterns using mature physical simulation engines, the inverse process of recovering parametric 2D patterns from deformed garment geometry remains fundamentally ill-posed for existing methods. We propose a two-stage f…
▽ More
Recovering sewing patterns from draped 3D garments is a challenging problem in human digitization research. In contrast to the well-studied forward process of draping designed sewing patterns using mature physical simulation engines, the inverse process of recovering parametric 2D patterns from deformed garment geometry remains fundamentally ill-posed for existing methods. We propose a two-stage framework that centers on a structured intermediate representation, BoxMesh, which serves as the key to bridging the gap between 3D garment geometry and parametric sewing patterns. BoxMesh encodes both garment-level geometry and panel-level structure in 3D, while explicitly disentangling intrinsic panel geometry and stitching topology from draping-induced deformations. This representation imposes a physically grounded structure on the problem, significantly reducing ambiguity. In Stage I, a geometry-driven autoregressive model infers BoxMesh from the input 3D garment. In Stage II, a semantics-aware autoregressive model parses BoxMesh into parametric sewing patterns. We adopt autoregressive modeling to naturally handle the variable-length and structured nature of panel configurations and stitching relationships. This decomposition separates geometric inversion from structured pattern inference, leading to more accurate and robust recovery. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the GarmentCodeData benchmark and generalizes effectively to real-world scans and single-view images.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
Authors:
Yihong Dong,
Jianha Xiao,
Xue Jiang,
Xuyuan Guo,
Zhiyuan Fan,
Jiaru Qian,
Kechi Zhang,
Jia Li,
Zhi Jin,
Ge Li
Abstract:
The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as…
▽ More
The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.
△ Less
Submitted 15 April, 2026; v1 submitted 3 April, 2026;
originally announced April 2026.
-
From Liouville equation to universal quantum control: A study of generating ultra highly squeezed states
Authors:
Zhu-yao Jin,
J. Q. You,
Jun Jing
Abstract:
Within a unified framework, we reveal that the seemingly disparate control approaches for classical and quantum continuous-variable systems are interconnected via differential manifolds of the ancillary representations. For classical systems, the ancillary representation is defined by the time-dependent ancillary canonical variables resulting from a symplectic transformation over the original cano…
▽ More
Within a unified framework, we reveal that the seemingly disparate control approaches for classical and quantum continuous-variable systems are interconnected via differential manifolds of the ancillary representations. For classical systems, the ancillary representation is defined by the time-dependent ancillary canonical variables resulting from a symplectic transformation over the original canonical variables. Under the conditions of the Hamilton-Jacobi equation, the ancillary canonical variables act as dynamical invariants to guide the system nonadiabatically through the entire phase space. The second quantization of the Liouville equation for the canonical variables leads to the Heisenberg equation for the relevant ancillary operators, which is found to be a sufficient condition to yield nonadiabatic passages towards arbitrary target states in both Hermitian and non-Hermitian systems and constrained exact solutions of the time-dependent Schroedinger equation. Using the non-Hermitian Hamiltonian rigorously derived from the Lindblad master equation, our theory is exemplified by the generation of single-mode squeezed states with a squeezing level of 29.3 dB and double-mode squeezed states with 20.5 dB, respectively.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Proton Temperature Anisotropy Across Interplanetary Shocks: A Statistical Analysis with WIND observations
Authors:
Zeping Jin,
Lingling Zhao,
Xingyu Zhu,
Vladimir Flosinski,
Gary P. Zank,
Jakobus Le Roux,
Yiming Jiao,
Ashok Silwal,
Nibuna S. M. Subashchandar
Abstract:
Interplanetary (IP) shocks efficiently modify the proton temperature anisotropy of the solar wind. Analyzing ~800 IP shocks observed by the Wind spacecraft from 1997-2024, we present a statistical study of upstream and downstream proton temperature anisotropy and its dependence on shock geometry, compression, and distance from the shock. We find that (1) quasi-perpendicular shocks produce a pronou…
▽ More
Interplanetary (IP) shocks efficiently modify the proton temperature anisotropy of the solar wind. Analyzing ~800 IP shocks observed by the Wind spacecraft from 1997-2024, we present a statistical study of upstream and downstream proton temperature anisotropy and its dependence on shock geometry, compression, and distance from the shock. We find that (1) quasi-perpendicular shocks produce a pronounced enhancement of perpendicular temperature downstream (Tperp > Tpara), whereas parallel shocks remain near isotropic downstream due to typically stronger upstream Tpara; (2) comparisons with the Chew-Goldberger-Low (CGL) double-adiabatic model reveal geometry-dependent deviations. CGL overestimates downstream perpendicular heating and underestimates parallel heating at quasi-perpendicular shocks, with the opposite trend at quasi-parallel shocks, highlighting the importance of non-adiabatic processes beyond simple compression; (3) Shock-driven anisotropy is strongly localized near the shock and gradually relaxes toward typical solar wind conditions farther downstream as the shock's influence diminishes; and (4) downstream anisotropy is regulated by kinetic instabilities, with quasi-perpendicular shocks constrained by proton cyclotron and mirror instabilities and quasi-parallel shocks limited by the parallel firehose instability. Together, these results show that the evolution of temperature anisotropy at interplanetary shocks is controlled by shock geometry, localized processes, and instability driven regulation.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
Authors:
Nan Wang,
Zhiwei Jin,
Chen Chen,
Haonan Lu
Abstract:
Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of ima…
▽ More
Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
Authors:
Zehao Jin,
Yanan Sui
Abstract:
The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by…
▽ More
The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning
Authors:
Zeyu Jin,
Xiaoyu Qin,
Songtao Zhou,
Kaifeng Yun,
Jia Jia
Abstract:
Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end-to-end method to generate anonymous live text commentary. Such generated commentary is insufficient in the context of real-world live televised commentary, as it contains anonymous entities, context-dependent errors…
▽ More
Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end-to-end method to generate anonymous live text commentary. Such generated commentary is insufficient in the context of real-world live televised commentary, as it contains anonymous entities, context-dependent errors and lacks statistical insights of the game events. To bridge the gap, we propose GameSight, a two-stage model to address soccer commentary generation as a knowledge-enhanced visual reasoning task, enabling live-televised-like knowledgeable commentary with accurate reference to entities (players and teams). GameSight starts by performing visual reasoning to align anonymous entities with fine-grained visual and contextual analysis. Subsequently, the entity-aligned commentary is refined with knowledge by incorporating external historical statistics and iteratively updated internal game state information. Consequently, GameSight improves the player alignment accuracy by 18.5% on SN-Caption-test-align dataset compared to Gemini 2.5-pro. Combined with further knowledge enhancement, GameSight outperforms in segment-level accuracy and commentary quality, as well as game-level contextual relevance and structural composition. We believe that our work paves the way for a more informative and engaging human-centric experience with the AI sports application. Demo Page: https://gamesight2025.github.io/gamesight2025
△ Less
Submitted 30 March, 2026;
originally announced April 2026.
-
Think Anywhere in Code Generation
Authors:
Xue Jiang,
Tianyu Zhang,
Ge Li,
Mengyang Liu,
Taozhi Chen,
Zhenhua Xu,
Binhua Li,
Wenpin Jiao,
Zhi Jin,
Yongbin Li,
Yihong Dong
Abstract:
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort…
▽ More
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
△ Less
Submitted 2 April, 2026; v1 submitted 31 March, 2026;
originally announced March 2026.
-
From Natural Alignment to Conditional Controllability in Multimodal Dialogue
Authors:
Zeyu Jin,
Songtao Zhou,
Haoyu Wang,
Minghao Tian,
Kaifeng Yun,
Zhuo Chen,
Xiaoyu Qin,
Jia Jia
Abstract:
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natu…
▽ More
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
An Intertwined Short and Long GRB with 4-minute Separation
Authors:
Liang Li,
Yu Wang,
Bing Zhang,
Ye Li,
Shu-Rui Zhang,
Jochen Greiner,
Zhi-Ping Jin,
Jin-Jun Geng,
Hou-Jun Lv,
Asaf Peer,
Maria Dainotti,
Tong Liu,
Yi-Zhong Fan,
Yong-Feng Huang,
Zi-Gao Dai,
Melin Kole,
Wei-Hua Lei,
Ye-Fei Yuan,
Shuang-Nan Zhang,
Felix Ryde,
She-Sheng Xue,
Rong-Gen Cai
Abstract:
Gamma-ray bursts (GRBs), the most energetic transients in the Universe, are traditionally classified into long-duration ($T_{90}>2$ s) and short-duration ($T_{90}<2$ s) events, associated with the core collapse of massive stars (Type II) and the merger of compact binary systems (Type I), respectively. The two classes exhibit distinct observational properties that serve as key diagnostic criteria f…
▽ More
Gamma-ray bursts (GRBs), the most energetic transients in the Universe, are traditionally classified into long-duration ($T_{90}>2$ s) and short-duration ($T_{90}<2$ s) events, associated with the core collapse of massive stars (Type II) and the merger of compact binary systems (Type I), respectively. The two classes exhibit distinct observational properties that serve as key diagnostic criteria for classification. Here we report GRB 160425A, a peculiar event comprising two sub-bursts separated by four minutes: a short-duration burst ($G_1$) and a long-duration burst ($G_2$). Nearly all standard prompt-emission diagnostics, including pulse morphology, duration, hardness ratio, minimum variability timescale, spectral properties, and established empirical correlations, consistently categorize $G_1$ as a short-like (Type I, merger-origin) and $G_2$ as a long-like (Type II, collapsar-origin) GRB. The coexistence of merger and collapsar signatures in a single event challenges existing progenitor frameworks and calls for a re-evaluation of GRB classification schemes and progenitor scenarios.
△ Less
Submitted 3 April, 2026; v1 submitted 30 March, 2026;
originally announced March 2026.
-
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Authors:
Fangda Ye,
Yuxin Hu,
Pengxiang Zhu,
Yibo Li,
Ziqi Jin,
Yao Xiao,
Yibo Wang,
Lei Wang,
Zhen Zhang,
Lu Wang,
Yue Deng,
Bin Wang,
Yifan Zhang,
Liangcai Su,
Xinyu Wang,
He Zhao,
Chen Wei,
Qiang Ren,
Bryan Hooi,
An Bo,
Shuicheng Yan,
Lidong Bing
Abstract:
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evol…
▽ More
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial
Authors:
Zhenchen Zhu,
Ge Hu,
Weixiong Tan,
Kai Gao,
Chao Sun,
Zhen Zhou,
Kepei Xu,
Wei Han,
Meixia Shang,
Xiaoming Qiu,
Yiqing Tan,
Jinhua Wang,
Zhoumeng Ying,
Li Peng,
Wei Song,
Lan Song,
Zhengyu Jin,
Nan Hong,
Yizhou Yu
Abstract:
The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology…
▽ More
The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Authors:
Yicheng Zou,
Dongsheng Zhu,
Lin Zhu,
Tong Zhu,
Yunhua Zhou,
Peiheng Zhou,
Xinyu Zhou,
Dongzhan Zhou,
Zhiwang Zhou,
Yuhao Zhou,
Bowen Zhou,
Zhanping Zhong,
Zhijie Zhong,
Haiteng Zhao,
Penghao Zhao,
Xiaomeng Zhao,
Zhiyuan Zhao,
Yechen Zhang,
Jin Zhang,
Wenwei Zhang,
Hongjie Zhang,
Zhuo Zhang,
Wenlong Zhang,
Bo Zhang,
Chao Zhang
, et al. (152 additional authors not shown)
Abstract:
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertis…
▽ More
We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
△ Less
Submitted 2 April, 2026; v1 submitted 26 March, 2026;
originally announced March 2026.
-
GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization
Authors:
Zhangyu Jin,
Maksim Siniukov,
Deuksin Kwon,
Ashutosh Chaubey,
Mohammad Soleymani
Abstract:
Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener,…
▽ More
Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs
Authors:
Florent Draye,
Abir Harrasse,
Vedant Palit,
Tung-Yu Wu,
Jiarui Liu,
Punya Syon Pandey,
Roderick Wu,
Terry Jingchen Zhang,
Zhijing Jin,
Bernhard Schölkopf
Abstract:
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their inte…
▽ More
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.
△ Less
Submitted 21 March, 2026;
originally announced March 2026.
-
ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis
Authors:
Zhan Jin,
Yu Luo,
Yizhou Zhang,
Ziyang Cui,
Yuqing Wei,
Xianchao Liu,
Xueying Zeng,
Qing Zhang
Abstract:
Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2…
▽ More
Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents
Authors:
Hao Zhang,
Mingjie Liu,
Shaokun Zhang,
Songyang Han,
Jian Hu,
Zhenghui Jin,
Yuchi Zhang,
Shizhe Diao,
Ximing Lu,
Binfeng Xu,
Zhiding Yu,
Jan Kautz,
Yi Dong
Abstract:
Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and mai…
▽ More
Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
An Extended T-A Formulation Based on Potential-Chain Recursion for Electromagnetic Modeling of Parallel-Wound No-Insulation HTS Coils
Authors:
Zhe Pan,
Qi Xu,
Ruixiang Wang,
Zhenghao Jin,
Jianzhao Geng
Abstract:
Parallel-wound no-insulation (PW-NI) high-temperature superconducting (HTS) coils significantly reduce charging delay while maintaining excellent self-protection capability, demonstrating great potential for high-field applications. Existing models that couple the T-A formulation with equivalent circuits have demonstrated high accuracy in electromagnetic analysis of PW-NI coils. However, eliminati…
▽ More
Parallel-wound no-insulation (PW-NI) high-temperature superconducting (HTS) coils significantly reduce charging delay while maintaining excellent self-protection capability, demonstrating great potential for high-field applications. Existing models that couple the T-A formulation with equivalent circuits have demonstrated high accuracy in electromagnetic analysis of PW-NI coils. However, eliminating the computational overhead caused by frequent variable mapping and data exchange between electromagnetic and circuit modules is important for improving computational efficiency, particularly in long-duration transient simulations of large-scale magnets. To address this issue, an extended T-A formulation based on potential-chain recursion, termed PCR-TA, is proposed. By directly embedding inter-tape current sharing and radial current bypass behaviors into the finite-element framework, this method computes the transient electromagnetic response of PW-NI coils without requiring an explicit equivalent circuit model. Building upon it, a multi-scale approach is further developed for large-scale PW-NI coils. The validity of the proposed method and its multi-scale extension is verified through comparisons with experimental measurements and field-circuit coupled modeling results. Comparative analyses demonstrate that the PCR-TA method achieves a speedup of approximately 2.4 over the field-circuit coupled method, whereas its multi-scale extension further increases this speedup to roughly 5.8. Furthermore, the PCR-TA method is extended to model the continuous transition of PW-NI coils from power-supply charging to closed-loop operation. This work provides an efficient method and tool for the electromagnetic modeling of PW-NI coils under both driven and closed-loop operating conditions.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
TDMM-LM: Bridging Facial Understanding and Animation via Language Models
Authors:
Luchuan Song,
Pinxin Liu,
Haiyang Liu,
Zhenchao Jin,
Yolo Yunlong Tang,
Zichong Xu,
Susan Liang,
Jing Bi,
Jason J Corso,
Chenliang Xu
Abstract:
Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fi…
▽ More
Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
△ Less
Submitted 14 March, 2026;
originally announced March 2026.
-
Design of Transit Networks: Global Optimization of Continuous Approximation Models via Geometric Programming
Authors:
Haoyang Mao,
Weihua Gu,
Wenbo Fan,
Zhicheng Jin,
Xiaokuan Zhao
Abstract:
Continuous approximation (CA) models have been widely adopted in transit network design studies due to their strong analytical tractability and high computational efficiency. However, such models are typically formulated as nonconvex optimization problems, and existing solution approaches mainly rely on iterative algorithms that exploit first-order optimality information or nonlinear programming s…
▽ More
Continuous approximation (CA) models have been widely adopted in transit network design studies due to their strong analytical tractability and high computational efficiency. However, such models are typically formulated as nonconvex optimization problems, and existing solution approaches mainly rely on iterative algorithms that exploit first-order optimality information or nonlinear programming solvers, whose solution quality lacks stability guarantees under complex demand conditions. This paper proposes a geometric programming (GP)-based CA method for transit network design, which can be efficiently solved to global optimality. Numerical experiments are conducted on both homogeneous and heterogeneous network settings to evaluate the effectiveness of the proposed approach. Comprehensive tests are performed under the combinations of six heterogeneous demand distributions, four levels of total passenger demand, and three value-of-time parameters. The results indicate that the GP approach consistently outperforms the coordinate descent method across all test cases, achieving cost reductions of approximately 1%-4%, even when the latter converges to identical solutions under different initializations. In comparison, nonlinear programming solvers, with fmincon as a representative example, are able to obtain globally optimal solutions comparable to those of the GP approach in low-demand heterogeneous networks; however, their performance becomes unstable under high-demand conditions. These findings demonstrate that GP provides an efficient and robust optimization framework for solving CA-based transit network design problems, especially in high-demand and highly heterogeneous network environments.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
Taming the expressiveness of neural-network wave functions for robust convergence to quantum many-body states
Authors:
Dezhe Z. Jin
Abstract:
Neural networks are emerging as a powerful tool for determining the quantum states of interacting many-body fermionic systems. The standard approach trains a neural-network ansatz by minimizing the mean local energy estimated from Monte Carlo samples. However, this typically results in large sample-to-sample fluctuations in the estimated mean energy and thus slow convergence of the energy minimiza…
▽ More
Neural networks are emerging as a powerful tool for determining the quantum states of interacting many-body fermionic systems. The standard approach trains a neural-network ansatz by minimizing the mean local energy estimated from Monte Carlo samples. However, this typically results in large sample-to-sample fluctuations in the estimated mean energy and thus slow convergence of the energy minimization. We propose that minimizing a logarithmically compressed variance of the local energies can dramatically improve convergence. Moreover, this loss function can be adapted to systematically obtain the energy spectrum across multiple runs. We demonstrate these ideas for spin-1/2 particles in a 2D harmonic trap with attractive Poschl-Teller interactions between opposite spins.
△ Less
Submitted 31 March, 2026; v1 submitted 16 March, 2026;
originally announced March 2026.
-
Systematically Improvable Numerical Atomic Orbital Basis Using Contracted Truncated Spherical Waves
Authors:
Yike Huang,
Zuxin Jin,
Linfeng Zhang,
Mohan Chen,
Rui Chen,
Ling Li
Abstract:
To solve the Kohn-Sham equation within the framework of density functional theory, we develop a scheme to construct numerical atomic orbital (NAO) basis sets by contracting truncated spherical waves (TSWs). The contraction minimizes the trace of the kinetic operator in the residual space, generalizing the spillage minimizing scheme [M. Chen et al., J. Phys. Condens. Matter 22, 445501 (2010); P. Li…
▽ More
To solve the Kohn-Sham equation within the framework of density functional theory, we develop a scheme to construct numerical atomic orbital (NAO) basis sets by contracting truncated spherical waves (TSWs). The contraction minimizes the trace of the kinetic operator in the residual space, generalizing the spillage minimizing scheme [M. Chen et al., J. Phys. Condens. Matter 22, 445501 (2010); P. Lin et al., Phys. Rev. B 103, 235131 (2021)]. In addition to the systematic improvability inherited from previous schemes, the use of TSW instead of plane waves as the expansion basis bridges reference states and NAOs more effectively, and eliminates spurious interactions between periodic images, thereby enabling better transferability through the inclusion of extensive reference states. Benchmarks demonstrate that the constructed NAO achieves satisfactory precision for various properties of both molecules and bulk systems, including total energy, bond length, atomization energy, lattice constant, cohesive energy, band gap, and energy-level alignment. By incorporating unoccupied states, the improved transferability in describing the conduction band is demonstrated to be effective and substantial.
△ Less
Submitted 9 April, 2026; v1 submitted 14 March, 2026;
originally announced March 2026.
-
Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study
Authors:
Zhiye Jin,
Yibai Li,
K. D. Joshi,
Xuefei,
Deng,
Xiaobing,
Li
Abstract:
This study presents the development of the PsyCogMetrics AI Lab (psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. Th…
▽ More
This study presents the development of the PsyCogMetrics AI Lab (psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build-Intervene-Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
Authors:
Lu Wang,
Zhuoran Jin,
Yupu Hao,
Yubo Chen,
Kang Liu,
Yulong Ao,
Jun Zhao
Abstract:
Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation an…
▽ More
Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/
△ Less
Submitted 12 March, 2026;
originally announced March 2026.
-
Counterweights and Complementarities: The Convergence of AI and Blockchain Powering a Decentralized Future
Authors:
Yibai Li,
Zhiye Jin,
Xiaobing,
Li,
K. D. Joshi,
Xuefei,
Deng
Abstract:
This editorial addresses the critical intersection of artificial intelligence (AI) and blockchain technologies, highlighting their contrasting tendencies toward centralization and decentralization, respectively. While AI, particularly with the rise of large language models (LLMs), exhibits a strong centralizing force due to data and resource monopolization by large corporations, blockchain offers…
▽ More
This editorial addresses the critical intersection of artificial intelligence (AI) and blockchain technologies, highlighting their contrasting tendencies toward centralization and decentralization, respectively. While AI, particularly with the rise of large language models (LLMs), exhibits a strong centralizing force due to data and resource monopolization by large corporations, blockchain offers a counterbalancing mechanism through its inherent decentralization, transparency, and security. The editorial argues that these technologies are not mutually exclusive but possess complementary strengths. Blockchain can mitigate AI's centralizing risks by enabling decentralized data management, computation, and governance, promoting greater inclusivity, transparency, and user privacy. Conversely, AI can enhance blockchain's efficiency and security through automated smart contract management, content curation, and threat detection. The core argument calls for the development of ``decentralized intelligence'' (DI) -- an interdisciplinary research area focused on creating intelligent systems that function without centralized control.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
AI Psychometrics: Evaluating the Psychological Reasoning of Large Language Models with Psychometric Validities
Authors:
Yibai Li,
Xiaolin Lin,
Zhenghui Sha,
Zhiye Jin,
Xiaobing Li
Abstract:
The immense number of parameters and deep neural networks make large language models (LLMs) rival the complexity of human brains, which also makes them opaque ``black box'' systems that are challenging to evaluate and interpret. AI Psychometrics is an emerging field that aims to tackle these challenges by applying psychometric methodologies to evaluate and interpret the psychological traits and pr…
▽ More
The immense number of parameters and deep neural networks make large language models (LLMs) rival the complexity of human brains, which also makes them opaque ``black box'' systems that are challenging to evaluate and interpret. AI Psychometrics is an emerging field that aims to tackle these challenges by applying psychometric methodologies to evaluate and interpret the psychological traits and processes of artificial intelligence (AI) systems. This paper investigates the application of AI Psychometrics to evaluate the psychological reasoning and overall psychometric validity of four prominent LLMs: GPT-3.5, GPT-4, LLaMA-2, and LLaMA-3. Using the Technology Acceptance Model (TAM), we examined convergent, discriminant, predictive, and external validity across these models. Our findings reveal that the responses from all these models generally met all validity criteria. Moreover, higher-performing models like GPT-4 and LLaMA-3 consistently demonstrated superior psychometric validity compared to their predecessors, GPT-3.5 and LLaMA-2. These results help to establish the validity of applying AI Psychometrics to evaluate and interpret large language models.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
Authors:
Zhuolin He,
Jing Li,
Guanghao Li,
Xiaolei Chen,
Jiacheng Tang,
Siyang Zhang,
Zhounan Jin,
Feipeng Cai,
Bin Li,
Jian Pu,
Jia Cai,
Xiangyang Xue
Abstract:
Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that e…
▽ More
Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.
△ Less
Submitted 9 March, 2026;
originally announced March 2026.
-
Structure and Progress Aware Diffusion for Medical Image Segmentation
Authors:
Siyuan Song,
Guyue Hu,
Chenglong Li,
Dengdi Sun,
Zhe Jin,
Jin Tang
Abstract:
Medical image segmentation is crucial for computer-aided diagnosis, which necessitates understanding both coarse morphological and semantic structures, as well as carving fine boundaries. The morphological and semantic structures in medical images are beneficial and stable clues for target understanding. While the fine boundaries of medical targets (like tumors and lesions) are usually ambiguous a…
▽ More
Medical image segmentation is crucial for computer-aided diagnosis, which necessitates understanding both coarse morphological and semantic structures, as well as carving fine boundaries. The morphological and semantic structures in medical images are beneficial and stable clues for target understanding. While the fine boundaries of medical targets (like tumors and lesions) are usually ambiguous and noisy since lesion overlap, annotation uncertainty, and so on, making it not reliable to serve as early supervision. However, existing methods simultaneously learn coarse structures and fine boundaries throughout the training process. In this paper, we propose a structure and progress-aware diffusion (SPAD) for medical image segmentation, which consists of a semantic-concentrated diffusion (ScD) and a boundary-centralized diffusion (BcD) modulated by a progress-aware scheduler (PaS). Specifically, the semantic-concentrated diffusion introduces anchor-preserved target perturbation, which perturbs pixels within a medical target but preserves unaltered areas as semantic anchors, encouraging the model to infer noisy target areas from the surrounding semantic context. The boundary-centralized diffusion introduces progress-aware boundary noise, which blurs unreliable and ambiguous boundaries, thus compelling the model to focus on coarse but stable anatomical morphology and global semantics. Furthermore, the progress-aware scheduler gradually modulates noise intensity of the ScD and BcD forming a coarse-to-fine diffusion paradigm, which encourage focusing on coarse morphological and semantic structures during early target understanding stages and gradually shifting to fine target boundaries during later contour adjusting stages.
△ Less
Submitted 8 March, 2026;
originally announced March 2026.
-
Multi-View Based Audio Visual Target Speaker Extraction
Authors:
Peijun Yang,
Zhan Jin,
Juan Liu,
Ming Li
Abstract:
Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker's voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory i…
▽ More
Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker's voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory information that could enhance speech extraction. In this work, we propose Multi-View Tensor Fusion (MVTF), a novel framework that transforms multi-view learning into single-view performance gains. During the training stage, we leverage synchronized multi-perspective lip videos to learn cross-view correlations through MVTF, where pairwise outer products explicitly model multiplicative interactions between different views of input lip embeddings. At the inference stage, the system supports both single-view and multi-view inputs. Experimental results show that in the single-view inputs, our framework leverages multi-view knowledge to achieve significant performance gains, while in the multi-view mode, it further improves overall performance and enhances the robustness. Our demo, code and data are available at https://anonymous.4open.science/w/MVTF-Gridnet-209C/
△ Less
Submitted 10 March, 2026; v1 submitted 8 March, 2026;
originally announced March 2026.
-
SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation
Authors:
Tongqing Chen,
Hang Wu,
Jiasen Wang,
Xiaotao Li,
Zhu Jin,
Lu Fang
Abstract:
High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between $SE(2)$ locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textb…
▽ More
High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between $SE(2)$ locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textbf{SuperSuit}, a bimodal data acquisition framework that supports both robot-in-the-loop teleoperation and active demonstration under a shared kinematic interface. Both modalities produce structurally identical joint-space trajectories, enabling direct data mixing without modifying downstream policies. For locomotion, SuperSuit maps natural human stepping to continuous planar base velocities, eliminating discrete command switches. For manipulation, it employs a strictly isomorphic wearable arm in both modes, while policy training is formulated in a shift-invariant delta-joint representation to mitigate calibration offsets and structural compliance without inverse kinematics. Real-world experiments on long-horizon mobile manipulation tasks show 2.6$\times$ higher demonstration throughput in active mode compared to a teleoperation baseline, comparable policy performance when substituting teleoperation data with active demonstrations at fixed dataset size, and monotonic performance improvement as active data volume increases. These results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation.
△ Less
Submitted 6 March, 2026;
originally announced March 2026.
-
RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform
Authors:
Kenan Li,
Rongzhi Li,
Linghao Zhang,
Qirui Jin,
Liao Zhu,
Xiaosong Huang,
Geng Zhang,
Yikai Zhang,
Shilin He,
Chengxing Xie,
Xin Zhang,
Zijian Jin,
Bowen Li,
Chaoyun Zhang,
Yu Kang,
Yufan Huang,
Elsie Nallipogu,
Saravan Rajmohan,
Qingwei Lin,
Dongmei Zhang
Abstract:
Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating…
▽ More
Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset creation, where task design is the only human intervention. RepoLaunch automates the remaining steps, enabling scalable benchmarking and training of coding agents and LLMs. Notably, several works on agentic benchmarking and training have recently adopted RepoLaunch for automated task generation.
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
When Do Language Models Endorse Limitations on Human Rights Principles?
Authors:
Keenan Samway,
Nicole Miu Takagi,
Rada Mihalcea,
Bernhard Schölkopf,
Ilias Chalkidis,
Daniel Hershcovich,
Zhijing Jin
Abstract:
As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), lev…
▽ More
As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages. Our analysis of eleven major LLMs reveals systematic biases where models: (1) accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, (2) demonstrate significant cross-linguistic variation with elevated endorsement rates of rights-limiting actions in Chinese and Hindi compared to English or Romanian, (3) show substantial susceptibility to prompt-based steering, and (4) exhibit noticeable differences between Likert and open-ended responses, highlighting critical challenges in LLM preference assessment.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
Type-Aware Retrieval-Augmented Generation with Dependency Closure for Solver-Executable Industrial Optimization Modeling
Authors:
Y. Zhong,
R. Huang,
M. Wang,
Z. Guo,
YC. Li,
M. Yu,
Z. Jin
Abstract:
Automated industrial optimization modeling requires reliable translation of natural-language requirements into solver-executable code. However, large language models often generate non-compilable models due to missing declarations, type inconsistencies, and incomplete dependency contexts. We propose a type-aware retrieval-augmented generation (RAG) method that enforces modeling entity types and mi…
▽ More
Automated industrial optimization modeling requires reliable translation of natural-language requirements into solver-executable code. However, large language models often generate non-compilable models due to missing declarations, type inconsistencies, and incomplete dependency contexts. We propose a type-aware retrieval-augmented generation (RAG) method that enforces modeling entity types and minimal dependency closure to ensure executability. Unlike existing RAG approaches that index unstructured text, our method constructs a domain-specific typed knowledge base by parsing heterogeneous sources, such as academic papers and solver code, into typed units and encoding their mathematical dependencies in a knowledge graph. Given a natural-language instruction, it performs hybrid retrieval and computes a minimal dependency-closed context, the smallest set of typed symbols required for solver-executable code, via dependency propagation over the graph. We validate the method on two constraint-intensive industrial cases: demand response optimization in battery production and flexible job shop scheduling. In the first case, our method generates an executable model incorporating demand-response incentives and load-reduction constraints, achieving peak shaving while preserving profitability; conventional RAG baselines fail. In the second case, it consistently produces compilable models that reach known optimal solutions, demonstrating robust cross-domain generalization; baselines fail entirely. Ablation studies confirm that enforcing type-aware dependency closure is essential for avoiding structural hallucinations and ensuring executability, addressing a critical barrier to deploying large language models in complex engineering optimization tasks.
△ Less
Submitted 3 March, 2026;
originally announced March 2026.
-
A Comparative Study of UMAP and Other Dimensionality Reduction Methods
Authors:
Guanzhe Zhang,
Shanshan Ding,
Zhezhen Jin
Abstract:
Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensiv…
▽ More
Uniform Manifold Approximation and Projection (UMAP) is a widely used manifold learning technique for dimensionality reduction. This paper studies UMAP, supervised UMAP, and several competing dimensionality reduction methods, including Principal Component Analysis (PCA), Kernel PCA, Sliced Inverse Regression (SIR), Kernel SIR, and t-distributed Stochastic Neighbor Embedding, through a comprehensive comparative analysis. Although UMAP has attracted substantial attention for preserving local and global structures, its supervised extensions, particularly for regression settings, remain rather underexplored. We provide a systematic evaluation of supervised UMAP for both regression and classification using simulated and real datasets, with performance assessed via predictive accuracy on low-dimensional embeddings. Our results show that supervised UMAP performs well for classification but exhibits limitations in effectively incorporating response information for regression, highlighting an important direction for future development.
△ Less
Submitted 1 March, 2026;
originally announced March 2026.
-
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Authors:
Jiachun Li,
Shaoping Huang,
Zhuoran Jin,
Chenlong Zhang,
Pengfei Cao,
Yubo Chen,
Kang Liu,
Jun Zhao
Abstract:
Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Li…
▽ More
Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.