-
Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection
Authors:
Yang Li,
Qiang Sheng,
Zhengjia Wang,
Yehan Yang,
Danding Wang,
Juan Cao
Abstract:
The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this…
▽ More
The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory to construct a logic graph for the creator's foundation while extracting Elementary Discourse Unit-level features for the editor's style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
PanLUNA: An Efficient and Robust Query-Unified Multimodal Model for Edge Biosignal Intelligence
Authors:
Marija Zelic,
Anna Tegon,
Yawei Li,
Thorir Mar Ingolfsson,
Luca Benini
Abstract:
Physiological foundation models (FMs) have shown promise for biosignal representation learning, yet most remain confined to a single modality such as EEG, ECG, or PPG, largely because paired multimodal datasets are scarce. In this paper, we present PanLUNA, a compact 5.4M-parameter pan-modal FM that jointly processes EEG, ECG, and PPG within a single shared encoder. Extending LUNA's channel-unific…
▽ More
Physiological foundation models (FMs) have shown promise for biosignal representation learning, yet most remain confined to a single modality such as EEG, ECG, or PPG, largely because paired multimodal datasets are scarce. In this paper, we present PanLUNA, a compact 5.4M-parameter pan-modal FM that jointly processes EEG, ECG, and PPG within a single shared encoder. Extending LUNA's channel-unification module, PanLUNA treats multimodal channels as entries in a unified query set augmented with sensor-type embeddings, enabling efficient cross-modal early fusion while remaining inherently robust to missing modalities at inference time. Despite its small footprint, PanLUNA matches or exceeds models up to 57$\times$ larger: 81.21% balanced accuracy on TUAB abnormal EEG detection and state-of-the-art 0.7416 balanced accuracy on HMC multimodal sleep staging. Quantization-aware training with INT8 weights recovers $\geq$96% of full-precision performance, and deployment on the GAP9 ultra-low-power RISC-V microcontroller for wearables achieves 325.6 ms latency and 18.8 mJ per 10-second, 12-lead ECG inference, and 1.206 s latency at 68.65 mJ for multimodal 5-channel sleep staging over 30-second epochs.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
Authors:
Shuhong Liu,
Chenyu Bao,
Ziteng Cui,
Xuangeng Chu,
Bin Ren,
Lin Gu,
Xiang Chen,
Mingrui Li,
Long Ma,
Marcos V. Conde,
Radu Timofte,
Yun Liu,
Ryo Umagami,
Tomohiro Hashimoto,
Zijian Hu,
Yuan Gan,
Tianhan Xu,
Yusuke Kurose,
Tatsuya Harada,
Junwei Yuan,
Gengjia Chang,
Xining Ge,
Mache You,
Qida Cao,
Zeliang Li
, et al. (81 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participa…
▽ More
This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks
Authors:
Yuanhang Li
Abstract:
The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terr…
▽ More
The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terrestrial network (NTN-TN) cooperative systems. We introduce SpectrumQA, a benchmark comprising 108K visual question-answer pairs across four granularity levels: scene classification (L1), regional reasoning (L2), spatial localization (L3), and semantic reasoning (L4). Our experiments on three NTN-TN scenarios with a frozen Qwen2-VL-7B and a trained ResNet-18 reveal a clear taskdependent complementarity: CNN achieves 72.9% accuracy at severity classification (L1) and 0.552 IoU at spatial localization (L3), while VLM uniquely enables semantic reasoning (L4) with F1=0.576 using only three in-context examples-a capability fundamentally absent in CNN architectures. Chain-of-thought (CoT) prompting further improves VLM reasoning by 12.6% (F1: 0.209->0.233) while having zero effect on spatial tasks, confirming that the complementarity is rooted in architectural differences rather than prompting limitations. A deterministic task-type router that delegates supervised tasks to CNN and reasoning tasks to VLM achieves a composite score of 0.616, a 39.1% improvement over CNN alone. We further show that VLM representations exhibit stronger cross-scenario robustness, with smaller performance degradation in 5 out of 6 transfer directions. These findings provide actionable guidelines: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning, rather than treating them as substitutes.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
A Multimodal Foundation Model of Spatial Transcriptomics and Histology for Biological Discovery and Clinical Prediction
Authors:
Jinxi Xiang,
Siyu Hou,
Yuchen Li,
Ryan Quinton,
Xiaoming Zhang,
Feyisope Eweje,
Xiangde Luo,
Yijiang Chen,
Zhe Li,
Colin Bergstrom,
Ted Kim,
Sierra Willens,
Francesca Maria Olguin,
Matthew Abikenari,
Andrew Heider,
Sanjeeth Rajaram,
Joel Neal,
Maximilian Diehn,
Xiang Zhou,
Ruijiang Li
Abstract:
Spatial transcriptomics (ST) enables gene expression mapping within anatomical context but remains costly and low-throughput. Hematoxylin and eosin (H\&E) staining offers rich morphology yet lacks molecular resolution. We present \textbf{\ours} (\textbf{S}patial \textbf{T}ranscriptomics and hist\textbf{O}logy \textbf{R}epresentation \textbf{M}odel), a foundation model trained on 1.2 million spatia…
▽ More
Spatial transcriptomics (ST) enables gene expression mapping within anatomical context but remains costly and low-throughput. Hematoxylin and eosin (H\&E) staining offers rich morphology yet lacks molecular resolution. We present \textbf{\ours} (\textbf{S}patial \textbf{T}ranscriptomics and hist\textbf{O}logy \textbf{R}epresentation \textbf{M}odel), a foundation model trained on 1.2 million spatially resolved transcriptomic profiles with matched histology across 18 organs. Using a hierarchical architecture integrating morphological features, gene expression, and spatial context, STORM bridges imaging and omics through robust molecular--morphological representations. STORM enhances spatial domain discovery, producing biologically coherent tissue maps, and outperforms existing methods in predicting spatial gene expression from H\&E images across 11 tumor types. The model is platform-agnostic, performing consistently across Visium, Xenium, Visium HD, and CosMx. Applied to 23 independent cohorts comprising 7,245 patients, STORM significantly improves immunotherapy response prediction and prognostication over established biomarkers, providing a scalable framework for spatially informed discovery and clinical precision medicine.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
Efficient Solving for Dynamic Data Structure Constraint Satisfaction Problem
Authors:
Nanbing Li,
Weijie Peng,
Jin Luo,
Shuai Wang,
Yihui Li,
Jun Fang,
Yun Liang
Abstract:
Functional verification plays a central role in ensuring the correctness of modern integrated circuit designs, where constrained-random verification is widely adopted to generate diverse stimuli under high-level constraints. In industrial verification environments, constraint solving increasingly involves dynamic data structures whose shape and content are determined at runtime, causing the sets o…
▽ More
Functional verification plays a central role in ensuring the correctness of modern integrated circuit designs, where constrained-random verification is widely adopted to generate diverse stimuli under high-level constraints. In industrial verification environments, constraint solving increasingly involves dynamic data structures whose shape and content are determined at runtime, causing the sets of variables and constraint instances to evolve across solver invocations, which in turn leads to substantial overhead when nested and high-dimensional structures repeatedly expand across solves. We formalize this class of problems as the Dynamic Data Structure Constraint Satisfaction Problem (D2SCSP),which captures the interaction between dynamic data structure expansion and constraint evaluation. We propose a dependency-guided problem partitioning framework combined with an incremental encoding and constraint activation mechanism, enabling reuse of solver state and encodings across multiple solves. The framework is integrated into an industrial SystemVerilog verification flow and implemented in the commercial simulator VeriSim. Experimental results on industrial benchmarks demonstrate significant performance improvements, achieving an average speedup of 24.80x over a baseline and 1.72x over a state-of-the-art commercial simulator, highlighting the practicality of the approach for real-world verification workflows.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling
Authors:
Yuanhang Li
Abstract:
Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires…
▽ More
Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Olmo Hybrid: From Theory to Practice and Back
Authors:
William Merrill,
Yanhong Li,
Tyler Romero,
Anej Svete,
Caia Costello,
Pradeep Dasigi,
Dirk Groeneveld,
David Heineman,
Bailey Kuehl,
Nathan Lambert,
Jacob Morrison,
Luca Soldaini,
Finbarr Timbers,
Pete Walsh,
Noah A. Smith,
Hannaneh Hajishirzi,
Ashish Sabharwal
Abstract:
Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure…
▽ More
Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models
Authors:
Haifeng Huang,
Yang Li
Abstract:
Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-…
▽ More
Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and Video LLM backbones demonstrate that KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
Authors:
Nanxi Li,
Xiang Wang,
Yuanjie Chen,
Haode Zhang,
Hong Li,
Yong-Lu Li
Abstract:
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intu…
▽ More
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.
△ Less
Submitted 30 March, 2026;
originally announced April 2026.
-
ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow
Authors:
Jiekai Wu,
Rong Fu,
Chuangqi Li,
Zijian Zhang,
Guangxin Wu,
Hao Zhang,
Shiyin Lin,
Jianyuan Ni,
Yang Li,
Dongxu Zhang,
Amir H. Gandomi,
Simon Fong,
Pengbin Feng
Abstract:
Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dyna…
▽ More
Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dynamics framework that models class prototypes as trajectories and learns their evolution with an explicit temporal vector field. By jointly enforcing low-curvature motion and inter-class separation, ProtoFlow stabilizes prototype geometry throughout incremental learning. Experiments on standard class- and domain-incremental remote sensing benchmarks show consistent gains over strong baselines, including up to 1.5-2.0 points improvement in mIoUall, together with reduced forgetting. These results suggest that explicitly modeling temporal prototype evolution is a practical and interpretable strategy for robust continual remote sensing segmentation.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Yan Shu,
Jiaqi Ma,
Ziteng Cui,
Shuhong Liu,
Guofeng Mei,
Lei Sun,
Zongwei Wu,
Fahad Shahbaz Khan,
Salman Khan,
Radu Timofte,
Yawei Li,
Hongyuan Yu,
Pufan Xu,
Chen Wu,
Long Peng,
Jiaojiao Yi,
Siyang Yi,
Yuning Cui,
Jingyuan Xia,
Xing Mou,
Keji He,
Jinlin Wu,
Zongang Gao
, et al. (38 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge…
▽ More
This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
Authors:
Jian Yang,
Wei Zhang,
Jiajun Wu,
Junhang Cheng,
Tuney Zheng,
Fanglin Xu,
Weicheng Gu,
Lin Jing,
Yaxin Du,
Joseph Li,
Yizhi Li,
Yan Xing,
Chuan Hao,
Ran Tao,
Ruihao Gong,
Aishan Liu,
Zhoujun Li,
Mingjie Tang,
Chenghua Lin,
Siheng Chen,
Wayne Xin Zhao,
Xianglong Liu,
Ming Zhou,
Bryan Dai,
Weifeng Lv
Abstract:
Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning tra…
▽ More
Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-turn dialogue with environmental error feedback, explicitly modeling the error-correction process. ICWM is trained on domain-specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self-verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD-Coder and 38.0% on KernelBench) shows InCoder-32B-Thinking achieves top-tier open-source results across all domains.GPU Optimization
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
Authors:
Mingao Tan,
Yiyang Li,
Shanze Wang,
Xinming Zhang,
Wei Zhang
Abstract:
Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot…
▽ More
Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
Authors:
Meihua Li,
Yang Zhang,
Weizhao He,
Hu Qu,
Yisong Li
Abstract:
Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-s…
▽ More
Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
Authors:
Yubin Qu,
Yi Liu,
Tongcheng Geng,
Gelei Deng,
Yuekang Li,
Leo Yu Zhang,
Ying Zhang,
Lei Ma
Abstract:
LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agen…
▽ More
LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agent's action space, such as file writes, shell commands, and network requests, despite existing safeguards. We introduce Document-Driven Implicit Payload Execution (DDIPE), which embeds malicious logic in code examples and configuration templates within skill documentation. Because agents reuse these examples during normal tasks, the payload executes without explicit prompts. Using an LLM-driven pipeline, we generate 1,070 adversarial skills from 81 seeds across 15 MITRE ATTACK categories. Across four frameworks and five models, DDIPE achieves 11.6% to 33.5% bypass rates, while explicit instruction attacks achieve 0% under strong defenses. Static analysis detects most cases, but 2.5% evade both detection and alignment. Responsible disclosure led to four confirmed vulnerabilities and two fixes.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
Authors:
Zhihao Chen,
Ying Zhang,
Yi Liu,
Gelei Deng,
Yuekang Li,
Yanjun Zhang,
Jianting Ning,
Leo Yu Zhang,
Lei Ma,
Zhiqiang Li
Abstract:
Third-party skills extend LLM agents with powerful capabilities but often handle sensitive credentials in privileged environments, making leakage risks poorly understood. We present the first large-scale empirical study of this problem, analyzing 17,022 skills (sampled from 170,226 on SkillsMP) using static analysis, sandbox testing, and manual inspection. We identify 520 vulnerable skills with 1,…
▽ More
Third-party skills extend LLM agents with powerful capabilities but often handle sensitive credentials in privileged environments, making leakage risks poorly understood. We present the first large-scale empirical study of this problem, analyzing 17,022 skills (sampled from 170,226 on SkillsMP) using static analysis, sandbox testing, and manual inspection. We identify 520 vulnerable skills with 1,708 issues and derive a taxonomy of 10 leakage patterns (4 accidental and 6 adversarial). We find that (1) leakage is fundamentally cross-modal: 76.3% require joint analysis of code and natural language, while 3.1% arise purely from prompt injection; (2) debug logging is the primary vector, with print and console.log causing 73.5% of leaks due to stdout exposure to LLMs; and (3) leaked credentials are both exploitable (89.6% without privileges) and persistent, as forks retain secrets even after upstream fixes. After disclosure, all malicious skills were removed and 91.6% of hardcoded credentials were fixed. We release our dataset, taxonomy, and detection pipeline to support future research.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
Authors:
Yiming Mao,
Zixi Yu,
Weixin Mao,
Yinhao Li,
Qirui Hu,
Zihan Lan,
Minzhao Zhu,
Hua Chen
Abstract:
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Ad…
▽ More
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
Authors:
Yunhao Feng,
Yifan Ding,
Yingshui Tan,
Xingjun Ma,
Yige Li,
Yutao Wu,
Yifeng Gao,
Kun Zhai,
Yanming Guo
Abstract:
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediat…
▽ More
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present \textbf{AgentHazard}, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains \textbf{2,653} instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of \textbf{73.63\%}, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
PolyReal: A Benchmark for Real-World Polymer Science Workflows
Authors:
Wanhao Liu,
Weida Wang,
Jiaqing Xie,
Suorong Yang,
Jue Wang,
Benteng Chen,
Guangtao Mei,
Zonglin Yang,
Shufei Zhang,
Yuchun Mo,
Lang Cheng,
Jin Zeng,
Houqiang Li,
Wanli Ouyang,
Yuqiang Li
Abstract:
Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their…
▽ More
Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus
Authors:
Shuai Wu,
Xue Li,
Yanna Feng,
Yufang Li,
Zhijun Wang
Abstract:
Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference.…
▽ More
Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference. In this paper, we propose the Council Mode, a novel multi-agent consensus framework that addresses these limitations by dispatching queries to multiple heterogeneous frontier LLMs in parallel and synthesizing their outputs through a dedicated consensus model. The Council pipeline operates in three phases: (1) an intelligent triage classifier that routes queries based on complexity, (2) parallel expert generation across architecturally diverse models, and (3) a structured consensus synthesis that explicitly identifies agreement, disagreement, and unique findings before producing the final response. We implement and evaluate this architecture within an open-source AI workspace. Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark and a 7.8-point improvement on TruthfulQA compared to the best-performing individual model, while maintaining significantly lower bias variance across domains. We provide the mathematical formulation of the consensus mechanism, detail the system architecture, and present extensive empirical results with ablation studies.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
Authors:
Siheng Wang,
Yanshu Li,
Bohan Hu,
Zhengdao Li,
Haibo Zhan,
Linshan Li,
Weiming Liu,
Ruizhi Qian,
Guangxin Wu,
Hao Zhang,
Jifeng Shen,
Piotr Koniusz,
Zhengtao Yao,
Junhao Dong,
Qiang Sun
Abstract:
Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between clos…
▽ More
Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Wavelength-multiplexed massively parallel diffractive optical information storage and image projection
Authors:
Che-Yung Shen,
Yuhang Li,
Cagatay Isil,
Jingxi Li,
Leon Lenk,
Tianyi Gan,
Guangdong Ma,
Fazil Onuralp Ardic,
Mona Jarrahi,
Aydogan Ozcan
Abstract:
We introduce a wavelength-multiplexed massively parallel diffractive information storage platform composed of dielectric surfaces that are structurally optimized at the wavelength scale using deep learning to store and project thousands of distinct image patterns, each assigned to a unique wavelength. Through numerical simulations in the visible spectrum, we demonstrated that our wavelength-multip…
▽ More
We introduce a wavelength-multiplexed massively parallel diffractive information storage platform composed of dielectric surfaces that are structurally optimized at the wavelength scale using deep learning to store and project thousands of distinct image patterns, each assigned to a unique wavelength. Through numerical simulations in the visible spectrum, we demonstrated that our wavelength-multiplexed diffractive system can store and project over 4,000 independent desired images/patterns within its output field-of-view, with high image quality and minimal crosstalk between spectral channels. Furthermore, in a proof-of-concept experiment, we demonstrated a two-layer diffractive design that stored six distinct patterns and projected them onto the same output field of view at six different wavelengths (500, 548, 596, 644, 692, and 740 nm). This diffractive architecture is scalable and can operate at various parts of the electromagnetic spectrum without the need for material dispersion engineering or redesigning its optimized diffractive layers. The demonstrated storage capacity, reconstruction image fidelity, and wavelength-encoded massively parallel read-out of our diffractive platform offer a compact and fast-access solution for large-scale optical information storage, image projection applications.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
OntoKG: Ontology-Oriented Knowledge Graph Construction with Intrinsic-Relational Routing
Authors:
Yitao Li,
Zhanlin Liu,
Anuranjan Pandey,
Muni Srikanth
Abstract:
Organizing a large-scale knowledge graph into a typed property graph requires structural decisions -- which entities become nodes, which properties become edges, and what schema governs these choices. Existing approaches embed these decisions in pipeline code or extract relations ad hoc, producing schemas that are tightly coupled to their construction process and difficult to reuse for downstream…
▽ More
Organizing a large-scale knowledge graph into a typed property graph requires structural decisions -- which entities become nodes, which properties become edges, and what schema governs these choices. Existing approaches embed these decisions in pipeline code or extract relations ad hoc, producing schemas that are tightly coupled to their construction process and difficult to reuse for downstream ontology-level tasks. We present an ontology-oriented approach in which the schema is designed from the outset for ontology analysis, entity disambiguation, domain customization, and LLM-guided extraction -- not merely as a byproduct of graph building. The core mechanism is intrinsic-relational routing, which classifies every property as either intrinsic or relational and routes it to the corresponding schema module. This routing produces a declarative schema that is portable across storage backends and independently reusable.
We instantiate the approach on the January 2026 Wikidata dump. A rule-based cleaning stage identifies a 34.6M-entity core set from the full dump, followed by iterative intrinsic-relational routing that assigns each property to one of 94 modules organized into 8 categories. With tool-augmented LLM support and human review, the schema reaches 93.3% category coverage and 98.0% module assignment among classified entities. Exporting this schema yields a property graph with 34.0M nodes and 61.2M edges across 38 relationship types. We validate the ontology-oriented claim through five applications that consume the schema independently of the construction pipeline: ontology structure analysis, benchmark annotation auditing, entity disambiguation, domain customization, and LLM-guided extraction.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
SEDGE: Structural Extrapolated Data Generation
Authors:
Kun Zhang,
Jiaqi Sun,
Yiqing Li,
Ignavier Ng,
Namrata Deka,
Shaoan Xie
Abstract:
This paper proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data generating process. We provide conditions under which data satisfying new specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions. On the algorithmic side,…
▽ More
This paper proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data generating process. We provide conditions under which data satisfying new specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on the structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Authors:
Xue Liu,
Xin Ma,
Yuxin Ma,
Yongchang Peng,
Duo Wang,
Zhoufutu Wen,
Ge Zhang,
Kaiyuan Zhang,
Xinyu Chen,
Tianci He,
Jiani Hou,
Liang Hu,
Ziyun Huang,
Yongzhe Hui,
Jianpeng Jiao,
Chennan Ju,
Yingru Kong,
Yiran Li,
Mengyun Liu,
Luyao Ma,
Fei Ni,
Yiqing Ni,
Yueyan Qiu,
Yanle Ren,
Zilin Shi
, et al. (9 additional authors not shown)
Abstract:
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity be…
▽ More
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.
△ Less
Submitted 6 April, 2026; v1 submitted 27 March, 2026;
originally announced April 2026.
-
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Authors:
Yongkang Li,
Lijun Zhou,
Sixu Yan,
Bencheng Liao,
Tianyi Yan,
Kaixin Xiong,
Long Chen,
Hongwei Xie,
Bing Wang,
Guang Chen,
Hangjun Ye,
Wenyu Liu,
Haiyang Sun,
Xinggang Wang
Abstract:
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises:…
▽ More
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Authors:
Xinlei Yu,
Zhangquan Chen,
Yongbo He,
Tianyu Fu,
Cheng Yang,
Chengming Xu,
Yue Ma,
Xiaobin Hu,
Zhe Cao,
Jie Xu,
Guibin Zhang,
Jiale Tao,
Jiayi Zhang,
Siyuan Ma,
Kaituo Feng,
Haojie Huang,
Youxing Li,
Ronghao Chen,
Huacan Wang,
Chenglin Wu,
Zikun Su,
Xiaogang Xu,
Kelu Yao,
Kun Wang,
Chen Gao
, et al. (12 additional authors not shown)
Abstract:
Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of expli…
▽ More
Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
Authors:
Yu Li,
Haoyu Luo,
Yuejin Xie,
Yuqian Fu,
Zhonghao Yang,
Shuai Shao,
Qihan Ren,
Wanying Qu,
Yanwei Fu,
Yujiu Yang,
Jing Shao,
Xia Hu,
Dongrui Liu
Abstract:
Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-leve…
▽ More
Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
Authors:
Xianben Yang,
Tao Wang,
Yuxuan Li,
Yi Jin,
Haibin Ling
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issu…
▽ More
3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Transversal non-Clifford gates on almost-good quantum LDPC and quantum locally testable codes
Authors:
Yiming Li,
Zimu Li,
Zi-Wen Liu
Abstract:
We exhibit nontrivial transversal logical multi-controlled-$Z$ gates on $[\![N,Θ(N),\tildeΘ(N)]\!]$ quantum low-density parity-check codes and $[\![N,Θ(N),\tildeΘ(N)]\!]$ quantum locally testable codes with soundness $\tildeΘ(1)$, combining nearly optimal code parameters with fault-tolerant non-Clifford gates for the first time. Remarkably, our proofs are almost entirely algebraic-topological, sho…
▽ More
We exhibit nontrivial transversal logical multi-controlled-$Z$ gates on $[\![N,Θ(N),\tildeΘ(N)]\!]$ quantum low-density parity-check codes and $[\![N,Θ(N),\tildeΘ(N)]\!]$ quantum locally testable codes with soundness $\tildeΘ(1)$, combining nearly optimal code parameters with fault-tolerant non-Clifford gates for the first time. Remarkably, our proofs are almost entirely algebraic-topological, showing that such presumably intricate logical gates naturally arise as a fundamental topological phenomenon. We develop a general framework for constructing a rich new family of homological invariant forms which we call ''cupcap gates'' that induce transversal logical multi-controlled-$Z$ and, building on insights from [Li et al., arXiv:2603.25831], covering space methods to certify their nontriviality. The claimed almost-good code results follow immediately as examples.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
Authors:
Yun Li,
Yidu Zhang,
Simon Thompson,
Ehsan Javanmardi,
Manabu Tsukada
Abstract:
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), w…
▽ More
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Hierarchical Memory Orchestration for Personalized Persistent Agents
Authors:
Junming Liu,
Yifei Sun,
Weihua Cheng,
Haodong Lei,
Yuqi Li,
Yirong Chen,
Ding Wang
Abstract:
While long-term memory is essential for intelligent agents to maintain consistent historical awareness, the accumulation of extensive interaction data often leads to performance bottlenecks. Naive storage expansion increases retrieval noise and computational latency, overwhelming the reasoning capacity of models deployed on constrained personal devices. To address this, we propose Hierarchical Mem…
▽ More
While long-term memory is essential for intelligent agents to maintain consistent historical awareness, the accumulation of extensive interaction data often leads to performance bottlenecks. Naive storage expansion increases retrieval noise and computational latency, overwhelming the reasoning capacity of models deployed on constrained personal devices. To address this, we propose Hierarchical Memory Orchestration (HMO), a framework that organizes interaction history into a three-tiered directory driven by user-centric contextual relevance. Our system maintains a compact primary cache, coupling recent and pivotal memories with an evolving user profile to ensure agent reasoning remains aligned with individual behavioral traits. This primary cache is complemented by a high-priority secondary layer, both of which are managed within a global archive of the full interaction history. Crucially, the user persona dictates memory redistribution across this hierarchy, promoting records mapped to long-term patterns toward more active tiers while relegating less relevant information. This targeted orchestration surfaces historical knowledge precisely when needed while maintaining a lean and efficient active search space. Evaluations on multiple benchmarks achieve state-of-the-art performance. Real-world deployments in ecosystems like OpenClaw demonstrate that HMO significantly enhances agent fluidity and personalization.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis
Authors:
Rui Dong,
Xiaotong Zhang,
Jiaxing Li,
Yueying Li,
Jiayin Wei,
Youyong Kong
Abstract:
Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into t…
▽ More
Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model's further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations
Authors:
Yue Li,
Linying Xue,
Kaiqing Lin,
Hanyu Quan,
Dongdong Lin,
Hui Tian,
Hongxia Wang,
Bin Wang
Abstract:
Recent advances in GAN and diffusion models have significantly improved the realism and controllability of facial deepfake manipulation, raising serious concerns regarding privacy, security, and identity misuse. Proactive defenses attempt to counter this threat by injecting adversarial perturbations into images before manipulation takes place. However, existing approaches remain limited in effecti…
▽ More
Recent advances in GAN and diffusion models have significantly improved the realism and controllability of facial deepfake manipulation, raising serious concerns regarding privacy, security, and identity misuse. Proactive defenses attempt to counter this threat by injecting adversarial perturbations into images before manipulation takes place. However, existing approaches remain limited in effectiveness due to suboptimal perturbation injection strategies and are typically designed under white-box assumptions, targeting only simple GAN-based attribute editing. These constraints hinder their applicability in practical real-world scenarios. In this paper, we propose AEGIS, the first diffusion-guided paradigm in which the AdvErsarial facial images are Generated for Identity Shielding. We observe that the limited defense capability of existing approaches stems from the peak-clipping constraint, where perturbations are forcibly truncated due to a fixed $L_\infty$-bounded. To overcome this limitation, instead of directly modifying pixels, AEGIS injects adversarial perturbations into the latent space along the DDIM denoising trajectory, thereby decoupling the perturbation magnitude from pixel-level constraints and allowing perturbations to adaptively amplify where most effective. The extensible design of AEGIS allows the defense to be expanded from purely white-box use to also support black-box scenarios through a gradient-estimation strategy. Extensive experiments across GAN and diffusion-based deepfake generators show that AEGIS consistently delivers strong defense effectiveness while maintaining high perceptual quality. In white-box settings, it achieves robust manipulation disruption, whereas in black-box settings, it demonstrates strong cross-model transferability.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation
Authors:
Lei Wang,
Yuanzi Li,
Jinchao Wu,
Heyang Gao,
Xiaohe Bo,
Xu Chen,
Ji-Rong Wen
Abstract:
Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the…
▽ More
Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a large-scale social simulation system designed around three core requirements: generality via auto-programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback-driven LLM fine-tuning. Leveraging this system, S-Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human-AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S-Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod's theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S-Researcher establishes a new human--AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Non-monotonicity in Conformal Risk Control
Authors:
Tareq Aldirawi,
Yun Li,
Wenge Guo
Abstract:
Conformal risk control (CRC) provides distribution-free guarantees for controlling the expected loss at a user-specified level. Existing theory typically assumes that the loss decreases monotonically with a tuning parameter that governs the size of the prediction set. This assumption is often violated in practice, where losses may behave non-monotonically due to competing objectives such as covera…
▽ More
Conformal risk control (CRC) provides distribution-free guarantees for controlling the expected loss at a user-specified level. Existing theory typically assumes that the loss decreases monotonically with a tuning parameter that governs the size of the prediction set. This assumption is often violated in practice, where losses may behave non-monotonically due to competing objectives such as coverage and efficiency.
We study CRC under non-monotone loss functions when the tuning parameter is selected from a finite grid, a common scenario in thresholding or discretized decision rules. Revisiting a known counterexample, we show that the validity of CRC without monotonicity depends on the relationship between the calibration sample size and the grid resolution. In particular, risk control can still be achieved when the calibration sample is sufficiently large relative to the grid.
We provide a finite-sample guarantee for bounded losses over a grid of size $m$, showing that the excess risk above the target level $α$ is of order $\sqrt{\log(m)/n}$, where $n$ is the calibration sample size. A matching lower bound shows that this rate is minimax optimal. We also derive refined guarantees under additional structural conditions, including Lipschitz continuity and monotonicity, and extend the analysis to settings with distribution shift via importance weighting.
Numerical experiments on synthetic multilabel classification and real object detection data illustrate the practical impact of non-monotonicity. Methods that account for finite-sample deviations achieve more stable risk control than approaches based on monotonicity transformations, while maintaining competitive prediction-set sizes.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
EXaCTz: Guaranteed Extremum Graph and Contour Tree Preservation for Distributed- and GPU-Parallel Lossy Compression
Authors:
Yuxiao Li,
Mingze Xia,
Xin Liang,
Bei Wang,
Hanqi Guo
Abstract:
This paper introduces EXaCTz, a parallel algorithm that concurrently preserves extremum graphs and contour trees in lossy-compressed scalar field data. While error-bounded lossy compression is essential for large-scale scientific simulations and workflows, existing topology-preserving methods suffer from (1) a significant throughput disparity, where topology correction speeds are on the order of M…
▽ More
This paper introduces EXaCTz, a parallel algorithm that concurrently preserves extremum graphs and contour trees in lossy-compressed scalar field data. While error-bounded lossy compression is essential for large-scale scientific simulations and workflows, existing topology-preserving methods suffer from (1) a significant throughput disparity, where topology correction speeds are on the order of MB/s, lagging orders of magnitude behind compression speeds on the order of GB/s, (2) limited support for diverse topological descriptors, and (3) a lack of theoretical convergence bounds. To address these challenges, EXaCTz introduces a high-performance, bounded-iteration algorithm that enforces topological consistency by deriving targeted edits for decompressed data. Unlike prior methods that rely on explicit topology reconstruction, EXaCTz enforces consistent min/max neighbors of all vertices, along with global ordering among critical points. As such, the algorithm enforces consistent critical-point classification, saddle extremum connectivity, and the preservation of merge/split events. We theoretically prove the convergence of our algorithm, bounded by the longest path in a vulnerability graph that characterizes potential cascading effects during correction. Experiments on real-world datasets show that EXaCTz achieves a single-GPU throughput of up to 4.52 GB/s, outperforming the state-of-the-art contour-tree-preserving method (Gorski et al.) by up to 213x (with a single-core CPU implementation for fair comparison) and 3,285x (with a single-GPU version). In distributed environments, EXaCTz scales to 128 GPUs with 55.6\% efficiency (compared with 6.4\% for a naive parallelization), processing datasets of up to 512 GB in under 48 seconds and achieving an aggregate correction throughput of up to 32.69 GB/s.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
SMASH: Mastering Scalable Whole-Body Skills for Humanoid Ping-Pong with Egocentric Vision
Authors:
Junli Ren,
Yinghui Li,
Kai Zhang,
Penglin Fu,
Haoran Jiang,
Yixuan Pan,
Guangjun Zeng,
Tao Huang,
Weizhong Guo,
Peng Lu,
Tianyu Li,
Jingbo Wang,
Li Chen,
Hongyang Li,
Ping Luo
Abstract:
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole-body coordination for precise task execution. These limitations stem from two core challenges: achieving low-latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task-aligned strike motions for learning precis…
▽ More
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole-body coordination for precise task execution. These limitations stem from two core challenges: achieving low-latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task-aligned strike motions for learning precise yet natural whole-body behaviors. In this work, we present \methodname, a modular system for agile humanoid table tennis that unifies scalable whole-body skill learning with onboard egocentric perception, eliminating the need for external cameras during deployment. Our work advances prior humanoid table-tennis systems in three key aspects. First, we achieve agile and precise ball interaction with tightly coordinated whole-body control, rather than relying on decoupled upper- and lower-body behaviors. This enables the system to exhibit diverse strike motions, including explosive whole-body smashes and low crouching shots. Second, by augmenting and diversifying strike motions with a generative model, our framework benefits from scalable motion priors and produces natural, robust striking behaviors across a wide workspace. Third, to the best of our knowledge, we demonstrate the first humanoid table-tennis system capable of consecutive strikes using onboard sensing alone, despite the challenges of low-latency perception, ego-motion-induced instability, and limited field of view. Extensive real-world experiments demonstrate stable and precise ball exchanges under high-speed conditions, validating scalable, perception-driven whole-body skill learning for dynamic humanoid interaction tasks.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Doctor-RAG: Failure-Aware Repair for Agentic Retrieval-Augmented Generation
Authors:
Shuguang Jiao,
Chengkai Huang,
Shuhan Qi,
Xuan Wang,
Yifan Li,
Lina Yao
Abstract:
Agentic Retrieval-Augmented Generation (Agentic RAG) has become a widely adopted paradigm for multi-hop question answering and complex knowledge reasoning, where retrieval and reasoning are interleaved at inference time. As reasoning trajectories grow longer, failures become increasingly common. Existing approaches typically address such failures by either stopping at diagnostic analysis or rerunn…
▽ More
Agentic Retrieval-Augmented Generation (Agentic RAG) has become a widely adopted paradigm for multi-hop question answering and complex knowledge reasoning, where retrieval and reasoning are interleaved at inference time. As reasoning trajectories grow longer, failures become increasingly common. Existing approaches typically address such failures by either stopping at diagnostic analysis or rerunning the entire retrieval-reasoning pipeline, which leads to substantial computational overhead and redundant reasoning. In this paper, we propose Doctor-RAG (DR-RAG), a unified diagnose-and-repair framework that corrects failures in Agentic RAG through explicit error localization and prefix reuse, enabling minimal-cost intervention. DR-RAG decomposes failure handling into two consecutive stages: (i) trajectory-level failure diagnosis and localization, which attributes errors to a coverage-gated taxonomy and identifies the earliest failure point in the reasoning trajectory; and (ii) tool-conditioned local repair, which intervenes only at the diagnosed failure point while maximally reusing validated reasoning prefixes and retrieved evidence. By explicitly separating error attribution from correction, DR-RAG enables precise error localization, thereby avoiding expensive full-pipeline reruns and enabling targeted, efficient repair. We evaluate DR-RAG across three multi-hop question answering benchmarks, multiple agentic RAG baselines, and different backbone models. Experimental results demonstrate that DR-RAG substantially improves answer accuracy while significantly reducing reasoning token consumption compared to rerun-based repair strategies.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Authors:
Zhanzhi Lou,
Hui Chen,
Yibo Li,
Qian Wang,
Bryan Hooi
Abstract:
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing t…
▽ More
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.
△ Less
Submitted 2 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
Optimal Brain Decomposition for Accurate LLM Low-Rank Approximation
Authors:
Yuhang Li,
Donghyun Lee,
Ruokai Yin,
Priyadarshini Panda
Abstract:
Low-rank decomposition has emerged as an important problem in Large Language Model (LLM) fine-tuning and inference. Through Singular Value Decomposition (SVD), the weight matrix can be factorized into low-rank spaces optimally. Previously, a common practice was to decompose the weight in the activation-whitened space, and then achieve satisfying results. In this work, we propose Optimal Brain Deco…
▽ More
Low-rank decomposition has emerged as an important problem in Large Language Model (LLM) fine-tuning and inference. Through Singular Value Decomposition (SVD), the weight matrix can be factorized into low-rank spaces optimally. Previously, a common practice was to decompose the weight in the activation-whitened space, and then achieve satisfying results. In this work, we propose Optimal Brain Decomposition LLM (OBD-LLM), which studies the decomposition problem in the model space by utilizing second-order Hessian information. Through a rigorous Kronecker-factorization of the Hessian, we show that the decomposition needs to consider both input and output information of the layer, and achieves much better decomposition results compared to input only method. Our loss-aware decomposition method involves a bi-directional whitening on the weight matrix. As a result, OBD-LLM is a closed-form solution for the optimal decomposition of weights in the language model. Remarkably, we achieve ~20-40\% better results than previous state-of-the-art decomposition methods, the SVD-LLM.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion
Authors:
Jiaqing Li,
Zhibo Zhang,
Shide Zhou,
Yuxi Li,
Tianlong Yu,
Kailong Wang
Abstract:
Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned LLMs without additional training costs. However, the security implications of this widely-adopted practice remain critically underexplored. In this work, we reveal that model merging introduces a novel attack surface that can be systematically exploited to compromise safety alignment.…
▽ More
Model merging has emerged as a powerful technique for combining specialized capabilities from multiple fine-tuned LLMs without additional training costs. However, the security implications of this widely-adopted practice remain critically underexplored. In this work, we reveal that model merging introduces a novel attack surface that can be systematically exploited to compromise safety alignment. We present TrojanMerge,, a framework that embeds latent malicious components into source models that remain individually benign but produce severely misaligned models when merged. Our key insight is formulating this attack as a constrained optimization problem: we construct perturbations that preserve source model safety through directional consistency constraints, maintain capabilities via Frobenius directional alignment constraints, yet combine during merging to form pre-computed attack vectors. Extensive experiments across 9 LLMs from 3 model families demonstrate that TrojanMerge, consistently achieves high harmful response rates in merged models while source models maintain safety scores comparable to unmodified versions. Our attack succeeds across diverse merging algorithms and remains effective under various hyperparameter configurations. These findings expose fundamental vulnerabilities in current model merging practices and highlight the urgent need for security-aware mechanisms.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
On the average-case complexity landscape for Tensor-Isomorphism-complete problems over finite fields
Authors:
Tiange Li,
Yinan Li,
Youming Qiao,
Dacheng Tao,
Yingjie Wang
Abstract:
In Grochow and Qiao (SIAM J. Comput., 2021), the complexity class Tensor Isomorphism (TI) was introduced and isomorphism problems for groups, algebras, and polynomials were shown to be TI-complete. In this paper, we study average-case algorithms for several TI-complete problems over finite fields, including algebra isomorphism, matrix code conjugacy, and $4$-tensor isomorphism.
Our main results…
▽ More
In Grochow and Qiao (SIAM J. Comput., 2021), the complexity class Tensor Isomorphism (TI) was introduced and isomorphism problems for groups, algebras, and polynomials were shown to be TI-complete. In this paper, we study average-case algorithms for several TI-complete problems over finite fields, including algebra isomorphism, matrix code conjugacy, and $4$-tensor isomorphism.
Our main results are as follows. Over the finite field of order $q$, we devise (1) average-case polynomial-time algorithms for algebra isomorphism and matrix code conjugacy that succeed in a $1/Θ(q)$ fraction of inputs and (2) an average-case polynomial-time algorithm for the $4$-tensor isomorphism that succeeds in a $1/q^{Θ(1)}$ fraction of inputs. Prior to our work, algorithms for algebra isomorphism with rigorous average-case analyses ran in exponential time, albeit succeeding on a larger fraction of inputs (Li--Qiao, FOCS'17; Brooksbank--Li--Qiao--Wilson, ESA'20; Grochow--Qiao--Tang, STACS'21).
These results reveal a finer landscape of the average-case complexities of TI-complete problems, providing guidance for cryptographic systems based on isomorphism problems. Our main technical contribution is to introduce the spectral properties of random matrices into algorithms for TI-complete problems. This leads to not only new algorithms but also new questions in random matrix theory over finite fields. To settle these questions, we need to extend both the generating function approach as in Neumann and Praeger (J. London Math. Soc., 1998) and the characteristic sum method of Gorodetsky and Rodgers (Trans. Amer. Math. Soc., 2021).
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Authors:
Eric Hanchen Jiang,
Levina Li,
Rui Sun,
Xiao Liang,
Yubei Li,
Yuchen Wu,
Haozheng Luo,
Hengli Li,
Zhi Zhang,
Zhaolu Kang,
Kai-Wei Chang,
Ying Nian Wu
Abstract:
Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperat…
▽ More
Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.
△ Less
Submitted 31 March, 2026;
originally announced April 2026.
-
Not Just Duolingo: Supporting Immigrant Language Preservation Through Family-Based Play
Authors:
Alejandro Ciuba,
Zheng YY Li,
Aakash Gautam
Abstract:
For immigrants, language preservation is crucial to maintain their identity, but the process of immigration can put a strain on a community's ability to do so. We interviewed eight Nepali immigrants to understand barriers to language preservation across sociopolitical contexts in Nepal and immigrant life in the United States. Participants described strong motivation but limited institutional suppo…
▽ More
For immigrants, language preservation is crucial to maintain their identity, but the process of immigration can put a strain on a community's ability to do so. We interviewed eight Nepali immigrants to understand barriers to language preservation across sociopolitical contexts in Nepal and immigrant life in the United States. Participants described strong motivation but limited institutional support, time and resource constraints, and English-dominant environments that widen parent-child language gaps. They envisioned technology that supports interactive, family centered learning. In response, we are developing an audio-first, point-and-click language learning game based on the theory of comprehensible input, designed for parent-child co-playing. An early evaluation with four design experts reveals promising gameplay, and the need to simplify symbol-heavy UI. We conclude with implications for designing language technologies that support preservation through relations while acknowledging the limits of design.
△ Less
Submitted 31 March, 2026;
originally announced April 2026.
-
The Mystery Deepens: On the Query Complexity of Tarski Fixed Points
Authors:
Xi Chen,
Yuhao Li,
Mihalis Yannakakis
Abstract:
We give an $O(\log^2 n)$-query algorithm for finding a Tarski fixed point over the $4$-dimensional lattice $[n]^4$, matching the $Ω(\log^2 n)$ lower bound of [EPRY20]. Additionally, our algorithm yields an ${O(\log^{\lceil (k-1)/3\rceil+1} n)}$-query algorithm for any constant $k$, improving the previous best upper bound ${O(\log^{\lceil (k-1)/2\rceil+1} n)}$ of [CL22].
Our algorithm uses a new…
▽ More
We give an $O(\log^2 n)$-query algorithm for finding a Tarski fixed point over the $4$-dimensional lattice $[n]^4$, matching the $Ω(\log^2 n)$ lower bound of [EPRY20]. Additionally, our algorithm yields an ${O(\log^{\lceil (k-1)/3\rceil+1} n)}$-query algorithm for any constant $k$, improving the previous best upper bound ${O(\log^{\lceil (k-1)/2\rceil+1} n)}$ of [CL22].
Our algorithm uses a new framework based on \emph{safe partial-information} functions. The latter were introduced in [CLY23] to give a reduction from the Tarski problem to its promised version with a unique fixed point. This is the first time they are directly used to design new algorithms for Tarski fixed points.
△ Less
Submitted 31 March, 2026;
originally announced April 2026.
-
ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Authors:
Yufeng Li,
Rrubaa Panchendrarajan,
Arkaitz Zubiaga
Abstract:
Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely o…
▽ More
Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
Think Anywhere in Code Generation
Authors:
Xue Jiang,
Tianyu Zhang,
Ge Li,
Mengyang Liu,
Taozhi Chen,
Zhenhua Xu,
Binhua Li,
Wenpin Jiao,
Zhi Jin,
Yongbin Li,
Yihong Dong
Abstract:
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort…
▽ More
Recent advances in reasoning Large Language Models (LLMs) have primarily relied on upfront thinking, where reasoning occurs before final answer. However, this approach suffers from critical limitations in code generation, where upfront thinking is often insufficient as problems' full complexity only reveals itself during code implementation. Moreover, it cannot adaptively allocate reasoning effort throughout the code generation process where difficulty varies significantly. In this paper, we propose Think-Anywhere, a novel reasoning mechanism that enables LLMs to invoke thinking on-demand at any token position during code generation. We achieve Think-Anywhere by first teaching LLMs to imitate the reasoning patterns through cold-start training, then leveraging outcome-based RL rewards to drive the model's autonomous exploration of when and where to invoke reasoning. Extensive experiments on four mainstream code generation benchmarks (i.e., LeetCode, LiveCodeBench, HumanEval, and MBPP) show that Think-Anywhere achieves state-of-the-art performance over both existing reasoning methods and recent post-training approaches, while demonstrating consistent generalization across diverse LLMs. Our analysis further reveals that Think-Anywhere enables the model to adaptively invoke reasoning at high-entropy positions, providing enhanced interpretability.
△ Less
Submitted 2 April, 2026; v1 submitted 31 March, 2026;
originally announced March 2026.
-
Four Generations of Quantum Biomedical Sensors
Authors:
Xin Jin,
Priyam Srivastava,
Ronghe Wang,
Yuqing Li,
Jonathan Beaumariage,
Tom Purdy,
M. V. Gurudev Dutt,
Kang Kim,
Kaushik Seshadreesan,
Junyu Liu
Abstract:
Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utiliz…
▽ More
Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utilize discrete energy levels for signal transduction but follow classical scaling laws. Second-generation sensors exploit quantum coherence to reach the standard quantum limit, while third-generation architectures leverage entanglement and spin squeezing to approach Heisenberg-limited precision. We further define an emerging fourth generation characterized by the end-to-end integration of quantum sensing with quantum learning and variational circuits, enabling adaptive inference directly within the quantum domain. By analyzing critical parameters such as bandwidth matching and sensor-tissue proximity, we identify key technological bottlenecks and propose a roadmap for transitioning from measuring physical observables to extracting structured biological information with quantum-enhanced intelligence.
△ Less
Submitted 2 April, 2026; v1 submitted 31 March, 2026;
originally announced March 2026.