-
Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation
Authors:
Xinyu Wang,
Sai Koneru,
Wenbo Zhang,
Wenliang Zheng,
Saksham Ranjan,
Sarah Rajtmajer
Abstract:
Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mix…
▽ More
Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.
△ Less
Submitted 10 April, 2026;
originally announced April 2026.
-
Label Leakage Attacks in Machine Unlearning: A Parameter and Inversion-Based Approach
Authors:
Weidong Zheng,
Kongyang Chen,
Yao Huang,
Yuanwei Guo,
Yatie Xiao
Abstract:
With the widespread application of artificial intelligence technologies in face recognition and other fields, data privacy security issues have received extensive attention, especially the \textit{right to be forgotten} emphasized by numerous privacy protection laws. Existing technologies have proposed various unlearning methods, but they may inadvertently leak the categories of unlearned data. Th…
▽ More
With the widespread application of artificial intelligence technologies in face recognition and other fields, data privacy security issues have received extensive attention, especially the \textit{right to be forgotten} emphasized by numerous privacy protection laws. Existing technologies have proposed various unlearning methods, but they may inadvertently leak the categories of unlearned data. This paper focuses on the category unlearning scenario, analyzes the potential problems of category leakage of unlearned data in multiple scenarios, and proposes four attack methods from the perspectives of model parameters and model inversion based on attackers with different knowledge backgrounds. At the level of model parameters, we construct discriminative features by computing either dot products or vector differences between the parameters of the target model and those of auxiliary models trained on subsets of retained data and unrelated data, respectively. These features are then processed via k-means clustering, Youden's Index, and decision tree algorithms to achieve accurate identification of the forgotten class. In the model inversion domain, we design a gradient optimization-based white-box attack and a genetic algorithm-based black-box attack to reconstruct class-prototypical samples. The prediction profiles of these synthesized samples are subsequently analyzed using a threshold criterion and an information entropy criterion to infer the forgotten class. We evaluate the proposed attacks on four standard datasets against five state-of-the-art unlearning algorithms, providing a detailed analysis of the strengths and limitations of each method. Experimental results demonstrate that our approach can effectively infer the classes forgotten by the target model.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
Authors:
Rui Dong,
Zitong Wang,
Jiaxing Li,
Weihuang Zheng,
Youyong Kong
Abstract:
Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent limitations of domain knowledge within uni-modal neurographs. Meanwhile, large language models (LLMs) have demonstrated powerful representation capabi…
▽ More
Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent limitations of domain knowledge within uni-modal neurographs. Meanwhile, large language models (LLMs) have demonstrated powerful representation capabilities. Combining LLMs with GNNs presents a promising direction for brain network analysis. While LLMs and MLLMs have emerged in neuroscience, integration of LLMs with graph-based data remains unexplored. In this work, we deal with these issues by incorporating LLM's powerful representation and generalization capabilities. Considering great cost for directly tuning LLMs, we instead function LLM as enhancer to boost GNN's performance on downstream tasks. Our method, namely BLEG, can be divided into three stages. We firstly prompt LLM to get augmented texts for fMRI graph data, then we design a LLM-LM instruction tuning method to get enhanced textual representations at a relatively lower cost. GNN is trained together for coarsened alignment. Finally we finetune an adapter after GNN for given downstream tasks. Alignment loss between LM and GNN logits is designed to further enhance GNN's representation. Extensive experiments on different datasets confirmed BLEG's superiority.Code can be available at https://github.com/KamonRiderDR/BLEG.
△ Less
Submitted 10 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes
Authors:
Mu Lin,
Yi-Lin Wei,
Jiaxuan Chen,
Yuhao Lin,
Shuoyu Chen,
Jiangran Lyu,
Jiayi Chen,
Yansong Tang,
He Wang,
Wei-Shi Zheng
Abstract:
Bimanual dexterous grasping is a fundamental and promising area in robotics, yet its progress is constrained by the lack of comprehensive datasets and powerful generation models. In this work, we propose BiDexGrasp, consists of a large-scale bimanual dexterous grasp dataset and a novel generation model. For dataset, we propose a novel bimanual grasp synthesis pipeline to efficiently annotate physi…
▽ More
Bimanual dexterous grasping is a fundamental and promising area in robotics, yet its progress is constrained by the lack of comprehensive datasets and powerful generation models. In this work, we propose BiDexGrasp, consists of a large-scale bimanual dexterous grasp dataset and a novel generation model. For dataset, we propose a novel bimanual grasp synthesis pipeline to efficiently annotate physically feasible data for dataset construction. This pipeline addresses the challenges of high-dimensional bimanual grasping through a two-stage synthesis strategy of efficient region-based grasp initialization and decoupled force-closure grasp optimization. Powered by this pipeline, we construct a large-scale bimanual dexterous grasp dataset, comprising 6351 diverse objects with sizes ranging from 30 to 80 cm, along with 9.7 million annotated grasp data. Based on this dataset, we further introduce a bimanual-coordinated and geometry-size-adaptive dexterous grasping generation framework. The framework lies in two key designs: a bimanual coordination module and a geometry-size-adaptive grasp generation strategy to generate coordinated and high-quality grasps on unseen objects. Extensive experiments conducted in both simulation and real world demonstrate the superior performance of our proposed data synthesis pipeline and learned generative framework.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
Mechanistic Circuit-Based Knowledge Editing in Large Language Models
Authors:
Tianyi Zhao,
Yinhan He,
Wendy Zheng,
Chen Chen
Abstract:
Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a "Reasoning Gap", where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\un…
▽ More
Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a "Reasoning Gap", where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise "map-and-adapt" editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
DAG Covers: The Steiner Point Effect
Authors:
Sujoy Bhore,
Hsien-Chih Chang,
Jonathan Conroy,
Arnold Filtser,
Eunjin Oh,
Nicole Wein,
Da Wei Zheng
Abstract:
Given a weighted digraph $G$, a $(t,g,μ)$-DAG cover is a collection of $g$ dominating DAGs $D_1,\dots,D_g$ such that all distances are approximately preserved: for every pair $(u,v)$ of vertices, $\min_id_{D_i}(u,v)\le t\cdot d_{G}(u,v)$, and the total number of non-$G$ edges is bounded by $|(\cup_i D_i)\setminus G|\le μ$. Assadi, Hoppenworth, and Wein [STOC 25] and Filtser [SODA 26] studied DAG c…
▽ More
Given a weighted digraph $G$, a $(t,g,μ)$-DAG cover is a collection of $g$ dominating DAGs $D_1,\dots,D_g$ such that all distances are approximately preserved: for every pair $(u,v)$ of vertices, $\min_id_{D_i}(u,v)\le t\cdot d_{G}(u,v)$, and the total number of non-$G$ edges is bounded by $|(\cup_i D_i)\setminus G|\le μ$. Assadi, Hoppenworth, and Wein [STOC 25] and Filtser [SODA 26] studied DAG covers for general digraphs. This paper initiates the study of \emph{Steiner} DAG cover, where the DAGs are allowed to contain Steiner points.
We obtain Steiner DAG covers on the important classes of planar digraphs and low-treewidth digraphs. Specifically, we show that any digraph with treewidth tw admits a $(1,2,\tilde{O}(n\cdot tw))$-Steiner DAG cover. For planar digraphs we provide a $(1+\varepsilon,2,\tilde{O}_\varepsilon(n))$-Steiner DAG cover.
We also demonstrate a stark difference between Steiner and non-Steiner DAG covers. As a lower bound, we show that any non-Steiner DAG cover for graphs with treewidth $1$ with stretch $t<2$ and sub-quadratic number of extra edges requires $Ω(\log n)$ DAGs.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis
Authors:
Tianhua Qi,
Wenming Zheng,
Björn W. Schuller,
Zhaojie Luo,
Haizhou Li
Abstract:
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characte…
▽ More
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
Authors:
Tianyi Zhao,
Yinhan He,
Wendy Zheng,
Yujie Zhang,
Chen Chen
Abstract:
Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mech…
▽ More
Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
Authors:
Sicheng Zuo,
Zixun Xie,
Wenzhao Zheng,
Shaoqing Xu,
Fang Li,
Hanbing Li,
Long Chen,
Zhi-Xin Yang,
Jiwen Lu
Abstract:
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles…
▽ More
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
△ Less
Submitted 7 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
Large Language Models in Game Development: Implications for Gameplay, Playability, and Player Experience
Authors:
Keeryn Johnson,
Muhammad Ahmed,
Charlie Lang,
Sahib Thethi,
Wilson Zheng,
Ronnie de Souza Santos
Abstract:
This paper investigates how the integration of large language models influences gameplay, playability, and player experience in game development. We report a collaborative autoethnographic study of two game projects in which LLMs were embedded as architectural components. Reflective narratives and development artifacts were analyzed using gameplay, playability, and player experience as guiding con…
▽ More
This paper investigates how the integration of large language models influences gameplay, playability, and player experience in game development. We report a collaborative autoethnographic study of two game projects in which LLMs were embedded as architectural components. Reflective narratives and development artifacts were analyzed using gameplay, playability, and player experience as guiding constructs. The findings suggest that LLM integration increases variability and personalization while introducing challenges related to correctness, difficulty calibration, and structural coherence across these concepts. The study provides preliminary empirical insight into how generative AI integration reshapes established game constructs and introduces new architectural and quality considerations within game engineering practice.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
You Only Erase Once: Erasing Anything without Bringing Unexpected Content
Authors:
Yixing Zhu,
Qing Zhang,
Wenju Xu,
Wei-Shi Zheng
Abstract:
We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfu…
▽ More
We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at https://zyxunh.github.io/YOEO-ProjectPage/.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
Authors:
Jiahao Niu,
Rongjia Zheng,
Wenju Xu,
Wei-Shi Zheng,
Qing Zhang
Abstract:
We present SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to con…
▽ More
We present SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material-illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination-material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a deshadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code is available at https://github.com/GrumpySloths/SGS_Intrinsic.github.io.
△ Less
Submitted 31 March, 2026; v1 submitted 29 March, 2026;
originally announced March 2026.
-
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Authors:
Jianfei Xiao,
Xiang Yu,
Chengbing Wang,
Wuqiang Zheng,
Xinyu Lin,
Kaining Liu,
Hongxun Ding,
Yang Zhang,
Wenjie Wang,
Fuli Feng,
Xiangnan He
Abstract:
As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distr…
▽ More
As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
△ Less
Submitted 9 March, 2026;
originally announced March 2026.
-
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
Authors:
Jiaming Liang,
Yifeng Zhan,
Chunlin Liu,
Weihua Zheng,
Bingye Peng,
Qiwei Liang,
Boyang Cai,
Xiaochun Mai,
Qiang Nie
Abstract:
Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the vi…
▽ More
Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector's ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
Vega: Learning to Drive with Natural Language Instructions
Authors:
Sicheng Zuo,
Yuxuan Li,
Wenzhao Zheng,
Zheng Zhu,
Jie Zhou,
Jiwen Lu
Abstract:
Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) conta…
▽ More
Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
△ Less
Submitted 30 March, 2026; v1 submitted 26 March, 2026;
originally announced March 2026.
-
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
Authors:
Xuankai Zhang,
Junjin Xiao,
Shangwei Huang,
Wei-shi Zheng,
Qing Zhang
Abstract:
We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model…
▽ More
We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
UniQueR: Unified Query-based Feedforward 3D Reconstruction
Authors:
Chensheng Peng,
Quentin Herau,
Jiezhi Yang,
Yichen Xie,
Yihan Hu,
Wenzhao Zheng,
Matthew Strong,
Masayoshi Tomizuka,
Wei Zhan
Abstract:
We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inferen…
▽ More
We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions--in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
Charting the Diameter Computation Landscape of Geometric Intersection Graphs in Three Dimensions and Higher
Authors:
Timothy M. Chan,
Hsien-Chih Chang,
Jie Gao,
Sándor Kisfaludi-Bak,
Hung Le,
Da Wei Zheng
Abstract:
Recent research on computing the diameter of geometric intersection graphs has made significant strides, primarily focusing on the 2D case where truly subquadratic-time algorithms were given for simple objects such as unit-disks and (axis-aligned) squares. However, in three or higher dimensions, there is no known truly subquadratic-time algorithm for any intersection graph of non-trivial objects,…
▽ More
Recent research on computing the diameter of geometric intersection graphs has made significant strides, primarily focusing on the 2D case where truly subquadratic-time algorithms were given for simple objects such as unit-disks and (axis-aligned) squares. However, in three or higher dimensions, there is no known truly subquadratic-time algorithm for any intersection graph of non-trivial objects, even basic ones such as unit balls or (axis-aligned) unit cubes. This was partially explained by the pioneering work of Bringmann et al. [SoCG '22] which gave several truly subquadratic lower bounds, notably for unit balls or unit cubes in 3D when the graph diameter $Δ$ is at least $Ω(\log n)$, hinting at a pessimistic outlook for the complexity of the diameter problem in higher dimensions. In this paper, we substantially extend the landscape of diameter computation for objects in three and higher dimensions, giving a few positive results. Our highlighted findings include:
- A truly subquadratic-time algorithm for deciding if the diameter of unit cubes in 3D is at most 3 (Diameter-3 hereafter), the first algorithm of its kind for objects in 3D or higher dimensions. Our algorithm is based on a novel connection to pseudolines, which is of independent interest.
- A truly subquadratic time lower bound for \Diameter-3 of unit balls in 3D under the Orthogonal Vector (OV) hypothesis, giving the first separation between unit balls and unit cubes in the small diameter regime. Previously, computing the diameter for both objects was known to be truly subquadratic hard when the diameter is $Ω(\log n)$.
- A near-linear-time algorithm for Diameter-2 of unit cubes in 3D, generalizing the previous result for unit squares in 2D.
- A truly subquadratic-time algorithm and lower bound for Diameter-2 and Diameter-3 of rectangular boxes (of arbitrary dimension and sizes), respectively.
△ Less
Submitted 23 March, 2026;
originally announced March 2026.
-
Transparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly
Authors:
Qihao Lin,
Borui Chen,
Yuping Zhou,
Jianing Wu,
Yulan Guo,
Weishi Zheng,
Chongkun Xia
Abstract:
The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical propert…
▽ More
The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment's contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at https://github.com/Keithllin/Transparent-Fragments-Contour-Estimation.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
Authors:
Dong Zhuo,
Wenzhao Zheng,
Sicheng Zuo,
Siming Yan,
Lu Hou,
Jie Zhou,
Jiwen Lu
Abstract:
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we…
▽ More
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos
Authors:
Weijia Dou,
Wenzhao Zheng,
Weiliang Chen,
Yu Zheng,
Jie Zhou,
Jiwen Lu
Abstract:
Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a m…
▽ More
Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference
Authors:
Lang Zhou,
Shuxuan Li,
Zhuohao Li,
Shi Liu,
Zhilin Zhao,
Wei-Shi Zheng
Abstract:
Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertain…
▽ More
Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping
Authors:
Yuliang Wu,
Yanhan Lin,
WengKit Lao,
Yuhao Lin,
Yi-Lin Wei,
Wei-Shi Zheng,
Ancong Wu
Abstract:
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which ma…
▽ More
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose DexGrasp-Zero, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a Morphology-Aligned Graph Convolutional Network (MAGCN) to encode the graph for policy learning. MAGCN incorporates a Physical Property Injection mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82% average success rate on unseen objects.
△ Less
Submitted 17 March, 2026; v1 submitted 17 March, 2026;
originally announced March 2026.
-
Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing
Authors:
Zhenyuan Yang,
Wenxin Zheng,
Mingyu Li,
Haibo Chen
Abstract:
Existing GPU spatial sharing systems face a three-way tradeoff: resource utilization, performance isolation, and semantic determinism. Hardware partitioning suffers from hardware under-utilization. Hardware multiplexing fails to avoid performance interference. Recently proposed software-based GPU kernel slicing reshapes floating-point reduction orders, destroying semantic determinism and inducing…
▽ More
Existing GPU spatial sharing systems face a three-way tradeoff: resource utilization, performance isolation, and semantic determinism. Hardware partitioning suffers from hardware under-utilization. Hardware multiplexing fails to avoid performance interference. Recently proposed software-based GPU kernel slicing reshapes floating-point reduction orders, destroying semantic determinism and inducing catastrophic token drift in generative models.
We present CoGPU, a transparent spatial sharing system that resolves this trilemma. CoGPU introduces \emph{GPU coroutine}, a novel abstraction that enables logical-to-physical resource decoupling. By dynamically mapping immutable virtual contexts to mutable physical resource via lightweight cooperative migration, CoGPU enables extensible, workload-aware scheduling without altering kernel semantics.
Evaluations demonstrate CoGPU simultaneously achieves high utilization, strong isolation, and absolute semantic determinism (guaranteeing zero token mismatch). In multi-tenant co-location, it improves training throughput by up to 79.2\% over temporal sharing and reduces P99 inference tail latency by 15.1\%. Its pluggable architecture supports custom policies; compared to the default policy, a \textsc{TPOT-FIRST} policy further reduces SLO violations by 21.2\% under dynamic traffic.
△ Less
Submitted 3 April, 2026; v1 submitted 16 March, 2026;
originally announced March 2026.
-
Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
Authors:
Ziqing Ma,
Kai Ying,
Xinyue Gu,
Tian Zhou,
Tianyu Zhu,
Haifan Zhang,
Peisong Niu,
Wang Zheng,
Cong Bai,
Liang Sun
Abstract:
Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine-scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We…
▽ More
Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine-scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan-solar, a two-stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high-resolution geostationary satellite imagery to produce 24- hour irradiance forecasts at kilometer scale. Its decoupled two-stage design first forecasts day-night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine-scale cloud structures from satellite and large-scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan-solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud-induced transients. An operational deployment of Baguan-solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at https://github.com/DAMO-DI-ML/Baguansolar. git.
△ Less
Submitted 17 March, 2026; v1 submitted 16 March, 2026;
originally announced March 2026.
-
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
Authors:
Yanlin Li,
Minghui Guo,
Kaiwen Zhang,
Shize Zhang,
Yiran Zhao,
Haodong Li,
Congyue Zhou,
Weijie Zheng,
Yushen Yan,
Shengqiong Wu,
Wei Ji,
Lei Cui,
Furu Wei,
Hao Fei,
Mong-Li Lee,
Wynne Hsu
Abstract:
In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Mu…
▽ More
In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
SparkTales: Facilitating Cross-Language Collaborative Storytelling through Coordinator-AI Collaboration
Authors:
Wenxin Zhao,
Peng Zhang,
Hansu Gu,
Haoxuan Zhou,
Xiaojie Huo,
Lin Wang,
Wen Zheng,
Tun Lu,
Ning Gu
Abstract:
Cross-language collaborative storytelling plays a vital role in children's language learning and cultural development, fostering both expressive ability and intercultural awareness. Yet, in practice, children's participation is often shallow, and facilitating such sessions places heavy cognitive and organizational burdens on coordinators, who must coordinate language support, maintain children's e…
▽ More
Cross-language collaborative storytelling plays a vital role in children's language learning and cultural development, fostering both expressive ability and intercultural awareness. Yet, in practice, children's participation is often shallow, and facilitating such sessions places heavy cognitive and organizational burdens on coordinators, who must coordinate language support, maintain children's engagement, and navigate cultural differences. To address these challenges, we conducted a formative study with coordinators to identify their needs and pain points, which guided the design of SparkTales, an intelligent support system for cross-language collaborative storytelling. SparkTales leverages both individual and common characteristics of participating children to provide coordinators with story frameworks, diverse questions, and comprehension-oriented materials, aiming to reduce coordinators' workload while enhancing children's interactive engagement. Evaluation results show that SparkTales not only significantly increases coordinators' efficiency and quality of guidance but also improves children's participation, providing valuable insights for the design of future intelligent systems supporting cross-language collaboration.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
NextAds: Towards Next-generation Personalized Video Advertising
Authors:
Yiyan Xu,
Ruoxuan Xia,
Wuqiang Zheng,
Fengbin Zhu,
Wenjie Wang,
Fuli Feng
Abstract:
With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, se…
▽ More
With the rapid growth of online video consumption, video advertising has become increasingly dominant in the digital advertising landscape. Yet diverse users and viewing contexts makes one-size-fits-all ad creatives insufficient for consistent effectiveness, underlining the importance of personalization. In practice, most personalized video advertising systems follow a retrieval-based paradigm, selecting the optimal one from a small set of professionally pre-produced creatives for each user. Such static and finite inventories limits both the granularity and the timeliness of personalization, and prevents the creatives from being continuously refined based on online user feedback. Recent advances in generative AI make it possible to move beyond retrieval toward optimizing video creatives in a continuous space at serving time.
In this light, we propose NextAds, a generation-based paradigm for next-generation personalized video advertising, and conceptualize NextAds with four core components. To enable comparable research progress, we formulate two representative tasks: personalized creative generation and personalized creative integration, and introduce corresponding lightweight benchmarks. To assess feasibility, we instantiate end-to-end pipelines for both tasks and conduct initial exploratory experiments, demonstrating that GenAI can generate and integrate personalized creatives with encouraging performance. Moreover, we discuss the key challenges and opportunities under this paradigm, aiming to provide actionable insights for both researchers and practitioners and to catalyze progress in personalized video advertising.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
Trinity: A Scenario-Aware Recommendation Framework for Large-Scale Cold-Start Users
Authors:
Wenhao Zheng,
Wang Lu,
Fangshuang Tang,
Yiyang Lu,
Jun Yang,
Pengcheng Xiong,
Yulan Yan
Abstract:
Early-stage users in a new scenario intensify cold-start challenges, yet prior works often address only parts of the problem through model architecture. Launching a new user experience to replace an established product involves sparse behavioral signals, low-engagement cohorts, and unstable model performance. We argue that effective recommendations require the synergistic integration of feature en…
▽ More
Early-stage users in a new scenario intensify cold-start challenges, yet prior works often address only parts of the problem through model architecture. Launching a new user experience to replace an established product involves sparse behavioral signals, low-engagement cohorts, and unstable model performance. We argue that effective recommendations require the synergistic integration of feature engineering, model architecture, and stable model updating. We propose Trinity, a framework embodying this principle. Trinity extracts valuable information from existing scenarios while ensuring predictive effectiveness and accuracy in the new scenario. In this paper, we showcase Trinity applied to a billion-user Microsoft product transition. Both offline and online experiments demonstrate that our framework achieves substantial improvements in addressing the combined challenge of new users in new scenarios.
△ Less
Submitted 28 February, 2026;
originally announced March 2026.
-
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Authors:
Jiangxin Sun,
Feng Xue,
Teng Long,
Chang Liu,
Jian-Fang Hu,
Wei-Shi Zheng,
Nicu Sebe
Abstract:
With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the…
▽ More
With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.
△ Less
Submitted 26 February, 2026;
originally announced February 2026.
-
Revisiting RAG Retrievers: An Information Theoretic Benchmark
Authors:
Wenqing Zheng,
Dmitri Kalaev,
Noah Fatsi,
Daniel Barcklow,
Owen Reinert,
Igor Melnyk,
Senthil Kumar,
C. Bayan Bruss
Abstract:
Retrieval-Augmented Generation (RAG) systems rely critically on the retriever module to surface relevant context for large language models. Although numerous retrievers have recently been proposed, each built on different ranking principles such as lexical matching, dense embeddings, or graph citations, there remains a lack of systematic understanding of how these mechanisms differ and overlap. Ex…
▽ More
Retrieval-Augmented Generation (RAG) systems rely critically on the retriever module to surface relevant context for large language models. Although numerous retrievers have recently been proposed, each built on different ranking principles such as lexical matching, dense embeddings, or graph citations, there remains a lack of systematic understanding of how these mechanisms differ and overlap. Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves. Those that do compare retrievers directly use a limited set of evaluation tools which fail to capture complementary and overlapping strengths. This work presents MIGRASCOPE, a Mutual Information based RAG Retriever Analysis Scope. We revisit state-of-the-art retrievers and introduce principled metrics grounded in information and statistical estimation theory to quantify retrieval quality, redundancy, synergy, and marginal contribution. We further show that if chosen carefully, an ensemble of retrievers outperforms any single retriever. We leverage the developed tools over major RAG corpora to provide unique insights on contribution levels of the state-of-the-art retrievers. Our findings provide a fresh perspective on the structure of modern retrieval techniques and actionable guidance for designing robust and efficient RAG systems.
△ Less
Submitted 24 February, 2026;
originally announced February 2026.
-
From Perception to Action: An Interactive Benchmark for Vision Reasoning
Authors:
Yuhao Wu,
Maojia Song,
Yihuai Lan,
Lei Wang,
Zhiqiang Hu,
Yao Xiao,
Heng Zhou,
Weihua Zheng,
Dylan Raharja,
Soujanya Poria,
Roy Ka-Wei Lee
Abstract:
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what a…
▽ More
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
△ Less
Submitted 24 February, 2026;
originally announced February 2026.
-
IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Authors:
Yinhan He,
Yaochen Zhu,
Mingjia Shi,
Wendy Zheng,
Lin Su,
Xiaoqing Wang,
Qi Guo,
Jundong Li
Abstract:
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training…
▽ More
Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.
△ Less
Submitted 22 February, 2026;
originally announced February 2026.
-
VLANeXt: Recipes for Building Strong VLA Models
Authors:
Xiao-Ming Wu,
Bin Fan,
Kang Liao,
Jian-Jian Jiang,
Runze Yang,
Yihang Luo,
Zhonghua Wu,
Wei-Shi Zheng,
Chen Change Loy
Abstract:
Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify wh…
▽ More
Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.
△ Less
Submitted 20 February, 2026;
originally announced February 2026.
-
Deep Neural Network Architectures for Electrocardiogram Classification: A Comprehensive Evaluation
Authors:
Yun Song,
Wenjia Zheng,
Tiedan Chen,
Ziyu Wang,
Jiazhao Shi,
Yisong Chen
Abstract:
With the rising prevalence of cardiovascular diseases, electrocardiograms (ECG) remain essential for the non-invasive detection of cardiac abnormalities. This study presents a comprehensive evaluation of deep neural network architectures for automated arrhythmia classification, integrating temporal modeling, attention mechanisms, and ensemble strategies. To address data scarcity in minority classe…
▽ More
With the rising prevalence of cardiovascular diseases, electrocardiograms (ECG) remain essential for the non-invasive detection of cardiac abnormalities. This study presents a comprehensive evaluation of deep neural network architectures for automated arrhythmia classification, integrating temporal modeling, attention mechanisms, and ensemble strategies. To address data scarcity in minority classes, the MIT-BIH Arrhythmia dataset was augmented using a Generative Adversarial Network (GAN). We developed and compared four distinct architectures, including Convolutional Neural Networks (CNN), CNN combined with Long Short-Term Memory (CNN-LSTM), CNN-LSTM with Attention, and 1D Residual Networks (ResNet-1D), to capture both local morphological features and long-term temporal dependencies. Performance was rigorously evaluated using accuracy, F1-score, and Area Under the Curve (AUC) with 95\% confidence intervals to ensure statistical robustness, while Gradient-weighted Class Activation Mapping (Grad-CAM) was employed to validate model interpretability. Experimental results indicate that the CNN-LSTM model achieved the optimal stand-alone balance between sensitivity and specificity, yielding an F1-score of 0.951. Conversely, the CNN-LSTM-Attention and ResNet-1D models exhibited higher sensitivity to class imbalance. To mitigate this, a dynamic ensemble fusion strategy was introduced; specifically, the Top2-Weighted ensemble achieved the highest overall performance with an F1-score of 0.958. These findings demonstrate that leveraging complementary deep architectures significantly enhances classification reliability, providing a robust and interpretable foundation for intelligent arrhythmia detection systems.
△ Less
Submitted 7 February, 2026;
originally announced February 2026.
-
Fuse3D: Generating 3D Assets Controlled by Multi-Image Fusion
Authors:
Xuancheng Jin,
Rengan Xie,
Wenting Zheng,
Rui Wang,
Hujun Bao,
Yuchi Huo
Abstract:
Recently, generating 3D assets with the control of condition images has achieved impressive quality. However, existing 3D generation methods are limited to handling a single control objective and lack the ability to utilize multiple images to independently control different regions of a 3D asset, which hinders their flexibility in applications. We propose Fuse3D, a novel method that enables genera…
▽ More
Recently, generating 3D assets with the control of condition images has achieved impressive quality. However, existing 3D generation methods are limited to handling a single control objective and lack the ability to utilize multiple images to independently control different regions of a 3D asset, which hinders their flexibility in applications. We propose Fuse3D, a novel method that enables generating 3D assets under the control of multiple images, allowing for the seamless fusion of multi-level regional controls from global views to intricate local details. First, we introduce a Multi-Condition Fusion Module to integrate the visual features from multiple image regions. Then, we propose a method to automatically align user-selected 2D image regions with their associated 3D regions based on semantic cues. Finally, to resolve control conflicts and enhance local control features from multi-condition images, we introduce a Local Attention Enhancement Strategy that flexibly balances region-specific feature fusion. Overall, we introduce the first method capable of controllable 3D asset generation from multiple condition images. The experimental results indicate that Fuse3D can flexibly fuse multiple 2D image regions into coherent 3D structures, resulting in high-quality 3D assets. Code and data for this paper are at https://jinnmnm.github.io/Fuse3d.github.io/.
△ Less
Submitted 12 November, 2025;
originally announced February 2026.
-
Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System
Authors:
Jiang Zhang,
Yubo Wang,
Wei Chang,
Lu Han,
Xingying Cheng,
Feng Zhang,
Min Li,
Songhao Jiang,
Wei Zheng,
Harry Tran,
Zhen Wang,
Lei Chen,
Yueming Wang,
Benyu Zhang,
Xiangjun Fan,
Bi Xue,
Qifan Wang
Abstract:
Approximate nearest neighbor (ANN) search is widely used in the retrieval stage of large-scale recommendation systems. In this stage, candidate items are indexed using their learned embedding vectors, and ANN search is executed for each user (or item) query to retrieve a set of relevant items. However, ANN-based retrieval has two key limitations. First, item embeddings and their indices are typica…
▽ More
Approximate nearest neighbor (ANN) search is widely used in the retrieval stage of large-scale recommendation systems. In this stage, candidate items are indexed using their learned embedding vectors, and ANN search is executed for each user (or item) query to retrieve a set of relevant items. However, ANN-based retrieval has two key limitations. First, item embeddings and their indices are typically learned in separate stages: indexing is often performed offline after embeddings are trained, which can yield suboptimal retrieval quality-especially for newly created items. Second, although ANN offers sublinear query time, it must still be run for every request, incurring substantial computation cost at industry scale. In this paper, we propose MultiFaceted Learnable Index (MFLI), a scalable, real-time retrieval paradigm that learns multifaceted item embeddings and indices within a unified framework and eliminates ANN search at serving time. Specifically, we construct a multifaceted hierarchical codebook via residual quantization of item embeddings and co-train the codebook with the embeddings. We further introduce an efficient multifaceted indexing structure and mechanisms that support real-time updates. At serving time, the learned hierarchical indices are used directly to identify relevant items, avoiding ANN search altogether. Extensive experiments on real-world data with billions of users show that MFLI improves recall on engagement tasks by up to 11.8\%, cold-content delivery by up to 57.29\%, and semantic relevance by 13.5\% compared with prior state-of-the-art methods. We also deploy MFLI in the system and report online experimental results demonstrating improved engagement, less popularity bias, and higher serving efficiency.
△ Less
Submitted 17 February, 2026;
originally announced February 2026.
-
Is Online Linear Optimization Sufficient for Strategic Robustness?
Authors:
Yang Cai,
Haipeng Luo,
Chen-Yu Wei,
Weiqiang Zheng
Abstract:
We consider bidding in repeated Bayesian first-price auctions. Bidding algorithms that achieve optimal regret have been extensively studied, but their strategic robustness to the seller's manipulation remains relatively underexplored. Bidding algorithms based on no-swap-regret algorithms achieve both desirable properties, but are suboptimal in terms of statistical and computational efficiency. In…
▽ More
We consider bidding in repeated Bayesian first-price auctions. Bidding algorithms that achieve optimal regret have been extensively studied, but their strategic robustness to the seller's manipulation remains relatively underexplored. Bidding algorithms based on no-swap-regret algorithms achieve both desirable properties, but are suboptimal in terms of statistical and computational efficiency. In contrast, online gradient ascent is the only algorithm that achieves $O(\sqrt{TK})$ regret and strategic robustness [KSS24], where $T$ denotes the number of auctions and $K$ the number of bids.
In this paper, we explore whether simple online linear optimization (OLO) algorithms suffice for bidding algorithms with both desirable properties. Our main result shows that sublinear linearized regret is sufficient for strategic robustness. Specifically, we construct simple black-box reductions that convert any OLO algorithm into a strategically robust no-regret bidding algorithm, in both known and unknown value distribution settings. For the known value distribution case, our reduction yields a bidding algorithm that achieves $O(\sqrt{T \log K})$ regret and strategic robustness (with exponential improvement on the $K$-dependence compared to [KSS24]). For the unknown value distribution case, our reduction gives a bidding algorithm with high-probability $O(\sqrt{T (\log K+\log(T/δ)})$ regret and strategic robustness, while removing the bounded density assumption made in [KSS24].
△ Less
Submitted 12 February, 2026;
originally announced February 2026.
-
3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
Authors:
Wancai Zheng,
Hao Chen,
Xianlong Lu,
Linlin Ou,
Xinyi Yu
Abstract:
Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision makin…
▽ More
Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/
△ Less
Submitted 12 February, 2026;
originally announced February 2026.
-
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Authors:
Ailin Huang,
Ang Li,
Aobo Kong,
Bin Wang,
Binxing Jiao,
Bo Dong,
Bojun Wang,
Boyu Chen,
Brian Li,
Buyun Ma,
Chang Su,
Changxin Miao,
Changyi Wan,
Chao Lou,
Chen Hu,
Chen Xu,
Chenfeng Yu,
Chengting Feng,
Chengyuan Yao,
Chunrui Han,
Dan Ma,
Dapeng Shi,
Daxin Jiang,
Dehua Ma,
Deshan Sun
, et al. (191 additional authors not shown)
Abstract:
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/f…
▽ More
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
△ Less
Submitted 23 February, 2026; v1 submitted 11 February, 2026;
originally announced February 2026.
-
TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
Authors:
Qinwen Xu,
Jiaming Liu,
Rui Zhou,
Shaojun Shi,
Nuowei Han,
Zhuoyang Liu,
Chenyang Gu,
Shuo Gu,
Yang Yue,
Gao Huang,
Wenzhao Zheng,
Sirui Han,
Peng Jia,
Shanghang Zhang
Abstract:
Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted…
▽ More
Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.
△ Less
Submitted 19 March, 2026; v1 submitted 9 February, 2026;
originally announced February 2026.
-
Reliable and Responsible Foundation Models: A Comprehensive Survey
Authors:
Xinyu Yang,
Junlin Han,
Rishi Bommasani,
Jinqi Luo,
Wenjie Qu,
Wangchunshu Zhou,
Adel Bibi,
Xiyao Wang,
Jaehong Yoon,
Elias Stengel-Eskin,
Shengbang Tong,
Lingfeng Shen,
Rafael Rafailov,
Runjia Li,
Zhaoyang Wang,
Yiyang Zhou,
Chenhang Cui,
Yu Wang,
Wenhao Zheng,
Huichi Zhou,
Jindong Gu,
Zhaorun Chen,
Peng Xia,
Tony Lee,
Thomas Zollo
, et al. (27 additional authors not shown)
Abstract:
Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, science, and beyond. As these models see increasing real-world deployment,…
▽ More
Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, science, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.
△ Less
Submitted 4 February, 2026;
originally announced February 2026.
-
SurfAge-Net: A Hierarchical Surface-Based Network for Interpretable Fine-Grained Brain Age Prediction
Authors:
Rongzhao He,
Dalin Zhu,
Ying Wang,
Songhong Yue,
Leilei Zhao,
Yu Fu,
Dan Wu,
Bin Hu,
Weihao Zheng
Abstract:
Brain age prediction serves as a powerful framework for assessing brain status and detecting deviations associated with neurodevelopmental and neurodegenerative disorders. However, most existing approaches emphasize whole-brain age prediction and therefore overlook the pronounced regional heterogeneity of brain maturation that is crucial for detecting localized atypical trajectories. To address th…
▽ More
Brain age prediction serves as a powerful framework for assessing brain status and detecting deviations associated with neurodevelopmental and neurodegenerative disorders. However, most existing approaches emphasize whole-brain age prediction and therefore overlook the pronounced regional heterogeneity of brain maturation that is crucial for detecting localized atypical trajectories. To address this limitation, we propose a novel spherical surface-based brain age prediction network (SurfAge-Net) that leverages multiple morphological metrics to capture region-specific developmental patterns with enhanced robustness and clinical interpretability. SurfAge-Net establishes a new modeling paradigm by incorporating the connectomic principles of cortical organization: it explicitly models both intra- and inter-hemispheric dependencies through a spatial-channel mixing and a lateralization-aware attention mechanism, enabling the network to characterize the coordinate maturation pattern uniquely associated with each target region. Validated on three fetal and neonatal datasets, SurfAge-Net outperforms existing approaches (global MAE = 0.54, regional MAE = 0.45 in gestational/postmenstrual weeks) and demonstrates strong generalizability across external cohorts. Importantly, it provides spatially precise and biologically interpretable maps of cortical maturation, effectively identifying heterogeneous delays and regional-specific abnormalities in atypical developmental populations. These results established fine-grained brain age prediction as a promising paradigm for advancing neurodevelopmental research and supporting early clinical assessment.
△ Less
Submitted 28 January, 2026;
originally announced February 2026.
-
Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search
Authors:
Tianming Liang,
Qirui Du,
Jian-Fang Hu,
Haichao Jiang,
Zicheng Lin,
Wei-Shi Zheng
Abstract:
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In…
▽ More
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
△ Less
Submitted 4 February, 2026;
originally announced February 2026.
-
Point2Insert: Video Object Insertion via Sparse Point Guidance
Authors:
Yu Zhou,
Xiaoyan Yang,
Bojia Zi,
Lihan Zhang,
Ruijie Sun,
Weishi Zheng,
Haibin Huang,
Chi Zhang,
Xuelong Li
Abstract:
This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations…
▽ More
This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.
△ Less
Submitted 3 February, 2026;
originally announced February 2026.
-
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
Authors:
Haichao Jiang,
Tianming Liang,
Wei-Shi Zheng,
Jian-Fang Hu
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alter…
▽ More
Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released at https://github.com/iSEE-Laboratory/Refer-Agent.
△ Less
Submitted 6 February, 2026; v1 submitted 3 February, 2026;
originally announced February 2026.
-
ObjEmbed: Towards Universal Multimodal Object Embeddings
Authors:
Shenghao Fu,
Yukun Su,
Fengyun Rao,
Jing Lyu,
Xiaohua Xie,
Wei-Shi Zheng
Abstract:
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the…
▽ More
Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.
△ Less
Submitted 2 February, 2026; v1 submitted 2 February, 2026;
originally announced February 2026.
-
DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs
Authors:
Ziyu Hu,
Zhiqing Zhong,
Weijian Zheng,
Zhijing Ye,
Xuwei Tan,
Xueru Zhang,
Zheng Xie,
Rajkumar Kettimuthu,
Xiaodong Yu
Abstract:
The exponential growth of large language models has outpaced the capabilities of traditional CPU and GPU architectures due to the slowdown of Moore's Law. Dataflow AI accelerators present a promising alternative; however, there remains a lack of in-depth performance analysis and standardized benchmarking methodologies for LLM training. We introduce DABench-LLM, the first benchmarking framework des…
▽ More
The exponential growth of large language models has outpaced the capabilities of traditional CPU and GPU architectures due to the slowdown of Moore's Law. Dataflow AI accelerators present a promising alternative; however, there remains a lack of in-depth performance analysis and standardized benchmarking methodologies for LLM training. We introduce DABench-LLM, the first benchmarking framework designed for evaluating LLM workloads on dataflow-based accelerators. By combining intra-chip performance profiling and inter-chip scalability analysis, DABench-LLM enables comprehensive evaluation across key metrics such as resource allocation, load balance, and resource efficiency. The framework helps researchers rapidly gain insights into underlying hardware and system behaviors, and provides guidance for performance optimizations. We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU. Our framework reveals performance bottlenecks and provides specific optimization strategies, demonstrating its generality and effectiveness across a diverse range of dataflow-based AI hardware platforms.
△ Less
Submitted 4 December, 2025;
originally announced January 2026.
-
MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models
Authors:
Tian-Yi Zhou,
Xuan-Hao Liu,
Bao-Liang Lu,
Wei-Long Zheng
Abstract:
Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG's non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfit…
▽ More
Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG's non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.
△ Less
Submitted 26 January, 2026; v1 submitted 26 January, 2026;
originally announced January 2026.
-
ReViP: Mitigating False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance
Authors:
Zhuohao Li,
Yinghao Li,
Jian-Jian Jiang,
Lang Zhou,
Tianyu Zhang,
Jiadong Yin,
Mu Lin,
Yi-Lin Wei,
Wei-Shi Zheng
Abstract:
Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with vision-language features, resulting in state-dominant bias and \textbf{false completions} despite visible execution failures. We systematically analyze this failure mode, attributing it to mo…
▽ More
Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with vision-language features, resulting in state-dominant bias and \textbf{false completions} despite visible execution failures. We systematically analyze this failure mode, attributing it to modality imbalance, where policies overly rely on internal state progression and underuse visual evidence. To address this, we introduce the first \textbf{False-Completion Benchmark Suite}, featuring eight tasks with three controlled perturbations (\emph{Object Drop}, \emph{Distractor Swap}, \emph{Relayout}) to comprehensively evaluate false completion. Moreover, we propose \textbf{ReViP}, a novel VLA framework with \textbf{Vi}sion-\textbf{P}roprioception \textbf{Re}balance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary \emph{progress-aware visual cues} to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, progress-aware visual cues are extracted by an external Task-Stage Observer, which performs task-relevant reasoning on real-time observations to drive task-stage feature-wise linear modulation, enhancing environmental awareness and mitigating state-driven errors. Extensive experiments show that ReViP effectively mitigates false completion and improves success rates over strong VLA baselines, achieving a \textbf{26\%} gain over $π_0$ model on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
△ Less
Submitted 11 March, 2026; v1 submitted 23 January, 2026;
originally announced January 2026.