-
CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation
Authors:
Yanting Li,
Zhuoyang Jiang,
Enyan Dai,
Lei Wang,
Wen-Cai Ye,
Li Liu
Abstract:
Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein--ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural val…
▽ More
Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein--ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework.
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
Authors:
Bo Li,
Mingda Wang,
Gexiang Fang,
Shikun Zhang,
Wei Ye
Abstract:
We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-gui…
▽ More
We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories
Authors:
Peiyang Liu,
Zhirui Chen,
Xi Wang,
Di Liang,
Youru Li,
Zhi Cai,
Wei Ye
Abstract:
Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms su…
▽ More
Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
Instruction Data Selection via Answer Divergence
Authors:
Bo Li,
Mingda Wang,
Shikun Zhang,
Wei Ye
Abstract:
Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an outp…
▽ More
Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi-modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine-tuning on only 10K ADG-selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
Authors:
Zunhai Su,
Hengyuan Zhang,
Wei Wu,
Yifan Zhang,
Yaxiu Liu,
He Xiao,
Qingyao Yang,
Yuxuan Sun,
Rui Yang,
Chao Zhang,
Keyu Fan,
Weihao Ye,
Jing Xiong,
Hui Shen,
Chaofan Tao,
Taiqiang Wu,
Zhongwei Wan,
Yulei Qian,
Yuchen Xie,
Ngai Wong
Abstract:
As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, signifi…
▽ More
As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.
△ Less
Submitted 11 April, 2026;
originally announced April 2026.
-
GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair
Authors:
Zhuoyao Liu,
Zhengran Zeng,
Shu-Dong Huang,
Yang Liu,
Shikun Zhang,
Wei Ye
Abstract:
Large Language Model (LLM)-based Automated Program Repair (APR) has shown strong potential on textual benchmarks, yet struggles in multimodal scenarios where bugs are reported with GUI screenshots. Existing methods typically convert images into plain text, which discards critical spatial relationships and causes a severe disconnect between visual observations and code components, leading localizat…
▽ More
Large Language Model (LLM)-based Automated Program Repair (APR) has shown strong potential on textual benchmarks, yet struggles in multimodal scenarios where bugs are reported with GUI screenshots. Existing methods typically convert images into plain text, which discards critical spatial relationships and causes a severe disconnect between visual observations and code components, leading localization to degrade into imprecise keyword matching. To bridge this gap, we propose GALA (Graph Alignment for Localization in APR), a framework that shifts multimodal APR from implicit semantic guessing to explicit structural reasoning. GALA operates in four stages: it first constructs an Image UI Graph to capture visual elements and their structural relationships; then performs file-level alignment by cross-referencing this UI graph with repository-level structures (e.g., file references) to locate candidate files; next conducts function-level alignment by reasoning over fine-grained code dependencies (e.g., call graphs) to precisely ground visual elements to corresponding code components; and finally performs patch generation within the grounded code context based on the aligned files and functions. By systematically enforcing both semantic and relational consistency across modalities, GALA establishes a highly accurate visual-to-code mapping. Evaluations on the SWE-bench Multimodal benchmark demonstrate that GALA achieves state-of-the-art performance, highlighting the effectiveness of hierarchical structural alignment.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Data Selection for Multi-turn Dialogue Instruction Tuning
Authors:
Bo Li,
Shikun Zhang,
Wei Ye
Abstract:
Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversa…
▽ More
Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.
△ Less
Submitted 11 April, 2026; v1 submitted 9 April, 2026;
originally announced April 2026.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Authors:
Chengli Xing,
Zhengran Zeng,
Gexiang Fang,
Rui Xie,
Wei Ye,
Shikun Zhang
Abstract:
Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programmi…
▽ More
Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programming datasets. In this paper, we aim to address this gap by exploring the effectiveness of a widely used general data filtering technique, i.e., data-influence-score filtering, within the context of programming-related datasets. To this end, we first introduce a method for calculating data-influence-score for generative programming tasks which involves transforming a variety of downstream coding tasks into validation sets and using the models loss on these sets as a performance metric. Next, we pre-train a Code-LLMs with 1 billion parameters from scratch on a dataset of 100 billion code tokens. Based on it, we conduct an extensive empirical study to evaluate the effectiveness of data-influence-score filtering methods. Specifically, we examine how well this technique improves model performance, investigate how the characteristics of beneficial training data vary across different training stages and programming tasks, and assess the feasibility of prediction-based data-influence-score filtering method. Our findings show that data-influence-score filtering based on validation-set-loss can enhance models programming performance. Moreover, we observe that the criteria of beneficial training data differ significantly across various downstream programming tasks.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
A Gradual Probabilistic Lambda Calculus
Authors:
Wenjia Ye,
MatÃas Toro,
Federico Olmedo
Abstract:
Probabilistic programming languages have recently gained a lot of attention, in particular due to their applications in domains such as machine learning and differential privacy. To establish invariants of interest, many such languages include some form of static checking in the form of type systems. However, adopting such a type discipline can be cumbersome or overly conservative.
Gradual typin…
▽ More
Probabilistic programming languages have recently gained a lot of attention, in particular due to their applications in domains such as machine learning and differential privacy. To establish invariants of interest, many such languages include some form of static checking in the form of type systems. However, adopting such a type discipline can be cumbersome or overly conservative.
Gradual typing addresses this problem by supporting a smooth transition between static and dynamic checking, and has been successfully applied for languages with different constructs and type abstractions. Nevertheless, its benefits have never been explored in the context of probabilistic languages.
In this work, we present and formalize GPLC, a gradual source probabilistic lambda calculus. GPLC includes a binary probabilistic choice operator and allows programmers to gradually introduce/remove static type -- and probability -- annotations. The static semantics of GPLC heavily relies on the notion of probabilistic couplings, as required for defining several relations, such as consistency, precision, and consistent transitivity. The dynamic semantics of GPLC is given via elaboration to the target language TPLC, which features a distribution-based semantics interpreting programs as probability distributions over final values. Regarding the language metatheory, we establish that TPLC -- and therefore also GPLC -- is type safe and satisfies two of the so-called refined criteria for gradual languages, namely, that it is a conservative extension of a fully static variant and that it satisfies the gradual guarantee, behaving smoothly with respect to type precision.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
Cognibit: From Digital Exhaustion to Real-World Connection Through Gamified Territory Control and LLM-Powered Twin Networking
Authors:
Wanghao Ye,
Sihan Chen,
Yiting Wang,
Shwai He,
Bowei Tian,
Guoheng Sun,
Ziyi Wang,
Ziyao Wang,
Yexiao He,
Zheyu Shen,
Meng Liu,
Yuning Zhang,
Meng Feng,
Yifei Dong,
Yanhong Qian,
Yang Wang,
Siyuan Peng,
Yilong Dai,
Zhenle Duan,
Joshua Liu,
Lang Xiong,
Hanzhang Qin,
Ang Li
Abstract:
We present an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through behavioral simulation. The platform unifies three key pillars: (1) digital twins that engage in autonomous multi-turn conversations on behalf of users to estimate compatibility, (2) gamified territory conquest mechanics that incentivize real-world exploration and…
▽ More
We present an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through behavioral simulation. The platform unifies three key pillars: (1) digital twins that engage in autonomous multi-turn conversations on behalf of users to estimate compatibility, (2) gamified territory conquest mechanics that incentivize real-world exploration and create organic settings for in-person encounters, and (3) AI companions that preserve persistent shared memory across devices. Built upon CogniPair's cognitive architecture (Ye et al., 2026), validated on the Columbia Speed Dating dataset (551 participants), our system extends prior simulation-only matching into a fully deployed social discovery environment. Through deployment, we derive empirical cost-quality baselines and identify fundamental scaling bottlenecks that remain hidden in component-level testing alone.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
Learning Dexterous Grasping from Sparse Taxonomy Guidance
Authors:
Juhan Park,
Taerim Yoon,
Seungmin Kim,
Joonggil Kim,
Wontae Ye,
Jeongeun Park,
Yoonbyung Chai,
Geonwoo Cho,
Geunwoo Cho,
Dohyeong Kim,
Kyungjae Lee,
Yongjae Kim,
Sungjoon Choi
Abstract:
Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to interve…
▽ More
Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to intervene when failures occur. To this end, we present GRIT, a two-stage framework that learns dexterous control from sparse taxonomy guidance. GRIT first predicts a taxonomy-based grasp specification from the scene and task context. Conditioned on this sparse command, a policy generates continuous finger motions that accomplish the task while preserving the intended grasp structure. Our result shows that certain grasp taxonomies are more effective for specific object geometries. By leveraging this relationship, GRIT improves generalization to novel objects over baselines and achieves an overall success rate of 87.9%. Moreover, real-world experiments demonstrate controllability, enabling grasp strategies to be adjusted through high-level taxonomy selection based on object geometry and task intent.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
Recruiting Heterogeneous Crowdsource Vehicles for Updating a High-definition Map
Authors:
Wentao Ye,
Yuan Luo,
Bo Liu,
Jianwei Huang
Abstract:
The high-definition map is a cornerstone of autonomous driving. Unlike constructing a costly fleet of mapping vehicles, the crowdsourcing paradigm is a cost-effective way to keep an HD map up to date. Achieving practical success for crowdsourcing-based HD maps is contingent on addressing two critical issues: freshness and recruitment costs. Given that crowdsource vehicles are often heterogeneous i…
▽ More
The high-definition map is a cornerstone of autonomous driving. Unlike constructing a costly fleet of mapping vehicles, the crowdsourcing paradigm is a cost-effective way to keep an HD map up to date. Achieving practical success for crowdsourcing-based HD maps is contingent on addressing two critical issues: freshness and recruitment costs. Given that crowdsource vehicles are often heterogeneous in terms of operational costs and sensing capabilities, it is practical to recruit heterogeneous crowdsource vehicles to achieve the tradeoff between freshness and recruitment costs. However, existing works neglect this aspect. To solve it, we formulate this problem as a Markov decision process. We demonstrate that the optimal policy is threshold-type age-dependent. Additionally, our findings reveal some counter-intuitive insights. In some cases, the company should initiate vehicle recruitment earlier when vehicles arrive more frequently, or have higher operational costs or sensing capabilities.} Besides, we propose an efficient algorithm, called the bound-based relative value iteration (BRVI) algorithm, to overcome the technical challenge that finding an optimal policy is time-consuming. Numerical simulations show that (i) the optimal policy reduces the average cost by $19.04\%$ compared to the state-of-the-art mechanism}, and (ii) the proposed algorithm can reduce the convergence time by $13.66\%$ on average compared to the existing algorithm.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
Efficient and Cost-effective Vehicle Recruitment for HD Map Crowdsourcing
Authors:
Wentao Ye,
Yuan Luo,
Bo Liu,
Jianwei Huang
Abstract:
The high-definition (HD) map is a cornerstone of autonomous driving. The crowdsourcing paradigm is a cost-effective way to keep an HD map up-to-date. Current HD map crowdsourcing mechanisms aim to enhance HD map freshness within recruitment budgets. However, many overlook unique and critical traits of crowdsourcing vehicles, such as random arrival and heterogeneity, leading to either compromised m…
▽ More
The high-definition (HD) map is a cornerstone of autonomous driving. The crowdsourcing paradigm is a cost-effective way to keep an HD map up-to-date. Current HD map crowdsourcing mechanisms aim to enhance HD map freshness within recruitment budgets. However, many overlook unique and critical traits of crowdsourcing vehicles, such as random arrival and heterogeneity, leading to either compromised map freshness or excessive recruitment costs. Furthermore, these characteristics complicate the characterization of the feasible space of the optimal recruitment policy, necessitating a method to compute it efficiently in dynamic transportation scenarios.To overcome these challenges, we propose an efficient and cost-effective vehicle recruitment (ENTER) mechanism. Specifically, the ENTER mechanism has a threshold structure and balances freshness with recruitment costs while accounting for the vehicles' random arrival and heterogeneity. It also integrates the bound-based relative value iteration (RVI) algorithm, which utilizes the threshold-type structure and upper bounds of thresholds to reduce the feasible space and expedite convergence. Numerical results show that the proposed ENTER mechanism increases the HD map company's payoff by 23.40% and 43.91% compared to state-of-the-art mechanisms that do not account for vehicle heterogeneity and random arrivals, respectively. Furthermore, the bound-based RVI algorithm in the ENTER mechanism reduces computation time by an average of 18.91% compared to the leading RVI-based algorithm.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
SteerRM: Debiasing Reward Models via Sparse Autoencoders
Authors:
Mengyuan Sun,
Zhuohao Yu,
Weizheng Gu,
Shikun Zhang,
Wei Ye
Abstract:
Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, t…
▽ More
Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
Revealing Combinatorial Reasoning of GNNs via Graph Concept Bottleneck Layer
Authors:
Yue Niu,
Zhaokai Sun,
Jiayi Yang,
Xiaofeng Cao,
Rui Fan,
Xin Sun,
Hanli Wang,
Wei Ye
Abstract:
Despite their success in various domains, the growing dependence on GNNs raises a critical concern about the nature of the combinatorial reasoning underlying their predictions, which is often hidden within their black-box architectures. Addressing this challenge requires understanding how GNNs translate topological patterns into logical rules. However, current works only uncover the hard logical r…
▽ More
Despite their success in various domains, the growing dependence on GNNs raises a critical concern about the nature of the combinatorial reasoning underlying their predictions, which is often hidden within their black-box architectures. Addressing this challenge requires understanding how GNNs translate topological patterns into logical rules. However, current works only uncover the hard logical rules over graph concepts, which cannot quantify the contribution of each concept to prediction. Moreover, they are post-hoc interpretable methods that generate explanations after model training and may not accurately reflect the true combinatorial reasoning of GNNs, since they approximate it with a surrogate. In this work, we develop a graph concept bottleneck layer that can be integrated into any GNN architectures to guide them to predict the selected discriminative global graph concepts. The predicted concept scores are further projected to class labels by a sparse linear layer. It enforces the combinatorial reasoning of GNNs' predictions to fit the soft logical rule over graph concepts and thus can quantify the contribution of each concept. To further improve the quality of the concept bottleneck, we treat concepts as "graph words" and graphs as "graph sentences", and leverage language models to learn graph concept embeddings. Extensive experiments on multiple datasets show that our method GCBMs achieve state-of-the-art performance both in classification and interpretability.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning
Authors:
Shaohuai Liu,
Weirui Ye,
Yilun Du,
Le Xie
Abstract:
Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. T…
▽ More
Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{https://yewr.github.io/ez_m/}{here}.
△ Less
Submitted 2 March, 2026;
originally announced March 2026.
-
From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models
Authors:
Hongrui Jia,
Chaoya Jiang,
Shikun Zhang,
Wei Ye
Abstract:
As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based cor…
▽ More
As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.
△ Less
Submitted 26 February, 2026;
originally announced February 2026.
-
Generative Recommendation for Large-Scale Advertising
Authors:
Ben Xue,
Dan Liu,
Lixiang Wang,
Mingjie Sun,
Peng Wang,
Pengfei Zhang,
Shaoyun Shi,
Tianyu Xu,
Yunhao Sha,
Zhiqiang Liu,
Bo Kong,
Bo Wang,
Hang Yang,
Jieting Xue,
Junhao Wang,
Shengyu Wang,
Shuping Hui,
Wencai Ye,
Xiao Lin,
Yongzhi Li,
Yuhang Chen,
Zhihui Yin,
Quan Chen,
Shiyang Wen,
Wenjin Wu
, et al. (5 additional authors not shown)
Abstract:
Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model (LLM)-style training and serving recipes. We present a production-oriented generative recommender co-designed across architecture…
▽ More
Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model (LLM)-style training and serving recipes. We present a production-oriented generative recommender co-designed across architecture, learning, and serving, named GR4AD (Generative Recommendation for ADdvertising). As for tokenization, GR4AD proposes UA-SID (Unified Advertisement Semantic ID) to capture complicated business information. Furthermore, GR4AD introduces LazyAR, a lazy autoregressive decoder that relaxes layer-wise dependencies for short, multi-candidate generation, preserving effectiveness while reducing inference cost, which facilitates scaling under fixed serving budgets. To align optimization with business value, GR4AD employs VSL (Value-Aware Supervised Learning) and proposes RSPO (Ranking-Guided Softmax Preference Optimization), a ranking-aware, list-wise reinforcement learning algorithm that optimizes value-based rewards under list-level metrics for continual online updates. For online inference, we further propose dynamic beam serving, which adapts beam width across generation levels and online load to control compute. Large-scale online A/B tests show up to 4.2% ad revenue improvement over an existing DLRM-based stack, with consistent gains from both model scaling and inference-time scaling. GR4AD has been fully deployed in Kuaishou advertising system with over 400 million users and achieves high-throughput real-time serving.
△ Less
Submitted 1 April, 2026; v1 submitted 26 February, 2026;
originally announced February 2026.
-
XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
Authors:
Zunhai Su,
Weihao Ye,
Hansen Feng,
Keyu Fan,
Jing Zhang,
Dahai Yu,
Zhengwu Liu,
Ngai Wong
Abstract:
Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to…
▽ More
Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.
△ Less
Submitted 25 February, 2026;
originally announced February 2026.
-
Validated Code Translation for Projects with External Libraries
Authors:
Hanliang Zhang,
Arindam Sharma,
Cristina David,
Meng Wang,
Brandon Paulsen,
Daniel Kroening,
Wenjia Ye,
Taro Sekiyama
Abstract:
Large Language Models (LLMs) have shown promise for program translation, particularly for migrating systems code to memory-safe languages such as Rust. However, existing approaches struggle when source programs depend on external libraries: LLMs frequently hallucinate non-existent target APIs and fail to generate call-enabling imports; moreover, validating semantic equivalence is challenging when…
▽ More
Large Language Models (LLMs) have shown promise for program translation, particularly for migrating systems code to memory-safe languages such as Rust. However, existing approaches struggle when source programs depend on external libraries: LLMs frequently hallucinate non-existent target APIs and fail to generate call-enabling imports; moreover, validating semantic equivalence is challenging when the code manipulates opaque, library-defined types. We present a translation and validation framework for translating Go projects with external dependencies to Rust. Our approach combines (i) a retrieval mechanism that maps Go library APIs to Rust APIs, and (ii) a cross-language validation pipeline that establishes language interoperability in the presence of opaque library types by synthesising adapters exclusively from public library APIs, prior to validating I/O equivalence. We evaluate our system on six real-world Go repositories with non-trivial external dependencies. Our approach significantly increases both the compilation and equivalence success rate (up to 100% in the most dependency-heavy case; approx. 2x on average) by enabling validated translation that manipulate opaque, library-defined types.
△ Less
Submitted 20 February, 2026;
originally announced February 2026.
-
Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks
Authors:
Guanfeng Tang,
Hongbo Zhao,
Ziwei Long,
Jiayao Li,
Bohong Xiao,
Wei Ye,
Hanli Wang,
Rui Fan
Abstract:
Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual f…
▽ More
Inspired by the human visual system, which operates on two parallel yet interactive streams for contextual and spatial understanding, this article presents Two Interactive Streams (TwInS), a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks. TwInS adopts a unified, general-purpose architecture in which multi-level contextual features from the scene parsing stream are infused into the geometric vision stream to guide its iterative refinement. In the reverse direction, decoded geometric features are projected into the contextual feature space for selective heterogeneous feature fusion via a novel cross-task adapter, which leverages rich cross-view geometric cues to enhance scene parsing. To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is further equipped with a tailored semi-supervised training strategy, which unleashes the potential of large-scale multi-view data and enables continuous self-evolution without requiring ground-truth correspondences. Extensive experiments conducted on three public datasets validate the effectiveness of TwInS's core components and demonstrate its superior performance over existing state-of-the-art approaches. The source code will be made publicly available upon publication.
△ Less
Submitted 13 February, 2026;
originally announced February 2026.
-
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Authors:
Xiangyi Li,
Wenbo Chen,
Yimin Liu,
Shenghan Zheng,
Xiaokun Chen,
Yifeng He,
Yubo Li,
Bingran You,
Haotian Shen,
Jiankai Sun,
Shuyi Wang,
Binxu Li,
Qunhong Zeng,
Di Wang,
Xuandong Zhao,
Yuanli Wang,
Roey Ben Chaim,
Zonglin Di,
Yipeng Gao,
Junwei He,
Yizhuo He,
Liqiang Jing,
Luyang Kong,
Xin Lan,
Jiachen Li
, et al. (16 additional authors not shown)
Abstract:
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-gen…
▽ More
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.
△ Less
Submitted 13 March, 2026; v1 submitted 13 February, 2026;
originally announced February 2026.
-
Advancing Block Diffusion Language Models for Test-Time Scaling
Authors:
Yi Lu,
Deyang Kong,
Jianing Wang,
Linsen Guo,
Xue Wang,
Qi Guo,
Tao Gui,
Xuanjing Huang,
Wei Ye,
Shikun Zhang,
Wei Wang
Abstract:
Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified fra…
▽ More
Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
△ Less
Submitted 10 February, 2026; v1 submitted 10 February, 2026;
originally announced February 2026.
-
LLaDA2.1: Speeding Up Text Diffusion via Token Editing
Authors:
Tiwei Bie,
Maosong Cao,
Xiang Cao,
Bingsen Chen,
Fuyuan Chen,
Kun Chen,
Lun Du,
Daozhuo Feng,
Haibo Feng,
Mingliang Gong,
Zhuocheng Gong,
Yanmei Gu,
Jian Guan,
Kaiyuan Guan,
Hongliang He,
Zenan Huang,
Juyong Jiang,
Zhonghui Jiang,
Zhenzhong Lan,
Chengxi Li,
Jianguo Li,
Zehuan Li,
Huabin Liu,
Lin Liu,
Guoshan Lu
, et al. (25 additional authors not shown)
Abstract:
While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T)…
▽ More
While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.
△ Less
Submitted 13 February, 2026; v1 submitted 9 February, 2026;
originally announced February 2026.
-
OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration
Authors:
Qi Guo,
Jianing Wang,
Deyang Kong,
Xiangyu Xi,
Jianfei Zhang,
Yi Lu,
Jingang Wang,
Wei Wang,
Shikun Zhang,
Wei Ye
Abstract:
Parallel thinking has emerged as a new paradigm for large reasoning models (LRMs) in tackling complex problems. Recent methods leverage Reinforcement Learning (RL) to enhance parallel thinking, aiming to address the limitations in computational resources and effectiveness encountered with supervised fine-tuning. However, most existing studies primarily focus on optimizing the aggregation phase, wi…
▽ More
Parallel thinking has emerged as a new paradigm for large reasoning models (LRMs) in tackling complex problems. Recent methods leverage Reinforcement Learning (RL) to enhance parallel thinking, aiming to address the limitations in computational resources and effectiveness encountered with supervised fine-tuning. However, most existing studies primarily focus on optimizing the aggregation phase, with limited attention to the path exploration stage. In this paper, we theoretically analyze the optimization of parallel thinking under the Reinforcement Learning with Verifiable Rewards (RLVR) setting, and identify that the mutual information bottleneck among exploration paths fundamentally restricts overall performance. To address this, we propose Outline-Guided Path Exploration (OPE), which explicitly partitions the solution space by generating diverse reasoning outlines prior to parallel path reasoning, thereby reducing information redundancy and improving the diversity of information captured across exploration paths. We implement OPE with an iterative RL strategy that optimizes outline planning and outline-guided reasoning independently. Extensive experiments across multiple challenging mathematical benchmarks demonstrate that OPE effectively improves reasoning performance in different aggregation strategies, enabling LRMs to more reliably discover correct solutions.
△ Less
Submitted 9 February, 2026;
originally announced February 2026.
-
TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code
Authors:
Jiangping Huang,
Wenguang Ye,
Weisong Sun,
Jian Zhang,
Mingyue Zhang,
Yang Liu
Abstract:
Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficien…
▽ More
Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43\% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61\% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.
△ Less
Submitted 6 February, 2026;
originally announced February 2026.
-
Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction
Authors:
Zizhan Guo,
Yi Feng,
Mengtan Zhang,
Haoran Zhang,
Wei Ye,
Rui Fan
Abstract:
Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the…
▽ More
Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.
△ Less
Submitted 6 February, 2026;
originally announced February 2026.
-
AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation
Authors:
Wenxin Ye,
Lin Li,
Ming Li,
Yang Shen,
Kanghong Wang,
Jimmy Xiangji Huang
Abstract:
Clothing recommendation extends beyond merely generating personalized outfits; it serves as a crucial medium for aesthetic guidance. However, existing methods predominantly rely on user-item-outfit interaction behaviors while overlooking explicit representations of clothing aesthetics. To bridge this gap, we present the AesRec benchmark dataset featuring systematic quantitative aesthetic annotatio…
▽ More
Clothing recommendation extends beyond merely generating personalized outfits; it serves as a crucial medium for aesthetic guidance. However, existing methods predominantly rely on user-item-outfit interaction behaviors while overlooking explicit representations of clothing aesthetics. To bridge this gap, we present the AesRec benchmark dataset featuring systematic quantitative aesthetic annotations, thereby enabling the development of aesthetics-aligned recommendation systems. Grounded in professional apparel quality standards and fashion aesthetic principles, we define a multidimensional set of indicators. At the item level, six dimensions are independently assessed: silhouette, chromaticity, materiality, craftsmanship, wearability, and item-level impression. Transitioning to the outfit level, the evaluation retains the first five core attributes while introducing stylistic synergy, visual harmony, and outfit-level impression as distinct metrics to capture the collective aesthetic impact. Given the increasing human-like proficiency of Vision-Language Models in multimodal understanding and interaction, we leverage them for large-scale aesthetic scoring. We conduct rigorous human-machine consistency validation on a fashion dataset, confirming the reliability of the generated ratings. Experimental results based on AesRec further demonstrate that integrating quantified aesthetic information into clothing recommendation models can provide aesthetic guidance for users while fulfilling their personalized requirements.
△ Less
Submitted 3 February, 2026;
originally announced February 2026.
-
Kimi K2.5: Visual Agentic Intelligence
Authors:
Kimi Team,
Tongtong Bai,
Yifan Bai,
Yiping Bao,
S. H. Cai,
Yuan Cao,
Y. Charles,
H. S. Che,
Cheng Chen,
Guanduo Chen,
Huarong Chen,
Jia Chen,
Jiahao Chen,
Jianlong Chen,
Jun Chen,
Kefan Chen,
Liang Chen,
Ruijue Chen,
Xinhao Chen,
Yanru Chen,
Yanxu Chen,
Yicun Chen,
Yimin Chen,
Yingjiang Chen,
Yuankun Chen
, et al. (301 additional authors not shown)
Abstract:
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5…
▽ More
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
△ Less
Submitted 2 February, 2026;
originally announced February 2026.
-
What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?
Authors:
Weizheng Gu,
Chengze Li,
Zhuohao Yu,
Mengyuan Sun,
Zhibang Yang,
Wei Wang,
Hongrui Jia,
Shikun Zhang,
Wei Ye
Abstract:
Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capab…
▽ More
Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available at https://anonymous.4open.science/r/What-Do-Agents-Learn-from-Trajectory-SFT-Semantics-or-Interfaces--0831/.
△ Less
Submitted 1 February, 2026;
originally announced February 2026.
-
Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority
Authors:
Zhanming Shen,
Zeyu Qin,
Jiaqi Hu,
Wentao Ye,
Hao Chen,
Xiaomeng Hu,
Haokai Xu,
Gang Chen,
Yi R. Fung,
Haobo Wang
Abstract:
The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshap…
▽ More
The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.
△ Less
Submitted 9 February, 2026; v1 submitted 1 February, 2026;
originally announced February 2026.
-
Bitcoin Price Prediction using Machine Learning and Combinatorial Fusion Analysis
Authors:
Yuanhong Wu,
Wei Ye,
Jingyan Xu,
D. Frank Hsu
Abstract:
In this work, we propose to apply a new model fusion and learning paradigm, known as Combinatorial Fusion Analysis (CFA), to the field of Bitcoin price prediction. Price prediction of financial product has always been a big topic in finance, as the successful prediction of the price can yield significant profit. Every machine learning model has its own strength and weakness, which hinders progress…
▽ More
In this work, we propose to apply a new model fusion and learning paradigm, known as Combinatorial Fusion Analysis (CFA), to the field of Bitcoin price prediction. Price prediction of financial product has always been a big topic in finance, as the successful prediction of the price can yield significant profit. Every machine learning model has its own strength and weakness, which hinders progress toward robustness. CFA has been used to enhance models by leveraging rank-score characteristic (RSC) function and cognitive diversity in the combination of a moderate set of diverse and relatively well-performed models. Our method utilizes both score and rank combinations as well as other weighted combination techniques. Key metrics such as RMSE and MAPE are used to evaluate our methodology performance. Our proposal presents a notable MAPE performance of 0.19\%. The proposed method greatly improves upon individual model performance, as well as outperforms other Bitcoin price prediction models.
△ Less
Submitted 8 March, 2026; v1 submitted 18 January, 2026;
originally announced February 2026.
-
Omni-fMRI: A Universal Atlas-Free fMRI Foundation Model
Authors:
Mo Wang,
Wenhao Ye,
Junfeng Xia,
Junxiang Zhang,
Xuanye Pan,
Minghao Xu,
Haotian Deng,
Hongkai Wen,
Quanying Liu
Abstract:
Self-supervised fMRI foundation models have shown promising transfer performance, yet most rely on predefined region-level parcellations that discard fine-grained voxel information and introduce atlas-dependent biases. We propose Omni-fMRI, an atlas-free foundation model that operates directly on voxel-level signals. To enable scalable pretraining on 49,497 fMRI sessions across nine datasets, Omni…
▽ More
Self-supervised fMRI foundation models have shown promising transfer performance, yet most rely on predefined region-level parcellations that discard fine-grained voxel information and introduce atlas-dependent biases. We propose Omni-fMRI, an atlas-free foundation model that operates directly on voxel-level signals. To enable scalable pretraining on 49,497 fMRI sessions across nine datasets, Omni-fMRI introduces a dynamic patching mechanism that substantially reduces computational cost while preserving informative spatial structure. To support reproducibility and fair comparison, we establish a comprehensive benchmark suite spanning 11 datasets and a diverse set of resting-state and task-based fMRI tasks. Experimental results demonstrate that Omni-fMRI consistently outperforms existing foundation models, providing a scalable and reproducible framework for atlas-free brain representation learning. Code and logs are available.
△ Less
Submitted 30 January, 2026;
originally announced January 2026.
-
ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models
Authors:
Bowen Fang,
Wen Ye,
Yunyue Su,
Jinghao Zhang,
Qiang Liu,
Yesheng Liu,
Xin Sun,
Shu Wu,
Jiabing Yang,
Baole Wei,
Liang Wang
Abstract:
Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn an…
▽ More
Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool's intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.
△ Less
Submitted 29 January, 2026;
originally announced January 2026.
-
MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations
Authors:
Xinan He,
Kaiqing Lin,
Yue Zhou,
Jiaming Zhong,
Wei Ye,
Wenhui Yi,
Bing Fan,
Feng Ding,
Haodong Li,
Bo Cao,
Bin Li
Abstract:
With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of…
▽ More
With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations' (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.
△ Less
Submitted 2 February, 2026; v1 submitted 29 January, 2026;
originally announced January 2026.
-
StreetDesignAI: A Multi-Persona Evaluation System for Inclusive Infrastructure Design
Authors:
Ziyi Wang,
Yilong Dai,
Duanya Lyu,
Mateo Nader,
Sihan Chen,
Wanghao Ye,
Zjian Ding,
Xiang Yan
Abstract:
Designing cycling infrastructure requires balancing the competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street environment. We investigate how persona-based evaluation can support cycling infrastructure design by making experiential conflicts explicit during the design process. Informed by a formative study with 12 domai…
▽ More
Designing cycling infrastructure requires balancing the competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street environment. We investigate how persona-based evaluation can support cycling infrastructure design by making experiential conflicts explicit during the design process. Informed by a formative study with 12 domain experts and crowdsourced bikeability assessments from 427 cyclists, we present StreetDesignAI, an interactive system that enables designers to (1) ground evaluation in real street context through imagery and map data, (2) receive parallel feedback from simulated cyclist personas spanning confident to cautious users, and (3) iteratively modify designs while the system surfaces conflicts across perspectives. A within-subjects study with 26 transportation professionals comparing StreetDesignAI against a general-purpose AI chatbot demonstrates that structured multi-perspective feedback significantly Broaden designers' understanding of various cyclists' perspectives, ability to identify diverse persona needs, and confidence in translating those needs into design decisions. Participants also reported significantly higher overall satisfaction and stronger intention to use the system in professional practice. Qualitative findings further illuminate how explicit conflict surfacing transforms design exploration from single-perspective optimization toward deliberate trade-off reasoning. We discuss implications for AI-assisted tools that scaffold persona-aware design through disagreement as an interaction primitive.
△ Less
Submitted 12 April, 2026; v1 submitted 22 January, 2026;
originally announced January 2026.
-
QKVQA: Question-Focused Filtering for Knowledge-based VQA
Authors:
Wei Ye,
Yixin Su,
Yueguo Chen,
Longxiang Gao,
Jianjun Li,
Ruixuan Li,
Rui Zhang
Abstract:
Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two…
▽ More
Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two issues: they fail to focus on parts relevant to the question during candidate section encoding, and they use similarity metrics to locate a section from a single article, resulting in information limitation. To address these issues, this paper proposes a question-focused, cross-article filtering method. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Cross-Article Selection module (CDA). This approach maintains inference time comparable to the optimal method with the shorter context length, efficiently obtaining high-quality filtered knowledge. The accuracy outperforms current state-of-the-art methods by 3.2 and 2.2 percentage points on Encyclopedic-VQA and InfoSeek, respectively. The code is publicly available at: https://github.com/leaffeall/QKVQA.
△ Less
Submitted 7 April, 2026; v1 submitted 20 January, 2026;
originally announced January 2026.
-
A Model Fusion Approach for Enhancing Credit Approval Decision Making
Authors:
Yuanhong Wu,
Jingyan Xu,
Wei Ye,
Christina Schweikert,
D. Frank Hsu
Abstract:
Credit default poses significant challenges to financial institutions and consumers, resulting in substantial financial losses and diminished trust. As such, credit default risk management has been a critical topic in the financial industry. In this paper, we present Combinatorial Fusion Analysis (CFA), a model fusion framework, that combines multiple machine learning algorithms to detect and pred…
▽ More
Credit default poses significant challenges to financial institutions and consumers, resulting in substantial financial losses and diminished trust. As such, credit default risk management has been a critical topic in the financial industry. In this paper, we present Combinatorial Fusion Analysis (CFA), a model fusion framework, that combines multiple machine learning algorithms to detect and predict credit card approval with high accuracy. We present the design methodology and implementation using five pre-trained models. The CFA results show an accuracy of 89.13% which is better than conventional machine learning and ensemble methods.
△ Less
Submitted 18 January, 2026;
originally announced January 2026.
-
A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models
Authors:
Weixin Ye,
Wei Wang,
Yahui Liu,
Yue Song,
Bin Ren,
Wei Bi,
Rita Cucchiara,
Nicu Sebe
Abstract:
In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To…
▽ More
In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable \textit{unknown (unk)} position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models' robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (\textit{e.g.,} ImageNet-1K) and sentiment analysis for text (\textit{e.g.,} Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks. Code is publicly available via https://github.com/ywxsuperstar/transformerattack
△ Less
Submitted 17 January, 2026;
originally announced January 2026.
-
Training-Trajectory-Aware Token Selection
Authors:
Zhanming Shen,
Jiaqi Hu,
Zeyu Qin,
Hao Chen,
Wentao Ye,
Zenan Huang,
Yihong Zhuang,
Guoshan Lu,
Junlin Zhou,
Junbo Zhao
Abstract:
Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharpl…
▽ More
Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.
△ Less
Submitted 15 January, 2026;
originally announced January 2026.
-
ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
Authors:
Yutao Mou,
Zhangchi Xue,
Lijun Li,
Peiyang Liu,
Shikun Zhang,
Wei Ye,
Jing Shao
Abstract:
While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invo…
▽ More
While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
△ Less
Submitted 15 January, 2026;
originally announced January 2026.
-
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering
Authors:
Wencheng Ye,
Xiaoyang Yuan,
Yi Bin,
Pengpeng Zeng,
Hengyu Jin,
Liang Peng,
Heng Tao Shen
Abstract:
Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-bas…
▽ More
Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-based Intervention for Steerable Enhancement of Reasoning), a plug-and-play intervention framework that adaptively steers LLM reasoning in activation space. RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input. The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner. Across seven diverse benchmarks, RISER yields 3.4-6.5% average zero-shot accuracy improvements over the base model while surpassing CoT-style reasoning with 2-3x higher token efficiency and robust accuracy gains. Further analysis shows that RISER autonomously combines multiple vectors into interpretable, precise control strategies, pointing toward more controllable and efficient LLM reasoning.
△ Less
Submitted 19 January, 2026; v1 submitted 14 January, 2026;
originally announced January 2026.
-
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Authors:
Junyi Chen,
Tong He,
Zhoujie Fu,
Pengfei Wan,
Kun Gai,
Weicai Ye
Abstract:
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vis…
▽ More
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
△ Less
Submitted 16 January, 2026; v1 submitted 5 January, 2026;
originally announced January 2026.
-
XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
Authors:
Zunhai Su,
Weihao Ye,
Hansen Feng,
Keyu Fan,
Jing Zhang,
Dahai Yu,
Zhengwu Liu,
Ngai Wong
Abstract:
Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematicall…
▽ More
Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs originating from multi-view inputs are pruned through efficient token importance identification, enabling a fixed memory budget. Leveraging the unique distribution of KV tensors, we incorporate KV quantization to further reduce memory consumption. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling scalable and practical streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.
△ Less
Submitted 3 January, 2026;
originally announced January 2026.
-
SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis
Authors:
Mo Wang,
Junfeng Xia,
Wenhao Ye,
Enyu Liu,
Kaining Peng,
Jianfeng Feng,
Quanying Liu,
Hongkai Wen
Abstract:
Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models.…
▽ More
Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models. Atlas-free methods, on the other hand, operate directly on voxel-level information - preserving spatial fidelity but are prohibitively memory- and compute-intensive, making large-scale pre-training infeasible. We introduce SLIM-Brain (Sample-efficient, Low-memory fMRI Foundation Model for Human Brain), a new atlas-free foundation model that simultaneously improves both data- and training-efficiency. SLIM-Brain adopts a two-stage adaptive design: (i) a lightweight temporal extractor captures global context across full sequences and ranks data windows by saliency, and (ii) a 4D hierarchical encoder (Hiera-JEPA) learns fine-grained voxel-level representations only from the top-$k$ selected windows, while deleting about 70% masked patches. Extensive experiments across seven public benchmarks show that SLIM-Brain establishes new state-of-the-art performance on diverse tasks, while requiring only 4 thousand pre-training sessions and approximately 30% of GPU memory comparing to traditional voxel-level methods.
△ Less
Submitted 30 January, 2026; v1 submitted 26 December, 2025;
originally announced December 2025.
-
TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning
Authors:
Saisai Yang,
Qingyi Huang,
Jing Yuan,
Liangyu Zha,
Kai Tang,
Yuhang Yang,
Ning Wang,
Yucheng Wei,
Liyao Li,
Wentao Ye,
Hao Chen,
Tao Zhang,
Junlin Zhou,
Haobo Wang,
Gang Chen,
Junbo Zhao
Abstract:
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learnin…
▽ More
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.
△ Less
Submitted 25 December, 2025; v1 submitted 23 December, 2025;
originally announced December 2025.
-
OmniMER: Auxiliary-Enhanced LLM Adaptation for Indonesian Multimodal Emotion Recognition
Authors:
Xueming Yan,
Boyan Xu,
Yaochu Jin,
Lixian Xiao,
Wenlong Ye,
Runyang Cai,
Zeqi Zheng,
Jingfa Liu,
Aimin Yang,
Yongduan Song
Abstract:
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emoti…
▽ More
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER
△ Less
Submitted 10 February, 2026; v1 submitted 22 December, 2025;
originally announced December 2025.
-
In-Context Audio Control of Video Diffusion Transformers
Authors:
Wenze Liu,
Weicai Ye,
Minghong Cai,
Quande Liu,
Xintao Wang,
Xiangyu Yue
Abstract:
Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusi…
▽ More
Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.
△ Less
Submitted 21 December, 2025;
originally announced December 2025.
-
Kling-Omni Technical Report
Authors:
Kling Team,
Jialu Chen,
Yuanzheng Ci,
Xiangyu Du,
Zipeng Feng,
Kun Gai,
Sainan Guo,
Feng Han,
Jingbin He,
Kang He,
Xiao Hu,
Xiaohua Hu,
Boyuan Jiang,
Fangyuan Kong,
Hang Li,
Jie Li,
Qingyu Li,
Shen Li,
Xiaohan Li,
Yan Li,
Jiajun Liang,
Borui Liao,
Yiqiao Liao,
Weihong Lin,
Quande Liu
, et al. (43 additional authors not shown)
Abstract:
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supp…
▽ More
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
UniBYD: A Unified Framework for Learning Robotic Manipulation Across Embodiments Beyond Imitation of Human Demonstrations
Authors:
Tingyu Yuan,
Biaoliang Guan,
Wen Ye,
Ziyan Tian,
Yi Yang,
Weijie Zhou,
Zhaowen Li,
Yan Huang,
Peng Wang,
Chaoyang Zhao,
Jinqiao Wang
Abstract:
In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. Moreover, current methods struggle to support diverse robotic han…
▽ More
In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. Moreover, current methods struggle to support diverse robotic hand configurations. In this paper, we propose UniBYD, a unified framework that uses a dynamic reinforcement learning algorithm to discover manipulation policies aligned with the robot's physical characteristics. To enable consistent modeling across diverse robotic hand morphologies, UniBYD incorporates a unified morphological representation (UMR). Building on UMR, we design a dynamic PPO with an annealed reward schedule, enabling reinforcement learning to transition from offline-informed imitation of human demonstrations to online-adaptive exploration of policies better adapted to diverse robotic morphologies, thereby going beyond mere imitation of human hands. To address the severe state drift caused by the incapacity of early-stage policies, we design a hybrid Markov-based shadow engine that provides fine-grained guidance to anchor the imitation within the expert's manifold. To evaluate UniBYD, we propose UniManip, the first benchmark for cross-embodiment manipulation spanning diverse robotic morphologies. Experiments demonstrate a 44.08% average improvement in success rate over the current state-of-the-art. Upon acceptance, we will release our code and benchmark.
△ Less
Submitted 10 March, 2026; v1 submitted 12 December, 2025;
originally announced December 2025.