Skip to main content

Showing 1–50 of 357 results for author: Yue, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.07239  [pdf, ps, other

    cs.CL cs.IT cs.LG

    Efficient Learned Data Compression via Dual-Stream Feature Decoupling

    Authors: Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu, Gang Wang, Wentong Cai

    Abstract: While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single-stream architectures struggle to simultaneously capture micro-syntactic and macro-semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constr… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: Accepted to ACL 2026

  2. arXiv:2604.02029  [pdf, ps, other

    cs.AI

    The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

    Authors: Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao , et al. (12 additional authors not shown)

    Abstract: Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of expli… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

  3. arXiv:2603.28767  [pdf, ps, other

    cs.CV

    Gen-Searcher: Reinforcing Agentic Search for Image Generation

    Authors: Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue

    Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generat… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

    Comments: Project page: https://gen-searcher.vercel.app Code: https://github.com/tulerfeng/Gen-Searcher

  4. arXiv:2603.22831  [pdf, ps, other

    cs.CE q-fin.MF

    Option pricing model under the G-expectation framework

    Authors: Ziting Pei, Xingye Yue, Xiaotao Zheng

    Abstract: G-expectation, as a sublinear expectation, provides a powerful framework for modeling uncertainty in financial markets. Motivated by the need for robust valuation under model uncertainty, this work develops a unified risk-neutral valuation approach within the G-expectation environment, yielding a nonlinear generalization of the Black-Scholes model, termed the G-Black-Scholes equation. To enhance c… ▽ More

    Submitted 24 March, 2026; originally announced March 2026.

  5. arXiv:2603.21176  [pdf, ps, other

    cs.CV

    GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

    Authors: Zifeng Zhu, Jiaming Han, Jiaxiang Zhao, Minnan Luo, Xiangyu Yue

    Abstract: While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this pa… ▽ More

    Submitted 22 March, 2026; originally announced March 2026.

    Comments: 25 pages, 7 figures

  6. arXiv:2603.20611  [pdf, ps, other

    cs.CV

    GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction

    Authors: Di Kong, Yikai Wang, Wenjie Guo, Yifan Bu, Boya Zhang, Yuexin Duan, Xiawei Yue, Wenbiao Du, Yiman Zhong, Yuwen Chen, Cheng Ma

    Abstract: Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. We introduce GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our proposed method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D… ▽ More

    Submitted 20 March, 2026; originally announced March 2026.

    Comments: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

  7. arXiv:2603.17826  [pdf, ps, other

    cs.SE cs.AI

    FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

    Authors: Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng, Vincent Ng, Zhi Wang, Xiangyu Yue, Chuanyi Li, Lewei Lu

    Abstract: Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoni… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  8. arXiv:2603.00976  [pdf, ps, other

    cs.CV

    PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation

    Authors: Jiangshan Wang, Kang Zhao, Jiayi Guo, Jiayu Wang, Hang Guo, Chenyang Zhu, Xiu Li, Xiangyu Yue

    Abstract: High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on… ▽ More

    Submitted 2 March, 2026; v1 submitted 1 March, 2026; originally announced March 2026.

    Comments: ICLR 2026

  9. arXiv:2603.00563  [pdf, ps, other

    cs.SD cs.AI

    Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

    Authors: Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si

    Abstract: The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel arch… ▽ More

    Submitted 28 February, 2026; originally announced March 2026.

    Comments: 5 pages, 3 figures, accepted at ICASSP 2026

  10. arXiv:2602.14178  [pdf, ps, other

    cs.CV cs.AI

    UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

    Authors: Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang

    Abstract: Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to br… ▽ More

    Submitted 11 March, 2026; v1 submitted 15 February, 2026; originally announced February 2026.

    Comments: 29 pages, 9 figures, 33 tables

  11. arXiv:2602.14041  [pdf, ps, other

    cs.CV cs.AI

    BitDance: Scaling Autoregressive Generative Models with Binary Tokens

    Authors: Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu, Ziyan Yang, Zhenheng Yang, Yali Wang, Huaibo Huang, Xiangyu Yue, Hao Chen

    Abstract: We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance us… ▽ More

    Submitted 13 March, 2026; v1 submitted 15 February, 2026; originally announced February 2026.

    Comments: Code and models: https://github.com/shallowdream204/BitDance

  12. arXiv:2602.13993  [pdf, ps, other

    cs.CV

    Elastic Diffusion Transformer

    Authors: Jiangshan Wang, Zeqiang Lai, Jiarui Chen, Jiayi Guo, Hang Guo, Xiu Li, Xiangyu Yue, Chunchao Guo

    Abstract: Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive. Previous acceleration methods, such as pruning and distillation, typically rely on a fixed computational capacity, leading to insufficient acceleration and degraded generation quality. To address this limitation, we propose \textbf{Elastic Diffusion Transformer (E-DiT)}, a… ▽ More

    Submitted 15 February, 2026; originally announced February 2026.

  13. arXiv:2602.11075  [pdf, ps, other

    cs.RO

    RISE: Self-Improving Robot Policy with Compositional World Model

    Authors: Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, Hongyang Li

    Abstract: Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on-policy RL in the physical world is constrained by safety risk, hardware cost, and environment… ▽ More

    Submitted 11 February, 2026; originally announced February 2026.

    Comments: Project page: https://opendrivelab.com/kai0-rl/

  14. arXiv:2602.09878  [pdf, ps, other

    cs.CV

    MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

    Authors: Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

    Abstract: World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given o… ▽ More

    Submitted 10 February, 2026; originally announced February 2026.

  15. arXiv:2602.09207  [pdf, ps, other

    cs.LG cs.AI

    CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

    Authors: Xiaofeng Xiao, Xiao Hu, Yang Ye, Xubo Yue

    Abstract: Reinforcement learning (RL) has achieved remarkable success in a wide range of sequential decision-making problems. Recent diffusion-based policies further improve RL by modeling complex, high-dimensional action distributions. However, existing diffusion policies primarily rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, l… ▽ More

    Submitted 9 February, 2026; originally announced February 2026.

  16. arXiv:2602.08990  [pdf, ps, other

    cs.AI

    InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

    Authors: Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li , et al. (32 additional authors not shown)

    Abstract: We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research, solution optimization, and long horizon memory. Th… ▽ More

    Submitted 9 February, 2026; originally announced February 2026.

    Comments: Code and project page: https://github.com/InternScience/InternAgent

  17. arXiv:2602.04907  [pdf, ps, other

    cs.LG cs.AI stat.ME

    Physics as the Inductive Bias for Causal Discovery

    Authors: Jianhong Chen, Naichen Shi, Xubo Yue

    Abstract: Causal discovery is often a data-driven paradigm to analyze complex real-world systems. In parallel, physics-based models such as ordinary differential equations (ODEs) provide mechanistic structure for many dynamical processes. Integrating these paradigms potentially allows physical knowledge to act as an inductive bias, improving identifiability, stability, and robustness of causal discovery in… ▽ More

    Submitted 3 February, 2026; originally announced February 2026.

  18. arXiv:2602.03570  [pdf, ps, other

    cs.LG

    Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

    Authors: Bixing Wu, Yuhong Zhao, Zongli Ye, Jiachen Lian, Xiangyu Yue, Gopala Anumanchipalli

    Abstract: Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modaliti… ▽ More

    Submitted 3 February, 2026; originally announced February 2026.

    Comments: 18 pages, 11 figures

  19. arXiv:2602.03012  [pdf, ps, other

    cs.CR cs.AI

    CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

    Authors: Xianzhen Luo, Jingyuan Zhang, Shiqi Zhou, Rain Huang, Chuan Xiao, Qingfu Zhu, Zhiyuan Ma, Xing Yue, Yang Yue, Wencong Zeng, Wanxiang Che

    Abstract: Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fu… ▽ More

    Submitted 2 February, 2026; originally announced February 2026.

    Comments: Under Review

  20. arXiv:2602.02244  [pdf, ps, other

    cs.LG cs.CL

    Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

    Authors: Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu

    Abstract: The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it t… ▽ More

    Submitted 7 February, 2026; v1 submitted 2 February, 2026; originally announced February 2026.

  21. arXiv:2601.22154  [pdf, ps, other

    cs.AI cs.CL

    Exploring Reasoning Reward Model for Agents

    Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue

    Abstract: Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a mult… ▽ More

    Submitted 29 January, 2026; originally announced January 2026.

    Comments: Project page: https://github.com/kxfan2002/Reagent

  22. arXiv:2601.06443  [pdf, ps, other

    cs.CV

    How to Build Robust, Scalable Models for GSV-Based Indicators in Neighborhood Research

    Authors: Xiaoya Tang, Xiaohe Yue, Heran Mane, Dapeng Li, Quynh Nguyen, Tolga Tasdizen

    Abstract: A substantial body of health research demonstrates a strong link between neighborhood environments and health outcomes. Recently, there has been increasing interest in leveraging advances in computer vision to enable large-scale, systematic characterization of neighborhood built environments. However, the generalizability of vision models across fundamentally different domains remains uncertain, f… ▽ More

    Submitted 10 January, 2026; originally announced January 2026.

  23. arXiv:2601.05014  [pdf, ps, other

    cs.RO

    The RoboSense Challenge: Sense Anything, Navigate Anywhere, Adapt Across Platforms

    Authors: Lingdong Kong, Shaoyuan Xie, Zeying Gong, Ye Li, Meng Chu, Ao Liang, Yuhao Dong, Tianshuai Hu, Ronghe Qiu, Rong Li, Hanjiang Hu, Dongyue Lu, Wei Yin, Wenhao Ding, Linfeng Li, Hang Song, Wenwei Zhang, Yuexin Ma, Junwei Liang, Zhedong Zheng, Lai Xing Ng, Benoit R. Cottereau, Wei Tsang Ooi, Ziwei Liu, Zhanpeng Zhang , et al. (114 additional authors not shown)

    Abstract: Autonomous systems are increasingly deployed in open and dynamic environments -- from city streets to aerial and indoor spaces -- where perception models must remain reliable under sensor noise, environmental variation, and platform shifts. However, even state-of-the-art methods often degrade under unseen conditions, highlighting the need for robust and generalizable robot sensing. The RoboSense 2… ▽ More

    Submitted 8 January, 2026; originally announced January 2026.

    Comments: Official IROS 2025 RoboSense Challenge Report; 51 pages, 37 figures, 5 tables; Competition Website at https://robosense2025.github.io/

  24. arXiv:2601.00625  [pdf, ps, other

    cs.CV

    RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation

    Authors: Junxiao Xue, Pavel Smirnov, Ziao Li, Yunyun Shi, Shi Chen, Xinyi Yin, Xiaohan Yue, Lei Wang, Yiduo Wang, Feng Lin, Yijia Chen, Xiao Ma, Xiaoran Yan, Qing Zhang, Fengjian Xue, Xuecheng Wu

    Abstract: We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients'motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose… ▽ More

    Submitted 2 January, 2026; originally announced January 2026.

  25. High Dimensional Data Decomposition for Anomaly Detection of Textured Images

    Authors: Ji Song, Xing Wang, Jianguo Wu, Xiaowei Yue

    Abstract: In the realm of diverse high-dimensional data, images play a significant role across various processes of manufacturing systems where efficient image anomaly detection has emerged as a core technology of utmost importance. However, when applied to textured defect images, conventional anomaly detection methods have limitations including non-negligible misidentification, low robustness, and excessiv… ▽ More

    Submitted 23 December, 2025; originally announced December 2025.

  26. arXiv:2512.18772  [pdf, ps, other

    cs.CV

    In-Context Audio Control of Video Diffusion Transformers

    Authors: Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue

    Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusi… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

  27. arXiv:2512.16918  [pdf, ps, other

    cs.CV

    AdaTooler-V: Adaptive Tool-Use for Images and Videos

    Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

    Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we… ▽ More

    Submitted 19 December, 2025; v1 submitted 18 December, 2025; originally announced December 2025.

    Comments: Project page: https://github.com/CYWang735/AdaTooler-V

  28. arXiv:2512.16295  [pdf, ps, other

    cs.AI

    OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

    Authors: Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding

    Abstract: With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action befo… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

  29. arXiv:2512.16279  [pdf, ps, other

    cs.AI cs.CL

    QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems

    Authors: Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu, Sherman S. M. Chow, Bo Zheng, Xiangyu Yue

    Abstract: Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent gua… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Preprint

  30. arXiv:2512.10258  [pdf, ps, other

    cs.LG

    R^2-HGP: A Double-Regularized Gaussian Process for Heterogeneous Transfer Learning

    Authors: Duo Wang, Xinming Wang, Chao Wang, Xiaowei Yue, Jianguo Wu

    Abstract: Multi-output Gaussian process (MGP) models have attracted significant attention for their flexibility and uncertainty-quantification capabilities, and have been widely adopted in multi-source transfer learning scenarios due to their ability to capture inter-task correlations. However, they still face several challenges in transfer learning. First, the input spaces of the source and target domains… ▽ More

    Submitted 10 December, 2025; originally announced December 2025.

    Comments: 17 pages, 9 figures. Under review for IEEE TPAMI

  31. arXiv:2512.07783  [pdf, ps, other

    cs.CL

    On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

    Authors: Charlie Zhang, Graham Neubig, Xiang Yue

    Abstract: Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined,… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  32. arXiv:2512.05530  [pdf, ps, other

    cs.AI

    MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models

    Authors: Chuang Yu, Jinmiao Zhao, Mingxuan Zhao, Yunpeng Liu, Xiujun Shu, Yuanhao Feng, Bo Wang, Xiangyu Yue

    Abstract: Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs wit… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

  33. arXiv:2512.05511  [pdf, ps, other

    cs.CV

    Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm

    Authors: Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Yaokun Li, Xiujun Shu, Yuanhao Feng, Bo Wang, Yimian Dai, Xiangyu Yue

    Abstract: While large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, their potential for single-frame infrared small target (SIRST) detection remains largely unexplored. To fill this gap, we systematically introduce the frozen representations from VFMs into the SIRST task for the first time and propose a Foundation-Driven Efficient Paradigm (FDEP), which c… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

  34. Edged Weisfeiler-Lehman Algorithm

    Authors: Xiao Yue, Bo Liu, Feng Zhang, Guangzhi Qu

    Abstract: As a classical approach on graph learning, the propagation-aggregation methodology is widely exploited by many of Graph Neural Networks (GNNs), wherein the representation of a node is updated by aggregating representations from itself and neighbor nodes recursively. Similar to the propagation-aggregation methodology, the Weisfeiler-Lehman (1-WL) algorithm tests isomorphism through color refinement… ▽ More

    Submitted 4 December, 2025; originally announced December 2025.

    Comments: Author's Accepted Manuscript (AAM) of ICANN 2024 paper published in LNCS (Springer). Final version available at: https://link.springer.com/chapter/10.1007/978-3-031-72344-5_7

    Journal ref: ICANN 2024, LNCS 15020, pp. 93-109, Springer, 2024

  35. arXiv:2512.03052  [pdf, ps, other

    cs.GR cs.CV

    LATTICE: Democratize High-Fidelity 3D Generation at Scale

    Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, Xiangyu Yue

    Abstract: We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces fr… ▽ More

    Submitted 23 November, 2025; originally announced December 2025.

    Comments: Technical Report

  36. arXiv:2512.03043  [pdf, ps, other

    cs.CV

    OneThinker: All-in-one Reasoning Model for Image and Video

    Authors: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue

    Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatilit… ▽ More

    Submitted 3 December, 2025; v1 submitted 2 December, 2025; originally announced December 2025.

    Comments: Project page: https://github.com/tulerfeng/OneThinker

  37. arXiv:2511.23476  [pdf, ps, other

    cs.AI

    Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

    Authors: Bao Shu, Yan Cai, Jianjian Sun, Chunrui Han, En Yu, Liang Zhao, Jingcheng Hu, Yinmin Zhang, Haoran Lv, Yuang Peng, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Xiangyu Yue

    Abstract: Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reason… ▽ More

    Submitted 28 November, 2025; originally announced November 2025.

    Comments: 17 pages, 9 figures

  38. arXiv:2511.21444  [pdf, ps, other

    cs.AI physics.ao-ph

    EWE: An Agentic Framework for Extreme Weather Analysis

    Authors: Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang, Fenghua Ling, Wanli Ouyang, Lei Bai

    Abstract: Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of autom… ▽ More

    Submitted 26 November, 2025; originally announced November 2025.

  39. arXiv:2511.20646  [pdf, ps, other

    cs.CV

    3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

    Authors: Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li

    Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlati… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

    Comments: 3D-aware Multi-task Learning, Cross-view Correlations, Code will be available at https://github.com/WeiHongLee/CrossView3DMTL

  40. arXiv:2511.18780  [pdf, ps, other

    cs.CV cs.AI

    ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

    Authors: Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng, Yingshui Tan, Xiaoyong Zhu, Bo Zhang, Bo Zheng, Xiangyu Yue

    Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk categ… ▽ More

    Submitted 26 November, 2025; v1 submitted 24 November, 2025; originally announced November 2025.

  41. arXiv:2511.17731  [pdf, ps, other

    cs.CV cs.LG

    VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

    Authors: Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue, Chenyu You

    Abstract: Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically sm… ▽ More

    Submitted 21 November, 2025; originally announced November 2025.

  42. arXiv:2511.16317  [pdf, ps, other

    cs.CV

    NaTex: Seamless Texture Generation as Latent Color Diffusion

    Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang, Jingwei Huang, Xiangyu Yue, Chunchao Guo

    Abstract: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, ac… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

    Comments: Technical Report

  43. arXiv:2511.01824  [pdf, ps, other

    cs.AI cs.LG

    Simulating Environments with Reasoning Models for Agent Training

    Authors: Yuetai Li, Huseyin A Inan, Xiang Yue, Wei-Ning Chen, Lukas Wutschitz, Janardhan Kulkarni, Radha Poovendran, Robert Sim, Saravan Rajmohan

    Abstract: LLM agents excel in compact environments requiring deep reasoning but remain brittle when operating in broader, more complex contexts that demand robustness across diverse tools and schemas. Building bespoke environments for training is heavy, brittle, and limits progress. In this paper, we demonstrate that LLMs can simulate realistic environment feedback without access to actual testbed data or A… ▽ More

    Submitted 3 November, 2025; originally announced November 2025.

  44. arXiv:2510.25726  [pdf, ps, other

    cs.CL cs.AI

    The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

    Authors: Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He

    Abstract: Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversi… ▽ More

    Submitted 26 February, 2026; v1 submitted 29 October, 2025; originally announced October 2025.

    Comments: ICLR 2026, Website: https://toolathlon.xyz/

  45. arXiv:2510.24702  [pdf, ps, other

    cs.CL cs.AI

    Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

    Authors: Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

    Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data prot… ▽ More

    Submitted 3 March, 2026; v1 submitted 28 October, 2025; originally announced October 2025.

  46. arXiv:2510.23642  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.PL

    VisCoder2: Building Multi-Language Visualization Coding Agents

    Authors: Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu, Jiaqi Deng, Kai Zou, Ping Nie, Fei Yuan, Xiang Yue, Wenhu Chen

    Abstract: Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and s… ▽ More

    Submitted 7 April, 2026; v1 submitted 24 October, 2025; originally announced October 2025.

  47. arXiv:2510.17332  [pdf, ps, other

    cs.CV

    iDETEX: Empowering MLLMs for Intelligent DETailed EXplainable IQA

    Authors: Zhaoran Zhao, Xinli Yue, Jianhui Sun, Yuhao Xie, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng

    Abstract: Image Quality Assessment (IQA) has progressed from scalar quality prediction to more interpretable, human-aligned evaluation paradigms. In this work, we address the emerging challenge of detailed and explainable IQA by proposing iDETEX-a unified multimodal large language model (MLLM) capable of simultaneously performing three key tasks: quality grounding, perception, and description. To facilitate… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

    Comments: Accepted to ICCV 2025 Workshop

  48. arXiv:2510.15895  [pdf

    cs.HC cs.AI cs.SD

    BREATH: A Bio-Radar Embodied Agent for Tonal and Human-Aware Diffusion Music Generation

    Authors: Yunzhe Wang, Xinyu Tang, Zhixun Huang, Xiaolong Yue, Yuxin Zeng

    Abstract: We present a multimodal system for personalized music generation that integrates physiological sensing, LLM-based reasoning, and controllable audio synthesis. A millimeter-wave radar sensor non-invasively captures heart rate and respiration rate. These physiological signals, combined with environmental state, are interpreted by a reasoning agent to infer symbolic musical descriptors, such as tempo… ▽ More

    Submitted 9 September, 2025; originally announced October 2025.

    Comments: Accepted by LLM4Music @ ISMIR 2025

  49. arXiv:2510.10518  [pdf, ps, other

    cs.CV

    VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

    Authors: Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu

    Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during… ▽ More

    Submitted 19 March, 2026; v1 submitted 12 October, 2025; originally announced October 2025.

  50. arXiv:2510.09606  [pdf, ps, other

    cs.CV

    SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

    Authors: Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

    Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual a… ▽ More

    Submitted 10 October, 2025; originally announced October 2025.

    Comments: Project Page: https://peiwensun2000.github.io/mm2km/