Skip to main content

Showing 1–50 of 2,311 results for author: Jiang, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.21302  [pdf, ps, other

    cs.CV

    AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

    Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng

    Abstract: Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-l… ▽ More

    Submitted 24 December, 2025; originally announced December 2025.

    Comments: 23 pages, 13 figures, 8 tables

  2. arXiv:2512.21257  [pdf, ps, other

    cs.IR cs.CL

    ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling

    Authors: Chuan Wang, Gaoming Yang, Han Wu, Jiakai Tang, Jiahao Yu, Jian Wu, Jianwu Hu, Junjun Zheng, Shuwen Xiao, Yeqiu Yang, Yuning Jiang, Ahjol Nurlanbek, Binbin Cao, Bo Zheng, Fangmei Zhu, Gaoming Zhou, Huimin Yi, Huiping Chu, Jin Huang, Jinzhe Shan, Kenan Cui, Longbin Li, Silu Zhou, Wen Chen, Xia Ming , et al. (8 additional authors not shown)

    Abstract: Industrial recommender systems face two fundamental limitations under the log-driven paradigm: (1) knowledge poverty in ID-based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond-log user interests that constrains model performance within platform boundaries. These limitations stem from an over-reliance on shallow interaction stati… ▽ More

    Submitted 24 December, 2025; originally announced December 2025.

  3. arXiv:2512.21095  [pdf, ps, other

    cs.CV

    UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

    Authors: Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Baiand Hao Feng, Wei Shi, Yuchen Su, Can Huang, Yu-Gang Jiang

    Abstract: Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in ma… ▽ More

    Submitted 24 December, 2025; originally announced December 2025.

  4. arXiv:2512.20092  [pdf, ps, other

    cs.CL

    Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

    Authors: Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong

    Abstract: Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memo… ▽ More

    Submitted 23 December, 2025; originally announced December 2025.

  5. arXiv:2512.20034  [pdf, ps, other

    cs.IR

    VSA:Visual-Structural Alignment for UI-to-Code

    Authors: Xian Wu, Ming Zhang, Zhiyu Fang, Fei Li, Bin Wang, Yong Jiang, Hao Zhou

    Abstract: The automation of user interface development has the potential to accelerate software delivery by mitigating intensive manual implementation. Despite the advancements in Large Multimodal Models for design-to-code translation, existing methodologies predominantly yield unstructured, flat codebases that lack compatibility with component-oriented libraries such as React or Angular. Such outputs typic… ▽ More

    Submitted 22 December, 2025; originally announced December 2025.

  6. arXiv:2512.19753  [pdf, ps, other

    cond-mat.mtrl-sci cs.AI

    QMBench: A Research Level Benchmark for Quantum Materials Research

    Authors: Yanzhen Wang, Yiyang Jiang, Diana Golovanova, Kamal Das, Hyeonhu Bae, Yufei Zhao, Huu-Thong Le, Abhinava Chatterjee, Yunzhe Liu, Chao-Xing Liu, Felipe H. da Jornada, Binghai Yan, Xiao-Liang Qi

    Abstract: We introduce QMBench, a comprehensive benchmark designed to evaluate the capability of large language model agents in quantum materials research. This specialized benchmark assesses the model's ability to apply condensed matter physics knowledge and computational techniques such as density functional theory to solve research problems in quantum materials science. QMBench encompasses different doma… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: 20 pages, 1 figure

  7. arXiv:2512.18232  [pdf, ps, other

    cs.SD cs.LG

    AutoSchA: Automatic Hierarchical Music Representations via Multi-Relational Node Isolation

    Authors: Stephen Ni-Hahn, Rico Zhu, Jerry Yin, Yue Jiang, Cynthia Rudin, Simon Mak

    Abstract: Hierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive; the analysis of a single piece of music requires a great deal of time and effort from trained experts. The representat… ▽ More

    Submitted 20 December, 2025; originally announced December 2025.

  8. arXiv:2512.17730  [pdf, ps, other

    cs.CV

    AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

    Authors: Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray

    Abstract: Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vis… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

    Comments: Under Review

  9. arXiv:2512.16848  [pdf, ps, other

    cs.LG cs.AI

    Meta-RL Induces Exploration in Language Agents

    Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

    Abstract: Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agen… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

  10. arXiv:2512.16676  [pdf, ps, other

    cs.LG cs.CL

    DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

    Authors: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu , et al. (10 additional authors not shown)

    Abstract: The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation.… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

  11. arXiv:2512.16279  [pdf, ps, other

    cs.AI cs.CL

    QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems

    Authors: Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu, Sherman S. M. Chow, Bo Zheng, Xiangyu Yue

    Abstract: Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent gua… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Preprint

  12. arXiv:2512.16248  [pdf, ps, other

    cs.CL cs.AI

    Sigma-MoE-Tiny Technical Report

    Authors: Qingguo Hu, Zhenghao Lin, Ziyue Yang, Yucheng Ding, Xiao Liu, Yuting Jiang, Ruizhe Wang, Tianyu Chen, Zhongxin Guo, Yifan Xiong, Rui Gao, Lei Qu, Jinsong Su, Peng Cheng, Yeyun Gong

    Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each… ▽ More

    Submitted 19 December, 2025; v1 submitted 18 December, 2025; originally announced December 2025.

  13. arXiv:2512.16164  [pdf, ps, other

    cs.CV cs.AI

    C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

    Authors: Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen

    Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, le… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  14. arXiv:2512.15818  [pdf, ps, other

    cs.CR

    Unveiling the Attribute Misbinding Threat in Identity-Preserving Models

    Authors: Junming Fu, Jishen Zeng, Yi Jiang, Peiyu Zhuang, Baoying Chen, Siyu Lu, Jianquan Yang

    Abstract: Identity-preserving models have led to notable progress in generating personalized content. Unfortunately, such models also exacerbate risks when misused, for instance, by generating threatening content targeting specific individuals. This paper introduces the \textbf{Attribute Misbinding Attack}, a novel method that poses a threat to identity-preserving models by inducing them to produce Not-Safe… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  15. arXiv:2512.15699  [pdf, ps, other

    cs.LG cs.SE

    FrontierCS: Evolving Challenges for Evolving Intelligence

    Authors: Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li , et al. (26 additional authors not shown)

    Abstract: We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a soluti… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: Code with instruction: https://github.com/FrontierCS/Frontier-CS

  16. arXiv:2512.15326  [pdf

    cs.CV

    A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly Detection

    Authors: Yuxin Jiang, Yunkang Can, Weiming Shen

    Abstract: Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. In order to address this issue, this paper introduces a novel technique called masked reverse knowledge distillation (MRKD). By employing image-level masking (I… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  17. arXiv:2512.15319  [pdf

    cs.CV

    Prototypical Learning Guided Context-Aware Segmentation Network for Few-Shot Anomaly Detection

    Authors: Yuxin Jiang, Yunkang Cao, Weiming Shen

    Abstract: Few-shot anomaly detection (FSAD) denotes the identification of anomalies within a target category with a limited number of normal samples. Existing FSAD methods largely rely on pre-trained feature representations to detect anomalies, but the inherent domain gap between pre-trained representations and target FSAD scenarios is often overlooked. This study proposes a Prototypical Learning Guided Con… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  18. arXiv:2512.14503  [pdf, ps, other

    cs.IR cs.CL

    RecGPT-V2 Technical Report

    Authors: Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen Chen, Wenjun Yang, Yujie Luo, Yuning Jiang, Zhujin Gao, Bo Zheng, Binbin Cao, Changfa Wu, Dixuan Wang, Han Wu, Haoyi Hu, Kewei Zhu, Lang Tian, Lin Yang, Qiqi Huang, Siqi Yang, Wenbo Su, Xiaoxiao He , et al. (10 additional authors not shown)

    Abstract: Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cogn… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  19. arXiv:2512.14157  [pdf, ps, other

    cs.AI cs.CV

    Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

    Authors: Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shihui Zhen

    Abstract: Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual e… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  20. arXiv:2512.14133  [pdf, ps, other

    cs.GR

    AnimaMimic: Imitating 3D Animation from Video Priors

    Authors: Tianyi Xie, Yunuo Chen, Yaowei Guo, Yin Yang, Bolei Zhou, Demetri Terzopoulos, Ying Jiang, Chenfanfu Jiang

    Abstract: Creating realistic 3D animation remains a time-consuming and expertise-dependent process, requiring manual rigging, keyframing, and fine-tuning of complex motions. Meanwhile, video diffusion models have recently demonstrated remarkable motion imagination in 2D, generating dynamic and visually coherent motion from text or image prompts. However, their results lack explicit 3D structure and cannot b… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  21. arXiv:2512.14068  [pdf, ps, other

    cs.CV cs.AI

    SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

    Authors: Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou

    Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first sys… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  22. arXiv:2512.13564  [pdf, ps, other

    cs.CL cs.AI

    Memory in the Age of AI Agents

    Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu , et al. (22 additional authors not shown)

    Abstract: Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the prol… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  23. arXiv:2512.13507  [pdf, ps, other

    cs.CV

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Authors: Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao , et al. (172 additional authors not shown)

    Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional au… ▽ More

    Submitted 23 December, 2025; v1 submitted 15 December, 2025; originally announced December 2025.

    Comments: Seedance 1.5 pro Technical Report

  24. arXiv:2512.13488  [pdf, ps, other

    cs.DC cs.CL

    SIGMA: An AI-Empowered Training Stack on Early-Life Hardware

    Authors: Lei Qu, Lianhai Ren, Peng Cheng, Rui Gao, Ruizhe Wang, Tianyu Chen, Xiao Liu, Xingjian Zhang, Yeyun Gong, Yifan Xiong, Yucheng Ding, Yuting Jiang, Zhenghao Lin, Zhongxin Guo, Ziyue Yang

    Abstract: An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimizati… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

    Comments: 22 pages, 7 figures

  25. arXiv:2512.12984  [pdf, ps, other

    cs.CG cs.CV cs.GR cs.LG math.OC

    VoroLight: Learning Quality Volumetric Voronoi Meshes from General Inputs

    Authors: Jiayin Lu, Ying Jiang, Yin Yang, Chenfanfu Jiang

    Abstract: We present VoroLight, a differentiable framework for 3D shape reconstruction based on Voronoi meshing. Our approach generates smooth, watertight surfaces and topologically consistent volumetric meshes directly from diverse inputs, including images, implicit shape level-set fields, point clouds and meshes. VoroLight operates in three stages: it first initializes a surface using a differentiable Vor… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  26. arXiv:2512.12756  [pdf, ps, other

    cs.CV

    FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

    Authors: Yue Jiang, Dingkang Yang, Minghao Han, Jinghang Han, Zizhi Chen, Yizhou Liu, Mingcheng Li, Peng Zhai, Lihua Zhang

    Abstract: Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modalit… ▽ More

    Submitted 14 December, 2025; originally announced December 2025.

    Comments: The omni-modal benchmark report from Fysics AI

  27. arXiv:2512.11395  [pdf, ps, other

    cs.CV

    FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

    Authors: Yilei Jiang, Zhen Wang, Yanghao Wang, Jun Yu, Yueting Zhuang, Jun Xiao, Long Chen

    Abstract: With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for \underline{simple editing} that only contains a single editing target. To satisfy the exploding editing requirements, the \underline{complex editing} which contains multiple editing targets has posed as a more challenging task. However, current co… ▽ More

    Submitted 12 December, 2025; originally announced December 2025.

  28. arXiv:2512.11143  [pdf, ps, other

    cs.CR

    Automated Penetration Testing with LLM Agents and Classical Planning

    Authors: Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan LI, Yan Chen

    Abstract: While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner-Executor-Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, w… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  29. arXiv:2512.10978  [pdf, ps, other

    q-bio.NC cs.AI

    Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning

    Authors: Xueqi Ma, Jun Wang, Yanbei Jiang, Sarah Monazam Erfani, Tongliang Liu, James Bailey

    Abstract: Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms. Understanding these mechanisms is crucial to improve their reasoning abilities. Drawing inspiration from the interplay between neural processes and human cognition, we propose a novel interpretability framework to systematically analyze the… ▽ More

    Submitted 3 December, 2025; originally announced December 2025.

    Comments: Accepted to NeurIPS 2025

  30. arXiv:2512.10971  [pdf, ps, other

    q-fin.CP cs.CE

    AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

    Authors: Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang

    Abstract: Large Language Models (LLMs) have demonstrated remarkable potential as autonomous agents, approaching human-expert performance through advanced reasoning and tool orchestration. However, decision-making in fully dynamic and live environments remains highly challenging, requiring real-time information integration and adaptive responses. While existing efforts have explored live evaluation mechanism… ▽ More

    Submitted 30 November, 2025; originally announced December 2025.

  31. MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

    Authors: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

    Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be ide… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: IEEE TPAMI, Project Page: https://henghuiding.com/MeViS/

    Journal ref: in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 12, pp. 11400-11416, 2025

  32. arXiv:2512.10313  [pdf, ps, other

    cs.AI cs.CY

    EpiPlanAgent: Agentic Automated Epidemic Response Planning

    Authors: Kangkun Mao, Fang Xu, Jinru Ding, Yidong Jiang, Yujun Yao, Yirong Chen, Junming Liu, Xiaoqin Wu, Qian Wu, Xiaoyan Huang, Jie Xu

    Abstract: Epidemic response planning is essential yet traditionally reliant on labor-intensive manual methods. This study aimed to design and evaluate EpiPlanAgent, an agent-based system using large language models (LLMs) to automate the generation and validation of digital emergency response plans. The multi-agent framework integrated task decomposition, knowledge grounding, and simulation modules. Public… ▽ More

    Submitted 11 December, 2025; v1 submitted 11 December, 2025; originally announced December 2025.

  33. arXiv:2512.10300  [pdf, ps, other

    cs.AI

    Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

    Authors: Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu, James Bailey, Jey Han Lau, Krista A. Ehinger

    Abstract: Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  34. arXiv:2512.09506  [pdf, ps, other

    cs.CE

    CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

    Authors: Jinru Ding, Chao Ding, Wenrao Pang, Boyi Xiao, Zhiqiang Liu, Pengcheng Chen, Jiayuan Chen, Tiantian Yuan, Junming Guan, Yidong Jiang, Dawei Cheng, Jie Xu

    Abstract: Large language models (LLMs) are increasingly deployed across the financial sector for tasks like investment research and algorithmic trading. Their high-stakes nature demands rigorous evaluation of models' safety and regulatory alignment. However, there is a significant gap between evaluation capabilities and safety requirements. Current financial benchmarks mainly focus on textbook-style questio… ▽ More

    Submitted 19 December, 2025; v1 submitted 10 December, 2025; originally announced December 2025.

  35. arXiv:2512.09469  [pdf, ps, other

    quant-ph cs.CV

    LiePrune: Lie Group and Quantum Geometric Dual Representation for One-Shot Structured Pruning of Quantum Neural Networks

    Authors: Haijian Shao, Bowen Yang, Wei Liu, Xing Deng, Yingtao Jiang

    Abstract: Quantum neural networks (QNNs) and parameterized quantum circuits (PQCs) are key building blocks for near-term quantum machine learning. However, their scalability is constrained by excessive parameters, barren plateaus, and hardware limitations. We propose LiePrune, the first mathematically grounded one-shot structured pruning framework for QNNs that leverages Lie group structure and quantum geom… ▽ More

    Submitted 10 December, 2025; originally announced December 2025.

    Comments: 7 pages, 2 figures

  36. arXiv:2512.09402  [pdf, ps, other

    cs.CV

    Wasserstein-Aligned Hyperbolic Multi-View Clustering

    Authors: Rui Wang, Yuting Jiang, Xiaoqing Luo, Xiao-Jun Wu, Nicu Sebe, Ziheng Chen

    Abstract: Multi-view clustering (MVC) aims to uncover the latent structure of multi-view data by learning view-common and view-specific information. Although recent studies have explored hyperbolic representations for better tackling the representation gap between different views, they focus primarily on instance-level alignment and neglect global semantic consistency, rendering them vulnerable to view-spec… ▽ More

    Submitted 10 December, 2025; originally announced December 2025.

    Comments: 14 pages

  37. arXiv:2512.08868  [pdf, ps, other

    cs.AI

    EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

    Authors: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Bo Zhang, Xuan Zhou, Ming Yan, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R. Fung, Yalong Li, Pengjun Xie

    Abstract: Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address th… ▽ More

    Submitted 11 December, 2025; v1 submitted 9 December, 2025; originally announced December 2025.

  38. arXiv:2512.07884  [pdf, ps, other

    cs.LG cs.AI cs.CV

    GSPN-2: Efficient Parallel Sequence Modeling

    Authors: Hongjun Wang, Yitong Jiang, Collin McCarthy, David Wehr, Hanrong Ye, Xinhao Li, Ka Chun Cheung, Wonmin Byeon, Jinwei Gu, Ke Chen, Kai Han, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

    Abstract: Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN… ▽ More

    Submitted 28 November, 2025; originally announced December 2025.

    Comments: NeurIPS 2025

  39. arXiv:2512.07792  [pdf

    cs.DC

    Designing Co-operation in Systems of Hierarchical, Multi-objective Schedulers for Stream Processing

    Authors: Animesh Dangwal, Yufeng Jiang, Charlie Arnold, Jun Fan, Mohamed Bassem, Aish Rajagopal

    Abstract: Stream processing is a computing paradigm that supports real-time data processing for a wide variety of applications. At Meta, it's used across the company for various tasks such as deriving product insights, providing and improving user services, and enabling AI at scale for our ever-growing user base. Meta's current stream processing framework supports processing TerraBytes(TBs) of data in mere… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  40. arXiv:2512.06865  [pdf, ps, other

    cs.CV

    Spatial Retrieval Augmented Autonomous Driving

    Authors: Xiaosong Jia, Chenhe Zhang, Yule Jiang, Songbur Wong, Zhiyuan Zhang, Chen Chen, Shaofeng Zhang, Xuanhe Zhou, Xue Yang, Junchi Yan, Yu-Gang Jiang

    Abstract: Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with t… ▽ More

    Submitted 7 December, 2025; originally announced December 2025.

    Comments: Demo Page: https://spatialretrievalad.github.io/ with open sourced code, dataset, and checkpoints

  41. arXiv:2512.06571  [pdf, ps, other

    cs.RO

    Learning Agile Striker Skills for Humanoid Soccer Robots from Noisy Sensory Input

    Authors: Zifan Xu, Myoungkyu Seo, Dongmyeong Lee, Hao Fu, Jiaheng Hu, Jiaxun Cui, Yuqian Jiang, Zhihan Wang, Anastasiia Brund, Joydeep Biswas, Peter Stone

    Abstract: Learning fast and robust ball-kicking skills is a critical capability for humanoid soccer robots, yet it remains a challenging problem due to the need for rapid leg swings, postural stability on a single support foot, and robustness under noisy sensory input and external perturbations (e.g., opponents). This paper presents a reinforcement learning (RL)-based system that enables humanoid robots to… ▽ More

    Submitted 10 December, 2025; v1 submitted 6 December, 2025; originally announced December 2025.

  42. arXiv:2512.06381  [pdf, ps, other

    cs.IR

    Beyond Existing Retrievals: Cross-Scenario Incremental Sample Learning Framework

    Authors: Tao Wang, Xun Luo, Jinlong Guo, Yuliang Yan, Jian Wu, Yuning Jiang, Bo Zheng

    Abstract: The parallelized multi-retrieval architecture has been widely adopted in large-scale recommender systems for its computational efficiency and comprehensive coverage of user interests. Many retrieval methods typically integrate additional cross-scenario samples to enhance the overall performance ceiling. However, those model designs neglect the fact that a part of the cross-scenario samples have al… ▽ More

    Submitted 6 December, 2025; originally announced December 2025.

  43. arXiv:2512.05693  [pdf, ps, other

    cs.RO cs.AI

    HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

    Authors: Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang

    Abstract: The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action s… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

  44. arXiv:2512.05300  [pdf, ps, other

    cs.DS

    Approximating Directed Minimum Cut and Arborescence Packing via Directed Expander Hierarchies

    Authors: Yonggang Jiang, Yaowei Long, Thatchaphol Saranurak, Benyu Wang

    Abstract: We give almost-linear-time algorithms for approximating rooted minimum cut and maximum arborescence packing in directed graphs, two problems that are dual to each other [Edm73]. More specifically, for an $n$-vertex, $m$-edge directed graph $G$ whose $s$-rooted minimum cut value is $k$, our first algorithm computes an $s$-rooted cut of size at most $O(k\log^{5} n)$ in $m^{1+o(1)}$ time, and our sec… ▽ More

    Submitted 17 December, 2025; v1 submitted 4 December, 2025; originally announced December 2025.

  45. arXiv:2512.04354  [pdf

    cs.LG cs.HC

    SmartAlert: Implementing Machine Learning-Driven Clinical Decision Support for Inpatient Lab Utilization Reduction

    Authors: April S. Liang, Fatemeh Amrollahi, Yixing Jiang, Conor K. Corbin, Grace Y. E. Kim, David Mui, Trevor Crowell, Aakash Acharya, Sreedevi Mony, Soumya Punnathanam, Jack McKeown, Margaret Smith, Steven Lin, Arnold Milstein, Kevin Schulman, Jason Hom, Michael A. Pfeffer, Tho D. Pham, David Svec, Weihan Chu, Lisa Shieh, Christopher Sharp, Stephen P. Ma, Jonathan H. Chen

    Abstract: Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision supp… ▽ More

    Submitted 3 December, 2025; originally announced December 2025.

    Comments: 22 pages, 5 figures

  46. arXiv:2512.03837  [pdf, ps, other

    cs.CV

    Heatmap Pooling Network for Action Recognition from RGB Videos

    Authors: Mengyuan Liu, Jinfu Liu, Yongkang Jiang, Bin He

    Abstract: Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling… ▽ More

    Submitted 3 December, 2025; originally announced December 2025.

    Comments: Final Version of IEEE Transactions on Pattern Analysis and Machine Intelligence

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  47. arXiv:2512.03043  [pdf, ps, other

    cs.CV

    OneThinker: All-in-one Reasoning Model for Image and Video

    Authors: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue

    Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatilit… ▽ More

    Submitted 3 December, 2025; v1 submitted 2 December, 2025; originally announced December 2025.

    Comments: Project page: https://github.com/tulerfeng/OneThinker

  48. arXiv:2512.01948  [pdf, ps, other

    cs.CL

    How Far Are We from Genuinely Useful Deep Research Agents?

    Authors: Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou

    Abstract: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to ref… ▽ More

    Submitted 15 December, 2025; v1 submitted 1 December, 2025; originally announced December 2025.

    Comments: 34 pages

  49. arXiv:2512.01629  [pdf, ps, other

    cs.CV cs.RO

    SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

    Authors: Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang

    Abstract: Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we fi… ▽ More

    Submitted 2 December, 2025; v1 submitted 1 December, 2025; originally announced December 2025.

    Comments: Project page: https://heyumeng.com/SPARK/index.html. 17 pages, 7 figures

  50. arXiv:2512.01422  [pdf, ps, other

    cs.CV

    MDiff4STR: Mask Diffusion Model for Scene Text Recognition

    Authors: Yongkun Du, Miaomiao Zhao, Songlin Fan, Zhineng Chen, Caiyan Jia, Yu-Gang Jiang

    Abstract: Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficien… ▽ More

    Submitted 1 December, 2025; originally announced December 2025.

    Comments: Accepted by AAAI 2026 (Oral)