Skip to main content

Showing 1–50 of 1,064 results for author: Gu, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.13323  [pdf, ps, other

    cs.RO

    Vectorizing Projection in Manifold-Constrained Motion Planning for Real-Time Whole-Body Control

    Authors: Shrutheesh R Iyer, I-Chia Chang, Andrew Z. Liu, Yan Gu, Zachary Kingston

    Abstract: Many robot planning tasks require satisfaction of one or more constraints throughout the entire trajectory. For geometric constraints, manifold-constrained motion planning algorithms are capable of planning collision-free path between start and goal configurations on the constraint submanifolds specified by task. Current state-of-the-art methods can take tens of seconds to solve these tasks for co… ▽ More

    Submitted 14 April, 2026; originally announced April 2026.

    Comments: 8 pages, 8 figures, 3 tables. Under review

  2. arXiv:2604.10545  [pdf, ps, other

    cs.HC

    Enhanced Self-Learning with Epistemologically-Informed LLM Dialogue

    Authors: Yi-Fan Cao, Kento Shigyo, Yitong Gu, Xiyuan Wang, Weijia Liu, Yang Wang, David Gotz, Zhilan Zhou, Huamin Qu

    Abstract: Large Language Models (LLMs) have advanced self-learning tools, enabling more personalized interactions. However, learners struggle to engage in meaningful dialogue and process complex information. To alleviate this, we incorporate epistemological frameworks within an LLM-based approach to self-learning, reducing the cognitive load on learners and fostering deeper engagement and holistic understan… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: Submitted to IJHCI

  3. arXiv:2604.07419  [pdf, ps, other

    cs.IR

    ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

    Authors: Hao Yang, Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Zulong Chen, Shuo Wang, Yu Gu, Ge Yu

    Abstract: Visual document retrieval aims to retrieve a set of document pages relevant to a query from visually rich collections. Existing methods often employ Vision-Language Models (VLMs) to encode queries and visual pages into a shared embedding space, which is then optimized via contrastive training. However, during visual document representation, localized evidence is usually scattered across complex do… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  4. arXiv:2604.04135  [pdf, ps, other

    cs.CV

    NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

    Authors: Shuhong Liu, Chenyu Bao, Ziteng Cui, Xuangeng Chu, Bin Ren, Lin Gu, Xiang Chen, Mingrui Li, Long Ma, Marcos V. Conde, Radu Timofte, Yun Liu, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Yuan Gan, Tianhan Xu, Yusuke Kurose, Tatsuya Harada, Junwei Yuan, Gengjia Chang, Xining Ge, Mache You, Qida Cao, Zeliang Li , et al. (81 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participa… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

  5. arXiv:2604.04125  [pdf, ps, other

    cs.IT math.OC

    Mechanism and Communication Co-Design for Differentially Private Energy Sharing

    Authors: Yingshuo Gu, Xi Weng, Yue Chen

    Abstract: Integrating distributed energy resources (DERs) is a critical step toward addressing the global climate crisis. This transformation has driven the transition from traditional consumers to prosumers and given rise to new energy sharing business models. Existing works have extensively studied prosumer energy sharing mechanisms, yet little attention has been paid to privacy protection, particularly w… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

    Comments: 11 pages, 7 figures

  6. arXiv:2604.03298  [pdf, ps, other

    cs.AR cs.DC cs.LG

    ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

    Authors: Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, Dingwen Tao

    Abstract: The rapid scaling of Large Language Models presents significant challenges for their deployment and inference, particularly on resource-constrained specialized AI hardware accelerators such as Huawei's Ascend NPUs, where weight data transfer has become a critical performance bottleneck. While lossless compression can preserve model accuracy and reduce data volume, existing lossless compression alg… ▽ More

    Submitted 7 April, 2026; v1 submitted 28 March, 2026; originally announced April 2026.

    Comments: Accepted by ISCA 2026, 17 pages, 13 figures, 7 tables

  7. arXiv:2604.02880  [pdf, ps, other

    cs.CV

    InstructTable: Improving Table Structure Recognition Through Instructions

    Authors: Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan, Yu Gu, Kai zhou, Pengfei Yan

    Abstract: Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in c… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

    Comments: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)

  8. arXiv:2603.27694  [pdf, ps, other

    cs.CL

    Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?

    Authors: Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong, Lei Huang, Bing Qin

    Abstract: An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

  9. arXiv:2603.27646  [pdf, ps, other

    cs.CL hep-lat hep-ph physics.comp-ph physics.optics

    PRBench: End-to-end Paper Reproduction in Physics Research

    Authors: Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang , et al. (26 additional authors not shown)

    Abstract: AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 1… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: 17 pages, 3 figures

    Report number: RISE-AGI-2026-002

  10. arXiv:2603.27460  [pdf, ps, other

    cs.CV cs.AI

    Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

    Authors: Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou , et al. (102 additional authors not shown)

    Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of… ▽ More

    Submitted 28 March, 2026; originally announced March 2026.

    Comments: 157 pages, 19 figures, 26 tables. Project repo: \url{https://github.com/uni-medical/Project-Imaging-X}

  11. arXiv:2603.26720  [pdf, ps, other

    cs.RO cs.AI

    SutureAgent: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

    Authors: Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

    Abstract: Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide su… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  12. arXiv:2603.25887  [pdf, ps, other

    cs.CV

    World Reasoning Arena

    Authors: PAN Team, Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu, Dequan Yang, Junrong Chen, Arif Ahmad, Cong Zeng, Ganesh Bannur, Xinqi Huang, Zheqi Liu, Yi Gu, Yichi Yang, Guangyi Liu, Zhiting Hu, Zhengzhong Liu, Eric Xing

    Abstract: World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive be… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

  13. arXiv:2603.25040  [pdf, ps, other

    cs.LG cs.CL cs.CV

    Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

    Authors: Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang , et al. (152 additional authors not shown)

    Abstract: We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertis… ▽ More

    Submitted 2 April, 2026; v1 submitted 26 March, 2026; originally announced March 2026.

  14. arXiv:2603.23909  [pdf, ps, other

    cs.AI

    DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction

    Authors: Keru Hua, Ding Wang, Yaoying Gu, Xiaoguang Ma

    Abstract: While Large Language Models (LLMs) provide semantic flexibility for robotic task planning, their susceptibility to hallucination and logical inconsistency limits their reliability in long-horizon domains. To bridge the gap between unstructured environments and rigorous plan synthesis, we propose DUPLEX, an agentic dual-system neuro-symbolic architecture that strictly confines the LLM to schema-gui… ▽ More

    Submitted 24 March, 2026; originally announced March 2026.

  15. arXiv:2603.20300  [pdf, ps, other

    cs.SE cs.AI

    From Human Interfaces to Agent Interfaces: Rethinking Software Design in the Age of AI-Native Systems

    Authors: Shaolin Wang, Yi Mei, Haoyang Che, He Jiang, Shui Yu, Ying Gu

    Abstract: Software systems have traditionally been designed for human interaction, emphasizing graphical user interfaces, usability, and cognitive alignment with end users. However, recent advances in large language model (LLM)-based agents are changing the primary consumers of software systems. Increasingly, software is no longer only used by humans, but also invoked autonomously by AI agents through struc… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

    Comments: 4 pages, 1 figure, 1 table

  16. arXiv:2603.19709  [pdf, ps, other

    cs.RO

    Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

    Authors: Weisheng Xu, Jian Li, Yi Gu, Bin Yang, Haodong Chen, Shuyi Lin, Mingqian Zhou, Jing Tan, Qiwei Wu, Xiangrui Jiang, Taowen Wang, Jiawen Wen, Qiwei Liang, Jiaxi Zhang, Renjing Xu

    Abstract: Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignme… ▽ More

    Submitted 24 March, 2026; v1 submitted 20 March, 2026; originally announced March 2026.

  17. arXiv:2603.19602  [pdf, ps, other

    cs.RO

    CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation

    Authors: Haoyu Xi, Mingao Tan, Xinming Zhang, Siwei Cheng, Shanze Wang, Yin Gu, Xiaoyu Shen, Wei Zhang

    Abstract: Visual navigation for cross-embodiment robots is challenging due to variations in robot and camera configurations, which can lead to the failure of navigation tasks. Previous approaches typically rely on collecting massive datasets across different robots, which is highly data-intensive, or fine-tuning models, which is time-consuming. Furthermore, both methods often lack explicit consideration of… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  18. arXiv:2603.19274  [pdf, ps, other

    cs.CL cs.AI

    CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation

    Authors: Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang

    Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model's foundational multimodal rea… ▽ More

    Submitted 27 February, 2026; originally announced March 2026.

  19. arXiv:2603.16822  [pdf, ps, other

    cs.AI

    Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

    Authors: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, Yueming Jin

    Abstract: Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advanc… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

    MSC Class: 68T45 ACM Class: I.2.10

  20. arXiv:2603.16777  [pdf, ps, other

    cs.AI

    Anticipatory Planning for Multimodal AI Agents

    Authors: Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan, Gang Wu, Franck Dernoncourt, Jihyung Kil, Ryan A. Rossi, Ruiyi Zhang

    Abstract: Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

    Comments: Published at CVPR 2026 Findings Track

  21. arXiv:2603.16436  [pdf, ps, other

    cs.LG

    DISCOVER: A Solver for Distributional Counterfactual Explanations

    Authors: Yikai Gu, Lele Cao, Bo Zhao, Lei Lei, Lei You

    Abstract: Counterfactual explanations (CE) explain model decisions by identifying input modifications that lead to different predictions. Most existing methods operate at the instance level. Distributional Counterfactual Explanations (DCE) extend this setting by optimizing an optimal transport objective that balances proximity to a factual input distribution and alignment to a target output distribution, wi… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

    Comments: 20 pages, 8 figures, 4 tables

  22. arXiv:2603.15650  [pdf, ps, other

    cs.LG cs.CV

    How to Achieve Prototypical Birth and Death for OOD Detection?

    Authors: Ningkang Peng, Qianfeng Yu, Xiaoqian Peng, Linjing Qian, Yafei Liu, Canran Xiao, Xinyu Lu, Tingyu Lu, Zhichao Zheng, Yanhui Gu

    Abstract: Out-of-Distribution (OOD) detection is crucial for the secure deployment of machine learning models, and prototype-based learning methods are among the mainstream strategies for achieving OOD detection. Existing prototype-based learning methods generally rely on a fixed number of prototypes. This static assumption fails to adapt to the inherent complexity differences across various categories. Cur… ▽ More

    Submitted 6 March, 2026; originally announced March 2026.

  23. arXiv:2603.15469  [pdf, ps, other

    cs.RO cs.AI

    RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation

    Authors: Haichao Liu, Yuheng Zhou, Zhenyu Wu, Ziheng Ji, Ziyu Shan, Qianzhun Wang, Ruixuan Liu, Zhiyuan Yang, Yejun Gu, Shalman Khan, Shijun Yan, Jun Liu, Haiyue Zhu, Changliu Liu, Jianfei Yang, Jingbing Zhang, Ziwei Wang

    Abstract: Embodied Artificial Intelligence (EAI) is rapidly developing, gradually subverting previous autonomous systems' paradigms from isolated perception to integrated, continuous action. This transition is highly significant for industrial robotic manipulation, promising to free human workers from repetitive, dangerous daily labor. To benchmark and advance this capability, we introduce the Robotic Colla… ▽ More

    Submitted 16 March, 2026; originally announced March 2026.

    Comments: 16 pages, 8 figures

  24. arXiv:2603.15237  [pdf, ps, other

    cs.CV

    Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

    Authors: Yao Gu, Xiaohao Xu, Yingna Wu

    Abstract: Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. W… ▽ More

    Submitted 16 March, 2026; originally announced March 2026.

    Comments: Accepted by IEEE ICASSP2026

  25. arXiv:2603.14909  [pdf, ps, other

    cs.CV

    TopoVST: Toward Topology-fidelitous Vessel Skeleton Tracking

    Authors: Yaoyu Liu, Minghui Zhang, Junjun He, Yun Gu

    Abstract: Automatic extraction of vessel skeletons is crucial for many clinical applications. However, achieving topologically faithful delineation of thin vessel skeletons remains highly challenging, primarily due to frequent discontinuities and the presence of spurious skeleton segments. To address these difficulties, we propose TopoVST, a topology-fidelitious vessel skeleton tracker. TopoVST constructs m… ▽ More

    Submitted 16 March, 2026; originally announced March 2026.

    Comments: 10 pages, 9 figures. Under Review

  26. arXiv:2603.12516  [pdf, ps, other

    cs.LG physics.flu-dyn

    Learning Pore-scale Multiphase Flow from 4D Velocimetry

    Authors: Chunyang Wang, Linqi Zhu, Yuxuan Gu, Robert van der Merwe, Xin Ju, Catherine Spurin, Samuel Krevor, Rex Ying, Tobias Pfaff, Martin J. Blunt, Tom Bultreys, Gege Wen

    Abstract: Multiphase flow in porous media underpins subsurface energy and environmental technologies, including geological CO$_2$ storage and underground hydrogen storage, yet pore-scale dynamics in realistic three-dimensional materials remain difficult to characterize and predict. Here we introduce a multimodal learning framework that infers multiphase pore-scale flow directly from time-resolved four-dimen… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

  27. arXiv:2603.12430  [pdf, ps, other

    cs.CV

    Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

    Authors: Jian Jiang, Chenxi Lin, Yiming Gu, Zengyi Qin, Zhitao Zeng, Kun Yuan, Yonghao Long, Xiang Xia, Cheng Yuan, Yuqi Wang, Zijie Yue, Kunyi Yang, Yuting Zhang, Zhu Zhuo, Dian Qin, Xin Wang, NG Chi Fai, Brian Anthony, Daguang Xu, Guy Rosman, Ozanan Meireles, Zizhen Zhang, Nicolas Padoy, Hesheng Wang, Qi Dou , et al. (2 additional authors not shown)

    Abstract: Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Lan… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

  28. arXiv:2603.11689  [pdf, ps, other

    cs.AI

    Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

    Authors: Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen

    Abstract: Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model ch… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

  29. arXiv:2603.09217  [pdf, ps, other

    cs.CV

    TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

    Authors: Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu

    Abstract: Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation mod… ▽ More

    Submitted 13 March, 2026; v1 submitted 10 March, 2026; originally announced March 2026.

    Comments: 18 pages, 12 figures

  30. arXiv:2603.08324  [pdf, ps, other

    cs.RO cs.AI

    EndoSERV: A Vision-based Endoluminal Robot Navigation System

    Authors: Junyang Wu, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

    Abstract: Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive… ▽ More

    Submitted 9 March, 2026; originally announced March 2026.

  31. arXiv:2603.08108  [pdf, ps, other

    cs.CE cs.LG

    Tau-BNO: Brain Neural Operator for Tau Transport Model

    Authors: Nuutti Barron, Heng Rao, Urmi Saha, Yu Gu, Zhenghao Liu, Ge Yu, Defu Yang, Ashish Raj, Minghan Chen

    Abstract: Mechanistic modeling provides a biophysically grounded framework for studying the spread of pathological tau protein in tauopathies like Alzheimer's disease. Existing approaches typically model tau propagation as a diffusive process on the brain's structural connectome, reproducing macroscopic patterns but neglecting microscale cellular transport and reaction mechanisms. The Network Transport Mode… ▽ More

    Submitted 16 March, 2026; v1 submitted 9 March, 2026; originally announced March 2026.

  32. arXiv:2603.07909  [pdf, ps, other

    cs.RO cs.AI

    Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

    Authors: Junyang Wu, Mingyi Luo, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Chunxi Zhang, Junhao Wang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

    Abstract: Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical m… ▽ More

    Submitted 8 March, 2026; originally announced March 2026.

  33. arXiv:2603.06775  [pdf, ps, other

    cs.RO

    HybridMimic: Hybrid RL-Centroidal Control for Humanoid Motion Mimicking

    Authors: Ludwig Chee-Ying Tay, I-Chia Chang, Yan Gu

    Abstract: Motion mimicking, i.e., encouraging the control policy to mimic human motion, facilitates the learning of complex tasks via reinforcement learning (RL) for humanoid robots. Although standard RL frameworks demonstrate impressive locomotion agility, they often bypass explicit reasoning about robot dynamics during deployment, which is a design choice that can lead to physically infeasible commands wh… ▽ More

    Submitted 6 March, 2026; originally announced March 2026.

  34. arXiv:2603.04945  [pdf, ps, other

    cs.CL

    Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

    Authors: Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang, Lu Wang, Zhiyang Su

    Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the… ▽ More

    Submitted 5 March, 2026; originally announced March 2026.

    Comments: Accepted by ICASSP 2026

  35. arXiv:2603.04597  [pdf, ps, other

    cs.CL cs.AI

    Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

    Authors: Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin

    Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level lan… ▽ More

    Submitted 4 March, 2026; originally announced March 2026.

  36. arXiv:2603.03827  [pdf, ps, other

    cs.MM

    Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

    Authors: Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Songze Li, Hanlei Zhang

    Abstract: Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrate… ▽ More

    Submitted 4 March, 2026; originally announced March 2026.

    Comments: Accepted by CVPR 2026

  37. arXiv:2603.01685  [pdf, ps, other

    cs.CV

    FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

    Authors: Shitong Shao, Yufei Gu, Zeke Xie

    Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating ge… ▽ More

    Submitted 12 March, 2026; v1 submitted 2 March, 2026; originally announced March 2026.

    Comments: Accepted by CVPR 2026

  38. arXiv:2603.01603  [pdf, ps, other

    cs.CV

    Sparse View Distractor-Free Gaussian Splatting

    Authors: Yi Gu, Zhaorui Wang, Jiahang Cao, Jiaxu Wang, Mingle Zhao, Dongjun Ye, Renjing Xu

    Abstract: 3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the relianc… ▽ More

    Submitted 2 March, 2026; originally announced March 2026.

  39. arXiv:2603.01565  [pdf, ps, other

    eess.AS cs.SD

    Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation

    Authors: Yi Gu, Yanqing Liu, Chen Yang, Sheng Zhao

    Abstract: Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis q… ▽ More

    Submitted 2 March, 2026; originally announced March 2026.

  40. arXiv:2603.00479  [pdf, ps, other

    cs.CV

    U-VLM: Hierarchical Vision Language Modeling for Report Generation

    Authors: Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang

    Abstract: Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale inf… ▽ More

    Submitted 28 February, 2026; originally announced March 2026.

  41. arXiv:2603.00123  [pdf, ps, other

    cs.CV cs.AI

    CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

    Authors: Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang

    Abstract: Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow… ▽ More

    Submitted 23 February, 2026; originally announced March 2026.

    Comments: submitting to ACL 2026

  42. arXiv:2602.20630  [pdf, ps, other

    cs.CV

    From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

    Authors: Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

    Abstract: Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a… ▽ More

    Submitted 3 March, 2026; v1 submitted 24 February, 2026; originally announced February 2026.

    Comments: Accepted by CVPR 2026

  43. arXiv:2602.19444  [pdf, ps, other

    cs.LG

    PIS: A Physics-Informed System for Accurate State Partitioning of $Aβ_{42}$ Protein Trajectories

    Authors: Qianfeng Yu, Ningkang Peng, Yanhui Gu

    Abstract: Understanding the conformational evolution of $β$-amyloid ($Aβ$), particularly the $Aβ_{42}$ isoform, is fundamental to elucidating the pathogenic mechanisms underlying Alzheimer's disease. However, existing end-to-end deep learning models often struggle to capture subtle state transitions in protein trajectories due to a lack of explicit physical constraints. In this work, we introduce PIS, a Phy… ▽ More

    Submitted 22 February, 2026; originally announced February 2026.

  44. arXiv:2602.15841  [pdf, ps, other

    eess.SY cs.NI math.OC

    Close-enough general routing problem for multiple unmanned aerial vehicles in monitoring missions

    Authors: Huan Liu, Michel Gendreau, Binjie Xu, Guohua Wu, Yi Gu

    Abstract: In this paper, we introduce a close-enough multi-UAV general routing problem (CEMUAVGRP) where a fleet of homogeneous UAVs conduct monitoring tasks containing nodes, each of which has its disk neighborhood, and edges, aiming to minimize the total distance. A two-phase iterative method is proposed, partitioning the CEMUAVGRP into a general routing phase where a satisfactory route including required… ▽ More

    Submitted 20 January, 2026; originally announced February 2026.

  45. arXiv:2602.15669  [pdf, ps, other

    cs.AI

    PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra

    Authors: Xiachong Feng, Liang Zhao, Weihong Zhong, Yichong Huang, Yuxuan Gu, Lingpeng Kong, Xiaocheng Feng, Bing Qin

    Abstract: Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appe… ▽ More

    Submitted 17 February, 2026; originally announced February 2026.

    Comments: ICLR 2026

  46. arXiv:2602.14662  [pdf, ps, other

    cs.CV cs.RO

    Advances in Global Solvers for 3D Vision

    Authors: Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, Javier Civera

    Abstract: Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxat… ▽ More

    Submitted 16 February, 2026; originally announced February 2026.

    Comments: Comprehensive survey; 37 pages, 7 figures, 3 tables. Project page with literature tracking and code tutorials: https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision

  47. arXiv:2602.13273  [pdf, ps, other

    cs.DB cs.AI cs.DC

    MergePipe: A Budget-Aware Parameter Management System for Scalable LLM Merging

    Authors: Yuanyi Wang, Yanggan Gu, Zihao Wang, Kunxi Li, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

    Abstract: Large language model (LLM) merging has become a key technique in modern LLM development pipelines, enabling the integration of multiple task- or domain-specific expert models without retraining. However, as the number of experts grows, existing merging implementations treat model parameters as unstructured files and execute merges in a stateless, one-shot manner, leading to excessive disk I/O, red… ▽ More

    Submitted 5 February, 2026; originally announced February 2026.

  48. arXiv:2602.13235  [pdf, ps, other

    cs.AI cs.CV

    Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

    Authors: Yuqi Xiong, Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu

    Abstract: Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design ca… ▽ More

    Submitted 9 April, 2026; v1 submitted 29 January, 2026; originally announced February 2026.

  49. arXiv:2602.11584  [pdf, ps, other

    cs.LG cs.AI

    Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization

    Authors: Yujie Gu, Richeng Jin, Zhaoyang Zhang, Huaiyu Dai

    Abstract: It is commonly believed that gradient compression in federated learning (FL) enjoys significant improvement in communication efficiency with negligible performance degradation. In this paper, we find that gradient compression induces sharper loss landscapes in federated learning, particularly under non-IID data distributions, which suggests hindered generalization capability. The recently emerging… ▽ More

    Submitted 12 February, 2026; originally announced February 2026.

  50. arXiv:2602.11513  [pdf, ps, other

    cs.CR cs.AI

    Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt

    Authors: Yujie Gu, Richeng Jin, Xiaoyu Ji, Yier Jin, Wenyuan Xu

    Abstract: Large Language Models (LLMs) have achieved remarkable performance and received significant research interest. The enormous computational demands, however, hinder the local deployment on devices with limited resources. The current prevalent LLM inference paradigms require users to send queries to the service providers for processing, which raises critical privacy concerns. Existing approaches propo… ▽ More

    Submitted 11 February, 2026; originally announced February 2026.