Skip to main content

Showing 1–50 of 1,735 results for author: Xie, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.08184  [pdf, ps, other

    cs.SD cs.AI

    AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

    Authors: Yuankun Xie, Haonan Cheng, Jiayi Zhou, Xiaoxuan Guo, Tao Wang, Jian Liu, Weiqiang Wang, Ruibo Fu, Xiaopeng Wang, Hengyan Huang, Xiaoying Huang, Long Ye, Guangtao Zhai

    Abstract: The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: Accepted to the ACM Multimedia 2026 Grand Challenge

  2. arXiv:2604.08159  [pdf, ps, other

    cs.CV cs.AI

    Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection

    Authors: Yushuo Zhang, Yu Cheng, Yongkang Hu, Jiuan Zhou, Jiawei Chen, Yuan Xie, Zhaoxia Yin

    Abstract: The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insuff… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  3. arXiv:2604.08044  [pdf, ps, other

    cs.AR

    A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

    Authors: Cong Li, Chenhao Xue, Yi Ren, Xiping Dong, Yu Cheng, Yinbo Hu, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Zhi Yang, Zhe Cheng, Yuan Xie, Guangyu Sun

    Abstract: Large language models (LLMs) exhibit memory-intensive behavior during decoding, making it a key bottleneck in LLM inference. To accelerate decoding execution, hybrid-bonding-based 3D-DRAM has been adopted in LLM accelerators. While this emerging technology provides strong performance gains over existing hardware, current 3D-DRAM accelerators (3D-Accelerators) rely on closed-source evaluation tools… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  4. arXiv:2604.08003  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

    Authors: Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Ming Lei, Jie Gao, Jie Wu

    Abstract: Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  5. arXiv:2604.07815  [pdf, ps, other

    cs.CL

    AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

    Authors: Yuxuan Hu, Jianchao Tan, Jiaqi Zhang, Wen Zan, Pingwei Sun, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai, Jing Zhang

    Abstract: Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-gra… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    ACM Class: I.2.7

  6. arXiv:2604.07758  [pdf, ps, other

    cs.CV cs.AI

    DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics

    Authors: Hang Zhang, Qijian Tian, Jingyu Gong, Daoguo Dong, Xuhong Wang, Yuan Xie, Xin Tan

    Abstract: Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we pres… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  7. arXiv:2604.07723  [pdf, ps, other

    cs.CV

    Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

    Authors: Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie, Yanyun Qu

    Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: Accepted by CVPR 2026

  8. arXiv:2604.07468  [pdf, ps, other

    cs.AI

    M-ArtAgent: Evidence-Based Multimodal Agent for Implicit Art Influence Discovery

    Authors: Hanyi Liu, Zhonghao Jiu, Minghao Wang, Yuhang Xie, Heran Yang

    Abstract: Implicit artistic influence, although visually plausible, is often undocumented and thus poses a historically constrained attribution problem: resemblance is necessary but not sufficient evidence. Most prior systems reduce influence discovery to embedding similarity or label-driven graph completion, while recent multimodal large language models (LLMs) remain vulnerable to temporal inconsistency an… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: 13 pages, 5 figures, submitted to IEEE Access

  9. arXiv:2604.06628  [pdf, ps, other

    cs.AI

    Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

    Authors: Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu

    Abstract: A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failu… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: Preprint. Under review

  10. arXiv:2604.05939  [pdf, ps, other

    cs.AI cs.HC

    Context-Value-Action Architecture for Value-Driven Large Language Model Agents

    Authors: TianZe Zhang, Sirui Sun, Yuhang Xie, Xin Zhang, Zhiqiang Wu, Guojie Song

    Abstract: Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents often exhibit behavioral rigidity, a flaw frequently masked by the self-referential bias of current "LLM-as-a-judge" evaluations. By evaluating against empirical ground truth, we reveal a counter-intuitive phenomenon: increasing the intensity of prompt-driven reasoning does not enhance fidelity but ra… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: Accepted to Findings of the Association for Computational Linguistics: ACL 2026

  11. arXiv:2604.05015  [pdf, ps, other

    cs.CV

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    Authors: Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He

    Abstract: With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: Homepage: https://video-mme-v2.netlify.app/

  12. arXiv:2604.05013  [pdf, ps, other

    cs.SE cs.AI

    Scaling Coding Agents via Atomic Skills

    Authors: Yingwei Ma, Yue Liu, Xinlong Yang, Yanhao Li, Kelin Fu, Yibo Miao, Yuchong Xie, Zhexu Wang, Shing-Chi Cheung

    Abstract: Current LLM coding agents are predominantly trained on composite benchmarks (e.g., bug fixing), which often leads to task-specific overfitting and limited generalization. To address this, we propose a novel scaling paradigm that shifts the focus from task-level optimization to atomic skill mastery. We first formalize five fundamental atomic skills, code localization, code editing, unit-test genera… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

  13. arXiv:2604.04921  [pdf, ps, other

    cs.CL cs.CV

    TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    Authors: Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen

    Abstract: Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-Ro… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: Code is available at https://github.com/WeianMao/triattention

  14. arXiv:2604.04863  [pdf, ps, other

    cs.CV

    Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

    Authors: Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Yutong Xie, Nguyen Cam-Tu, Minh Khoi Nguyen, Dang Huy Pham Nguyen, Anton van den Hengel, Johan W. Verjans, Phi Le Nguyen, Vu Minh Hieu Phan

    Abstract: Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, wh… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: Accepted at CVPR2026 Main Track

  15. arXiv:2604.04503  [pdf, ps, other

    cs.AI cs.MA

    Memory Intelligence Agent

    Authors: Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, Yuan Xie

    Abstract: Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval c… ▽ More

    Submitted 7 April, 2026; v1 submitted 6 April, 2026; originally announced April 2026.

  16. arXiv:2604.04342  [pdf, ps, other

    cs.LG stat.ML

    Generative models for decision-making under distributional shift

    Authors: Xiuyuan Cheng, Yunqin Zhu, Yao Xie

    Abstract: Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decisi… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

    Comments: Under review for INFORMS TutORials in Operations Research, 2026

  17. arXiv:2604.03941  [pdf, ps, other

    cs.CV

    SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

    Authors: Lingyun Zhang, Yu Xie, Zhongli Fang, Yu Liu, Ping Chen

    Abstract: The widespread deployment of text-to-image diffusion models is significantly challenged by the generation of visually harmful content, such as sexually explicit content, violence, and horror imagery. Common safety interventions, ranging from input filtering to model concept erasure, often suffer from two critical limitations: (1) a severe trade-off between safety and context preservation, where re… ▽ More

    Submitted 4 April, 2026; originally announced April 2026.

    Comments: 6 pages, 5 figures, accepted to 2026 IEEE International Conference on Multimedia and Expo (ICME)

  18. arXiv:2604.03286  [pdf

    cs.AI cond-mat.mtrl-sci cs.HC

    Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models

    Authors: Yong Xie, Kexin He, Andres Castellanos-Gomez

    Abstract: The control of complex laboratory instrumentation often requires significant programming expertise, creating a barrier for researchers lacking computational skills. This work explores the potential of large language models (LLMs), such as ChatGPT, and LLM-based artificial intelligence (AI) agents to enable efficient programming and automation of scientific equipment. Through a case study involving… ▽ More

    Submitted 25 March, 2026; originally announced April 2026.

    Comments: 16 pages, 5 figures. Accepted manuscript published in Small Structures. Supporting data and code available at https://doi.org/10.5281/zenodo.15065601

    Journal ref: Small Structures, 2025, 6(8), 2500173

  19. arXiv:2604.02022  [pdf, ps, other

    cs.AI

    ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

    Authors: Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

    Abstract: Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-leve… ▽ More

    Submitted 8 April, 2026; v1 submitted 2 April, 2026; originally announced April 2026.

  20. arXiv:2604.01171  [pdf, ps, other

    cs.CV

    Open-Set Supervised 3D Anomaly Detection: An Industrial Dataset and a Generalisable Framework for Unknown Defects

    Authors: Hanzhe Liang, Luocheng Zhang, Junyang Xia, HanLiang Zhou, Bingyang Guo, Yingxi Xie, Can Gao, Ruiyun Yu, Jinbao Wang, Pan Li

    Abstract: Although self-supervised 3D anomaly detection assumes that acquiring high-precision point clouds is computationally expensive, in real manufacturing scenarios it is often feasible to collect a limited number of anomalous samples. Therefore, we study open-set supervised 3D anomaly detection, where the model is trained with only normal samples and a small number of known anomalous samples, aiming to… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

    Comments: Resources: https://github.com/hzzzzzhappy/open-industry

    ACM Class: F.2.2; I.2.7

  21. arXiv:2604.01052  [pdf, ps, other

    cs.CR cs.AI

    VibeGuard: A Security Gate Framework for AI-Generated Code

    Authors: Ying Xie

    Abstract: "Vibe coding," in which developers delegate code generation to AI assistants and accept the output with little manual review, has gained rapid adoption in production settings. On March 31, 2026, Anthropic's Claude Code CLI shipped a 59.8 MB source map file in its npm package, exposing roughly 512,000 lines of proprietary TypeScript. The tool had itself been largely vibe-coded, and the leak traced… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  22. arXiv:2604.00557  [pdf, ps, other

    cs.RO cs.CV cs.LG

    Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

    Authors: Yichen Xie, Yixiao Wang, Shuqi Zhao, Cheng-En Wu, Masayoshi Tomizuka, Jianwen Xie, Hao-Shu Fang

    Abstract: The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  23. arXiv:2604.00509  [pdf, ps, other

    cs.GR cs.CV

    RT-GS: Gaussian Splatting with Reflection and Transmittance Primitives

    Authors: Kunnong Zeng, Chensheng Peng, Yichen Xie, Masayoshi Tomizuka, Cem Yuksel

    Abstract: Gaussian Splatting is a powerful tool for reconstructing diffuse scenes, but it struggles to simultaneously model specular reflections and the appearance of objects behind semi-transparent surfaces. These specular reflections and transmittance are essential for realistic novel view synthesis, and existing methods do not properly incorporate the underlying physical processes to simulate them. To ad… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  24. arXiv:2604.00505  [pdf, ps, other

    cs.LG cs.AI

    Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

    Authors: Yunwen Lei, Yufeng Xie

    Abstract: Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is of… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  25. arXiv:2604.00055  [pdf, ps, other

    cs.RO cs.CV cs.LG

    Generalizable Dense Reward for Long-Horizon Robotic Tasks

    Authors: Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, Yesh Dattatreya

    Abstract: Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dens… ▽ More

    Submitted 30 March, 2026; originally announced April 2026.

    Comments: Project page: https://silongyong.github.io/vllr_project_page/

  26. DeepEye: A Steerable Self-driving Data Agent System

    Authors: Boyan Li, Yiran Peng, Yupeng Xie, Sirong Lu, Yizhang Zhu, Xing Mu, Xinyu Liu, Yuyu Luo

    Abstract: Large Language Models (LLMs) have revolutionized natural language interaction with data. The "holy grail" of data analytics is to build autonomous Data Agents that can self-drive complex data analysis workflows. However, current implementations are still limited to linear "ChatBI" systems. These systems struggle with joint analysis across heterogeneous data sources (e.g., databases, documents, and… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

    Comments: SIGMOD Demo (2026)

  27. arXiv:2603.28152  [pdf, ps, other

    cs.CV

    ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models

    Authors: Yuhuan Xie, Aoxuan Pan, Yi-Hua Huang, Chirui Chang, Peng Dai, Xin Yu, Xiaojuan Qi

    Abstract: Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

    Comments: 11 pages, 8 figures

  28. arXiv:2603.28134  [pdf, ps, other

    cs.CV

    Robust Remote Sensing Image-Text Retrieval with Noisy Correspondence

    Authors: Qiya Song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang

    Abstract: As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. In addi… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

  29. arXiv:2603.27991  [pdf, ps, other

    cs.HC cs.AI

    ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

    Authors: Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Jiale Lao, Yue Cheng, Wei Chen

    Abstract: Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can automate content creation, but directly applying them to interactive document generation of… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

  30. arXiv:2603.27538  [pdf, ps, other

    cs.CV cs.CL

    LongCat-Next: Lexicalizing Modalities as Discrete Tokens

    Authors: Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li , et al. (64 additional authors not shown)

    Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Aut… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: LongCat-Next Technical Report

  31. arXiv:2603.25768  [pdf, ps, other

    cs.SE cs.AI cs.AR cs.MA

    UCAgent: An End-to-End Agent for Block-Level Functional Verification

    Authors: Junyue Wang, Zhicheng Yao, Yan Pi, Xiaolong Li, Fangyuan Song, Jinru Wang, Yunlong Xie, Sa Wang, Yungang Bao

    Abstract: Functional verification remains a critical bottleneck in modern IC development cycles, accounting for approximately 70% of total development time in many projects. However, traditional methods, including constrained-random and formal verification, struggle to keep pace with the growing complexity of modern semiconductor designs. While recent advances in Large Language Models (LLMs) have shown pr… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

  32. arXiv:2603.24577  [pdf, ps, other

    cs.CV cs.AI

    EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

    Authors: Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit

    Abstract: Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) m… ▽ More

    Submitted 25 March, 2026; originally announced March 2026.

  33. arXiv:2603.23501  [pdf, ps, other

    cs.CV cs.AI cs.CL

    MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

    Authors: Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan

    Abstract: Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no ob… ▽ More

    Submitted 24 March, 2026; originally announced March 2026.

    Comments: 11 Pages

  34. arXiv:2603.22851  [pdf, ps, other

    cs.CV cs.AI

    UniQueR: Unified Query-based Feedforward 3D Reconstruction

    Authors: Chensheng Peng, Quentin Herau, Jiezhi Yang, Yichen Xie, Yihan Hu, Wenzhao Zheng, Matthew Strong, Masayoshi Tomizuka, Wei Zhan

    Abstract: We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inferen… ▽ More

    Submitted 24 March, 2026; originally announced March 2026.

  35. arXiv:2603.22673  [pdf, ps, other

    cs.HC

    Design Implications for Student and Educator Needs in AI-Supported Programming Learning Tools

    Authors: Boxuan Ma, Yinjie Xie, Huiyong Li, Gen Li, Li Chen, Atsushi Shimada, Shin'Ichi Konomi

    Abstract: AI-powered coding assistants can support students in programming courses by providing on-demand explanations and debugging help. However, existing research often focuses on individual tools, leaving a gap in evidence-based design recommendations that reflect both educator and student perspectives in education settings. To ground the design of learning-oriented AI coding assistants for both sides'… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

  36. arXiv:2603.22300  [pdf, ps, other

    cs.LG cs.AI

    Scaling Attention via Feature Sparsity

    Authors: Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang

    Abstract: Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), wh… ▽ More

    Submitted 30 March, 2026; v1 submitted 17 March, 2026; originally announced March 2026.

    Comments: 26 pages, 11 figures; Accepted at ICLR 2026

  37. arXiv:2603.22293  [pdf, ps, other

    cs.CL cs.AI cs.LG

    TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

    Authors: Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang

    Abstract: Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward… ▽ More

    Submitted 11 March, 2026; originally announced March 2026.

    Comments: Code: https://github.com/ucsd-wang-lab-lm/tips

  38. arXiv:2603.21547  [pdf, ps, other

    cs.CV

    PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models

    Authors: Yiwei Xie, Zheng Zhang, Ping Liu

    Abstract: Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of e… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

    Comments: This preprint was posted after submission to IEEE Transactions

  39. arXiv:2603.21334  [pdf, ps, other

    cs.HC

    Software as Content: Dynamic Applications as the Human-Agent Interaction Layer

    Authors: Mulong Xie, Yang Xie

    Abstract: Chat-based natural language interfaces have emerged as the dominant paradigm for human-agent interaction, yet they fundamentally constrain engagement with structured information and complex tasks. We identify three inherent limitations: the mismatch between structured data and linear text, the high entropy of unconstrained natural language input, and the lack of persistent, evolving interaction st… ▽ More

    Submitted 25 March, 2026; v1 submitted 22 March, 2026; originally announced March 2026.

    Comments: 37 pages, 10 figures

  40. arXiv:2603.21280  [pdf, ps, other

    cs.CY cs.AI

    WARBENCH: A Comprehensive Benchmark for Evaluating LLMs in Military Decision-Making

    Authors: Zongjie Li, Chaozheng Wang, Yuchong Xie, Pingchuan Ma, Shuai Wang

    Abstract: Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitation… ▽ More

    Submitted 22 March, 2026; originally announced March 2026.

  41. arXiv:2603.20587  [pdf, ps, other

    cs.LG cs.IT math.MG

    Neural collapse in the orthoplex regime

    Authors: James Alcala, Rayna Andreeva, Vladimir A. Kobzar, Dustin G. Mixon, Sanghoon Na, Shashank Sule, Yangxinyu Xie

    Abstract: When training a neural network for classification, the feature vectors of the training set are known to collapse to the vertices of a regular simplex, provided the dimension $d$ of the feature space and the number $n$ of classes satisfies $n\leq d+1$. This phenomenon is known as neural collapse. For other applications like language models, one instead takes $n\gg d$. Here, the neural collapse phen… ▽ More

    Submitted 20 March, 2026; originally announced March 2026.

  42. arXiv:2603.20435  [pdf

    cs.AI

    Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health

    Authors: Jingwei Huang, Kuroush Nezafati, Zhikai Chi, Ruichen Rong, Colin Treager, Tingyi Wanyan, Yueshuang Xu, Xiaowei Zhan, Patrick Leavey, Guanghua Xiao, Wenqi Shi, Yang Xie

    Abstract: Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)-based extraction pipelines often struggle to capture these dependencies, leading to clinically inconsistent outputs. We propose deep reflective reasoning, a large language model agent fram… ▽ More

    Submitted 20 March, 2026; originally announced March 2026.

    Comments: 12 figures and 2 tables

  43. arXiv:2603.19621  [pdf, ps, other

    cs.LG cs.AI

    DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

    Authors: Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang

    Abstract: Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  44. arXiv:2603.19582  [pdf, ps, other

    cs.RO cs.AI

    Evolving Embodied Intelligence: Graph Neural Network--Driven Co-Design of Morphology and Control in Soft Robotics

    Authors: Jianqiang Wang, Shuaiqun Pan, Alvaro Serra-Gomez, Xiaohan Wei, Yue Xie

    Abstract: The intelligent behavior of robots does not emerge solely from control systems, but from the tight coupling between body and brain, a principle known as embodied intelligence. Designing soft robots that leverage this interaction remains a significant challenge, particularly when morphology and control require simultaneous optimization. A significant obstacle in this co-design process is that morph… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  45. arXiv:2603.19552  [pdf, ps, other

    cs.CV

    StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

    Authors: Zhongrui Yu, Zhao Wang, Yijia Xie, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan

    Abstract: Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction.… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  46. arXiv:2603.19364  [pdf, ps, other

    cs.CV

    AURORA: Adaptive Unified Representation for Robust Ultrasound Analysis

    Authors: Ufaq Khan, L. D. M. S. Sai Teja, Ayuba Shakiru, Mai A. Shaaban, Yutong Xie, Muhammad Bilal, Muhammad Haris Khan

    Abstract: Ultrasound images vary widely across scanners, operators, and anatomical targets, which often causes models trained in one setting to generalize poorly to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis (FMC-UIA) reflects this difficulty by requiring a single model to handle multiple tasks, including segmentation, detection, classification, and l… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  47. arXiv:2603.19310  [pdf, ps, other

    cs.LG cs.AI

    MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

    Authors: Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

    Abstract: Recent advances in large language models (LLMs) have been driven by reinforcement-learning-based post-training, which requires multiple rollouts with rewards. However, obtaining ground truth labels for the calculation of rewards on a scale often requires expensive human labeling or time-consuming verification procedures. For instance, evaluating mathematical proofs demands expert review, and open-… ▽ More

    Submitted 24 March, 2026; v1 submitted 13 March, 2026; originally announced March 2026.

  48. arXiv:2603.18604  [pdf, ps, other

    cs.NI cs.AI

    AutORAN: LLM-driven Natural Language Programming for Agile xApp Development

    Authors: Xin Li, Shiming Yu, Leming Shen, Jianing Zhang, Yuanqing Zheng, Yaxiong Xie

    Abstract: Traditional RAN systems are closed and monolithic, stifling innovation. The openness and programmability enabled by Open Radio Access Network (O-RAN) are envisioned to revolutionize cellular networks with control-plane applications--xApps. The development of xApps (typically by third-party developers), however, remains time-consuming and cumbersome, often requiring months of manual coding and inte… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  49. arXiv:2603.17895  [pdf, ps, other

    cs.CV

    A Creative Agent is Worth a 64-Token Template

    Authors: Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng

    Abstract: Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  50. arXiv:2603.17753  [pdf, ps, other

    cs.CV

    PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

    Authors: Wenbin Tan, Jiawen Lin, Fangyong Wang, Yuan Xie, Yong Xie, Yachao Zhang, Yanyun Qu

    Abstract: 3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.