Skip to main content

Showing 1–50 of 423 results for author: Yao, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.14499  [pdf, ps, other

    cs.CV

    Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency

    Authors: Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng, Kaiwen Zhang, Yihua Sun, Chuhong Yang, Weihang Zhang, Fang Chen, Yilan Wu, Lie Ju, Guochen Ning, Longfei Ma, Huiping Yao, Jinyuan Wang, Peilun Shi, Yukun Zhou, Jie Xu, Pearse A. Keane, Hanruo Liu, Hongen Liao, Ningli Wang, Huiqi Li

    Abstract: Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insi… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  2. arXiv:2512.11802  [pdf, ps, other

    cs.RO cs.CV cs.HC

    Benchmarking Tesla's Traffic Light and Stop Sign Control: Field Dataset and Behavior Insights

    Authors: Zheng Li, Peng Zhang, Shixiao Liang, Hang Zhou, Chengyuan Ma, Handong Yao, Qianwen Li, Xiaopeng Li

    Abstract: Understanding how Advanced Driver-Assistance Systems (ADAS) interact with Traffic Control Devices (TCDs) is critical for assessing their influence on traffic operations, yet this interaction has received little focused empirical study. This paper presents a field dataset and behavioral analysis of Tesla's Traffic Light and Stop Sign Control (TLSSC), a mature ADAS that perceives traffic lights and… ▽ More

    Submitted 31 October, 2025; originally announced December 2025.

  3. arXiv:2512.07228  [pdf, ps, other

    cs.CV cs.AI cs.CR cs.LG

    Towards Robust Protective Perturbation against DeepFake Face Swapping

    Authors: Hengyang Yao, Lin Li, Ke Sun, Jianing Qiu, Huiping Chen

    Abstract: DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robus… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  4. arXiv:2512.06258  [pdf, ps, other

    cs.CV

    Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs

    Authors: Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Jiaqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, Huaxiu Yao

    Abstract: We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportion… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

  5. arXiv:2511.22935  [pdf, ps, other

    cs.LG cs.AI

    EnECG: Efficient Ensemble Learning for Electrocardiogram Multi-task Foundation Model

    Authors: Yuhao Xu, Xiaoda Wang, Jiaying Lu, Sirui Ding, Defu Cao, Huaxiu Yao, Yan Liu, Xiao Hu, Carl Yang

    Abstract: Electrocardiogram (ECG) analysis plays a vital role in the early detection, monitoring, and management of various cardiovascular conditions. While existing models have achieved notable success in ECG interpretation, they fail to leverage the interrelated nature of various cardiac abnormalities. Conversely, developing a specific model capable of extracting all relevant features for multiple ECG tas… ▽ More

    Submitted 28 November, 2025; originally announced November 2025.

  6. arXiv:2511.19900  [pdf, ps, other

    cs.CV cs.AI

    Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

    Authors: Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

    Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex… ▽ More

    Submitted 26 November, 2025; v1 submitted 24 November, 2025; originally announced November 2025.

  7. arXiv:2511.18608  [pdf, ps, other

    cs.SE cs.CR

    From Reviewers' Lens: Understanding Bug Bounty Report Invalid Reasons with LLMs

    Authors: Jiangrui Zheng, Yingming Zhou, Ali Abdullah Ahmad, Hanqing Yao, Xueqing Liu

    Abstract: Bug bounty platforms (e.g., HackerOne, BugCrowd) leverage crowd-sourced vulnerability discovery to improve continuous coverage, reduce the cost of discovery, and serve as an integral complement to internal red teams. With the rise of AI-generated bug reports, little work exists to help bug hunters understand why these reports are labeled as invalid. To improve report quality and reduce reviewers'… ▽ More

    Submitted 23 November, 2025; originally announced November 2025.

    Comments: 10 pages, 4 figures

  8. arXiv:2511.17400  [pdf, ps, other

    cs.CV cs.AI

    Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?

    Authors: Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman

    Abstract: Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel i… ▽ More

    Submitted 21 November, 2025; originally announced November 2025.

    Comments: This has been accepted at the NeurIPS AI4Science Workshop 2025

  9. arXiv:2511.16043  [pdf, ps, other

    cs.LG

    Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

    Authors: Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao

    Abstract: Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single-round interactions, hindering the development of complex curricula invo… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

  10. arXiv:2511.11944  [pdf, ps, other

    cs.CV

    From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

    Authors: Ling Wang, Yunfan Lu, Wenzong Ma, Huizai Yao, Pengteng Li, Hui Xiong

    Abstract: Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill-posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the \textbf{first time}. Event cameras offer much higher HDR (… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

    Comments: 11 pages, 8 figures. Completed in April 2025

  11. arXiv:2511.10020  [pdf, ps, other

    cs.CV cs.AI

    Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation

    Authors: Yuxin Jiang, Wei Luo, Hui Zhang, Qiyu Chen, Haiming Yao, Weiming Shen, Yunkang Cao

    Abstract: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignmen… ▽ More

    Submitted 13 November, 2025; originally announced November 2025.

  12. arXiv:2511.08892  [pdf, ps, other

    cs.AI

    Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

    Authors: Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi

    Abstract: We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz k… ▽ More

    Submitted 11 November, 2025; originally announced November 2025.

  13. arXiv:2511.07301  [pdf, ps, other

    cs.CV cs.AI

    Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection

    Authors: Huizai Yao, Sicheng Zhao, Pengteng Li, Yi Cui, Shuo Lu, Weiyu Guo, Yunfan Lu, Yijie Xu, Hui Xiong

    Abstract: Source-Free Object Detection (SFOD) aims to adapt a source-pretrained object detector to a target domain without access to source data. However, existing SFOD methods predominantly rely on internal knowledge from the source model, which limits their capacity to generalize across domains and often results in biased pseudo-labels, thereby hindering both transferability and discriminability. In contr… ▽ More

    Submitted 10 November, 2025; originally announced November 2025.

    Comments: Accepted to AAAI 2026. Extended version with full Appendix

  14. arXiv:2511.06101  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Adapting Web Agents with Synthetic Supervision

    Authors: Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao

    Abstract: Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this pape… ▽ More

    Submitted 8 November, 2025; originally announced November 2025.

    Comments: 19 pages, 6 figures

  15. arXiv:2511.04215  [pdf, ps, other

    cs.CR cs.CL

    Black-Box Guardrail Reverse-engineering Attack

    Authors: Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao, Cong Wang

    Abstract: Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose G… ▽ More

    Submitted 6 November, 2025; originally announced November 2025.

  16. arXiv:2511.03773  [pdf, ps, other

    cs.AI

    Scaling Agent Learning via Experience Synthesis

    Authors: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

    Abstract: While reinforcement learning (RL) can empower autonomous agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified f… ▽ More

    Submitted 10 November, 2025; v1 submitted 5 November, 2025; originally announced November 2025.

  17. arXiv:2511.02779  [pdf, ps, other

    cs.CV

    When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

    Authors: Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye

    Abstract: We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely… ▽ More

    Submitted 4 November, 2025; originally announced November 2025.

    Comments: 28 pages, 15 figures

  18. arXiv:2511.00269  [pdf, ps, other

    cs.CV cs.AI

    FedReplay: A Feature Replay Assisted Federated Transfer Learning Framework for Efficient and Privacy-Preserving Smart Agriculture

    Authors: Long Li, Jiajia Li, Dong Chen, Lina Pu, Haibo Yao, Yanbo Huang

    Abstract: Accurate classification plays a pivotal role in smart agriculture, enabling applications such as crop monitoring, fruit recognition, and pest detection. However, conventional centralized training often requires large-scale data collection, which raises privacy concerns, while standard federated learning struggles with non-independent and identically distributed (non-IID) data and incurs high commu… ▽ More

    Submitted 31 October, 2025; originally announced November 2025.

  19. arXiv:2510.26184  [pdf, ps, other

    cs.LG cs.CY

    A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation

    Authors: Songxin Lei, Qiongyan Wang, Yanchen Zhu, Hanyu Yao, Sijie Ruan, Weilin Ruan, Yuyu Luo, Huaming Wu, Yuxuan Liang

    Abstract: Public resource allocation involves the efficient distribution of resources, including urban infrastructure, energy, and transportation, to effectively meet societal demands. However, existing methods focus on optimizing the movement of individual resources independently, without considering their capacity constraints. To address this limitation, we propose a novel and more practical problem: Coll… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

  20. arXiv:2510.26043  [pdf, ps, other

    stat.ML cs.LG

    $L_1$-norm Regularized Indefinite Kernel Logistic Regression

    Authors: Shaoxin Wang, Hanjing Yao

    Abstract: Kernel logistic regression (KLR) is a powerful classification method widely applied across diverse domains. In many real-world scenarios, indefinite kernels capture more domain-specific structural information than positive definite kernels. This paper proposes a novel $L_1$-norm regularized indefinite kernel logistic regression (RIKLR) model, which extends the existing IKLR framework by introducin… ▽ More

    Submitted 29 October, 2025; originally announced October 2025.

    Comments: 17 pages, 1 figure

  21. arXiv:2510.23691  [pdf, ps, other

    cs.AI

    Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

    Authors: Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan, Xiangyang Li, Yitao Liang , et al. (2 additional authors not shown)

    Abstract: We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal d… ▽ More

    Submitted 27 October, 2025; originally announced October 2025.

  22. arXiv:2510.14251  [pdf, ps, other

    cs.CV

    MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering

    Authors: Mingkai Liu, Dikai Fan, Haohua Que, Haojia Gao, Xiao Liu, Shuxue Peng, Meixia Lin, Shengyu Gu, Ruicong Ye, Wanli Qiu, Handong Yao, Ruopeng Zhang, Xianliang Huang

    Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Co… ▽ More

    Submitted 15 October, 2025; originally announced October 2025.

    Comments: 8 pages

  23. Source-Free Object Detection with Detection Transformer

    Authors: Huizai Yao, Sicheng Zhao, Shuo Lu, Hui Chen, Yangyang Li, Guoping Liu, Tengfei Xing, Chenggang Yan, Jianhua Tao, Guiguang Ding

    Abstract: Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data. Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transfo… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

    Comments: IEEE Transactions on Image Processing

  24. arXiv:2510.10991  [pdf, ps, other

    cs.CV cs.AI cs.CL

    A Survey on Agentic Multimodal Large Language Models

    Authors: Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang, Bo Fang, Ruolin Zhu, Yongcheng Jing, Shunyu Liu, Guanbin Li, Dacheng Tao

    Abstract: With the recent emergence of revolutionary autonomous agentic systems, research community is witnessing a significant shift from traditional static, passive, and domain-specific AI agents toward more dynamic, proactive, and generalizable agentic AI. Motivated by the growing interest in agentic AI and its potential trajectory toward AGI, we present a comprehensive survey on Agentic Multimodal Large… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

  25. arXiv:2510.10223  [pdf, ps, other

    cs.CL cs.AI cs.LG

    You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

    Authors: Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, Hui Xiong

    Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptatio… ▽ More

    Submitted 11 October, 2025; originally announced October 2025.

    Comments: Under Review

  26. arXiv:2510.08987  [pdf, ps, other

    cs.AI

    Towards Efficient Multimodal Unified Reasoning Model via Model Merging

    Authors: Qixiang Yin, Huanjin Yao, Jianghao Chen, Jiaxing Huang, Zhicheng Zhao, Fei Su

    Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter challenges in terms of reasoning efficiency, large model size and overthinking. However, existing lightweight MLLMs lack the capability to balance high efficiency and performance at a small scale. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achiev… ▽ More

    Submitted 20 November, 2025; v1 submitted 10 October, 2025; originally announced October 2025.

    Comments: Technical report, Code will be available at https://github.com/buptyqx/Tiny-R1V

  27. arXiv:2510.06605  [pdf, ps, other

    cs.CR cs.AI cs.CL

    Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation

    Authors: Shuo Shao, Yiming Li, Hongwei Yao, Yifei Chen, Yuchen Yang, Zhan Qin

    Abstract: The substantial investment required to develop Large Language Models (LLMs) makes them valuable intellectual property, raising significant concerns about copyright protection. LLM fingerprinting has emerged as a key technique to address this, which aims to verify a model's origin by extracting an intrinsic, unique signature (a "fingerprint") and comparing it to that of a source model to identify i… ▽ More

    Submitted 7 October, 2025; originally announced October 2025.

  28. arXiv:2510.04860  [pdf, ps, other

    cs.LG cs.AI

    Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

    Authors: Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao

    Abstract: As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction dri… ▽ More

    Submitted 6 October, 2025; originally announced October 2025.

  29. Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents

    Authors: Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, Libin Liu

    Abstract: We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture gen… ▽ More

    Submitted 6 October, 2025; originally announced October 2025.

    Comments: SIGGRAPH ASIA 2025 (Conference Track); Project page: https://pku-mocca.github.io/Social-Agent-Page/

  30. arXiv:2509.24896  [pdf, ps, other

    cs.CV

    DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation

    Authors: Xi Chen, Hongxun Yao, Zhaopan Xu, Kui Jiang

    Abstract: Source-free active domain adaptation (SFADA) enhances knowledge transfer from a source model to an unlabeled target domain using limited manual labels selected via active learning. While recent domain adaptation studies have introduced Vision-and-Language (ViL) models to improve pseudo-label quality or feature alignment, they often treat ViL-based and data supervision as separate sources, lacking… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: 5 pages

  31. arXiv:2509.22888  [pdf, ps, other

    cs.AI cs.CL

    JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

    Authors: Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang

    Abstract: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric i… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

    Comments: 22 pages, 10 figures, 5 tables

  32. arXiv:2509.21882  [pdf, ps, other

    cs.LG cs.AI

    Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

    Authors: Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Bing Hu, Hanqun Cao, Wenqi Shi, Tianang Leng, Rui Yang, Yingjian Chen, Ziqi Wang, Irene Li, Nan Liu, Huaxiu Yao, Li Erran Li, Ge Liu, Amin Saberi, Naoto Yokoya, Jure Leskovec , et al. (2 additional authors not shown)

    Abstract: Reinforcement learning with verifiable rewards (RLVR) is a practical and scalable approach to enhancing large language models in areas such as math, code, and other structured tasks. Two questions motivate this paper: how much of the reported gains survive under strictly parity-controlled evaluation, and whether RLVR is cost-free or exacts a measurable tax. We argue that progress is real, but gain… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  33. arXiv:2509.20271  [pdf, ps, other

    cs.CV

    A Versatile Foundation Model for AI-enabled Mammogram Interpretation

    Authors: Fuxiang Huang, Jiayi Zhu, Yunfang Yu, Yu Xie, Yuan Guo, Qingcong Kong, Mingxiang Wu, Xinrui Jiang, Shu Yang, Jiabo Ma, Ziyi Liu, Zhe Xu, Zhixuan Chen, Yujie Tan, Zifan He, Luhui Mao, Xi Wang, Junlin Hou, Lei Zhang, Qiong Luo, Zhenhui Li, Herui Yao, Hao Chen

    Abstract: Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in tra… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: 64 pages, 7 figures, 40 tables

  34. arXiv:2509.19261  [pdf, ps, other

    cs.RO

    Imitation-Guided Bimanual Planning for Stable Manipulation under Changing External Forces

    Authors: Kuanqi Cai, Chunfeng Wang, Zeqi Li, Haowen Yao, Weinan Chen, Luis Figueredo, Aude Billard, Arash Ajoudani

    Abstract: Robotic manipulation in dynamic environments often requires seamless transitions between different grasp types to maintain stability and efficiency. However, achieving smooth and adaptive grasp transitions remains a challenge, particularly when dealing with external forces and complex motion constraints. Existing grasp transition strategies often fail to account for varying external forces and do… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

    Journal ref: IROS 2025

  35. arXiv:2509.18898  [pdf, ps, other

    cs.CV

    DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

    Authors: Pengteng Li, Yunfan Lu, Pinhao Song, Weiyu Guo, Huizai Yao, F. Richard Yu, Hui Xiong

    Abstract: In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermedi… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

  36. arXiv:2509.18849  [pdf, ps, other

    cs.AI

    MAPO: Mixed Advantage Policy Optimization

    Authors: Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao

    Abstract: Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror pro… ▽ More

    Submitted 24 September, 2025; v1 submitted 23 September, 2025; originally announced September 2025.

  37. arXiv:2509.16087  [pdf, ps, other

    cs.CV cs.AI

    See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

    Authors: Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, Hui Xiong

    Abstract: We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two c… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

    Comments: Accepted by NeurIPS 2025

  38. arXiv:2509.15473  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Breathing and Semantic Pause Detection and Exertion-Level Classification in Post-Exercise Speech

    Authors: Yuyu Wang, Wuyue Xia, Huaxiu Yao, Jingping Nie

    Abstract: Post-exercise speech contains rich physiological and linguistic cues, often marked by semantic pauses, breathing pauses, and combined breathing-semantic pauses. Detecting these events enables assessment of recovery rate, lung function, and exertion-related abnormalities. However, existing works on identifying and distinguishing different types of pauses in this context are limited. In this work, b… ▽ More

    Submitted 18 September, 2025; originally announced September 2025.

    Comments: 6 pages, 3rd ACM International Workshop on Intelligent Acoustic Systems and Applications (IASA 25)

  39. arXiv:2509.03117  [pdf, ps, other

    cs.CR

    PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

    Authors: Yuchen Yang, Yiming Li, Hongwei Yao, Enhao Huang, Shuo Shao, Yuyi Wang, Zhibo Wang, Dacheng Tao, Zhan Qin

    Abstract: System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for ef… ▽ More

    Submitted 18 November, 2025; v1 submitted 3 September, 2025; originally announced September 2025.

  40. arXiv:2508.20655  [pdf, ps, other

    cs.CV cs.CL

    Improving Alignment in LVLMs with Debiased Self-Judgment

    Authors: Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao

    Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations--where generated outputs are not grounded in the visual input--and raising safety concerns across various domains. Existi… ▽ More

    Submitted 11 September, 2025; v1 submitted 28 August, 2025; originally announced August 2025.

    Comments: EMNLP 2025 Findings

  41. arXiv:2508.19843  [pdf, ps, other

    cs.CR cs.AI cs.CL

    SoK: Large Language Model Copyright Auditing via Fingerprinting

    Authors: Shuo Shao, Yiming Li, Yu He, Hongwei Yao, Wenyuan Yang, Dacheng Tao, Zhan Qin

    Abstract: The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that compares the distinctive features (i.e., fingerprint) of LLMs to identify whether an LLM is derived from another, o… ▽ More

    Submitted 17 November, 2025; v1 submitted 27 August, 2025; originally announced August 2025.

  42. arXiv:2508.19611  [pdf, ps, other

    cs.AI cs.CL

    Instructional Agents: LLM Agents on Automated Course Material Generation for Teaching Faculties

    Authors: Huaiyuan Yao, Wanpeng Xu, Justin Turnau, Nadia Kellam, Hua Wei

    Abstract: Preparing high-quality instructional materials remains a labor-intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi-agent large language model (LLM) framework designed to automate end-to-end course material generation, including syllabus creation, lecture scripts… ▽ More

    Submitted 31 August, 2025; v1 submitted 27 August, 2025; originally announced August 2025.

    Comments: 18 pages, 9 figures

    ACM Class: I.2.7

  43. arXiv:2508.17580  [pdf, ps, other

    cs.CL cs.AI cs.LG

    UQ: Assessing Language Models on Unsolved Questions

    Authors: Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff

    Abstract: Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward e… ▽ More

    Submitted 24 August, 2025; originally announced August 2025.

    Comments: FN, KZL, and NM are project co-leads and contributed equally. Project website: https://uq.stanford.edu

  44. arXiv:2508.17380  [pdf, ps, other

    cs.AI

    Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

    Authors: Jiaqi Liu, Songning Lai, Pengze Li, Di Yu, Wenjie Zhou, Yiyang Zhou, Peng Xia, Zijun Wang, Xi Chen, Shixiang Tang, Lei Bai, Wanli Ouyang, Mingyu Ding, Huaxiu Yao, Aoran Wang

    Abstract: Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temp… ▽ More

    Submitted 24 August, 2025; originally announced August 2025.

  45. arXiv:2508.10566  [pdf, ps, other

    cs.CV

    HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

    Authors: Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

    Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome t… ▽ More

    Submitted 30 October, 2025; v1 submitted 14 August, 2025; originally announced August 2025.

  46. arXiv:2508.09123  [pdf, ps, other

    cs.AI cs.CV

    OpenCUA: Open Foundations for Computer-Use Agents

    Authors: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen , et al. (17 additional authors not shown)

    Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open… ▽ More

    Submitted 4 October, 2025; v1 submitted 12 August, 2025; originally announced August 2025.

    Comments: Updata author list, modify first page format, correct typos

  47. arXiv:2508.06904  [pdf, ps, other

    cs.CV

    An Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation

    Authors: Chao Yin, Jide Li, Hang Yao, Xiaoqiang Li

    Abstract: Training-free Camouflaged Object Segmentation (COS) seeks to segment camouflaged objects without task-specific training, by automatically generating visual prompts to guide the Segment Anything Model (SAM). However, existing pipelines mostly yield semantic-level prompts, which drive SAM to coarse semantic masks and struggle to handle multiple discrete camouflaged instances effectively. To address… ▽ More

    Submitted 12 November, 2025; v1 submitted 9 August, 2025; originally announced August 2025.

    Comments: under review

  48. arXiv:2508.06418  [pdf, ps, other

    cs.CL

    Quantifying Conversation Drift in MCP via Latent Polytope

    Authors: Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang

    Abstract: The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacki… ▽ More

    Submitted 8 August, 2025; originally announced August 2025.

  49. arXiv:2508.04945  [pdf, ps, other

    cs.CL cs.AI cs.CV

    Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

    Authors: Louie Hong Yao, Nicholas Jarvis, Tianyu Jiang

    Abstract: Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which rel… ▽ More

    Submitted 6 August, 2025; originally announced August 2025.

    Comments: 18 pages, 5 figures

  50. arXiv:2508.03447  [pdf, ps, other

    cs.CV

    CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

    Authors: Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao, Yuxin Jiang, Yinan Duan, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang

    Abstract: Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-an… ▽ More

    Submitted 5 August, 2025; originally announced August 2025.

    Comments: 19 pages, 33 figures, 14 tables