Skip to main content

Showing 1–50 of 121 results for author: Ban, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.00860  [pdf, ps, other

    cs.LG

    Policy Improvement Reinforcement Learning

    Authors: Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban

    Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  2. arXiv:2603.26720  [pdf, ps, other

    cs.RO cs.AI

    SutureAgent: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

    Authors: Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

    Abstract: Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide su… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  3. arXiv:2603.25565  [pdf, ps, other

    cs.CV

    GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

    Authors: Xuran Hu, Zhitong Xiong, Zhongcheng Hong, Yifang Ban, Xiaoxiang Zhu, Wufan Zhao

    Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing under… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

    Comments: 18 pages, 4 figures

    ACM Class: I.2.10

  4. arXiv:2603.21574  [pdf, ps, other

    cs.AI

    Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

    Authors: Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang

    Abstract: Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimat… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

  5. arXiv:2603.21563  [pdf, ps, other

    cs.AI

    Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

    Authors: Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang

    Abstract: Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimiz… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

  6. arXiv:2603.18544  [pdf, ps, other

    eess.IV cs.AI cs.CV

    SCISSR: Scribble-Conditioned Interactive Surgical Segmentation and Refinement

    Authors: Haonan Ping, Jian Jiang, Cheng Yuan, Qizhen Sun, Lv Wu, Yutong Ban

    Abstract: Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene se… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  7. arXiv:2603.16822  [pdf, ps, other

    cs.AI

    Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

    Authors: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, Yueming Jin

    Abstract: Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advanc… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

    MSC Class: 68T45 ACM Class: I.2.10

  8. arXiv:2603.12787  [pdf, ps, other

    cs.CV

    Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

    Authors: Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Yip, Yujia Gao, Cheng Chen, Dillan Imans, Yonghao Long, Yiru Ye, Yixiao Liu, Rongyun Mai, Kai Chen, Hongliang Ren, Yutong Ban, Guangsuo Wang, Francis Wong, Chi-Fai Ng, Kee Yuan Ngiam, Russell H. Taylor, Daguang Xu, Yueming Jin, Qi Dou

    Abstract: Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with… ▽ More

    Submitted 13 March, 2026; originally announced March 2026.

    Comments: 34 pages, 8 figures

  9. arXiv:2603.12430  [pdf, ps, other

    cs.CV

    Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

    Authors: Jian Jiang, Chenxi Lin, Yiming Gu, Zengyi Qin, Zhitao Zeng, Kun Yuan, Yonghao Long, Xiang Xia, Cheng Yuan, Yuqi Wang, Zijie Yue, Kunyi Yang, Yuting Zhang, Zhu Zhuo, Dian Qin, Xin Wang, NG Chi Fai, Brian Anthony, Daguang Xu, Guy Rosman, Ozanan Meireles, Zizhen Zhang, Nicolas Padoy, Hesheng Wang, Qi Dou , et al. (2 additional authors not shown)

    Abstract: Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Lan… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

  10. arXiv:2603.10682  [pdf, ps, other

    cs.RO

    OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

    Authors: Guiyong Zheng, Yueting Ban, Mingjie Zhang, Juepeng Zheng, Boyu Zhou

    Abstract: Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-s… ▽ More

    Submitted 11 March, 2026; originally announced March 2026.

  11. arXiv:2603.09715  [pdf, ps, other

    cs.AI

    Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

    Authors: Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

    Abstract: Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a samp… ▽ More

    Submitted 10 March, 2026; originally announced March 2026.

  12. arXiv:2603.08035  [pdf, ps, other

    cs.AI cs.LG

    CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

    Authors: Dengcan Liu, Fengkai Yang, Xiaohan Wang, Shurui Yan, Jiajun Chai, Jiahao Li, Yikun Ban, Zhendong Mao, Wei Lin, Guojun Yin

    Abstract: Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet conventional reward models suffer from poor interpretability and heavy reliance on costly expert annotations. While recent rubric-based approaches enhance evaluation transparency, they lack systematic quality control, yielding noisy and redundant criteria, failing to mitigate persistent biases (e.g.,… ▽ More

    Submitted 9 March, 2026; originally announced March 2026.

  13. arXiv:2603.07032  [pdf, ps, other

    cs.RO

    SSP: Safety-guaranteed Surgical Policy via Joint Optimization of Behavioral and Spatial Constraints

    Authors: Jianshu Hu, ZhiYuan Guan, Lei Song, Kantaphat Leelakunwet, Hesheng Wang, Wei Xiao, Qi Dou, Yutong Ban

    Abstract: The paradigm of robot-assisted surgery is shifting toward data-driven autonomy, where policies learned via Reinforcement Learning (RL) or Imitation Learning (IL) enable the execution of complex tasks. However, these ``black-box" policies often lack formal safety guarantees, a critical requirement for clinical deployment. In this paper, we propose the Safety-guaranteed Surgical Policy (SSP) framewo… ▽ More

    Submitted 6 March, 2026; originally announced March 2026.

  14. arXiv:2603.02604  [pdf, ps, other

    cs.LG

    Heterogeneous Agent Collaborative Reinforcement Learning

    Authors: Zhixia Zhang, Zixuan Huang, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban

    Abstract: We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agen… ▽ More

    Submitted 3 March, 2026; originally announced March 2026.

  15. arXiv:2602.23766  [pdf, ps, other

    cs.IR

    UniFAR: A Unified Facet-Aware Retrieval Framework for Scientific Documents

    Authors: Zheng Dou, Zhao Zhang, Deqing Wang, Yikun Ban, Fuzhen Zhuang

    Abstract: Existing scientific document retrieval (SDR) methods primarily rely on document-centric representations learned from inter-document relationships for document-document (doc-doc) retrieval. However, the rise of LLMs and RAG has shifted SDR toward question-driven retrieval, where documents are retrieved in response to natural-language questions (q-doc). This change has led to systematic mismatches b… ▽ More

    Submitted 27 February, 2026; originally announced February 2026.

  16. arXiv:2602.18749  [pdf, ps, other

    cs.AI

    Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation

    Authors: Wei Guo, Siyuan Lu, Xiangdong Ran, Yiqi Tong, Yikun Ban, Zelong Xu, Jing Fan, Zixuan Huang, Xiao Zhang, Zhaojun Hu, Fuzhen Zhuang

    Abstract: Data allocation plays a critical role in federated large language model (LLM) and small language models (SLMs) reasoning collaboration. Nevertheless, existing data allocation methods fail to address an under-explored challenge in collaboration: bidirectional model learnability gap, where client-side SLMs cannot identify high-reward samples matching their learnability constraints for effective know… ▽ More

    Submitted 21 February, 2026; originally announced February 2026.

  17. arXiv:2602.09538  [pdf, ps, other

    cs.CL

    UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

    Authors: Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuan Huang, Deqing Wang, Jianxin Li, Yitong Yao, Chao Wang, Shuangyong Song

    Abstract: Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs inde… ▽ More

    Submitted 10 February, 2026; originally announced February 2026.

    Comments: Under Review

  18. Camel: Frame-Level Bandwidth Estimation for Low-Latency Live Streaming under Video Bitrate Undershooting

    Authors: Liming Liu, Zhidong Jia, Li Jiang, Wei Zhang, Lan Xie, Feng Qian, Leju Yan, Bing Yan, Qiang Ma, Zhou Sha, Wei Yang, Yixuan Ban, Xinggong Zhang

    Abstract: Low-latency live streaming (LLS) has emerged as a popular web application, with many platforms adopting real-time protocols such as WebRTC to minimize end-to-end latency. However, we observe a counter-intuitive phenomenon: even when the actual encoded bitrate does not fully utilize the available bandwidth, stalling events remain frequent. This insufficient bandwidth utilization arises from the int… ▽ More

    Submitted 10 February, 2026; originally announced February 2026.

    Comments: 8 pages, 20 figures, to appear in WWW 2026

    Journal ref: Proceedings of the ACM Web Conference 2026 (WWW '26)

  19. arXiv:2602.08499  [pdf, ps, other

    cs.LG cs.AI

    Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

    Authors: Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

    Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supe… ▽ More

    Submitted 9 February, 2026; originally announced February 2026.

  20. arXiv:2602.08354  [pdf, ps, other

    cs.AI

    Does Your Reasoning Model Implicitly Know When to Stop Thinking?

    Authors: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

    Abstract: Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with c… ▽ More

    Submitted 27 February, 2026; v1 submitted 9 February, 2026; originally announced February 2026.

  21. arXiv:2602.08222  [pdf, ps, other

    cs.AI

    Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

    Authors: Zehao Chen, Gongxun Li, Tianxiang Ai, Yifei Li, Zixuan Huang, Wang Zhou, Fuzhen Zhuang, Xianglong Liu, Jianxin Li, Deqing Wang, Yikun Ban

    Abstract: As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models' own historical weak states. Motivated by this observatio… ▽ More

    Submitted 8 February, 2026; originally announced February 2026.

  22. arXiv:2601.22664  [pdf, ps, other

    cs.AI

    Real-Time Aligned Reward Model beyond Semantics

    Authors: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

    Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to… ▽ More

    Submitted 9 March, 2026; v1 submitted 30 January, 2026; originally announced January 2026.

  23. arXiv:2601.16933  [pdf, ps, other

    cs.CV cs.LG

    Reward-Forcing: Autoregressive Video Generation with Reward Feedback

    Authors: Jingran Zhang, Ning Li, Yuanhao Ban, Andrew Bai, Justin Cui

    Abstract: While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically l… ▽ More

    Submitted 2 April, 2026; v1 submitted 23 January, 2026; originally announced January 2026.

    Comments: https://openreview.net/forum?id=K8Qjsxxl7y&noteId=K8Qjsxxl7y

  24. arXiv:2601.16914  [pdf, ps, other

    cs.CV cs.AI

    LoL: Longer than Longer, Scaling Video Generation to Hour

    Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh

    Abstract: Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink f… ▽ More

    Submitted 23 January, 2026; originally announced January 2026.

    Comments: preprint

  25. arXiv:2601.08521  [pdf, ps, other

    cs.LG

    Your Group-Relative Advantage Is Biased

    Authors: Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban

    Abstract: Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a f… ▽ More

    Submitted 21 January, 2026; v1 submitted 13 January, 2026; originally announced January 2026.

  26. arXiv:2512.23213  [pdf, ps, other

    cs.CL cs.AI

    Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

    Authors: Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun

    Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible… ▽ More

    Submitted 6 February, 2026; v1 submitted 29 December, 2025; originally announced December 2025.

  27. arXiv:2512.22309  [pdf, ps, other

    cs.LG cs.AI

    LLMBoost: Make Large Language Models Stronger with Boosting

    Authors: Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

    Abstract: Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging i… ▽ More

    Submitted 26 December, 2025; originally announced December 2025.

  28. arXiv:2512.19347  [pdf, ps, other

    cs.RO

    OMP: One-step Meanflow Policy with Directional Alignment

    Authors: Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, Yutong Ban

    Abstract: Robot manipulation has increasingly adopted data-driven generative policy frameworks, yet the field faces a persistent trade-off: diffusion models suffer from high inference latency, while flow-based methods often require complex architectural constraints. Although in image generation domain, the MeanFlow paradigm offers a path to single-step inference, its direct application to robotics is impede… ▽ More

    Submitted 29 January, 2026; v1 submitted 22 December, 2025; originally announced December 2025.

  29. arXiv:2512.06343  [pdf, ps, other

    cs.LG cs.AI cs.CL

    When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

    Authors: Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-jui Hsieh

    Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gra… ▽ More

    Submitted 31 January, 2026; v1 submitted 6 December, 2025; originally announced December 2025.

  30. arXiv:2512.05529  [pdf, ps, other

    cs.CV cs.AI

    See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors

    Authors: Kunyi Yang, Qingyu Wang, Cheng Yuan, Yutong Ban

    Abstract: Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pre… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

    Comments: The first two authors contributed equally

  31. arXiv:2512.02231  [pdf, ps, other

    cs.CV cs.AI cs.LG

    See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

    Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curate… ▽ More

    Submitted 1 December, 2025; originally announced December 2025.

    Comments: preprint

  32. arXiv:2511.16778  [pdf, ps, other

    cs.LG

    GCL-OT: Graph Contrastive Learning with Optimal Transport for Heterophilic Text-Attributed Graphs

    Authors: Yating Ren, Yikun Ban, Huobin Tan

    Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods c… ▽ More

    Submitted 28 January, 2026; v1 submitted 20 November, 2025; originally announced November 2025.

    Comments: AAAI 2026

  33. arXiv:2511.11672  [pdf, ps, other

    cs.DC

    OSGym: Scalable OS Infra for Computer Use Agents

    Authors: Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Han Fang, Ling Zhu, Zixin Xie, Zibu Wei, Tianshu Ran, Haoran Geng, Ray Pan, Qizhen Sun, Zachary Bright, Yuyang Cai, Chongye Yang, Jiace Zhao, Tianrui Liu, Han Cao, Yeyang Zhou, Rui Wang, Song Wang, Xiang Ren, Bo Zhang , et al. (3 additional authors not shown)

    Abstract: Training computer use agents requires full-featured OS sandboxes with GUI environments, which consume substantial hardware resources as the number of sandboxes scales. Stochastic errors arising from diverse software execution within these sandboxes further demand robust infrastructure design and reliable error recovery. We present OSGym, a scalable OS environment infrastructure for computer use ag… ▽ More

    Submitted 1 April, 2026; v1 submitted 11 November, 2025; originally announced November 2025.

  34. arXiv:2510.23786  [pdf, ps, other

    cs.LG

    Relaxed Sequence Sampling for Diverse Protein Design

    Authors: Joohwan Ko, Aristofanis Rontogiannis, Yih-En Andrew Ban, Axel Elaldi, Nicholas Franklin

    Abstract: Protein design using structure prediction models such as AlphaFold2 has shown remarkable success, but existing approaches like relaxed sequence optimization (RSO) rely on single-path gradient descent and ignore sequence-space constraints, limiting diversity and designability. We introduce Relaxed Sequence Sampling (RSS), a Markov chain Monte Carlo (MCMC) framework that integrates structural and ev… ▽ More

    Submitted 27 October, 2025; originally announced October 2025.

  35. arXiv:2510.02283  [pdf, ps, other

    cs.CV cs.AI

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh

    Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional tea… ▽ More

    Submitted 2 October, 2025; originally announced October 2025.

    Comments: preprint

  36. arXiv:2509.25562  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG

    IRIS: Intrinsic Reward Image Synthesis

    Authors: Yihang Chen, Yuanhao Ban, Yunqi Hong, Cho-Jui Hsieh

    Abstract: Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings… ▽ More

    Submitted 29 January, 2026; v1 submitted 29 September, 2025; originally announced September 2025.

  37. arXiv:2509.24701  [pdf, ps, other

    cs.LG cs.AI

    FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits

    Authors: Pingchen Lu, Zhi Hong, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Yao Shu, Min Zhang, Shuang Qiu, Zhongxiang Dai

    Abstract: The performance of large language models (LLMs) is highly sensitive to the input prompt, making prompt optimization a critical task. However, real-world application is hindered by three major challenges: (1) the black-box nature of powerful proprietary LLMs, (2) the need for high sample efficiency due to query costs, and (3) the desire for privacy-preserving collaboration among multiple users. To… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: Preprint

  38. arXiv:2509.24696  [pdf, ps, other

    cs.LG cs.AI

    T-POP: Test-Time Personalization with Online Preference Feedback

    Authors: Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, Zhongxiang Dai

    Abstract: Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challen… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

    Comments: Preprint

  39. arXiv:2509.23575  [pdf, ps, other

    cs.RO

    Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints

    Authors: Jianshu Hu, Lidi Wang, Shujia Li, Yunpeng Jiang, Xiao Li, Paul Weng, Yutong Ban

    Abstract: Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues… ▽ More

    Submitted 20 February, 2026; v1 submitted 27 September, 2025; originally announced September 2025.

    Comments: Published in ICLR 2026

  40. arXiv:2509.17100  [pdf, ps, other

    cs.CV

    The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment

    Authors: Deepak Alapatt, Jennifer Eckhoff, Zhiliang Lyu, Yutong Ban, Jean-Paul Mazellier, Sarah Choksi, Kunyi Yang, Po-Hsing Chiang, Noemi Zorzetti, Samuele Cannas, Daniel Neimark, Omri Bar, Amine Yamlahi, Jakob Hennighausen, Xiaohan Wang, Rui Li, Long Liang, Yuxian Wang, Saurabh Koju, Binod Bhattarai, Tim Jaspers, Zhehua Mao, Anjana Wijekoon, Jun Ma, Yinan Xu , et al. (16 additional authors not shown)

    Abstract: Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently pe… ▽ More

    Submitted 28 January, 2026; v1 submitted 21 September, 2025; originally announced September 2025.

    Comments: 21 pages, 10 figures

    MSC Class: 68T07 ACM Class: I.2.10; J.3

  41. arXiv:2509.05602  [pdf, ps, other

    cs.CL

    Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

    Authors: Hongyan Xie, Yitong Yao, Yikun Ban, Zixuan Huang, Deqing Wang, Zhenhe Wu, Haoxiang Su, Chao Wang, Shuangyong Song

    Abstract: Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs' abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlat… ▽ More

    Submitted 9 September, 2025; v1 submitted 6 September, 2025; originally announced September 2025.

    Comments: PrePrint

  42. arXiv:2506.21330  [pdf, ps, other

    cs.CV cs.AI

    Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

    Authors: Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff, Paul Pak, Ozanan R. Meireles, Guy Rosman, Yutong Ban, Daniela Rus

    Abstract: Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dep… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  43. arXiv:2506.17252  [pdf, ps, other

    cs.LG cs.AI

    Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

    Authors: Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

    Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of th… ▽ More

    Submitted 9 March, 2026; v1 submitted 8 June, 2025; originally announced June 2025.

  44. arXiv:2506.15483  [pdf, ps, other

    cs.CV cs.AI

    GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

    Authors: Shujia Li, Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Yutong Ban

    Abstract: While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  45. arXiv:2506.02555  [pdf, other

    cs.CV

    SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

    Authors: Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, Xiaochun Cao, Yutong Ban, Qi Dou, Yang Liu, Yueming Jin

    Abstract: Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 29 pages, 5 figures

    MSC Class: 68T45 ACM Class: I.2.10

  46. arXiv:2505.16270  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

    Authors: Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He

    Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first intr… ▽ More

    Submitted 13 November, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

    Comments: NeurIPS 2025 Spotlight

  47. arXiv:2505.13925  [pdf, ps, other

    cs.RO cs.LG

    Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning

    Authors: Yunpeng Jiang, Jianshu Hu, Paul Weng, Yutong Ban

    Abstract: Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries, such as reflection, rotation, and translation, while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in roboti… ▽ More

    Submitted 21 October, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted in NeurIPS 2025

  48. arXiv:2505.11580  [pdf, ps, other

    cs.LG cs.AI q-bio.BM

    Flash Invariant Point Attention

    Authors: Andrew Liu, Axel Elaldi, Nicholas T Franklin, Nathan Russell, Gurinder S Atwal, Yih-En A Ban, Olivia Viessmann

    Abstract: Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. Fl… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  49. arXiv:2504.13440  [pdf, other

    cs.CV

    Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation

    Authors: Cheng Yuan, Yutong Ban

    Abstract: Surgical scene segmentation is crucial for robot-assisted laparoscopic surgery understanding. Current approaches face two challenges: (i) static image limitations including ambiguous local feature similarities and fine-grained structural details, and (ii) dynamic video complexities arising from rapid instrument motion and persistent visual occlusions. While existing methods mainly focus on spatial… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  50. arXiv:2503.02558  [pdf, other

    cs.CV

    Tracking-Aware Deformation Field Estimation for Non-rigid 3D Reconstruction in Robotic Surgeries

    Authors: Zeqing Wang, Han Fang, Yihong Xu, Yutong Ban

    Abstract: Minimally invasive procedures have been advanced rapidly by the robotic laparoscopic surgery. The latter greatly assists surgeons in sophisticated and precise operations with reduced invasiveness. Nevertheless, it is still safety critical to be aware of even the least tissue deformation during instrument-tissue interactions, especially in 3D space. To address this, recent works rely on NeRF to ren… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.