Skip to main content

Showing 1–50 of 1,229 results for author: Ma, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.09059  [pdf, ps, other

    cs.CV cs.AI

    Learning Vision-Language-Action World Models for Autonomous Driving

    Authors: Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao, Bailan Feng, Chao Ma

    Abstract: Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: Accepted by CVPR2026 findings

  2. arXiv:2604.08048  [pdf, ps, other

    cs.CV

    Guiding a Diffusion Model by Swapping Its Tokens

    Authors: Weijia Zhang, Yuehao Liu, Shanyan Guan, Wu Ran, Yanhao Ge, Wei Li, Chao Ma

    Abstract: Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and u… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: Accepted by CVPR 2026 (Oral)

  3. arXiv:2604.06551  [pdf, ps, other

    cs.CL

    CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

    Authors: Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang

    Abstract: Large language models show potential for scalable mental-health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy. We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

  4. arXiv:2604.06111  [pdf, ps, other

    cs.AI cs.CL

    AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

    Authors: Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma, Shouren Wang, Zhihao Dou, Yuli Zhou, Vipin Chaudhary, Xiaotian Han

    Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose AgentCE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed sche… ▽ More

    Submitted 9 April, 2026; v1 submitted 7 April, 2026; originally announced April 2026.

  5. arXiv:2604.05379  [pdf, ps, other

    cs.IR cs.LG

    Retrieve-then-Adapt: Retrieval-Augmented Test-Time Adaptation for Sequential Recommendation

    Authors: Xing Tang, Jingyang Bin, Ziqiang Cui, Xiaokun Zhang, Fuyuan Lyu, Jingyan Jiang, Dugang Liu, Chen Ma, Xiuqiang He

    Abstract: The sequential recommendation (SR) task aims to predict the next item based on users' historical interaction sequences. Typically trained on historical data, SR models often struggle to adapt to real-time preference shifts during inference due to challenges posed by distributional divergence and parameterized constraints. Existing approaches to address this issue include test-time training, test-t… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

  6. arXiv:2604.04623  [pdf, ps, other

    cs.HC

    On Optimizing Electrode Configuration for Wrist-Worn sEMG-Based Thumb Gesture Recognition

    Authors: Wenjuan Zhong, Chenfei Ma, Kianoush Nazarpour

    Abstract: Thumb gestures provide an effective and unobtrusive input modality for wearable and always-available human-machine interaction. Wrist-worn surface electromyography (sEMG) has emerged as a promising approach for compact and wearable human-machine interfaces. However, compared to forearm sEMG, the impact of electrode configuration on wrist-based decoding performance remains understudied. We systemat… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

  7. arXiv:2604.04500  [pdf, ps, other

    cs.CV

    Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    Authors: Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou

    Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness o… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: CVPR 2026

  8. arXiv:2604.03014  [pdf, ps, other

    cs.IR cs.AI

    User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

    Authors: Jing Du, Zesheng Ye, Congbo Ma, Feng Liu, Flora. D. Salim

    Abstract: Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

    Comments: 11 pages, 7 figures, 3 tables

  9. arXiv:2604.02795  [pdf, ps, other

    cs.CL cs.AI

    Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

    Authors: Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma, Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu

    Abstract: Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL fra… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

  10. arXiv:2604.01664  [pdf, ps, other

    cs.AI

    ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents

    Authors: Yong Wu, YanZhao Zheng, TianZe Xu, ZhenTao Zhang, YuanQiang Yu, JiHuai Zhu, Chao Ma, BinBin Lin, BaoHua Dong, HangCheng Zhu, RuoHui Huang, Gang Yu

    Abstract: LLM-based agents show strong potential for long-horizon reasoning, yet their context size is limited by deployment factors (e.g., memory, latency, and cost), yielding a constrained context budget. As interaction histories grow, this induces a trade-off between retaining past information and staying within the context limit. To address this challenge, we propose Budget-Aware Context Management (BAC… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

  11. arXiv:2604.01053  [pdf, ps, other

    cs.CV

    PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

    Authors: Zilong Li, Dongyang Li, Chenglong Ma, Zhan Feng, Dakai Jin, Junping Zhang, Hao Luo, Fan Wang, Hongming Shan

    Abstract: Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  12. arXiv:2604.00927  [pdf, ps, other

    cs.CV cs.AI

    Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

    Authors: Arina Kharlamova, Bowei He, Chen Ma, Xue Liu

    Abstract: We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  13. arXiv:2603.29295  [pdf, ps, other

    cs.CV

    GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

    Authors: Yaning Zhang, Linlin Shen, Zitong Yu, Chunjie Ma, Zan Gao

    Abstract: Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with ada… ▽ More

    Submitted 31 March, 2026; originally announced March 2026.

  14. arXiv:2603.27460  [pdf, ps, other

    cs.CV cs.AI

    Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

    Authors: Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou , et al. (102 additional authors not shown)

    Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of… ▽ More

    Submitted 28 March, 2026; originally announced March 2026.

    Comments: 157 pages, 19 figures, 26 tables. Project repo: \url{https://github.com/uni-medical/Project-Imaging-X}

  15. arXiv:2603.25720  [pdf, ps, other

    cs.AI cs.CV

    R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

    Authors: Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao

    Abstract: Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and na… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

  16. arXiv:2603.25706  [pdf, ps, other

    cs.CV

    Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

    Authors: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

    Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved… ▽ More

    Submitted 29 March, 2026; v1 submitted 26 March, 2026; originally announced March 2026.

    Comments: CVPR 2026 Camera-ready, Webpage: https://doubiiu.github.io/projects/WanWeaver

  17. arXiv:2603.25405  [pdf, ps, other

    cs.RO cs.AI

    System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games

    Authors: Guangyu Zhao, Ceyao Zhang, Chengdong Ma, Tao Wu, Yiyang Song, Haoxuan Ru, Yifan Zhong, Ruilin Yan, Lingfeng Li, Ruochong Li, Yu Li, Xuyuan Han, Yun Ding, Ruizhang Jiang, Xiaochuan Zhang, Yichao Li, Yuanpei Chen, Yaodong Yang, Yitao Liang

    Abstract: Long-horizon tabletop games pose a distinct systems challenge for robotics: small perceptual or execution errors can invalidate accumulated task state, propagate across decision-making modules, and ultimately derail interaction. This paper studies how to maintain internal state consistency in turn-based, multi-human robotic tabletop games through deliberate system design rather than isolated compo… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

  18. arXiv:2603.24958  [pdf, ps, other

    cs.IR

    DIET: Learning to Distill Dataset Continually for Recommender Systems

    Authors: Jiaqing Zhang, Hao Wang, Mingjia Yin, Bo Chen, Qinglin Jia, Rui Zhou, Ruiming Tang, ChaoYi Ma, Enhong Chen

    Abstract: Modern deep recommender models are trained under a continual learning paradigm, relying on massive and continuously growing streaming behavioral logs. In large-scale platforms, retraining models on full historical data for architecture comparison or iteration is prohibitively expensive, severely slowing down model development. This challenge calls for data-efficient approaches that can faithfully… ▽ More

    Submitted 25 March, 2026; originally announced March 2026.

  19. arXiv:2603.24025  [pdf, ps, other

    cs.LG stat.ME

    i-IF-Learn: Iterative Feature Selection and Unsupervised Learning for High-Dimensional Complex Data

    Authors: Chen Ma, Wanjie Wang, Shuhao Fan

    Abstract: Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It's common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly perf… ▽ More

    Submitted 25 March, 2026; originally announced March 2026.

    Comments: 28 pages, 5 figures, including appendix. Accepted at AISTATS

  20. arXiv:2603.23906  [pdf, ps, other

    cs.CV

    GenMask: Adapting DiT for Segmentation via Direct Mask Generation

    Authors: Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang

    Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we… ▽ More

    Submitted 26 March, 2026; v1 submitted 24 March, 2026; originally announced March 2026.

    Comments: Accepted by cvpr 2026

  21. arXiv:2603.23896  [pdf, ps, other

    cs.CV

    MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

    Authors: Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

    Abstract: End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-veri… ▽ More

    Submitted 24 March, 2026; originally announced March 2026.

    Comments: Accepted to CVPR 2026

  22. arXiv:2603.23885  [pdf, ps, other

    cs.CV

    Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

    Authors: Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

    Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsi… ▽ More

    Submitted 27 March, 2026; v1 submitted 24 March, 2026; originally announced March 2026.

    Comments: Accepted to CVPR 2026

  23. arXiv:2603.22455  [pdf, ps, other

    cs.LG

    SkillRouter: Skill Routing for LLM Agents at Scale

    Authors: YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu

    Abstract: Reusable skills let LLM agents package task-specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasible. This creates a skill-routing problem: given a user task, the system must identify relevant skills before downstream planning or execution. Existing… ▽ More

    Submitted 1 April, 2026; v1 submitted 23 March, 2026; originally announced March 2026.

  24. arXiv:2603.22446  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

    Authors: Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou

    Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

    Comments: Published as a conference paper at the International Conference on Learning Representations (ICLR 2026)

  25. arXiv:2603.22117  [pdf, ps, other

    cs.LG cs.AI

    On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

    Authors: Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou

    Abstract: Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for und… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

  26. arXiv:2603.20611  [pdf, ps, other

    cs.CV

    GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction

    Authors: Di Kong, Yikai Wang, Wenjie Guo, Yifan Bu, Boya Zhang, Yuexin Duan, Xiawei Yue, Wenbiao Du, Yiman Zhong, Yuwen Chen, Cheng Ma

    Abstract: Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. We introduce GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our proposed method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D… ▽ More

    Submitted 20 March, 2026; originally announced March 2026.

    Comments: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

  27. arXiv:2603.19863  [pdf, ps, other

    cs.CV

    MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment

    Authors: Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma, Junjun He, Ningsheng Xu

    Abstract: Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the… ▽ More

    Submitted 20 March, 2026; originally announced March 2026.

  28. arXiv:2603.19835  [pdf, ps, other

    cs.LG

    FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

    Authors: Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou

    Abstract: We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment impose… ▽ More

    Submitted 31 March, 2026; v1 submitted 20 March, 2026; originally announced March 2026.

    Comments: Move related work to main paper, and add one more background information in Preliminary section

  29. arXiv:2603.19693  [pdf, ps, other

    cs.IR

    From Token to Item: Enhancing Large Language Models for Recommendation via Item-aware Attention Mechanism

    Authors: Xiaokun Zhang, Bowei He, Jiamin Chen, Ziqiang Cui, Chen Ma

    Abstract: Large Language Models (LLMs) have recently gained increasing attention in the field of recommendation. Existing LLM-based methods typically represent items as token sequences, and apply attention layers on these tokens to generate recommendations. However, by inheriting the standard attention mechanism, these methods focus on modeling token-level relations. This token-centric focus overlooks the i… ▽ More

    Submitted 20 March, 2026; originally announced March 2026.

    Comments: This work has been accepted by WWW 2026

  30. arXiv:2603.19232  [pdf, ps, other

    cs.CV

    Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

    Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu

    Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pr… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

    Comments: Accepted by CVPR 2026 main track; Code: https://github.com/YuqingWang1029/CubiD

  31. arXiv:2603.18465  [pdf, ps, other

    cs.CV

    MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

    Authors: Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma, Junjun He, Ningsheng Xu

    Abstract: Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to divers… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  32. arXiv:2603.16620  [pdf, ps, other

    cs.CV

    TCATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation

    Authors: Qiang He, Wentian Qu, Jiajia Dai, Changsong Lei, Shaofeng Wang, Feifei Zuo, Yajie Wang, Yaqian Liang, Xiaoming Deng, Cuixia Ma, Yong-Jin Liu, Hongan Wang

    Abstract: Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we pro… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

    Comments: 6 pages, 4 figures, ICASSP 2026

  33. arXiv:2603.16292  [pdf, ps, other

    cs.CL cs.AI

    Attention-guided Evidence Grounding for Spoken Question Answering

    Authors: Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

    Abstract: Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large La… ▽ More

    Submitted 17 March, 2026; v1 submitted 17 March, 2026; originally announced March 2026.

    Comments: Accepted to ICME 2026

  34. arXiv:2603.15689  [pdf, ps, other

    cs.LG cs.AI cs.CV

    Transition Flow Matching

    Authors: Chenrui Ma

    Abstract: Mainstream flow matching methods typically focus on learning the local velocity field, which inherently requires multiple integration steps during generation. In contrast, Mean Velocity Flow models establish a relationship between the local velocity field and the global mean velocity, enabling the latter to be learned through a mathematically grounded formulation and allowing generation to be tran… ▽ More

    Submitted 15 March, 2026; originally announced March 2026.

  35. arXiv:2603.14251  [pdf, ps, other

    cs.CL cs.AI

    Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

    Authors: Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang

    Abstract: Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning.… ▽ More

    Submitted 15 March, 2026; originally announced March 2026.

  36. arXiv:2603.14228  [pdf, ps, other

    cs.CV

    Not All Directions Matter: Toward Structured and Task-Aware Low-Rank Adaptation

    Authors: Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, Hao Xu

    Abstract: Low-Rank Adaptation (LoRA) has become a cornerstone of parameter-efficient fine-tuning (PEFT). Yet, its efficacy is hampered by two fundamental limitations: semantic drift, by treating all update directions with equal importance, and structural incoherence, from adapting layers independently, resulting in suboptimal, uncoordinated updates. To remedy these, we propose StructLoRA, a framework that a… ▽ More

    Submitted 15 March, 2026; originally announced March 2026.

  37. Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling

    Authors: Viet Dung Nguyen, Mobina Ghorbaninejad, Chengyi Ma, Reynold Bailey, Gabriel J. Diaz, Alexander Fix, Ryan J. Suess, Alexander Ororbia

    Abstract: Eye feature extraction from event-based data streams can be performed efficiently and with low energy consumption, offering great utility to real-world eye tracking pipelines. However, few eye feature extractors are designed to handle sudden changes in event density caused by the changes between gaze behaviors that vary in their kinematics, leading to degraded prediction performance. In this work,… ▽ More

    Submitted 30 March, 2026; v1 submitted 14 March, 2026; originally announced March 2026.

    Comments: 8 pages, 3 figures, 1 tables, accepted to ETRA 2026

    ACM Class: I.2.0; I.2.6; I.4.6; I.4.5

  38. arXiv:2603.12627  [pdf, ps, other

    stat.ML cs.IT cs.LG

    Batched Kernelized Bandits: Refinements and Extensions

    Authors: Chenkai Ma, Keqin Chen, Jonathan Scarlett

    Abstract: In this paper, we consider the problem of black-box optimization with noisy feedback revealed in batches, where the unknown function to optimize has a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We refer to this as the Batched Kernelized Bandits problem, and refine and extend existing results on regret bounds. For algorithmic upper bounds, (Li and Scarlett, 2022) shows that… ▽ More

    Submitted 12 March, 2026; originally announced March 2026.

  39. arXiv:2603.10495  [pdf, ps, other

    cs.CV

    IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

    Authors: Jiahao Lyu, Pei Fu, Zhenhang Li, Weichao Zeng, Shaojie Zhang, Jiahui Yang, Can Ma, Yu Zhou, Zhenbo Luo, Jian Luan

    Abstract: End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness… ▽ More

    Submitted 1 April, 2026; v1 submitted 11 March, 2026; originally announced March 2026.

  40. arXiv:2603.09245  [pdf, ps, other

    cs.CV

    Towards Instance Segmentation with Polygon Detection Transformers

    Authors: Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou, Chenghai Mao, Yan Peng, Xiaomao Li

    Abstract: One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction.… ▽ More

    Submitted 10 March, 2026; originally announced March 2026.

  41. arXiv:2603.08862  [pdf, ps, other

    cs.RO cs.LG

    APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model

    Authors: Yuanjie Lu, Beichen Wang, Zhengqi Wu, Yang Li, Xiaomin Lin, Chengzhi Mao, Xuesu Xiao

    Abstract: Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining class… ▽ More

    Submitted 9 March, 2026; originally announced March 2026.

  42. arXiv:2603.08850  [pdf, ps, other

    cs.CV

    HECTOR: Hybrid Editable Compositional Object References for Video Generation

    Authors: Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma

    Abstract: Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compos… ▽ More

    Submitted 9 March, 2026; originally announced March 2026.

  43. arXiv:2603.07769  [pdf, ps, other

    cs.CV

    MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

    Authors: Jiyao Liu, Junzhi Ning, Chenglong Ma, Wanying Qu, Jianghan Shen, Siqi Luo, Jinjie Wei, Jin Ye, Pengze Li, Tianbin Li, Jiashi Lin, Hongming Shan, Xinzhe Luo, Xiaohong Liu, Lihao Liu, Junjun He, Ningsheng Xu

    Abstract: Despite impressive performance on standard benchmarks, multimodal large language models (MLLMs) face critical challenges in real-world clinical environments where medical images inevitably suffer various quality degradations. Existing benchmarks exhibit two key limitations: (1) absence of large-scale, multidimensional assessment across medical image quality gradients and (2) no systematic confiden… ▽ More

    Submitted 8 March, 2026; originally announced March 2026.

    Comments: 29 pages, 11 figures

  44. arXiv:2603.02231  [pdf, ps, other

    cs.LG cs.AI

    Physics-Informed Neural Networks with Architectural Physics Embedding for Large-Scale Wave Field Reconstruction

    Authors: Huiwen Zhang, Feng Ye, Chu Ma

    Abstract: Large-scale wave field reconstruction requires precise solutions but faces challenges with computational efficiency and accuracy. The physics-based numerical methods like Finite Element Method (FEM) provide high accuracy but struggle with large-scale or high-frequency problems due to prohibitive computational costs. Pure data-driven approaches excel in speed but often lack sufficient labeled data… ▽ More

    Submitted 12 February, 2026; originally announced March 2026.

    Comments: 20 pages, 17 figures

  45. arXiv:2603.01571  [pdf, ps, other

    cs.AI

    Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

    Authors: Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma

    Abstract: Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth… ▽ More

    Submitted 2 March, 2026; originally announced March 2026.

  46. arXiv:2603.01562  [pdf, ps, other

    cs.AI

    RubricBench: Aligning Model-Generated Rubrics with Human Standards

    Authors: Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma

    Abstract: As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric a… ▽ More

    Submitted 3 March, 2026; v1 submitted 2 March, 2026; originally announced March 2026.

  47. arXiv:2602.23203  [pdf, ps, other

    cs.CV cs.AI

    ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

    Authors: Junhu Fu, Shuyu Liang, Wutong Li, Chen Ma, Peng Huang, Kehao Wang, Ke Chen, Shengli Lin, Pinghong Zhou, Zeju Li, Yuanyuan Wang, Yi Guo

    Abstract: Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we pr… ▽ More

    Submitted 26 February, 2026; originally announced February 2026.

  48. arXiv:2602.21657  [pdf, ps, other

    cs.CV cs.AI

    Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis

    Authors: Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang, Jun Feng

    Abstract: Computer-aided diagnosis (CAD) has significantly advanced automated chest X-ray diagnosis but remains isolated from clinical workflows and lacks reliable decision support and interpretability. Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists. However, the absence of interactive tools seamlessly embedded within di… ▽ More

    Submitted 25 February, 2026; originally announced February 2026.

  49. arXiv:2602.20735  [pdf, ps, other

    cs.IR cs.AI cs.CL

    RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition

    Authors: Kun Ran, Marwah Alaofi, Danula Hettiachchi, Chenglong Ma, Khoi Nguyen Dinh Anh, Khoi Vo Nguyen, Sachin Pathiyan Cherumanal, Lida Rashidi, Falk Scholer, Damiano Spina, Shuoqi Sun, Oleg Zendel

    Abstract: This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG (R2RAG), a research-focused retrieval-augmented generation (RAG) architecture composed of lightweight components that dynamically adapt the retrieval strategy based on inferred query complexity and evidence sufficiency. The system uses sm… ▽ More

    Submitted 24 February, 2026; originally announced February 2026.

    Comments: MMU-RAG NeurIPS 2025 winning system

  50. arXiv:2602.18500  [pdf, ps, other

    cs.CV cs.ET cs.HC

    Scaling Ultrasound Volumetric Reconstruction via Mobile Augmented Reality

    Authors: Kian Wei Ng, Yujia Gao, Deborah Khoo, Ying Zhen Tan, Chengzheng Mao, Haojie Cheng, Andrew Makmur, Kee Yuan Ngiam, Serene Goh, Eng Tat Khoo

    Abstract: Accurate volumetric characterization of lesions is essential for oncologic diagnosis, risk stratification, and treatment planning. While imaging modalities such as Computed Tomography provide high-quality 3D data, 2D ultrasound (2D-US) remains the preferred first-line modality for breast and thyroid imaging due to cost, portability, and safety factors. However, volume estimates derived from 2D-US… ▽ More

    Submitted 17 February, 2026; originally announced February 2026.

    Comments: Submitted to MICCAI 2026