Skip to main content

Showing 1–50 of 9,925 results for author: Li, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.11804  [pdf, ps, other

    cs.CV

    OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

    Authors: Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng

    Abstract: In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. Howeve… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

    Comments: Project page: https://correr-zhou.github.io/OmniShow/

  2. arXiv:2604.11742  [pdf, ps, other

    cs.CL cs.AI

    Discourse Diversity in Multi-Turn Empathic Dialogue

    Authors: Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez, Jina Suh, Desmond C. Ong, Junyi Jessy Li

    Abstract: Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formul… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  3. arXiv:2604.11554  [pdf, ps, other

    cs.CL

    Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

    Authors: Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li, Lumeng Wu, Jia Liu, Minghao Li, Weihang Chen, Weiqi Hu, Lei Zhang

    Abstract: Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \t… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

    Comments: 17 pages, 22 figures

  4. arXiv:2604.11487  [pdf, ps, other

    cs.CV

    NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

    Authors: Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya, Artem Filippov, Georgii Bychkov, Sergey Lavrushkin, Mikhail Erofeev, Anastasia Antsiferova, Changsheng Chen, Shunquan Tan, Radu Timofte, Dmitry Vatolin, Chuanbiao Song, Zijian Yu, Hao Tan, Jun Lan, Zhiqiang Yang, Yongwei Tang, Zhiqiang Wu, Jia Wen Seow, Hong Vin Koay, Haodong Ren, Feng Xu, Shuai Chen, Ruiyang Xia , et al. (29 additional authors not shown)

    Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical us… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

    Comments: CVPR 2026 NTIRE Workshop Paper, Robust AI-Generated Image Detection Technical Report

  5. arXiv:2604.11259  [pdf, ps, other

    cs.AI cs.CR

    Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

    Authors: Zhixin Lin, Jungang Li, Dongliang Xu, Shidong Pan, Yibo Shi, Yuchi Liu, Yuecong Min, Yue Yao

    Abstract: Mobile GUI agents powered by Multimodal Large Language Models (MLLMs) can execute complex tasks on mobile devices. Despite this progress, most existing systems still optimize task success or efficiency, neglecting users' privacy personalization. In this paper, we study the often-overlooked problem of agent personalization. We observe that personalization can induce systematic structural heterogene… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

    Comments: 10 pages, 6 figures, 3 tables

  6. arXiv:2604.11197  [pdf, ps, other

    cs.CV

    MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

    Authors: Jiahui Peng, He Yao, Jingwen Li, Yanzhou Su, Sibo Ju, Yujie Lu, Jin Ye, Hongchun Lu, Xue Li, Lincheng Jiang, Min Zhu, Junlong Cheng

    Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information pro… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  7. arXiv:2604.10962  [pdf, ps, other

    cs.RO

    ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    Authors: Xiaotian Qiu, Lukai Chen, Jinhao Li, Qi Sun, Cheng Zhuo, Guohao Dai

    Abstract: Flow Matching (FM) policies have emerged as an efficient backbone for robotic control, offering fast and expressive action generation that underpins recent large-scale embodied AI systems. However, FM policies trained via imitation learning inherit the limitations of demonstration data; surpassing suboptimal behaviors requires reinforcement learning (RL) fine-tuning. Recent methods convert determi… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: 20 pages, 19 figures

    ACM Class: I.2.6; I.2.9

  8. arXiv:2604.10892  [pdf, ps, other

    cs.RO cs.MA

    HECTOR: Human-centric Hierarchical Coordination and Supervision of Robotic Fleets under Continual Temporal Tasks

    Authors: Shen Wang, Yinhang Luo, Jie Li, Meng Guo

    Abstract: Robotic fleets can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. However, it can be demanding or even impractical for an operator to directly control each robot. Thus, autonomy of the fleet and its online interaction with the operator are both essential, particularly in dynamic and partially unknown environments. The oper… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

  9. arXiv:2604.10627  [pdf, ps, other

    cs.CL cs.AI cs.CE

    Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

    Authors: Yang Cui, Jingyuan Sun, Yizheng Sun, Yifan Wang, Yunhao Zhang, Jixing Li, Shaonan Wang, Hongpeng Zhou, John Hale, Chengqing Zong, Goran Nenadic

    Abstract: How the brain supports language across different languages is a basic question in neuroscience and a useful test for multilingual artificial intelligence. Neuroimaging has identified language-responsive brain regions across languages, but it cannot by itself show whether the underlying processing is shared or language-specific. Here we use six multilingual large language models (LLMs) as controlla… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: 23 pages, 5 figures, Journal format

  10. arXiv:2604.10598  [pdf, ps, other

    cs.RO

    AWARE: Adaptive Whole-body Active Rotating Control for Enhanced LiDAR-Inertial Odometry under Human-in-the-Loop Interaction

    Authors: Yizhe Zhang, Jianping Li, Liangliang Yin, Zhen Dong, Bisheng Yang

    Abstract: Human-in-the-loop (HITL) UAV operation is essential in complex and safety-critical aerial surveying environments, where human operators provide navigation intent while onboard autonomy must maintain accurate and robust state estimation. A key challenge in this setting is that resource-constrained UAV platforms are often limited to narrow-field-of-view LiDAR sensors. In geometrically degenerate or… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

  11. arXiv:2604.10577  [pdf, ps, other

    cs.CR cs.AI

    The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

    Authors: Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

    Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: 63 pages

  12. arXiv:2604.10541  [pdf, ps, other

    cs.CV

    Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets

    Authors: Jia Li, Yu Zhang, Yin Chen, Zhenzhen Hu, Yong Li, Richang Hong, Shiguang Shan, Meng Wang

    Abstract: Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insu… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: 18 pages, 11 figures

  13. arXiv:2604.10532  [pdf, ps, other

    cs.CV

    The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results

    Authors: Jingkai Wang, Jue Gong, Zheng Chen, Kai Liu, Jiatong Li, Yulun Zhang, Radu Timofte, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yingsi Chen, Yijiao Liu, Hui Li, Yu Wang, Congchao Zhu, Alexandru-Gabriel Lefterache, Anamaria Radoi, Chuanyue Yan, Tao Lu, Yanduo Zhang, Kanghui Zhao, Jiaming Wang, Yuqi Li , et al. (28 additional authors not shown)

    Abstract: This paper provides a review of the NTIRE 2026 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural and realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: NTIRE 26: https://cvlai.net/ntire/2026 . NTIRE Real-World Face Restoration: https://ntire-face.github.io/2026/ . CVPR 2026 Workshop

  14. arXiv:2604.10484  [pdf, ps, other

    cs.AR

    Strix: Re-thinking NPU Reliability from a System Perspective

    Authors: Jiapeng Guan, Jie Zhang, Hao Zhou, Ran Wei, Dean You, Hui Wang, Yingquan Wang, Tinglue Wang, Xudong Zhao, Jing Li, Zhe Jiang

    Abstract: DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirem… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

    Comments: This paper has been accepted for publication at DAC 2026

  15. arXiv:2604.10409  [pdf, ps, other

    cs.CV cs.AI

    IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

    Authors: Di Wen, Zeyun Zhong, David Schneider, Manuel Zaremski, Linus Kunzmann, Yitian Shi, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiahang Li, Jonas Hemmerich, Qiyi Tong, Patric Grauberger, Arash Ajoudani, Danda Pani Paudel, Sven Matthiesen, Barbara Deml, Jürgen Beyerer, Luc Van Gool, Rainer Stiefelhagen, Kunyu Peng

    Abstract: We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aw… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

    Comments: 9 pages, 2 figures, benchmark and dataset are available at https://github.com/Kratos-Wen/IMPACT

  16. arXiv:2604.10321  [pdf, ps, other

    cs.CV

    NTIRE 2026 Challenge on Single Image Reflection Removal in the Wild: Datasets, Results, and Methods

    Authors: Jie Cai, Kangning Yang, Zhiyuan Li, Florin-Alexandru Vasluianu, Radu Timofte, Jinlong Li, Jinglin Shen, Zibo Meng, Junyan Cao, Lu Zhao, Pengwei Liu, Yuyi Zhang, Fengjun Guo, Jiagao Hu, Zepeng Wang, Fei Wang, Daiguo Zhou, Yi'ang Chen, Honghui Zhu, Mengru Yang, Yan Luo, Kui Jiang, Jin Guo, Jonghyuk Park, Jae-Young Sim , et al. (28 additional authors not shown)

    Abstract: In this paper, we review the NTIRE 2026 challenge on single-image reflection removal (SIRR) in the Wild. SIRR is a fundamental task in image restoration. Despite progress in academic research, most methods are tested on synthetic images or limited real-world images, creating a gap in real-world applications. In this challenge, we provide participants with the OpenRR-5k dataset, which requires them… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

  17. arXiv:2604.10299  [pdf, ps, other

    cs.CV cs.CL

    Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking

    Authors: Jingru Li, Wei Ren, Tianqing Zhu

    Abstract: Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Ja… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

    Comments: Accepted to ACL 2026. Code: https://github.com/Landsayy/AttentionJailbreak

  18. arXiv:2604.10233  [pdf, ps, other

    cs.CV cs.AI

    Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

    Authors: Yang Yu, Dunyuan Xu, Yaoqian Li, Xiaomeng Li, Jinpeng Li, Pheng-Ann Heng

    Abstract: 3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

  19. arXiv:2604.10125  [pdf, ps, other

    cs.CV

    PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization

    Authors: Dongli Wu, Jingyu Hu, Ka-Hei Hui, Xiaobao Wei, Chengwen Luo, Jianqiang Li, Zhengzhe Liu

    Abstract: Existing single-image 3D indoor scene generators often produce results that look visually plausible but fail to obey real-world physics, limiting their reliability in robotics, embodied AI, and design. To examine this gap, we introduce a unified Physics Evaluator that measures four main aspects: geometric priors, contact, stability, and deployability, which are further decomposed into nine sub-con… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

  20. arXiv:2604.10101  [pdf, ps, other

    cs.CL

    Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry

    Authors: Jiang Li, Tian Lan, Shanshan Wang, Dongxing Zhang, Dianqing Lin, Guanglai Gao, Derek F. Wong, Xiangdong Su

    Abstract: The rapid development of large language models (LLMs) has extended text generation tasks into the literary domain. However, AI-generated literary creations has raised increasingly prominent issues of creative authenticity and ethics in literary world, making the detection of LLM-generated literary texts essential and urgent. While previous works have made significant progress in detecting AI-gener… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

    Comments: Accepted to ACL 2026 Main Conference

  21. arXiv:2604.10062  [pdf, ps, other

    cs.LG

    When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

    Authors: Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang

    Abstract: We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

  22. arXiv:2604.10044  [pdf, ps, other

    cs.AI

    LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

    Authors: Dongjie Xu, Hao Wu, Weijie Shi, Yue Cui, Yuanjun Liu, Jiawei Li, Haolun Ma, An Liu, Jia Zhu, Jiajie Xu

    Abstract: Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

  23. arXiv:2604.10016  [pdf, ps, other

    astro-ph.SR cs.LG

    Predicting Associations between Solar Flares and Coronal Mass Ejections Using SDO/HMI Magnetograms and a Hybrid Neural Network

    Authors: Jialiang Li, Vasyl Yurchyshyn, Jason T. L. Wang, Haimin Wang, Manolis K. Georgoulis, Wen He, Yasser Abduallah, Hameedullah A. Farooki, Yan Xu

    Abstract: Solar eruptions, including flares and coronal mass ejections (CMEs), have a significant impact on Earth. Some flares are associated with CMEs, and some flares are not. The association between flares and CMEs is not always obvious. In this study, we propose a new deep learning method, specifically a hybrid neural network (HNN) that combines a vision transformer with long short-term memory, to predi… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

    Comments: 14 pages, 8 figures

  24. arXiv:2604.09814  [pdf, ps, other

    cs.CV

    RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation

    Authors: Jieru Li, Matthew Chen, Micky C. Nnamdi, J. Ben Tamo, Benoit L. Marteau, May D. Wang

    Abstract: Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: 14 pages, 9 figures

  25. arXiv:2604.09748  [pdf, ps, other

    cs.CR cs.AI

    Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

    Authors: Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou, Min Zhang, Jing Li

    Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward veri… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: 20 pages,8 figures, publish in acl2026

  26. arXiv:2604.09710  [pdf, ps, other

    cs.CV cs.LG

    Robust Fair Disease Diagnosis in CT Images

    Authors: Justin Li, Daniel Ding, Asmita Yuki Pritha, Aryana Hou, Xin Wang, Shu Hu

    Abstract: Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness correct… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: 8 pages, 3 figures, 2 tables. Accepted at the 3rd Workshop on New Trends in AI-Generated Media and Security (AIMS) @ CVPR 2026

  27. arXiv:2604.09553  [pdf, ps, other

    cs.IR cs.AI

    SRBench: A Comprehensive Benchmark for Sequential Recommendation with Large Language Models

    Authors: Jianhong Li, Zeheng Qian, Wangze Ni, Haoyang Li, Hongwei Yao, Yang Bai, Kui Ren

    Abstract: LLM development has aroused great interest in Sequential Recommendation (SR) applications. However, comprehensive evaluation of SR models remains lacking due to the limitations of the existing benchmarks: 1) an overemphasis on accuracy, ignoring other real-world demands (e.g., fairness); 2) existing datasets fail to unleash LLMs' potential, leading to unfair comparison between Neural-Network-based… ▽ More

    Submitted 30 January, 2026; originally announced April 2026.

  28. arXiv:2604.09532  [pdf, ps, other

    cs.CV cs.AI

    Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

    Authors: Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu

    Abstract: Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

  29. arXiv:2604.09455  [pdf, ps, other

    cs.AI

    E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

    Authors: Weiyang Guo, Zesheng Shi, Liye Zhao, Jiayuan Ma, Zeen Zhu, Junxian He, Min Zhang, Jing Li

    Abstract: While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges,… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: 22 pages 10 figures, published in acl2026

  30. arXiv:2604.09415  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.RO

    PhysInOne: Visual Physics Learning and Reasoning in One Suite

    Authors: Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li, Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun , et al. (14 additional authors not shown)

    Abstract: We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: CVPR 2026. Siyuan, Hejun, Hu, Jinxi, Dongsheng, Junwei, Yixiao, Jiayue, and Shiwei are co-first authors. Project page: https://vlar-group.github.io/PhysInOne.html

  31. arXiv:2604.09231  [pdf, ps, other

    cs.CV

    Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

    Authors: Huiang He, Shengchu Zhao, Jianwen Huang, Jie Li, Jiaqi Wu, Hu Zhang, Pei Tang, Heliang Zheng, Yukun Li, Rongfei Jia

    Abstract: Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: 13 pages

  32. arXiv:2604.09199  [pdf, ps, other

    cs.CV

    Globally Optimal Pose from Orthographic Silhouettes

    Authors: Agniva Sengupta, Dilara KuÅŸ, Jianning Li, Stefan Zachow

    Abstract: We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-s… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Journal ref: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. Denver, Colorado

  33. arXiv:2604.09144  [pdf, ps, other

    quant-ph cs.NI

    QuIKS: Near-Zero Latency Key Supply with Adaptive Buffering for Resource-Efficient Quantum Key Distribution Networks

    Authors: Yuxin Chen, Zite Xia, Jian Li, Kaiping Xue, Zhonghui Li, Lutong Chen, Ruidong Li

    Abstract: Quantum key distribution (QKD) networks provide information-theoretically secure keys for distant parties, emerging as a vital alternative to classical cryptography infrastructures threatened by quantum computing. In QKD networks, the immediacy of key supply service is crucial to the security and performance of applications, as their data must be encrypted before transmission. While key buffering… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

  34. arXiv:2604.09142  [pdf, ps, other

    cs.CV

    Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

    Authors: Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang

    Abstract: Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) r… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

  35. arXiv:2604.09000  [pdf, ps, other

    cs.CV

    StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

    Authors: Junxi Wang, Te Sun, Jiayi Zhu, Junxian Li, Haowen Xu, Zichen Wen, Xuming Hu, Zhiyu Li, Linfeng Zhang

    Abstract: Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introd… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: 2026ACL Findings

  36. arXiv:2604.08995  [pdf, ps, other

    cs.CV

    Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    Authors: Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou

    Abstract: With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory… ▽ More

    Submitted 12 April, 2026; v1 submitted 10 April, 2026; originally announced April 2026.

    Comments: Project page: https://matrix-game-v3.github.io/

  37. arXiv:2604.08479  [pdf, ps, other

    cs.CL

    AI generates well-liked but templatic empathic responses

    Authors: Emma Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li, Desmond C. Ong

    Abstract: Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include val… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  38. arXiv:2604.08384  [pdf, ps, other

    eess.AS cs.AI

    TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

    Authors: Jing Peng, Chenghao Wang, Yi Yang, Lirong Qian, Junjie Li, Yu Xi, Shuai Wang, Kai Yu

    Abstract: Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We prop… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  39. arXiv:2604.08364  [pdf, ps, other

    cs.CV

    MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

    Authors: Junyao Gao, Sibo Liu, Jiaxing Li, Yanan Sun, Yuanpeng Tu, Fei Shen, Weidong Zhang, Cairong Zhao, Jun Zhang

    Abstract: In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundatio… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: project website https://jeoyal.github.io/MegaStyle/

  40. arXiv:2604.08281  [pdf, ps, other

    cs.CL

    When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

    Authors: Ruotao Xu, Yixin Ji, Yu Luo, Jinpeng Li, Dong Li, Peifeng Li, Juntao Li, Min Zhang

    Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  41. arXiv:2604.08168  [pdf, ps, other

    cs.RO cs.AI

    ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    Authors: Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

    Abstract: Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to cap… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  42. arXiv:2604.08038  [pdf, ps, other

    cs.CV

    Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

    Authors: Jun Li, Yingying Shi, Zhixuan Ruan, Nan Guo, Jianhua Xu

    Abstract: In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, st… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  43. arXiv:2604.07728  [pdf, ps, other

    cs.CV cs.GR cs.RO

    GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting

    Authors: Jialin Li, Bin Fu, Ruiping Wang, Xilin Chen

    Abstract: High-fidelity interactive digital assets are essential for embodied intelligence and robotic interaction, yet articulated objects remain challenging to reconstruct due to their complex structures and coupled geometry-motion relationships. Existing methods suffer from instability in geometry-motion joint optimization, while their generalization remains limited on complex multi-joint or out-of-distr… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: Accepted to CVPRF2026

  44. arXiv:2604.07723  [pdf, ps, other

    cs.CV

    Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

    Authors: Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie, Yanyun Qu

    Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: Accepted by CVPR 2026

  45. arXiv:2604.07394  [pdf, ps, other

    cs.LG cs.CL

    Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

    Authors: Quantong Qiu, Zhiyi Hong, Yi Yang, Haitian Wang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang

    Abstract: The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different task… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  46. arXiv:2604.07361  [pdf, ps, other

    cs.LG

    BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis

    Authors: Rui Dong, Zitong Wang, Jiaxing Li, Weihuang Zheng, Youyong Kong

    Abstract: Graph Neural Networks (GNNs) have been widely used in diverse brain network analysis tasks based on preprocessed functional magnetic resonance imaging (fMRI) data. However, their performances are constrained due to high feature sparsity and inherent limitations of domain knowledge within uni-modal neurographs. Meanwhile, large language models (LLMs) have demonstrated powerful representation capabi… ▽ More

    Submitted 10 April, 2026; v1 submitted 1 April, 2026; originally announced April 2026.

  47. arXiv:2604.07273  [pdf, ps, other

    cs.CV

    GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

    Authors: Yiqian Wu, Rawal Khirodkar, Egor Zakharov, Timur Bagautdinov, Lei Xiao, Zhaoen Su, Shunsuke Saito, Xiaogang Jin, Junxuan Li

    Abstract: We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training… ▽ More

    Submitted 9 April, 2026; v1 submitted 8 April, 2026; originally announced April 2026.

  48. arXiv:2604.07171  [pdf, ps, other

    cs.LG

    Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization

    Authors: Yong Si, Mingfei Lu, Jing Li, Yang Hu, Guijiang Li, Yueheng Song, Zhaokui Wang

    Abstract: Decision-making in military aviation Prognostics and Health Management (PHM) faces significant challenges due to the "curse of dimensionality" in large-scale fleet operations, combined with sparse feedback and stochastic mission profiles. To address these issues, this paper proposes Smart Commander, a novel Hierarchical Reinforcement Learning (HRL) framework designed to optimize sequential mainten… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: 21 pages, 6 figures, 4 tables

  49. arXiv:2604.06987  [pdf, ps, other

    cs.CV cs.AI cs.CR

    CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

    Authors: Renyang Liu, Jiale Li, Jie Zhang, Cong Wu, Xiaojun Jia, Shuxin Li, Wei Zhou, Kwok-Yan Lam, See-kiong Ng

    Abstract: Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital settin… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  50. arXiv:2604.06925  [pdf, ps, other

    cs.MM

    LungCURE: Benchmarking Multimodal Real-World Clinical Reasoning for Precision Lung Cancer Diagnosis and Treatment

    Authors: Fangyu Hao, Jiayu Yang, Yifan Zhu, Zijun Yu, Qicen Wu, Wang Yunlong, Jiawei Li, Yulin Liu, Xu Zeng, Guanting Chen, Shihao Li, Zhonghong Ou, Meina Song, Mengyang Sun, Haoran Luo, Yu Shi, Yingyi Wang

    Abstract: Lung cancer clinical decision support demands precise reasoning across complex, multi-stage oncological workflows. Existing multimodal large language models (MLLMs) fail to handle guideline-constrained staging and treatment reasoning. We formalize three oncological precision treatment (OPT) tasks for lung cancer, spanning TNM staging, treatment recommendation, and end-to-end clinical decision supp… ▽ More

    Submitted 9 April, 2026; v1 submitted 8 April, 2026; originally announced April 2026.

    Comments: 20 pages, 22 figures