Skip to main content

Showing 1–50 of 2,866 results for author: Zha, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.14932  [pdf, ps, other

    cs.AI

    WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

    Authors: Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, Zhou Zhao

    Abstract: End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attemp… ▽ More

    Submitted 16 April, 2026; originally announced April 2026.

  2. arXiv:2604.14920  [pdf, ps, other

    cs.AI

    Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

    Authors: Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, Zhou Zhao

    Abstract: Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevail… ▽ More

    Submitted 16 April, 2026; originally announced April 2026.

  3. arXiv:2604.14172  [pdf, ps, other

    cs.CL cs.AI

    Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations

    Authors: Ziyin Zhou, Jianyi Zhang, Xu ji, Yilong Li, Jiameng Han, Zhangchi Zhao

    Abstract: Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the prob… ▽ More

    Submitted 25 March, 2026; originally announced April 2026.

  4. arXiv:2604.13938  [pdf, ps, other

    cs.CV

    ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

    Authors: Tianze Xia, Zijian Ning, Zonglin Zhao, Mingjia Wang

    Abstract: Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion a… ▽ More

    Submitted 15 April, 2026; originally announced April 2026.

  5. arXiv:2604.13804  [pdf, ps, other

    cs.LG

    Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

    Authors: Dongjie Fu, Fangming Feng, Xize Cheng, Linjun Li, Zhou Zhao, Tao Jin

    Abstract: The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluatin… ▽ More

    Submitted 15 April, 2026; originally announced April 2026.

  6. arXiv:2604.13648  [pdf, ps, other

    cs.SE

    Figma2Code: Automating Multimodal Design to Code in the Wild

    Authors: Yi Gui, Jiawan Zhang, Yina Wang, Tianran Ma, Yao Wan, Shilin He, Dongping Chen, Zhou Zhao, Wenbin Jiang, Xuanhua Shi, Hai Jin, Philip S Yu

    Abstract: Front-end development constitutes a substantial portion of software engineering, yet converting design mockups into production-ready User Interface (UI) code remains tedious and costly. While recent work has explored automating this process with Multimodal Large Language Models (MLLMs), existing approaches typically rely solely on design images. As a result, they must infer complex UI details from… ▽ More

    Submitted 15 April, 2026; originally announced April 2026.

    Comments: ICLR 2026

  7. arXiv:2604.13472  [pdf, ps, other

    cs.LG cs.AI cs.MA

    Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

    Authors: Zijian Zhao, Jing Gao, Sen Li

    Abstract: Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the… ▽ More

    Submitted 15 April, 2026; originally announced April 2026.

  8. arXiv:2604.12332  [pdf, ps, other

    cs.IT math.CO

    Turán-Theoretic Bounds on Several Elementary Trapping Sets in LDPC Codes

    Authors: Ziyang Zhao, Haoran Xiong, Zicheng Ye, Guiying Yan

    Abstract: LDPC codes have attracted significant attention because of their superior performance close to the Shannon limit. Elementary trapping sets are the main cause of the error floor phenomenon in LDPC codes. We consider typical graphs related to trapping sets, including theta graphs, dumbbell graphs, and short cycles with chords. Based on the Turán numbers of $θ(2,2,2)$, $θ(1,3,3)$ and $D(4,4;0)$, we p… ▽ More

    Submitted 14 April, 2026; originally announced April 2026.

  9. arXiv:2604.11950  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.CR

    AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

    Authors: Zijie Zhao, Chenyuan Yang, Weidong Wang, Yihan Yang, Ziqi Zhang, Lingming Zhang

    Abstract: While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted inpu… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  10. arXiv:2604.11627  [pdf, ps, other

    cs.CV

    POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    Authors: Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang

    Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token s… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  11. arXiv:2604.11594  [pdf, ps, other

    eess.AS cs.SD

    HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

    Authors: Shuiyuan Wang, Zhixian Zhao, Hongfei Yue, Chengyou Wang, Shuai Wang, Hui Bu, Xin Xu, Lei Xie

    Abstract: Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs' EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, i… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  12. arXiv:2604.11576  [pdf, ps, other

    cs.CV

    Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

    Authors: Songlong Xing, Weijie Wang, Zhengyu Zhao, Jindong Gu, Philip Torr, Nicu Sebe

    Abstract: Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the imp… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

    Comments: Accepted to CVPR Findings Track 2026

  13. arXiv:2604.11411  [pdf, ps, other

    cs.CV

    Online Reasoning Video Object Segmentation

    Authors: Jinyuan Liu, Yang Wang, Zeyu Zhao, Weixin Li, Song Wang, Ruize Han

    Abstract: Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployment… ▽ More

    Submitted 13 April, 2026; originally announced April 2026.

  14. arXiv:2604.11064  [pdf, ps, other

    cs.LG cs.CV

    A Faster Path to Continual Learning

    Authors: Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng

    Abstract: Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iter… ▽ More

    Submitted 14 April, 2026; v1 submitted 13 April, 2026; originally announced April 2026.

    Comments: Update Author Affiliations

  15. arXiv:2604.10856  [pdf, ps, other

    cs.RO cs.AI

    BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

    Authors: Seth Z. Zhao, Luobin Wang, Hongwei Ruan, Yuxin Bao, Yilan Chen, Ziyang Leng, Abhijit Ravichandran, Honglin He, Zewei Zhou, Xu Han, Abhishek Peri, Zhiyu Huang, Pranav Desai, Henrik Christensen, Jiaqi Ma, Bolei Zhou

    Abstract: Open-loop (OL) to closed-loop (CL) gap (OL-CL gap) exists when OL-pretrained policies scoring high in OL evaluations fail to transfer effectively in closed-loop (CL) deployment. In this paper, we unveil the root causes of this systemic failure and propose a practical remedy. Specifically, we demonstrate that OL policies suffer from Observational Domain Shift and Objective Mismatch. We show that wh… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

  16. arXiv:2604.10517  [pdf, ps, other

    cs.AI

    From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

    Authors: Xiaoda Yang, Yuxiang Liu, Shenzhou Gao, Can Wang, Jingyang Xue, Lixin Yang, Yao Mu, Tao Jin, Shuicheng Yan, Zhimeng Zhang, Zhou Zhao

    Abstract: Modern vision-language models achieve strong performance in static perception, but remain limited in the complex spatiotemporal reasoning required for embodied, egocentric tasks. A major source of failure is their reliance on temporal priors learned from passive video data, which often leads to spatiotemporal hallucinations and poor generalization in dynamic environments. To address this, we prese… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

  17. arXiv:2604.10506  [pdf, ps, other

    cs.AI

    A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

    Authors: Xiaoda Yang, Shuai Yang, Can Wang, Jingyang Xue, Menglan Tang, Checheng Yu, Xunzhe Zhou, Sashuai Zhou, Tao Jin, Lixin Yang, Xiangyu Yue, Zhou Zhao

    Abstract: Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this,… ▽ More

    Submitted 12 April, 2026; originally announced April 2026.

  18. arXiv:2604.10164  [pdf, ps, other

    cs.AI

    Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities

    Authors: Ze Zhao, Yuhui He, Lyuwen Wu, Gu Tang, Bin Lu, Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou

    Abstract: Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions.… ▽ More

    Submitted 11 April, 2026; originally announced April 2026.

    Comments: 24 pages, accepted by ICLR2026

  19. arXiv:2604.09415  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.RO

    PhysInOne: Visual Physics Learning and Reasoning in One Suite

    Authors: Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, Yafei Yang, Hongkang Song, Shenxing Wei, Zihui Zhang, Peng Huang, Shijie Liu, Zhengli Hao, Hao Li, Yitian Li, Wenqi Zhou, Zhihan Zhao, Zongqi He, Hongtao Wen, Shouwang Huang, Peng Yun , et al. (14 additional authors not shown)

    Abstract: We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

    Comments: CVPR 2026. Siyuan, Hejun, Hu, Jinxi, Dongsheng, Junwei, Yixiao, Jiayue, and Shiwei are co-first authors. Project page: https://vlar-group.github.io/PhysInOne.html

  20. arXiv:2604.09313  [pdf, ps, other

    eess.IV cs.CV

    Compositional-Degradation UAV Image Restoration: Conditional Decoupled MoE Network and A Benchmark

    Authors: Jinquan Yan, Zhicheng Zhao, Zhengzheng Tu, Chenglong Li, Jin Tang, Bin Luo

    Abstract: UAV images are critical for applications such as large-area mapping, infrastructure inspection, and emergency response. However, in real-world flight environments, a single image is often affected by multiple degradation factors, including rain, haze, and noise, undermining downstream task performance. Current unified restoration approaches typically rely on implicit degradation representations th… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

  21. arXiv:2604.08046  [pdf, ps, other

    cs.CL

    Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

    Authors: Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu

    Abstract: Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ''integration bottleneck'': even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric kn… ▽ More

    Submitted 15 April, 2026; v1 submitted 9 April, 2026; originally announced April 2026.

  22. arXiv:2604.07958  [pdf, ps, other

    cs.CV

    ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    Authors: Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang, Siyu Chen, Xiaoda Yang, Tao Jin, Zhou Zhao

    Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient frame… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  23. arXiv:2604.07343  [pdf, ps, other

    cs.CL cs.LG

    Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

    Authors: Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang, Zhe Zhao

    Abstract: Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Pers… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  24. arXiv:2604.07209  [pdf, ps, other

    cs.CV

    INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    Authors: InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao

    Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time frame… ▽ More

    Submitted 13 April, 2026; v1 submitted 8 April, 2026; originally announced April 2026.

  25. arXiv:2604.06798  [pdf, ps, other

    cs.LG cs.AI

    MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

    Authors: Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang

    Abstract: Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we… ▽ More

    Submitted 13 April, 2026; v1 submitted 8 April, 2026; originally announced April 2026.

    Comments: Accepted at ACL 2026 Findings

  26. arXiv:2604.06728  [pdf, ps, other

    cs.CV cs.AI cs.MM

    URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

    Authors: Zhenyu Wang, Weichen Cheng, Weijia Li, Junjie Mou, Zongyou Zhao, Guoying Zhang

    Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  27. arXiv:2604.06176  [pdf, ps, other

    cs.IR cs.AI cs.CL

    Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model

    Authors: Weishu Chen, Zhouhui Hou, Mingjie Zhan, Zhicheng Zhao, Fei Su

    Abstract: We present an empirical study of embedding-based retrieval under realistic conversational settings, where queries are short, dialogue-like, and weakly specified, and retrieval corpora contain structured conversational artifacts. Focusing on Qwen3-embedding models, we identify a deployment-relevant robustness vulnerability: under conversational retrieval without query prompting, structured dialogue… ▽ More

    Submitted 3 February, 2026; originally announced April 2026.

  28. arXiv:2604.05965  [pdf, ps, other

    cs.AI

    Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

    Authors: Renxuan Tan, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

    Abstract: Transcending the single-preference paradigm, aligning LLMs with diverse human values is pivotal for robust deployment. Contemporary Multi-Objective Preference Alignment (MPA) approaches predominantly rely on static linear scalarization or rigid gradient projection to navigate these trade-offs. However, by enforcing strict conflict avoidance or simultaneous descent, these paradigms often prematurel… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

  29. arXiv:2604.05587  [pdf, ps, other

    cs.AI math.OC

    ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

    Authors: Zhe Zhao, Haibin Wen, Jiaming Ma, Jiachang Zhan, Tianyi Xu, Ye Wei, Qingfu Zhang

    Abstract: An important recurring pattern in scientific breakthroughs is a two-stage process: an initial phase of undirected experimentation that yields an unexpected finding, followed by a retrospective phase that explains why the finding works and situates it within existing theory. We present ResearchEVO, an end-to-end framework that computationally instantiates this discover-then-explain paradigm. The Ev… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

  30. arXiv:2604.05517  [pdf, ps, other

    cs.AI

    UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

    Authors: Xiaolong Wei, Zerun Zhu, Simin Niu, Xingyu Zhang, Peiying Yu, Changxuan Xiao, Yuchen Li, Jicheng Yang, Zhejun Zhao, Chong Meng, Long Xia, Daiting Shi

    Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typical… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: Accepted to Findings of ACL 2026

  31. arXiv:2604.04872  [pdf, ps, other

    cs.CL cs.LG

    Synthetic Sandbox for Training Machine Learning Engineering Agents

    Authors: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan

    Abstract: As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each roll… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: 28 pages, 9 tables, 8 figures

  32. arXiv:2604.04771  [pdf, ps, other

    cs.CV cs.CL

    MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

    Authors: Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei , et al. (18 additional authors not shown)

    Abstract: Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training… ▽ More

    Submitted 9 April, 2026; v1 submitted 6 April, 2026; originally announced April 2026.

    Comments: Technical Report

  33. arXiv:2604.04387  [pdf, ps, other

    cs.AI cs.CY cs.ET cs.HC cs.LG

    Gradual Cognitive Externalization: From Modeling Cognition to Constituting It

    Authors: Zhimin Zhao

    Abstract: Developers are publishing AI agent skills that replicate a colleague's communication style, encode a supervisor's mentoring heuristics, or preserve a person's behavioral repertoire beyond biological death. To explain why, we propose Gradual Cognitive Externalization (GCE), a framework arguing that ambient AI systems, through sustained causal coupling with users, transition from modeling cognitive… ▽ More

    Submitted 6 April, 2026; v1 submitted 5 April, 2026; originally announced April 2026.

  34. arXiv:2604.04264  [pdf, ps, other

    stat.ML cs.IT cs.LG eess.SP stat.AP

    Avoiding Non-Integrable Beliefs in Expectation Propagation

    Authors: Zilu Zhao, Jichao Chen, Dirk Slock

    Abstract: Expectation Propagation (EP) is a widely used iterative message-passing algorithm that decomposes a global inference problem into multiple local ones. It approximates marginal distributions as ``beliefs'' using intermediate functions called ``messages''. It has been shown that the stationary points of EP are the same as corresponding constrained Bethe Free Energy (BFE) optimization problem. Theref… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

  35. arXiv:2604.04177  [pdf, ps, other

    cs.CL

    Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs

    Authors: Jason Chan, Robert Gaizauskas, Zhixue Zhao

    Abstract: As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models' outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically so… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

    Comments: Preprint

  36. arXiv:2604.04135  [pdf, ps, other

    cs.CV

    NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

    Authors: Shuhong Liu, Chenyu Bao, Ziteng Cui, Xuangeng Chu, Bin Ren, Lin Gu, Xiang Chen, Mingrui Li, Long Ma, Marcos V. Conde, Radu Timofte, Yun Liu, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Yuan Gan, Tianhan Xu, Yusuke Kurose, Tatsuya Harada, Junwei Yuan, Gengjia Chang, Xining Ge, Mache You, Qida Cao, Zeliang Li , et al. (81 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participa… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

  37. arXiv:2604.03586  [pdf, ps, other

    cs.CL

    MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification

    Authors: Tailong Luo, Hao Li, Rong Fu, Xinyue Jiang, Huaxuan Ding, Yiduo Zhang, Zilin Zhao, Simon Fong, Guangyin Jin, Jianyuan Ni

    Abstract: With the growing prevalence of multimodal news content, effective news topic classification demands models capable of jointly understanding and reasoning over heterogeneous data such as text and images. Existing methods often process modalities independently or employ simplistic fusion strategies, limiting their ability to capture complex cross-modal interactions and leverage external knowledge. T… ▽ More

    Submitted 4 April, 2026; originally announced April 2026.

    Comments: Accepted in International Joint Conference on Neural Networks (IJCNN) 2026

  38. arXiv:2604.02794  [pdf, ps, other

    cs.AI

    CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    Authors: Situo Zhang, Yifan Zhang, Zichen Zhu, Da Ma, Lei Pan, Danyang Zhang, Zihan Zhao, Lu Chen, Kai Yu

    Abstract: Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack of high-quality training data, as well as the need for fine-grained visual grounding and precise numerical computation. To address these challenges, we first propose DuoChart, a scalable dual-source da… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

  39. arXiv:2604.02752  [pdf, ps, other

    cs.CV

    Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

    Authors: Jinfan Liu, Wuze Zhang, Zhangli Hu, Zhehan Zhao, Ye Chen, Bingbing Ni

    Abstract: In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimizat… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

  40. arXiv:2604.02381  [pdf, ps, other

    cs.NI cs.IT cs.MA

    Agentic AI-Empowered Wireless Agent Networks With Semantic-Aware Collaboration via ILAC

    Authors: Zhouxiang Zhao, Jiaxiang Wang, Zhaohui Yang, Kun Yang, Zhaoyang Zhang, Mingzhe Chen, Kaibin Huang

    Abstract: The rapid development of agentic artificial intelligence (AI) is driving future wireless networks to evolve from passive data pipes into intelligent collaborative ecosystems under the emerging paradigm of integrated learning and communication (ILAC). However, realizing efficient agentic collaboration faces challenges not only in handling semantic redundancy but also in the lack of an integrated me… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  41. arXiv:2604.02158  [pdf, ps, other

    cs.DC cs.LG cs.PF

    A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems

    Authors: Beste Oztop, Dhruva Kulkarni, Zhengji Zhao, Ayse Kivilcim Coskun, Kadidia Konate

    Abstract: Efficient utilization of GPU resources and power has become critical with the growing demand for GPUs in high-performance computing (HPC). In this paper, we analyze GPU utilization and GPU memory utilization, as well as the power consumption of the Vienna ab initio Simulation Package (VASP), using the Slurm workload manager historical logs and GPU performance metrics collected by NVIDIA's Data Cen… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

    Comments: 9 pages, 6 figures

  42. arXiv:2604.01905  [pdf, ps, other

    cs.CR cs.SE

    From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

    Authors: Yiheng Huang, Zhijia Zhao, Bihuan Chen, Susheng Wu, Zhuotong Zhou, Yiheng Cao, Xin Hu, Xin Peng

    Abstract: The model context protocol (MCP) standardizes how LLMs connect to external tools and data sources, enabling faster integration but introducing new attack vectors. Despite the growing adoption of MCP, existing MCP security studies classify attacks by their observable effects, obscuring how attacks behave across different MCP server components and overlooking multi-component attack chains. Meanwhile… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

  43. arXiv:2604.01676  [pdf, ps, other

    cs.CV cs.AI cs.SE

    GPA: Learning GUI Process Automation from Demonstrations

    Authors: Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese, Junnan Li

    Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization t… ▽ More

    Submitted 3 April, 2026; v1 submitted 2 April, 2026; originally announced April 2026.

  44. arXiv:2604.01479  [pdf, ps, other

    cs.CV

    UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    Authors: Zhisheng Huang, Jiahao Chen, Cheng Lin, Chenyu Hu, Hanzhuo Huang, Zhengming Yu, Mengfei Li, Yuheng Liu, Zekai Gu, Zibo Zhao, Yuan Liu, Xin Li, Wenping Wang

    Abstract: Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a… ▽ More

    Submitted 2 April, 2026; v1 submitted 1 April, 2026; originally announced April 2026.

  45. arXiv:2604.01128  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

    Authors: Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa

    Abstract: This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction E… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

    Comments: Project Page: https://agent4science-utokyo.github.io/PaperRecon_HP/

  46. arXiv:2604.00778  [pdf, ps, other

    cs.CL

    From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

    Authors: Ayan Datta, Mounika Marreddy, Alexander Mehler, Zhixue Zhao, Radhika Mamidi

    Abstract: Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using thi… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  47. arXiv:2603.29931  [pdf, ps, other

    cs.CV

    Gloria: Consistent Character Video Generation via Content Anchors

    Authors: Yuhang Yang, Fan Zhang, Huaijin Pi, Shuai Guo, Guowei Xu, Wei Zhai, Yang Cao, Zheng-Jun Zha

    Abstract: Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inher… ▽ More

    Submitted 31 March, 2026; originally announced March 2026.

    Comments: Accepted by CVPR2026 Main, project: https://yyvhang.github.io/Gloria_Page/

  48. arXiv:2603.29632  [pdf, ps, other

    cs.MA cs.AI

    An Empirical Study of Multi-Agent Collaboration for Automated Research

    Authors: Yang Shen, Zhenyi Yi, Ziyi Zhao, Lijun Sun, Dongyang Li, Chin-Teng Lin, Yuhui Shi

    Abstract: As AI agents evolve, the community is rapidly shifting from single Large Language Models (LLMs) to Multi-Agent Systems (MAS) to overcome cognitive bottlenecks in automated research. However, the optimal multi-agent coordination framework for these autonomous agents remains largely unexplored. In this paper, we present a systematic empirical study investigating the comparative efficacy of distinct… ▽ More

    Submitted 31 March, 2026; originally announced March 2026.

  49. arXiv:2603.29213  [pdf, ps, other

    cs.RO

    Kilohertz-Safe: A Scalable Framework for Constrained Dexterous Retargeting

    Authors: Yinxiao Tian, Ziyi Yang, Zinan Zhao, Zhen Kan

    Abstract: Dexterous hand teleoperation requires motion re-targeting methods that simultaneously achieve high-frequency real-time performance and enforcement of heterogeneous kinematic and safety constraints. Existing nonlinear optimization-based approaches often incur prohibitive computational cost, limiting their applicability to kilohertz-level control, while learning-based methods typically lack formal s… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

    Comments: 8 pages,6 Figures,Under Reiview

  50. arXiv:2603.27507  [pdf, ps, other

    cs.CV

    Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

    Authors: Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao

    Abstract: Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich obje… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.