Skip to main content

Showing 1–50 of 375 results for author: Du, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.15577  [pdf, ps, other

    cs.CV

    MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

    Authors: Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen

    Abstract: In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  2. arXiv:2512.13368  [pdf, ps, other

    cs.IR

    BlossomRec: Block-level Fused Sparse Attention Mechanism for Sequential Recommendations

    Authors: Mengyang Ma, Xiaopeng Li, Wanyu Wang, Zhaocheng Du, Jingtong Gao, Pengyue Jia, Yuyang Ye, Yiqi Wang, Yunpeng Weng, Weihong Luo, Xiao Han, Xiangyu Zhao

    Abstract: Transformer structures have been widely used in sequential recommender systems (SRS). However, as user interaction histories increase, computational time and memory requirements also grow. This is mainly caused by the standard attention mechanism. Although there exist many methods employing efficient attention and SSM-based models, these approaches struggle to effectively model long sequences and… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  3. arXiv:2512.13300  [pdf, ps, other

    cs.LG cs.AI cs.IR

    No One Left Behind: How to Exploit the Incomplete and Skewed Multi-Label Data for Conversion Rate Prediction

    Authors: Qinglin Jia, Zhaocheng Du, Chuhan Wu, Huifeng Guo, Ruiming Tang, Shuting Shi, Muyu Zhang

    Abstract: In most real-world online advertising systems, advertisers typically have diverse customer acquisition goals. A common solution is to use multi-task learning (MTL) to train a unified model on post-click data to estimate the conversion rate (CVR) for these diverse targets. In practice, CVR prediction often encounters missing conversion data as many advertisers submit only a subset of user conversio… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  4. arXiv:2512.10421  [pdf, ps, other

    cs.CV

    Neural Collapse in Test-Time Adaptation

    Authors: Xiao Chen, Zhongjing Du, Jiazhen Huang, Xu Jiang, Li Lu, Jingyan Jiang, Zhi Wang

    Abstract: Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: 10 pages, 8 figures

  5. arXiv:2512.10349  [pdf, ps, other

    cs.RO

    Design and Validation of an Under-actuated Robotic Finger with Synchronous Tendon Routing

    Authors: Quan Yuan, Zhenting Du, Daqian Cao, Weibang Bai

    Abstract: Tendon-driven under-actuated robotic fingers provide advantages for dexterous manipulation through reduced actuator requirements and simplified mechanical design. However, achieving both high load capacity and adaptive compliance in a compact form remains challenging. This paper presents an under-actuated tendon-driven robotic finger (UTRF) featuring a synchronous tendon routing that mechanically… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: 7 pages and 11 figures

  6. arXiv:2512.07951  [pdf, ps, other

    cs.CV

    Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

    Authors: Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen

    Abstract: Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in vid… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

    Comments: Project webpage: https://aim-uofa.github.io/LivingSwap

  7. arXiv:2512.06963  [pdf, ps, other

    cs.RO cs.AI cs.CV

    VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

    Authors: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo

    Abstract: Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present V… ▽ More

    Submitted 7 December, 2025; originally announced December 2025.

    Comments: Project page: https://videovla-nips2025.github.io

    Journal ref: The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025)

  8. arXiv:2512.05693  [pdf, ps, other

    cs.RO cs.AI

    HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

    Authors: Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang

    Abstract: The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action s… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

  9. arXiv:2512.04552  [pdf, ps, other

    cs.SD cs.AI eess.AS

    RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

    Authors: Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li

    Abstract: Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address… ▽ More

    Submitted 4 December, 2025; originally announced December 2025.

    Comments: Submitted to ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  10. arXiv:2512.02421  [pdf, ps, other

    cs.CV

    Generalizing Vision-Language Models with Dedicated Prompt Guidance

    Authors: Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li

    Abstract: Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we prov… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

    Comments: Accepted to AAAI26

  11. arXiv:2512.00907  [pdf, ps, other

    cs.RO

    Magnetic Tactile-Driven Soft Actuator for Intelligent Grasping and Firmness Evaluation

    Authors: Chengjin Du, Federico Bernabei, Zhengyin Du, Sergio Decherchi, Matteo Lo Preti, Lucia Beccai

    Abstract: Soft robots are powerful tools for manipulating delicate objects, yet their adoption is hindered by two gaps: the lack of integrated tactile sensing and sensor signal distortion caused by actuator deformations. This paper addresses these challenges by introducing the SoftMag actuator: a magnetic tactile-sensorized soft actuator. Unlike systems relying on attached sensors or treating sensing and ac… ▽ More

    Submitted 2 December, 2025; v1 submitted 30 November, 2025; originally announced December 2025.

    Comments: 25 pages, 24 figures

  12. arXiv:2511.21054  [pdf, ps, other

    cs.LG

    Efficient Diffusion Planning with Temporal Diffusion

    Authors: Jiaming Guo, Rui Zhang, Zerun Li, Yunkai Gao, Shaohui Peng, Siming Lan, Xing Hu, Zidong Du, Xishan Zhang, Ling Li

    Abstract: Diffusion planning is a promising method for learning high-performance policies from offline data. To avoid the impact of discrepancies between planning and reality on performance, previous works generate new plans at each time step. However, this incurs significant computational overhead and leads to lower decision frequencies, and frequent plan switching may also affect performance. In contrast,… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

    Comments: Accepted by the AAAI26 Conference Main Track

  13. arXiv:2511.20099  [pdf, ps, other

    cs.LG cs.AR cs.PL

    QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

    Authors: Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

    Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an ope… ▽ More

    Submitted 26 November, 2025; v1 submitted 25 November, 2025; originally announced November 2025.

    Comments: Accepted by the AAAI26 Conference Main Track

  14. arXiv:2511.20056  [pdf, ps, other

    cs.CL

    Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

    Authors: Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang, Haoyu Chen, Xiaoyong Zhu, Bo Zheng, Xuejiao Zhao

    Abstract: Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user's bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-sca… ▽ More

    Submitted 18 December, 2025; v1 submitted 25 November, 2025; originally announced November 2025.

    Comments: Work in Progress

  15. arXiv:2511.17001  [pdf, ps, other

    cs.RO

    Stable Offline Hand-Eye Calibration for any Robot with Just One Mark

    Authors: Sicheng Xie, Lingchen Meng, Zhiying Du, Shuyuan Tu, Haidong Cao, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Imitation learning has achieved remarkable success in a variety of robotic tasks by learning a mapping function from camera-space observations to robot-space actions. Recent work indicates that the use of robot-to-camera transformation information ({\ie}, camera extrinsics) benefits the learning process and produces better results. However, camera extrinsics are oftentimes unavailable and estimati… ▽ More

    Submitted 21 November, 2025; originally announced November 2025.

  16. arXiv:2511.11438  [pdf, ps, other

    cs.CV

    VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

    Authors: Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei

    Abstract: Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

    Comments: This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details

  17. arXiv:2511.09149  [pdf, ps, other

    cs.LG cs.AI cs.MA

    Enabling Agents to Communicate Entirely in Latent Space

    Authors: Zhuoyun Du, Runze Wang, Huiyu Bai, Zouying Cao, Xiaoyong Zhu, Bo Zheng, Wei Chen, Haochao Ying

    Abstract: While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by human mind-reading, we propose Interlat (Inter-agent Latent Sp… ▽ More

    Submitted 12 November, 2025; originally announced November 2025.

    Comments: Work in progess

  18. arXiv:2511.05722  [pdf, ps, other

    cs.CL cs.AI

    OckBench: Measuring the Efficiency of LLM Reasoning

    Authors: Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

    Abstract: Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we… ▽ More

    Submitted 7 November, 2025; originally announced November 2025.

  19. arXiv:2511.00256  [pdf, ps, other

    eess.AS cs.LG cs.SD

    NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

    Authors: Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Aurosweta Mahapatra, Ali N. Salman, Carlos Busso, Berrak Sisman

    Abstract: Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various spe… ▽ More

    Submitted 31 October, 2025; originally announced November 2025.

    Comments: Under review for IEEE Transactions on Affective Computing

  20. arXiv:2510.24320  [pdf, ps, other

    cs.CL cs.AI

    Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

    Authors: Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

    Abstract: Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operat… ▽ More

    Submitted 28 October, 2025; originally announced October 2025.

    Comments: Preprint, 25 pages, 9 figures. Code: https://github.com/WooooDyy/Critique-RL

  21. arXiv:2510.19944  [pdf, ps, other

    eess.IV cs.CV

    Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets

    Authors: Jiashi Feng, Xiu Li, Jing Lin, Jiahang Liu, Gaohong Liu, Weiqiang Lou, Su Ma, Guang Shi, Qinlong Wang, Jun Wang, Zhongcong Xu, Xuanyu Yi, Zihao Yu, Jianfeng Zhang, Yifan Zhu, Rui Chen, Jinxin Chi, Zixian Du, Li Han, Lixin Huang, Kaihua Jiang, Yuhan Li, Guan Luo, Shuguang Wang, Qianyi Wu , et al. (3 additional authors not shown)

    Abstract: Developing embodied AI agents requires scalable training environments that balance content diversity with physics accuracy. World simulators provide such environments but face distinct limitations: video-based methods generate diverse content but lack real-time physics feedback for interactive learning, while physics-based engines provide accurate dynamics but face scalability limitations from cos… ▽ More

    Submitted 22 October, 2025; originally announced October 2025.

    Comments: Seed3D 1.0 Technical Report; Official Page on https://seed.bytedance.com/seed3d

  22. arXiv:2510.19400  [pdf, ps, other

    cs.CV

    Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

    Authors: Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

    Abstract: Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasin… ▽ More

    Submitted 22 October, 2025; originally announced October 2025.

    Comments: The project and benchmark are publicly available at https://github.com/microsoft/MV-RoboBench

  23. arXiv:2510.19296  [pdf, ps, other

    cs.LG cs.AR cs.PL

    QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

    Authors: Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

    Abstract: The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for V… ▽ More

    Submitted 8 December, 2025; v1 submitted 22 October, 2025; originally announced October 2025.

    Comments: Accepted to NeurIPS 2025

  24. arXiv:2510.17950  [pdf, ps, other

    cs.RO

    RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies

    Authors: Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei , et al. (12 additional authors not shown)

    Abstract: Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In t… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

    Comments: Authors are listed in alphabetical order. The official website is located at https://robochallenge.ai

  25. Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR

    Authors: Yue Gu, Zhihao Du, Ying Shi, Jiqing Han, Yongjun He

    Abstract: Fine-tuning generic ASR models with large-scale synthetic personal data can enhance the personalization of ASR models, but it introduces challenges in adapting to synthetic personal data without forgetting real knowledge, and in adapting to personal data without forgetting generic knowledge. Considering that the functionally invariant path (FIP) framework enables model adaptation while preserving… ▽ More

    Submitted 11 October, 2025; originally announced October 2025.

    Comments: Accepted for publication in IEEE Signal Processing Letters, 2025

  26. arXiv:2510.09016  [pdf, ps, other

    cs.SD cs.AI eess.AS

    DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

    Authors: Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu

    Abstract: Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality C… ▽ More

    Submitted 10 October, 2025; originally announced October 2025.

    Comments: under review

  27. arXiv:2510.03896  [pdf, ps, other

    cs.CV cs.RO

    Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

    Authors: Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

    Abstract: Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent… ▽ More

    Submitted 4 October, 2025; originally announced October 2025.

  28. arXiv:2510.03895  [pdf, ps, other

    cs.RO cs.CV

    NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation

    Authors: Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen

    Abstract: Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges… ▽ More

    Submitted 4 October, 2025; originally announced October 2025.

  29. arXiv:2510.03038  [pdf, ps, other

    cs.LG cs.AI cs.IR

    CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

    Authors: Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu

    Abstract: With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model compression. Recent quantization methods show promise for efficient deployment, yet they overlook device-specific user interests, resul… ▽ More

    Submitted 3 October, 2025; originally announced October 2025.

    Comments: accepted by ACM MM'25

  30. arXiv:2509.21896  [pdf, ps, other

    cs.AI

    GenesisGeo: Technical Report

    Authors: Minfeng Zhu, Zi Wang, Sizhe Ji, Zhengtong Du, Junming Ke, Xiao Deng, Zanlang Yin, Xiuqi Huang, Heyu Wang, Wei Chen

    Abstract: We present GenesisGeo, an automated theorem prover in Euclidean geometry. We have open-sourced a large-scale geometry dataset of 21.8 million geometric problems, over 3 million of which contain auxiliary constructions. Specially, we significantly accelerate the symbolic deduction engine DDARN by 120x through theorem matching, combined with a C++ implementation of its core components. Furthermore,… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  31. arXiv:2509.21365  [pdf

    cs.CV cs.AI

    MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation

    Authors: Zhicheng Du, Qingyang Shi, Jiasheng Lu, Yingshan Liang, Xinyu Zhang, Yiran Wang, Peiwu Qin

    Abstract: The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the commonly used evaluation metrics are only suitable for the associated analysis between two modalities, which greatly limits the evaluation of multimodal similarity. He… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

  32. arXiv:2509.20485  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

    Authors: Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

    Abstract: Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody.… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: Under review for IEEE OJSP

  33. arXiv:2509.19852  [pdf, ps, other

    cs.SD cs.AI

    Eliminating stability hallucinations in llm-based tts models via attention guidance

    Authors: ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling

    Abstract: This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-s… ▽ More

    Submitted 24 September, 2025; originally announced September 2025.

    Comments: 5 pages, submitted to ICASSP2026

  34. Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation

    Authors: Bo Yu, Jianhua Yang, Zetao Du, Yan Huang, Chenglong Li, Liang Wang

    Abstract: Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modaliti… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

    Comments: Accepted by MICCAI 2025

  35. arXiv:2509.19077  [pdf, ps, other

    cs.AI

    Code Driven Planning with Domain-Adaptive Critic

    Authors: Zikang Tian, Shaohui Peng, Du Huang, Jiaming Guo, Ruizhi Chen, Rui Zhang, Xishan Zhang, Yuxuan Guo, Zidong Du, Qi Guo, Ling Li, Yewen Pu, Xing Hu, Yunji Chen

    Abstract: Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediat… ▽ More

    Submitted 23 September, 2025; originally announced September 2025.

  36. arXiv:2509.18569  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Explore the Reinforcement Learning for the LLM based ASR and TTS system

    Authors: Changfeng Gao, Yabin Li, Keyu An, Zhifu Gao, Zhihao Du, Han Zhao, Xiangang Li

    Abstract: In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL fram… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

  37. arXiv:2509.16293  [pdf, ps, other

    cs.LG cs.AI cs.DC

    Robust LLM Training Infrastructure at ByteDance

    Authors: Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du , et al. (10 additional authors not shown)

    Abstract: The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should s… ▽ More

    Submitted 20 October, 2025; v1 submitted 19 September, 2025; originally announced September 2025.

  38. arXiv:2509.15940  [pdf, ps, other

    cs.DC

    Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

    Authors: Guoliang He, Youhe Jiang, Wencong Xiao, Kaihua Jiang, Shuguang Wang, Jun Wang, Zixian Du, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, Eiko Yoneki

    Abstract: The scaling law for large language models (LLMs) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over thousands of computing nodes. However, LLM pre-training presents unique challenges due to its complex communication patterns, where GPUs exchange data in sparse… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

    Comments: NeurIPS 2025

  39. arXiv:2509.14142  [pdf, ps, other

    cs.CV

    MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

    Authors: Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang , et al. (103 additional authors not shown)

    Abstract: This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

    Comments: ICCV 2025 MARS2 Workshop and Challenge "Multimodal Reasoning and Slow Thinking in the Large Model Era: Towards System 2 and Beyond''

  40. arXiv:2509.12508  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Fun-ASR Technical Report

    Authors: Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Ying Liu, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Haoxu Wang, Wen Wang, Wupeng Wang , et al. (13 additional authors not shown)

    Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM… ▽ More

    Submitted 19 December, 2025; v1 submitted 15 September, 2025; originally announced September 2025.

    Comments: Authors are listed in alphabetical order. Work in progress

  41. arXiv:2509.08863  [pdf

    cs.SE

    GeoJSON Agents:A Multi-Agent LLM Architecture for Geospatial Analysis-Function Calling vs Code Generation

    Authors: Qianqian Luo, Qingming Lin, Liuchang Xu, Sensen Wu, Ruichen Mao, Chao Wang, Hailin Feng, Bo Huang, Zhenhong Du

    Abstract: Large Language Models (LLMs) have demonstrated substantial progress in task automation and natural language understanding. However, without domain expertise in geographic information science (GIS), they continue to encounter limitations including reduced accuracy and unstable performance when processing complex tasks. To address these challenges, we propose GeoJSON Agents-a novel multi-agent LLM a… ▽ More

    Submitted 3 December, 2025; v1 submitted 9 September, 2025; originally announced September 2025.

  42. arXiv:2509.08755  [pdf, ps, other

    cs.LG cs.AI cs.CL

    AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

    Authors: Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

    Abstract: Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework th… ▽ More

    Submitted 10 September, 2025; originally announced September 2025.

    Comments: preprint, 39 pages, 16 figures. Project: https://AgentGym-RL.github.io/. Framework and Code: https://github.com/woooodyy/AgentGym, https://github.com/woooodyy/AgentGym-RL

  43. arXiv:2509.05908  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling

    Authors: Yue Gu, Zhihao Du, Ying Shi, Shiliang Zhang, Qian Chen, Jiqing Han

    Abstract: Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only… ▽ More

    Submitted 6 September, 2025; originally announced September 2025.

    Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing, 2025 (https://ieeexplore.ieee.org/document/11150731). DOI: 10.1109/TASLPRO.2025.3606198

  44. arXiv:2508.16151  [pdf, ps, other

    cs.AR cs.CL

    Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

    Authors: Yang Liu, Yi Chen, Yongwei Zhao, Yifan Hao, Zifu Zheng, Weihao Kong, Zhangmai Li, Dongchen Jiang, Ruiyang Xia, Zhihong Ma, Zisheng Liu, Zhaoyong Wan, Yunqi Lu, Ximing Liu, Hongrui Guo, Zhihao Yang, Zhe Wang, Tianrui Ma, Mo Zou, Rui Zhang, Ling Li, Xing Hu, Zidong Du, Zhiwei Xu, Qi Guo , et al. (2 additional authors not shown)

    Abstract: The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weig… ▽ More

    Submitted 22 August, 2025; originally announced August 2025.

  45. arXiv:2508.12604  [pdf, ps, other

    cs.LG cs.AI

    SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

    Authors: Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu

    Abstract: Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially s… ▽ More

    Submitted 18 August, 2025; originally announced August 2025.

    Comments: Work in progress

  46. arXiv:2508.11944  [pdf, ps, other

    cs.AI cs.CL cs.HC

    CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs

    Authors: Hongtao Liu, Zhicheng Du, Zihe Wang, Weiran Shen

    Abstract: Game-playing ability serves as an indicator for evaluating the strategic reasoning capability of large language models (LLMs). While most existing studies rely on utility performance metrics, which are not robust enough due to variations in opponent behavior and game structure. To address this limitation, we propose \textbf{Cognitive Hierarchy Benchmark (CHBench)}, a novel evaluation framework ins… ▽ More

    Submitted 16 August, 2025; originally announced August 2025.

  47. arXiv:2508.11737  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    Ovis2.5 Technical Report

    Authors: Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang , et al. (17 additional authors not shown)

    Abstract: We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex cha… ▽ More

    Submitted 15 August, 2025; originally announced August 2025.

  48. arXiv:2508.10409  [pdf, ps, other

    cs.AR cs.AI

    AnalogSeeker: An Open-source Foundation Language Model for Analog Circuit Design

    Authors: Zihao Chen, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Xinyi Yang, Mingjie Zhou, Zhuoyao Du, Xu Yan, Zhouyang Wu, Zhenyu Xu, Jiangli Huang, Li Shang, Xuan Zeng, Fan Yang

    Abstract: In this paper, we propose AnalogSeeker, an effort toward an open-source foundation language model for analog circuit design, with the aim of integrating domain knowledge and giving design assistance. To overcome the scarcity of data in this field, we employ a corpus collection strategy based on the domain knowledge framework of analog circuits. High-quality, accessible textbooks across relevant su… ▽ More

    Submitted 5 November, 2025; v1 submitted 14 August, 2025; originally announced August 2025.

  49. arXiv:2508.08791  [pdf, ps, other

    cs.CL cs.AI

    Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

    Authors: Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen

    Abstract: Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environme… ▽ More

    Submitted 11 September, 2025; v1 submitted 12 August, 2025; originally announced August 2025.

  50. arXiv:2508.06471  [pdf, ps, other

    cs.CL

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Authors: GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai , et al. (147 additional authors not shown)

    Abstract: We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance acro… ▽ More

    Submitted 8 August, 2025; originally announced August 2025.