Skip to main content

Showing 1–50 of 6,453 results for author: Zhang, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.19583  [pdf, ps, other

    cs.RO cs.GR

    Learning Generalizable Hand-Object Tracking from Synthetic Demonstrations

    Authors: Yinhuai Wang, Runyi Yu, Hok Wai Tsui, Xiaoyi Lin, Hui Zhang, Qihan Zhao, Ke Fan, Miao Li, Jie Song, Jingbo Wang, Qifeng Chen, Ping Tan

    Abstract: We present a system for learning generalizable hand-object tracking controllers purely from synthetic data, without requiring any human demonstrations. Our approach makes two key contributions: (1) HOP, a Hand-Object Planner, which can synthesize diverse hand-object trajectories; and (2) HOT, a Hand-Object Tracker that bridges synthetic-to-physical transfer through reinforcement learning and inter… ▽ More

    Submitted 22 December, 2025; originally announced December 2025.

  2. arXiv:2512.19093  [pdf, ps, other

    cs.AI

    Tool-Augmented Hybrid Ensemble Reasoning with Distillation for Bilingual Mathematical Problem Solving

    Authors: Peiqing Lu, Yuan Zhang, Haoyun Zhang, Jiasen Zheng, Kejian Tong, Wenjun Wu

    Abstract: Bilingual mathematical problem solving needs a clear link between language reasoning and symbolic calculation. Large language models often handle language well but are weak in accurate computation. This paper presents HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), a framework that joins reasoning and calculation using NuminaMath-7B-TIR, GPT-4o, and Mistral-7B. HERALD u… ▽ More

    Submitted 22 December, 2025; originally announced December 2025.

  3. arXiv:2512.19020  [pdf, ps, other

    cs.CV cs.LG

    CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

    Authors: Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang, Suhui Wu, Yongxin Chen, Hao Zhang

    Abstract: Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

  4. arXiv:2512.18706  [pdf, ps, other

    cs.SD

    X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

    Authors: Zhanxun Liu, Yifan Duan, Mengmeng Wang, Pengchao Feng, Haotian Zhang, Xiaoyu Xing, Yijia Shan, Haina Zhu, Yuhang Dai, Chaochao Lu, Xipeng Qiu, Lei Xie, Lan Wang, Nan Yan, Zilong Zheng, Ziyang Ma, Kai Yu, Xie Chen

    Abstract: We present X-Talk, an open-source framework that champions a decoupled, modular design for LLM-driven speech-to-speech (S2S) systems. While the dominant trend favors end-to-end (E2E) modeling to optimize information flow, these "omni-models" often struggle to balance the competing objectives of complex speech tasks within a single network. X-Talk challenges this paradigm by demonstrating that a sy… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

    Comments: 14 pages

  5. arXiv:2512.18595  [pdf, ps, other

    cs.LG

    Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

    Authors: Runze Mao, Rui Zhang, Xuan Bai, Tianhao Wu, Teng Zhang, Zhenyi Chen, Minqi Lin, Bocheng Zeng, Yangchen Xu, Yingxuan Xiang, Haoze Zhang, Shubham Goswami, Pierre A. Dawe, Yifan Xu, Zhenhua An, Mengtao Yan, Xiaoyi Lu, Yi Wang, Rongbo Bai, Haobu Gao, Xiaohang Fang, Han Li, Hao Sun, Zhi X. Chen

    Abstract: Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail t… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

    Comments: 52 pages, 20 figures. Code and data available at https://github.com/deepflame-ai/REALM. Companion website and leaderboard at https://realm-bench.org

  6. arXiv:2512.18238  [pdf, ps, other

    cs.DB

    Sync Without Guesswork: Incomplete Time Series Alignment

    Authors: Ding Jia, Jingyu Zhu, Yu Sun, Aoqian Zhang, Shaoxu Song, Haiwei Zhang, Xiaojie Yuan

    Abstract: Multivariate time series alignment is critical for ensuring coherent analysis across variables, but missing values and timestamp inconsistencies make this task highly challenging. Existing approaches often rely on prior imputation, which can introduce errors and lead to suboptimal alignments. To address these limitations, we propose a constraint-based alignment framework for incomplete multivariat… ▽ More

    Submitted 20 December, 2025; originally announced December 2025.

  7. arXiv:2512.18204  [pdf, ps, other

    cs.DS

    Learning Dependency Models for Subset Repair

    Authors: Haoda Li, Jiahui Chen, Yu Sun, Shaoxu Song, Haiwei Zhang, Xiaojie Yuan

    Abstract: Inconsistent values are commonly encountered in real-world applications, which can negatively impact data analysis and decision-making. While existing research primarily focuses on identifying the smallest removal set to resolve inconsistencies, recent studies have shown that multiple minimum removal sets may exist, making it difficult to make further decisions. While some approaches use the most… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

  8. arXiv:2512.17909  [pdf, ps, other

    cs.CV

    Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

    Authors: Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

    Abstract: Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this para… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

    Comments: Project Page: https://jshilong.github.io/PS-VAE-PAGE/

  9. arXiv:2512.17901  [pdf, ps, other

    cs.AI cs.CL

    When Reasoning Meets Its Laws

    Authors: Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang

    Abstract: Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothes… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

  10. arXiv:2512.17796  [pdf, ps, other

    cs.CV cs.AI

    Animate Any Character in Any World

    Authors: Yitong Wang, Fangyun Wei, Hongyang Zhang, Bo Dai, Yan Lu

    Abstract: Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, l… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Project page: https://snowflakewang.github.io/AniX/

  11. arXiv:2512.17733  [pdf, ps, other

    cs.IR cs.AI

    Diversity Recommendation via Causal Deconfounding of Co-purchase Relations and Counterfactual Exposure

    Authors: Jingmao Zhang, Zhiting Zhao, Yunqi Lin, Jianghong Ma, Tianjun Wei, Haijun Zhang, Xiaofeng Zhang

    Abstract: Beyond user-item modeling, item-to-item relationships are increasingly used to enhance recommendation. However, common methods largely rely on co-occurrence, making them prone to item popularity bias and user attributes, which degrades embedding quality and performance. Meanwhile, although diversity is acknowledged as a key aspect of recommendation quality, existing research offers limited attenti… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

  12. arXiv:2512.16891  [pdf, ps, other

    cs.CV cs.AI cs.IR cs.LG cs.MM

    LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

    Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu

    Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency s… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    MSC Class: 68T05; 68T07; 68T10; 68T45; 68T50; 68U10; 68P20; 62H30; 62H35 ACM Class: I.2.4; I.2.6; I.2.7; I.2.8; I.2.10; I.4; I.5; I.7; H.3.1; H.3.3; H.3.4; H.3.5

  13. arXiv:2512.16727  [pdf, ps, other

    cs.CV cs.HC

    OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

    Authors: Haochen Chang, Pengfei Ren, Buyuan Zhang, Da Li, Tianhao Han, Haoyang Zhang, Liang Xie, Hongbo Chen, Erwei Yin

    Abstract: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate s… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Project page: https://omg-bench.github.io/

  14. arXiv:2512.16077  [pdf, ps, other

    cs.CV

    Auto-Vocabulary 3D Object Detection

    Authors: Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh

    Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semant… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: technical report

  15. arXiv:2512.15716  [pdf, ps, other

    cs.CV cs.AI

    Spatia: Video Generation with Updatable Spatial Memory

    Authors: Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu

    Abstract: Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spati… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: Project page: https://zhaojingjing713.github.io/Spatia/

  16. arXiv:2512.15601  [pdf, ps, other

    cs.CL

    You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

    Authors: Hongbin Na, Zimu Wang, Zhaoming Chen, Peilin Zhou, Yining Hua, Grace Ziqi Zhou, Haiyang Zhang, Tao Shen, Wei Wang, John Torous, Shaoxiong Ji, Ling Chen

    Abstract: Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: Under Review

  17. arXiv:2512.15173  [pdf, ps, other

    cs.NI

    UAV-enabled Computing Power Networks: Task Completion Probability Analysis

    Authors: Yiqin Deng, Zhengru Fang, Senkang Hu, Yanan Ma, Haixia Zhang, Yuguang Fang

    Abstract: This paper presents an innovative framework that synergistically enhances computing performance through ubiquitous computing power distribution and dynamic computing node accessibility control via adaptive unmanned aerial vehicle (UAV) positioning, establishing UAV-enabled Computing Power Networks (UAV-CPNs). In UAV-CPNs, UAVs function as dynamic aerial relays, outsourcing tasks generated in the r… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: Accepted by IEEE Global Communications Conference (GLOBECOM) 2025

  18. arXiv:2512.15160  [pdf, ps, other

    cs.CV

    EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

    Authors: Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang

    Abstract: Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: 13 pages, 7 figures, 6 tables

  19. arXiv:2512.15006  [pdf, ps, other

    cs.CV cs.AI

    Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation

    Authors: Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa

    Abstract: Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

    Comments: WACV 2026 accepted

  20. arXiv:2512.14930  [pdf, ps, other

    stat.AP cs.AI eess.SY

    Restless Multi-Process Multi-Armed Bandits with Applications to Self-Driving Microscopies

    Authors: Jaume Anguera Peris, Songtao Cheng, Hanzhao Zhang, Wei Ouyang, Joakim Jaldén

    Abstract: High-content screening microscopy generates large amounts of live-cell imaging data, yet its potential remains constrained by the inability to determine when and where to image most effectively. Optimally balancing acquisition time, computational capacity, and photobleaching budgets across thousands of dynamically evolving regions of interest remains an open challenge, further complicated by limit… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  21. arXiv:2512.14681  [pdf, ps, other

    cs.CL

    Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

    Authors: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang

    Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compa… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  22. arXiv:2512.14614  [pdf, ps, other

    cs.CV cs.GR

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Authors: Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

    Abstract: This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse in… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

    Comments: project page: https://3d-models.hunyuan.tencent.com/world/, demo: https://3d.hunyuan.tencent.com/sceneTo3D

  23. arXiv:2512.14442  [pdf, ps, other

    cs.CV cs.RO

    A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

    Authors: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen

    Abstract: Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  24. arXiv:2512.14099  [pdf, ps, other

    cs.CV

    ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

    Authors: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

    Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pionee… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  25. arXiv:2512.14067  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

    Authors: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

    Abstract: Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  26. arXiv:2512.13837  [pdf, ps, other

    cs.LG

    Explainable reinforcement learning from human feedback to improve alignment

    Authors: Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu

    Abstract: A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tun… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  27. arXiv:2512.13600  [pdf

    cs.CV cs.AI

    DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides

    Authors: Haoyue Zhang, Meera Chappidi, Erolcan Sayar, Helen Richards, Zhijun Chen, Lucas Liu, Roxanne Wadia, Peter A Humphrey, Fady Ghali, Alberto Contreras-Sanz, Peter Black, Jonathan Wright, Stephanie Harmon, Michael Haffner

    Abstract: Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretrain… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  28. arXiv:2512.13592  [pdf, ps, other

    cs.LG cs.CV

    Image Diffusion Preview with Consistency Solver

    Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao

    Abstract: The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  29. arXiv:2512.13507  [pdf, ps, other

    cs.CV

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Authors: Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao , et al. (171 additional authors not shown)

    Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional au… ▽ More

    Submitted 16 December, 2025; v1 submitted 15 December, 2025; originally announced December 2025.

    Comments: Seedance 1.5 pro Technical Report

  30. arXiv:2512.13438  [pdf, ps, other

    cs.SE cs.AI

    From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents

    Authors: Dezhi Ran, Zhi Gong, Yuzhe Guo, Mengzhou Wu, Yuan Cao, Haochuan Lu, Hengyu Zhang, Xia Zeng, Gang Cao, Liangchao Yao, Yuetang Deng, Wei Yang, Tao Xie

    Abstract: While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that tran… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  31. arXiv:2512.13313  [pdf, ps, other

    cs.CV

    KlingAvatar 2.0 Technical Report

    Authors: Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang , et al. (3 additional authors not shown)

    Abstract: Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs ups… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

    Comments: 14 pages, 7 figures

  32. arXiv:2512.13047  [pdf, ps, other

    cs.OS cs.SE

    Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC

    Authors: Qingyuan Liu, Zou Mo, Hengbin Zhang, Dong Du, Yubin Xia, Haibo Chen

    Abstract: File systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs significant overhead, especially as systems grow in complexity. This paper proposes a new paradigm, generative file systems, which leverages Large Language Models (LLM… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  33. arXiv:2512.12500  [pdf

    cs.HC cs.AI

    Explainable AI as a Double-Edged Sword in Dermatology: The Impact on Clinicians versus The Public

    Authors: Xuhai Xu, Haoyu Hu, Haoran Zhang, Will Ke Wang, Reina Wang, Luis R. Soenksen, Omar Badri, Sheharbano Jafry, Elise Burger, Lotanna Nwandu, Apoorva Mehta, Erik P. Duhaime, Asif Qasim, Hause Lin, Janis Pereira, Jonathan Hershon, Paulius Mui, Alejandro A. Gru, Noémie Elhadad, Lena Mamykina, Matthew Groh, Philipp Tschandl, Roxana Daneshjou, Marzyeh Ghassemi

    Abstract: Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithm's opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 1… ▽ More

    Submitted 13 December, 2025; originally announced December 2025.

  34. arXiv:2512.12492  [pdf, ps, other

    cs.CV cs.CL

    Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

    Authors: Shengkai Xu, Hsiang Lun Kao, Tianxiang Xu, Honghui Zhang, Junqiao Wang, Runmeng Ding, Guanyu Liu, Tianyu Shi, Zhenyu Yu, Guofeng Pan, Ziqian Bi, Yuqi Ouyang

    Abstract: Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-veri… ▽ More

    Submitted 15 December, 2025; v1 submitted 13 December, 2025; originally announced December 2025.

  35. arXiv:2512.12425  [pdf, ps, other

    cs.CV

    BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation

    Authors: Hangwei Zhang, Armando Teles Fortes, Tianyi Wei, Xingang Pan

    Abstract: Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically am… ▽ More

    Submitted 13 December, 2025; originally announced December 2025.

  36. DCAF-Net: Dual-Channel Attentive Fusion Network for Lower Limb Motion Intention Prediction in Stroke Rehabilitation Exoskeletons

    Authors: Liangshou Zhang, Yanbin Liu, Hanchi Liu, Zheng Sun, Haozhi Zhang, Yang Zhang, Xin Ma

    Abstract: Rehabilitation exoskeletons have shown promising results in promoting recovery for stroke patients. Accurately and timely identifying the motion intentions of patients is a critical challenge in enhancing active participation during lower limb exoskeleton-assisted rehabilitation training. This paper proposes a Dual-Channel Attentive Fusion Network (DCAF-Net) that synergistically integrates pre-mov… ▽ More

    Submitted 13 December, 2025; originally announced December 2025.

    Comments: 6 pages, 6 figures

    ACM Class: I.2.9; J.3; I.2.8

    Journal ref: 2025 44th Chinese Control Conference (CCC), Chongqing, China, 2025, pp. 9102-9107

  37. arXiv:2512.12059  [pdf, ps, other

    cs.AI

    The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

    Authors: Luke Bhan, Hanyu Zhang, Andrew Gordon Wilson, Michael W. Mahoney, Chuck Arvin

    Abstract: Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses. We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning'' capabilities. As a prerequisite for this, we systematically e… ▽ More

    Submitted 12 December, 2025; originally announced December 2025.

    Comments: Presented at AAAI 2026 AI4TS workshop and AABA4ET workshop

  38. Safe Learning for Contact-Rich Robot Tasks: A Survey from Classical Learning-Based Methods to Safe Foundation Models

    Authors: Heng Zhang, Rui Dai, Gokhan Solak, Pokuang Zhou, Yu She, Arash Ajoudani

    Abstract: Contact-rich tasks pose significant challenges for robotic systems due to inherent uncertainty, complex dynamics, and the high risk of damage during interaction. Recent advances in learning-based control have shown great potential in enabling robots to acquire and generalize complex manipulation skills in such environments, but ensuring safety, both during exploration and execution, remains a crit… ▽ More

    Submitted 10 December, 2025; originally announced December 2025.

  39. arXiv:2512.11503  [pdf, ps, other

    cs.CV

    TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition

    Authors: Yanan Liu, Jun Liu, Hao Zhang, Dan Xu, Hossein Rahmani, Mohammed Bennamoun, Qiuhong Ke

    Abstract: Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for… ▽ More

    Submitted 12 December, 2025; originally announced December 2025.

  40. arXiv:2512.11423  [pdf, ps, other

    cs.CV

    JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

    Authors: Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He

    Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of erro… ▽ More

    Submitted 12 December, 2025; originally announced December 2025.

  41. arXiv:2512.11286  [pdf, ps, other

    quant-ph cs.CR

    A Survey of OAM-Encoded High-Dimensional Quantum Key Distribution: Foundations, Experiments, and Recent Trends

    Authors: Huan Zhang, Zhenyu Cao, Yu Sun, Hu Jin

    Abstract: High-dimensional quantum key distribution (HD-QKD) enhances information efficiency and noise tolerance by encoding data in large Hilbert spaces. The orbital angular momentum (OAM) of light provides a scalable basis for such encoding and supports high-dimensional photonic communication. Practical OAM-based implementations remain constrained by challenges in state generation, transmission, and detec… ▽ More

    Submitted 12 December, 2025; originally announced December 2025.

    Comments: 20 pages, 5 figures, submitted to ICT Express

    MSC Class: 81P94

  42. arXiv:2512.11254  [pdf, ps, other

    cs.IR

    FAIR: Focused Attention Is All You Need for Generative Recommendation

    Authors: Longtao Xiao, Haolin Zhang, Guohao Cai, Jieming Zhu, Yifan Wang, Heng Chang, Zhenhua Dong, Xiu Li, Ruixuan Li

    Abstract: Recently, transformer-based generative recommendation has garnered significant attention for user behavior modeling. However, it often requires discretizing items into multi-code representations (e.g., typically four code tokens or more), which sharply increases the length of the original item sequence. This expansion poses challenges to transformer-based models for modeling user behavior sequence… ▽ More

    Submitted 16 December, 2025; v1 submitted 11 December, 2025; originally announced December 2025.

  43. arXiv:2512.11187  [pdf, ps, other

    cs.AI

    Deep Learning--Accelerated Multi-Start Large Neighborhood Search for Real-time Freight Bundling

    Authors: Haohui Zhang, Wouter van Heeswijk, Xinyu Hu, Neil Yorke-Smith, Martijn Mes

    Abstract: Online Freight Exchange Systems (OFEX) play a crucial role in modern freight logistics by facilitating real-time matching between shippers and carrier. However, efficient combinatorial bundling of transporation jobs remains a bottleneck. We model the OFEX combinatorial bundling problem as a multi-commodity one-to-one pickup-and-delivery selective traveling salesperson problem (m1-PDSTSP), which op… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  44. arXiv:2512.11094  [pdf, ps, other

    cs.NI

    SHIFT: An RDMA Failure-Resilient Layer for Distributed Training

    Authors: Shengkai Lin, Kairui Zhou, Yibo Wu, Hongtao Zhang, Qinwei Yang, Wei Zhang, Arvind Krishnamurthy, Shizhen Zhao

    Abstract: With gang scheduling in large-scale distributed Large Language Model training, a single network anomaly can propagate and cause complete task failure. The frequency of such anomalies increases with network scale. However, existing fault-tolerance mechanisms, such as checkpointing and runtime resilience methods, primarily operate at the application layer and inevitably cause disruptions in training… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  45. arXiv:2512.11087  [pdf, ps, other

    cs.LG cs.AI cs.CR math.OC

    Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerating Neural Network Verification

    Authors: Duo Zhou, Jorge Chavez, Hesun Chen, Grani A. Hanasusanto, Huan Zhang

    Abstract: State-of-the-art neural network (NN) verifiers demonstrate that applying the branch-and-bound (BaB) procedure with fast bounding techniques plays a key role in tackling many challenging verification properties. In this work, we introduce the linear constraint-driven clipping framework, a class of scalable and efficient methods designed to enhance the efficacy of NN verifiers. Under this framework,… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: Accepted to NeurIPS 2025

  46. arXiv:2512.10739  [pdf, ps, other

    cs.CL cs.AI

    Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

    Authors: Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen

    Abstract: Large Reasoning Models (LRMs) have expanded the mathematical reasoning frontier through Chain-of-Thought (CoT) techniques and Reinforcement Learning with Verifiable Rewards (RLVR), capable of solving AIME-level problems. However, the performance of LRMs is heavily dependent on the extended reasoning context length. For solving ultra-hard problems like those in the International Mathematical Olympi… ▽ More

    Submitted 11 December, 2025; v1 submitted 11 December, 2025; originally announced December 2025.

  47. arXiv:2512.10501  [pdf, ps, other

    cs.AI

    Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation

    Authors: Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang

    Abstract: Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interfa… ▽ More

    Submitted 12 December, 2025; v1 submitted 11 December, 2025; originally announced December 2025.

    Comments: 12 pages, 6 figures

  48. arXiv:2512.10493  [pdf, ps, other

    cs.SE

    Decoding Human-LLM Collaboration in Coding: An Empirical Study of Multi-Turn Conversations in the Wild

    Authors: Binquan Zhang, Li Zhang, Haoyuan Zhang, Fang Liu, Song Wang, Bo Shen, An Fu, Lin Shi

    Abstract: Large language models (LLMs) are increasingly acting as dynamic conversational interfaces, supporting multi-turn interactions that mimic human-like conversation and facilitate complex tasks like coding. While datasets such as LMSYS-Chat-1M and WildChat capture real-world user-LLM conversations, few studies systematically explore the mechanisms of human-LLM collaboration in coding scenarios. What t… ▽ More

    Submitted 12 December, 2025; v1 submitted 11 December, 2025; originally announced December 2025.

  49. arXiv:2512.10480  [pdf, ps, other

    cs.RO

    Seamless Outdoor-Indoor Pedestrian Positioning System with GNSS/UWB/IMU Fusion: A Comparison of EKF, FGO, and PF

    Authors: Jiaqiang Zhang, Xianjia Yu, Sier Ha, Paola Torrico Moron, Sahar Salimpour, Farhad Kerama, Haizhou Zhang, Tomi Westerlund

    Abstract: Accurate and continuous pedestrian positioning across outdoor-indoor environments remains challenging because GNSS, UWB, and inertial PDR are complementary yet individually fragile under signal blockage, multipath, and drift. This paper presents a unified GNSS/UWB/IMU fusion framework for seamless pedestrian localization and provides a controlled comparison of three probabilistic back-ends: an err… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

    Comments: 8 pages, 4 figures, submitted to The 17th International Conference on Ambient Systems, Networks and Technologies

  50. arXiv:2512.09276  [pdf, ps, other

    cs.CV

    Dynamic Facial Expressions Analysis Based Parkinson's Disease Auxiliary Diagnosis

    Authors: Xiaochen Huang, Xiaochen Bi, Cuihua Lv, Xin Wang, Haoyan Zhang, Wenjing Jiang, Xin Ma, Yibin Li

    Abstract: Parkinson's disease (PD), a prevalent neurodegenerative disorder, significantly affects patients' daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestati… ▽ More

    Submitted 9 December, 2025; originally announced December 2025.