Skip to main content

Showing 1–50 of 588 results for author: Cheng, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.20156  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Fun-Audio-Chat Technical Report

    Authors: Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou

    Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio La… ▽ More

    Submitted 23 December, 2025; originally announced December 2025.

    Comments: 21 pages, https://github.com/FunAudioLLM/Fun-Audio-Chat

  2. arXiv:2512.19134  [pdf, ps, other

    cs.CL cs.IR

    QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

    Authors: Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng

    Abstract: Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which sh… ▽ More

    Submitted 22 December, 2025; originally announced December 2025.

  3. arXiv:2512.18352  [pdf, ps, other

    cs.CL cs.AI

    LLM-based Few-Shot Early Rumor Detection with Imitation Agent

    Authors: Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu

    Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In thi… ▽ More

    Submitted 20 December, 2025; originally announced December 2025.

  4. arXiv:2512.18279  [pdf, ps, other

    cs.CV

    UniMPR: A Unified Framework for Multimodal Place Recognition with Heterogeneous Sensor Configurations

    Authors: Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Yiming Ma, Guangming Xiong

    Abstract: Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to v… ▽ More

    Submitted 23 December, 2025; v1 submitted 20 December, 2025; originally announced December 2025.

    Comments: 14 pages, 9 figures

  5. arXiv:2512.16969  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

    Authors: Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li , et al. (82 additional authors not shown)

    Abstract: Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep res… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

  6. arXiv:2512.16924  [pdf, ps, other

    cs.CV

    The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

    Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

    Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference im… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Project page and code: https://worldcanvas.github.io/

  7. arXiv:2512.15699  [pdf, ps, other

    cs.LG cs.SE

    FrontierCS: Evolving Challenges for Evolving Intelligence

    Authors: Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li , et al. (26 additional authors not shown)

    Abstract: We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a soluti… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

    Comments: Code with instruction: https://github.com/FrontierCS/Frontier-CS

  8. arXiv:2512.15567  [pdf, ps, other

    cs.AI cond-mat.mtrl-sci cs.LG physics.chem-ph

    Evaluating Large Language Models in Scientific Discovery

    Authors: Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang , et al. (31 additional authors not shown)

    Abstract: Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain exp… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  9. arXiv:2512.06417  [pdf, ps, other

    cs.LG cs.SD

    Hankel-FNO: Fast Underwater Acoustic Charting Via Physics-Encoded Fourier Neural Operator

    Authors: Yifan Sun, Lei Cheng, Jianlong Li, Peter Gerstoft

    Abstract: Fast and accurate underwater acoustic charting is crucial for downstream tasks such as environment-aware sensor placement optimization and autonomous vehicle path planning. Conventional methods rely on computationally expensive while accurate numerical solvers, which are not scalable for large-scale or real-time applications. Although deep learning-based surrogate models can accelerate these compu… ▽ More

    Submitted 6 December, 2025; originally announced December 2025.

  10. arXiv:2512.06363  [pdf, ps, other

    cs.CV

    Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection

    Authors: Jiabao Guo, Yadian Wang, Hui Ma, Yuhao Fu, Ju Jia, Hui Liu, Shengeng Tang, Lechao Cheng, Yunfeng Diao, Ajian Liu

    Abstract: Real-world face recognition systems are vulnerable to both physical presentation attacks (PAs) and digital forgery attacks (DFs). We aim to achieve comprehensive protection of biometric data by implementing a unified physical-digital defense framework with advanced detection. Existing approaches primarily employ CLIP with regularization constraints to enhance model generalization across both tasks… ▽ More

    Submitted 6 December, 2025; originally announced December 2025.

  11. arXiv:2512.04678  [pdf, ps, other

    cs.CV

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Authors: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang

    Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and dimini… ▽ More

    Submitted 4 December, 2025; originally announced December 2025.

  12. arXiv:2512.04585  [pdf, ps, other

    cs.CV

    SAM3-I: Segment Anything with Instructions

    Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, Li Cheng

    Abstract: Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that i… ▽ More

    Submitted 16 December, 2025; v1 submitted 4 December, 2025; originally announced December 2025.

    Comments: Preliminary results; work in progress

  13. arXiv:2512.03046  [pdf, ps, other

    cs.CV

    MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

    Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen

    Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

    Comments: Code and demo available at https://magicquill.art/v2/

  14. arXiv:2511.22038  [pdf, ps, other

    cs.CL

    Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

    Authors: Rochana Chaturvedi, Yue Zhou, Andrew Boyd, Brian T. Layden, Mudassir Rashid, Lu Cheng, Ali Cinar, Barbara Di Eugenio

    Abstract: Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex… ▽ More

    Submitted 26 November, 2025; originally announced November 2025.

  15. arXiv:2511.20993  [pdf, ps, other

    cs.LG cs.AI

    Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning

    Authors: Shanwei Fan, Bin Zhang, Zhiwei Xu, Yingxuan Teng, Siqi Dai, Lin Cheng, Guoliang Fan

    Abstract: Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs oft… ▽ More

    Submitted 7 December, 2025; v1 submitted 25 November, 2025; originally announced November 2025.

  16. arXiv:2511.18051  [pdf, ps, other

    eess.SY cs.LG

    Sparse Kalman Identification for Partially Observable Systems via Adaptive Bayesian Learning

    Authors: Jilan Mei, Tengjie Zheng, Lin Cheng, Shengping Gong, Xu Huang

    Abstract: Sparse dynamics identification is an essential tool for discovering interpretable physical models and enabling efficient control in engineering systems. However, existing methods rely on batch learning with full historical data, limiting their applicability to real-time scenarios involving sequential and partially observable data. To overcome this limitation, this paper proposes an online Sparse K… ▽ More

    Submitted 22 November, 2025; originally announced November 2025.

  17. arXiv:2511.17217  [pdf, ps, other

    cs.CV

    Dual-domain Adaptation Networks for Realistic Image Super-resolution

    Authors: Chaowei Fang, Bolin Fu, De Cheng, Lechao Cheng, Guanbin Li

    Abstract: Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic im… ▽ More

    Submitted 21 November, 2025; originally announced November 2025.

  18. arXiv:2511.14159  [pdf, ps, other

    cs.CV

    MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

    Authors: Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng

    Abstract: Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fi… ▽ More

    Submitted 18 November, 2025; originally announced November 2025.

    Comments: 16 pages, 8 figures

  19. arXiv:2511.12030  [pdf, ps, other

    cs.CV

    VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

    Authors: Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng

    Abstract: Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically dep… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

    Comments: 14 pages, 9 figures, extended version of the AAAI 2026 paper "VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation"

  20. arXiv:2511.10698  [pdf, ps, other

    cs.CR

    Transferable Hypergraph Attack via Injecting Nodes into Pivotal Hyperedges

    Authors: Meixia He, Peican Zhu, Le Cheng, Yangming Guo, Manman Yuan, Keke Tang

    Abstract: Recent studies have demonstrated that hypergraph neural networks (HGNNs) are susceptible to adversarial attacks. However, existing methods rely on the specific information mechanisms of target HGNNs, overlooking the common vulnerability caused by the significant differences in hyperedge pivotality along aggregation paths in most HGNNs, thereby limiting the transferability and effectiveness of atta… ▽ More

    Submitted 12 November, 2025; originally announced November 2025.

    Comments: AAAI 2026, Accept

  21. arXiv:2511.07980  [pdf

    cs.AI

    Capturing Complex Spatial-Temporal Dependencies in Traffic Forecasting: A Self-Attention Approach

    Authors: Zheng Chenghong, Zongyin Deng, Liu Cheng, Xiong Simin, Di Deshi, Li Guanyao

    Abstract: We study the problem of traffic forecasting, aiming to predict the inflow and outflow of a region in the subsequent time slot. The problem is complex due to the intricate spatial and temporal interdependence among regions. Prior works study the spatial and temporal dependency in a decouple manner, failing to capture their joint effect. In this work, we propose ST-SAM, a novel and efficient Spatial… ▽ More

    Submitted 11 November, 2025; originally announced November 2025.

    Comments: 5 pages

  22. arXiv:2511.07659  [pdf, ps, other

    cs.CL cs.AI

    Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

    Authors: Sai Shridhar Balamurali, Lu Cheng

    Abstract: Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%)… ▽ More

    Submitted 10 November, 2025; originally announced November 2025.

  23. arXiv:2511.06408  [pdf, ps, other

    cs.CV

    VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes

    Authors: Zhengyu Zou, Jingfeng Li, Hao Li, Xiaolei Hou, Jinwen Hu, Jingkun Chen, Lechao Cheng, Dingwen Zhang

    Abstract: Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in hand… ▽ More

    Submitted 9 November, 2025; originally announced November 2025.

  24. arXiv:2511.05894  [pdf, ps, other

    cs.CV

    Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

    Authors: Fei Yu, Quan Deng, Shengeng Tang, Yuehua Li, Lechao Cheng

    Abstract: Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method in… ▽ More

    Submitted 8 November, 2025; originally announced November 2025.

    Comments: Accepted by AAAI 2026

  25. arXiv:2511.04973  [pdf, ps, other

    cs.LG

    Less Is More: Generating Time Series with LLaMA-Style Autoregression in Simple Factorized Latent Spaces

    Authors: Siyuan Li, Yifan Sun, Lei Cheng, Lewen Wang, Yang Liu, Weiqing Liu, Jianlong Li, Jiang Bian, Shikai Fang

    Abstract: Generative models for multivariate time series are essential for data augmentation, simulation, and privacy preservation, yet current state-of-the-art diffusion-based approaches are slow and limited to fixed-length windows. We propose FAR-TS, a simple yet effective framework that combines disentangled factorization with an autoregressive Transformer over a discrete, quantized latent space to gener… ▽ More

    Submitted 6 November, 2025; originally announced November 2025.

  26. arXiv:2511.03475  [pdf, ps, other

    cs.LG

    RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

    Authors: Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

    Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that ac… ▽ More

    Submitted 5 November, 2025; originally announced November 2025.

  27. arXiv:2511.02119  [pdf, ps, other

    cs.AI cs.CL

    InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance

    Authors: Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Dan M. Frangopol, Minghui Cheng

    Abstract: Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-rangin… ▽ More

    Submitted 3 November, 2025; originally announced November 2025.

  28. arXiv:2511.00540  [pdf, ps, other

    cs.CV

    Real-IAD Variety: Pushing Industrial Anomaly Detection Dataset to a Modern Era

    Authors: Wenbing Zhu, Chengjie Wang, Bin-Bin Gao, Jiangning Zhang, Guannan Jiang, Jie Hu, Zhenye Gan, Lidong Wang, Ziqing Zhou, Linjie Cheng, Yurui Pan, Bo Peng, Mingmin Chi, Lizhuang Ma

    Abstract: Industrial Anomaly Detection (IAD) is critical for enhancing operational safety, ensuring product quality, and optimizing manufacturing efficiency across global industries. However, the IAD algorithms are severely constrained by the limitations of existing public benchmarks. Current datasets exhibit restricted category diversity and insufficient scale, frequently resulting in metric saturation and… ▽ More

    Submitted 1 November, 2025; originally announced November 2025.

    Comments: 13 pages, 4 figures and 5 tables

  29. arXiv:2510.22931  [pdf, ps, other

    cs.LG cs.AI

    Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining

    Authors: Xiaofan Zhou, Lu Cheng

    Abstract: Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little attention has been given to establishing statistical reliability guarantees for LLMs under CL, particularly in the setting of continual domain pretraining (CDP). Conformal Prediction (CP) has shown promise in off… ▽ More

    Submitted 28 October, 2025; v1 submitted 26 October, 2025; originally announced October 2025.

  30. arXiv:2510.20822  [pdf, ps, other

    cs.CV

    HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

    Authors: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu

    Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Wi… ▽ More

    Submitted 23 October, 2025; originally announced October 2025.

    Comments: Project page and code: https://holo-cine.github.io/

  31. Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature

    Authors: Lei Cheng, Siyang Cao

    Abstract: This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial… ▽ More

    Submitted 23 October, 2025; originally announced October 2025.

    Comments: accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)

  32. arXiv:2510.15742  [pdf, ps, other

    cs.CV

    Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

    Authors: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

    Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context… ▽ More

    Submitted 16 December, 2025; v1 submitted 17 October, 2025; originally announced October 2025.

    Comments: Project page: https://ezioby.github.io/Ditto_page Code: https://github.com/EzioBy/Ditto

  33. arXiv:2510.15390  [pdf, ps, other

    stat.ML cs.LG eess.SY

    Recursive Inference for Heterogeneous Multi-Output GP State-Space Models with Arbitrary Moment Matching

    Authors: Tengjie Zheng, Jilan Mei, Di Wu, Lin Cheng, Shengping Gong

    Abstract: Accurate learning of system dynamics is becoming increasingly crucial for advanced control and decision-making in engineering. However, real-world systems often exhibit multiple channels and highly nonlinear transition dynamics, challenging traditional modeling methods. To enable online learning for these systems, this paper formulates the system as Gaussian process state-space models (GPSSMs) and… ▽ More

    Submitted 17 October, 2025; originally announced October 2025.

  34. arXiv:2510.11496  [pdf, ps, other

    cs.CV cs.AI

    AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

    Authors: Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen , et al. (15 additional authors not shown)

    Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-si… ▽ More

    Submitted 21 December, 2025; v1 submitted 13 October, 2025; originally announced October 2025.

    Comments: Tech report of OPPO AndesVL Team

  35. arXiv:2510.08918  [pdf, ps, other

    cs.CR

    Psyzkaller: Learning from Historical and On-the-Fly Execution Data for Smarter Seed Generation in OS kernel Fuzzing

    Authors: Boyu Liu, Yang Zhang, Liang Cheng, Yi Zhang, Junjie Fan, Yu Fu

    Abstract: Fuzzing has become a cornerstone technique for uncovering vulnerabilities and enhancing the security of OS kernels. However, state-of-the-art kernel fuzzers, including the de facto standard Syzkaller, struggle to generate valid syscall sequences that respect implicit Syscall Dependency Relations (SDRs). Consequently, many generated seeds either fail kernel validation or cannot penetrate deep execu… ▽ More

    Submitted 9 October, 2025; originally announced October 2025.

  36. arXiv:2510.05414  [pdf, ps, other

    cs.CL

    A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis

    Authors: Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Haifeng Wang, Minghui Cheng

    Abstract: Large language models (LLMs) have recently been used to empower autonomous agents in engineering, significantly improving automation and efficiency in labor-intensive workflows. However, their potential remains underexplored in structural engineering, particularly for finite element modeling tasks requiring geometric modeling, complex reasoning, and domain knowledge. To bridge this gap, this paper… ▽ More

    Submitted 6 October, 2025; originally announced October 2025.

  37. arXiv:2510.04712  [pdf, ps, other

    cs.CV cs.HC cs.MM

    ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model

    Authors: Luo Cheng, Song Siyang, Yan Siyuan, Yu Zhen, Ge Zongyuan

    Abstract: The automatic generation of diverse and human-like facial reactions in dyadic dialogue remains a critical challenge for human-computer interaction systems. Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for respond… ▽ More

    Submitted 6 October, 2025; originally announced October 2025.

    Comments: Accepted to ACM Multimedia

  38. arXiv:2510.04009  [pdf, ps, other

    cs.AI cs.CL

    What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models

    Authors: Zicong He, Boxuan Zhang, Weihao Liu, Ruixiang Tang, Lu Cheng

    Abstract: The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for c… ▽ More

    Submitted 4 October, 2025; originally announced October 2025.

    Comments: 22 pages

  39. arXiv:2509.25991  [pdf, ps, other

    cs.AI cs.CV

    Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline

    Authors: Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng, Zhun Zhong

    Abstract: In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on… ▽ More

    Submitted 15 October, 2025; v1 submitted 30 September, 2025; originally announced September 2025.

  40. arXiv:2509.25822  [pdf, ps, other

    cs.RO

    Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

    Authors: Jing Wang, Weiting Peng, Jing Tang, Zeyu Gong, Xihua Wang, Bo Tao, Li Cheng

    Abstract: Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representations and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action throu… ▽ More

    Submitted 11 November, 2025; v1 submitted 30 September, 2025; originally announced September 2025.

    Comments: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

  41. arXiv:2509.24231  [pdf

    cs.CV

    EVLF-FM: Explainable Vision Language Foundation Model for Medicine

    Authors: Yang Bai, Haoran Cheng, Yang Zhou, Jun Zhou, Arun Thirunavukarasu, Yuhe Ke, Jie Yao, Kanae Fukutsu, Chrystie Wan Ning Quek, Ashley Hong, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Hiok Hong Chan, Victor Koh, Marcus Tan, Kelvin Z. Li, Leonard Yip, Ching Yu Cheng, Yih Chung Tham , et al. (18 additional authors not shown)

    Abstract: Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM enc… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

  42. arXiv:2509.21905  [pdf, ps, other

    cs.CV

    TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation

    Authors: Qihang Wang, Yaxiong Wang, Lechao Cheng, Zhun Zhong

    Abstract: This paper explores image editing under the joint control of text and drag interactions. While recent advances in text-driven and drag-driven editing have achieved remarkable progress, they suffer from complementary limitations: text-driven methods excel in texture manipulation but lack precise spatial control, whereas drag-driven approaches primarily modify shape and structure without fine-graine… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  43. arXiv:2509.21079  [pdf, ps, other

    cs.CL

    SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials

    Authors: Qixin Wan, Zilong Wang, Jingwen Zhou, Wanting Wang, Ziheng Geng, Jiachen Liu, Ran Cao, Minghui Cheng, Lu Cheng

    Abstract: Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-… ▽ More

    Submitted 25 September, 2025; originally announced September 2025.

  44. arXiv:2509.13736  [pdf, ps, other

    cs.RO

    Motion Adaptation Across Users and Tasks for Exoskeletons via Meta-Learning

    Authors: Muyuan Ma, Long Cheng, Lijun Han, Xiuze Xia, Houcheng Li

    Abstract: Wearable exoskeletons can augment human strength and reduce muscle fatigue during specific tasks. However, developing personalized and task-generalizable assistance algorithms remains a critical challenge. To address this, a meta-imitation learning approach is proposed. This approach leverages a task-specific neural network to predict human elbow joint movements, enabling effective assistance whil… ▽ More

    Submitted 17 September, 2025; originally announced September 2025.

  45. A Learnable Fully Interacted Two-Tower Model for Pre-Ranking System

    Authors: Chao Xiong, Xianwen Yu, Wei Xu, Lei Cheng, Chuan Yuan, Linjian Mo

    Abstract: Pre-ranking plays a crucial role in large-scale recommender systems by significantly improving the efficiency and scalability within the constraints of providing high-quality candidate sets in real time. The two-tower model is widely used in pre-ranking systems due to a good balance between efficiency and effectiveness with decoupled architecture, which independently processes user and item inputs… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

    Journal ref: SIGIR2025

  46. arXiv:2509.12653  [pdf, ps, other

    cs.CV cs.AI

    Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations

    Authors: Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong

    Abstract: The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

  47. arXiv:2509.10948  [pdf, ps, other

    cs.RO cs.AI cs.CR eess.SY math.OC

    ViSTR-GP: Online Cyberattack Detection via Vision-to-State Tensor Regression and Gaussian Processes in Automated Robotic Operations

    Authors: Navid Aftabi, Philip Samaha, Jin Ma, Long Cheng, Ramy Harik, Dan Li

    Abstract: Industrial robotic systems are central to automating smart manufacturing operations. Connected and automated factories face growing cybersecurity risks that can potentially cause interruptions and damages to physical operations. Among these attacks, data-integrity attacks often involve sophisticated exploitation of vulnerabilities that enable an attacker to access and manipulate the operational da… ▽ More

    Submitted 13 September, 2025; originally announced September 2025.

  48. arXiv:2509.10814  [pdf, ps, other

    cs.CR

    Automatic Generation of a Cryptography Misuse Taxonomy Using Large Language Models

    Authors: Yang Zhang, Wenyi Ouyang, Yi Zhang, Liang Cheng, Chen Wu, Wenxin Hu

    Abstract: The prevalence of cryptographic API misuse (CAM) is compromising the effectiveness of cryptography and in turn the security of modern systems and applications. Despite extensive efforts to develop CAM detection tools, these tools typically rely on a limited set of predefined rules from human-curated knowledge. This rigid, rule-based approach hinders adaptation to evolving CAM patterns in real prac… ▽ More

    Submitted 13 September, 2025; originally announced September 2025.

    Comments: 23 pages, 9 figures

  49. arXiv:2509.10584  [pdf, ps, other

    cs.CY cs.AI cs.CL

    Smart Trial: Evaluating the Use of Large Language Models for Recruiting Clinical Trial Participants via Social Media

    Authors: Xiaofan Zhou, Zisu Wang, Janice Krieger, Mohan Zalake, Lu Cheng

    Abstract: Clinical trials (CT) are essential for advancing medical research and treatment, yet efficiently recruiting eligible participants -- each of whom must meet complex eligibility criteria -- remains a significant challenge. Traditional recruitment approaches, such as advertisements or electronic health record screening within hospitals, are often time-consuming and geographically constrained. This wo… ▽ More

    Submitted 11 September, 2025; originally announced September 2025.

  50. arXiv:2509.07887  [pdf, ps, other

    cs.LG

    A Survey of Graph Neural Networks for Drug Discovery: Recent Developments and Challenges

    Authors: Katherine Berry, Liang Cheng

    Abstract: Graph Neural Networks (GNNs) have gained traction in the complex domain of drug discovery because of their ability to process graph-structured data such as drug molecule models. This approach has resulted in a myriad of methods and models in published literature across several categories of drug discovery research. This paper covers the research categories comprehensively with recent papers, namel… ▽ More

    Submitted 9 September, 2025; originally announced September 2025.

    Comments: 16 pages, 1 figure

    ACM Class: I.2; I.2.1; J.3