Skip to main content

Showing 1–50 of 333 results for author: Cheng, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.19311  [pdf, ps, other

    cs.CV cs.AI

    MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

    Authors: Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu, Jingdong Wang

    Abstract: This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFl… ▽ More

    Submitted 22 December, 2025; originally announced December 2025.

  2. arXiv:2512.18684  [pdf, ps, other

    cs.CV

    A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

    Authors: Huimin Wu, Kwang-Ting Cheng, Stephen Lin, Zhirong Wu

    Abstract: This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

    Comments: AAAI 20206, Project website: geovit-aaai26.github.io

  3. arXiv:2512.18597  [pdf, ps, other

    cs.CV cs.GR

    Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach

    Authors: Zhe Li, Kun Cheng, Hanyue Mo, Jintao Lu, Ziwen Kuang, Jianwen Ye, Lixu Xu, Xinya Meng, Jiahui Zhao, Shengda Ji, Shuyuan Liu, Mengyu Wang

    Abstract: A vision-based trajectory analysis solution is proposed to address the "zero-speed braking" issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contr… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

    Comments: 5 figures,16 pages

  4. arXiv:2512.16924  [pdf, ps, other

    cs.CV

    The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

    Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

    Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference im… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Project page and code: https://worldcanvas.github.io/

  5. arXiv:2512.12875  [pdf, ps, other

    cs.CV cs.MM cs.SD

    Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

    Authors: Weihan Xu, Kan Jen Cheng, Koichi Saito, Muhammad Jehanzeb Mirza, Tingle Li, Yisi Liu, Alexander H. Liu, Liming Wang, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji, Gopala Anumanchipalli, Paul Pu Liang

    Abstract: Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and ma… ▽ More

    Submitted 14 December, 2025; originally announced December 2025.

  6. arXiv:2512.04678  [pdf, ps, other

    cs.CV

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Authors: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang

    Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and dimini… ▽ More

    Submitted 4 December, 2025; originally announced December 2025.

  7. arXiv:2512.03046  [pdf, ps, other

    cs.CV

    MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

    Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen

    Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for… ▽ More

    Submitted 2 December, 2025; originally announced December 2025.

    Comments: Code and demo available at https://magicquill.art/v2/

  8. arXiv:2511.19952  [pdf, ps, other

    cs.LG

    Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios

    Authors: Haoran Hu, Junren Shi, Shuo Jiang, Kun Cheng, Xia Yang, Changhao Piao

    Abstract: Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction models.To overcome these dual challenges-computational complexity and… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

  9. arXiv:2511.17106  [pdf, ps, other

    cs.CV

    ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

    Authors: Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, Shanghang Zhang

    Abstract: Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore,… ▽ More

    Submitted 21 November, 2025; originally announced November 2025.

    Comments: 16 pages

  10. arXiv:2511.16719  [pdf, ps, other

    cs.CV cs.AI

    SAM 3: Segment Anything with Concepts

    Authors: Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni , et al. (13 additional authors not shown)

    Abstract: We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

  11. arXiv:2511.16024  [pdf, ps, other

    cs.CV

    Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

    Authors: Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu, Nannan Wang, Xinbo Gao

    Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-res… ▽ More

    Submitted 1 December, 2025; v1 submitted 19 November, 2025; originally announced November 2025.

    Comments: 16 pages, Accepted by AAAI 2026, v2: corrected typos

  12. arXiv:2511.12044  [pdf, ps, other

    cs.CV

    FedSDA: Federated Stain Distribution Alignment for Non-IID Histopathological Image Classification

    Authors: Cheng-Chang Tsai, Kai-Wen Cheng, Chun-Shien Lu

    Abstract: Federated learning (FL) has shown success in collaboratively training a model among decentralized data resources without directly sharing privacy-sensitive training data. Despite recent advances, non-IID (non-independent and identically distributed) data poses an inevitable challenge that hinders the use of FL. In this work, we address the issue of non-IID histopathological images with feature dis… ▽ More

    Submitted 15 November, 2025; originally announced November 2025.

    Comments: Extended version. 22 pages, 18 figures, 6 tables

  13. arXiv:2511.11019  [pdf, ps, other

    cs.CR cs.SE

    PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities

    Authors: Zichao Wei, Jun Zeng, Ming Wen, Zeliang Yu, Kai Cheng, Yiding Zhu, Jingyi Guo, Shiqi Zhou, Le Yin, Xiaodong Su, Zhechao Ma

    Abstract: Software vulnerabilities are increasing at an alarming rate. However, manual patching is both time-consuming and resource-intensive, while existing automated vulnerability repair (AVR) techniques remain limited in effectiveness. Recent advances in large language models (LLMs) have opened a new paradigm for AVR, demonstrating remarkable progress. To examine the capability of LLMs in AVR, several vu… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

  14. arXiv:2511.00293  [pdf, ps, other

    cs.CV

    MagicView: Multi-View Consistent Identity Customization via Priors-Guided In-Context Learning

    Authors: Hengjia Li, Jianjin Xu, Keli Cheng, Lei Wang, Ning Bi, Boxi Wu, Fernando De la Torre, Deng Cai

    Abstract: Recent advances in personalized generative models have demonstrated impressive capabilities in producing identity-consistent images of the same individual across diverse scenes. However, most existing methods lack explicit viewpoint control and fail to ensure multi-view consistency of generated identities. To address this limitation, we present MagicView, a lightweight adaptation framework that eq… ▽ More

    Submitted 3 December, 2025; v1 submitted 31 October, 2025; originally announced November 2025.

  15. arXiv:2510.27107  [pdf, ps, other

    cs.AR

    A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents

    Authors: Zhipeng Liao, Kunming Shao, Jiangnan Yu, Liang Zhao, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Jie Yang, Mohamad Sawan

    Abstract: With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

    Comments: Accepted by BioCAS2025

  16. arXiv:2510.25278  [pdf, ps, other

    cs.AR

    DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation

    Authors: Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian, Yi Zou, Xiaomeng Wang, Tim Kwang-Ting Cheng, Chi-Ying Tsui

    Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or lim… ▽ More

    Submitted 29 October, 2025; originally announced October 2025.

    Comments: Accepted by 2025 IEEE/ACM ISLPED

  17. arXiv:2510.24411  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.HC

    OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

    Authors: Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong

    Abstract: Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast… ▽ More

    Submitted 9 December, 2025; v1 submitted 28 October, 2025; originally announced October 2025.

    Comments: work in progress

  18. arXiv:2510.20822  [pdf, ps, other

    cs.CV

    HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

    Authors: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu

    Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Wi… ▽ More

    Submitted 23 October, 2025; originally announced October 2025.

    Comments: Project page and code: https://holo-cine.github.io/

  19. arXiv:2510.18347  [pdf, ps, other

    cs.RO eess.SY math.OC

    Coverage-Recon: Coordinated Multi-Drone Image Sampling with Online Map Feedback

    Authors: Muhammad Hanif, Reiji Terunuma, Takumi Sumino, Kelvin Cheng, Takeshi Hatanaka

    Abstract: This article addresses collaborative 3D map reconstruction using multiple drones. Achieving high-quality reconstruction requires capturing images of keypoints within the target scene from diverse viewing angles, and coverage control offers an effective framework to meet this requirement. Meanwhile, recent advances in real-time 3D reconstruction algorithms make it possible to render an evolving map… ▽ More

    Submitted 21 October, 2025; originally announced October 2025.

    Comments: Submitted to IEEE Transactions on Control Systems Technology (under review). Project page: https://htnk-lab.github.io/coverage-recon/

  20. arXiv:2510.15742  [pdf, ps, other

    cs.CV

    Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

    Authors: Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen

    Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context… ▽ More

    Submitted 16 December, 2025; v1 submitted 17 October, 2025; originally announced October 2025.

    Comments: Project page: https://ezioby.github.io/Ditto_page Code: https://github.com/EzioBy/Ditto

  21. arXiv:2510.15110  [pdf, ps, other

    cs.LG cs.AI cs.CL

    DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

    Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

    Abstract: Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not… ▽ More

    Submitted 16 October, 2025; originally announced October 2025.

    Comments: NVIDIA-Tech Report

  22. arXiv:2510.09016  [pdf, ps, other

    cs.SD cs.AI eess.AS

    DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

    Authors: Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu

    Abstract: Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality C… ▽ More

    Submitted 10 October, 2025; originally announced October 2025.

    Comments: under review

  23. arXiv:2510.08759  [pdf, ps, other

    cs.CV cs.RO

    BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

    Authors: Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong

    Abstract: Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial un… ▽ More

    Submitted 9 October, 2025; originally announced October 2025.

  24. arXiv:2510.07355  [pdf, ps, other

    cs.MM cs.SD

    AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

    Authors: Krish Patel, Dingkun Zhou, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

    Abstract: Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a… ▽ More

    Submitted 8 October, 2025; originally announced October 2025.

  25. arXiv:2509.23672  [pdf, ps, other

    cs.CV

    Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding

    Authors: Xixi Jiang, Chen Yang, Dong Zhang, Pingcheng Dong, Xin Yang, Kwang-Ting Cheng

    Abstract: Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiot… ▽ More

    Submitted 28 September, 2025; originally announced September 2025.

  26. arXiv:2509.03505  [pdf, ps, other

    cs.LG cs.AI cs.CL

    LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

    Authors: Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng , et al. (13 additional authors not shown)

    Abstract: We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX-16M and LimiX-2M, two instantiations of our large structured-data models (LDMs). Both models treat structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabu… ▽ More

    Submitted 7 November, 2025; v1 submitted 3 September, 2025; originally announced September 2025.

    Comments: 61 pages

  27. arXiv:2508.20427   

    cs.IR cs.AI

    Rethinking Purity and Diversity in Multi-Behavior Sequential Recommendation from the Frequency Perspective

    Authors: Yongqiang Han, Kai Cheng, Kefan Wang, Enhong Chen

    Abstract: In recommendation systems, users often exhibit multiple behaviors, such as browsing, clicking, and purchasing. Multi-behavior sequential recommendation (MBSR) aims to consider these different behaviors in an integrated manner to improve the recommendation performance of the target behavior. However, some behavior data will also bring inevitable noise to the modeling of user interests. Some researc… ▽ More

    Submitted 16 October, 2025; v1 submitted 28 August, 2025; originally announced August 2025.

    Comments: Some experiments in the paper have not been sufficiently validated, leading to conclusions that lack robustness. Additionally, there has been significant progress in follow-up work that requires revisions to the manuscript

  28. arXiv:2508.18384  [pdf, ps, other

    cs.CL cs.AI

    Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails

    Authors: Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren

    Abstract: The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-… ▽ More

    Submitted 25 August, 2025; originally announced August 2025.

  29. arXiv:2508.17623  [pdf, ps, other

    cs.CL eess.AS

    EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

    Authors: Jingwen Liu, Kan Jen Cheng, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang-Cheng Chou, Tingle Li, Guan-Ting Lin, Gopala Anumanchipalli

    Abstract: Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via t… ▽ More

    Submitted 25 August, 2025; v1 submitted 24 August, 2025; originally announced August 2025.

    Comments: Accepted at (ASRU 2025) 2025 IEEE Automatic Speech Recognition and Understanding Workshop

  30. arXiv:2508.09035  [pdf, ps, other

    cs.DC cs.CL cs.LG

    P/D-Device: Disaggregated Large Language Model between Cloud and Devices

    Authors: Yibo Jin, Yixu Xu, Yue Chen, Chengbin Wang, Tao Wang, Jiaqi Huang, Rongfei Zhang, Yiming Dong, Yuting Yan, Ke Cheng, Yingjie Zhu, Shulan Wang, Qianqian Tang, Shuaishuai Meng, Guanxin Cheng, Ze Wang, Shuyan Miao, Ketao Wang, Wen Liu, Yifan Yang, Tong Zhang, Anran Wang, Chengzhou Lu, Tiantian Dong, Yongsheng Zhang , et al. (5 additional authors not shown)

    Abstract: Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, in… ▽ More

    Submitted 12 August, 2025; originally announced August 2025.

  31. It's a Complete Haystack: Understanding Dependency Management Needs in Computer-Aided Design

    Authors: Kathy Cheng, Alison Olechowski, Shurui Zhou

    Abstract: In today's landscape, hardware development teams face increasing demands for better quality products, greater innovation, and shorter manufacturing lead times. Despite the need for more efficient and effective processes, hardware designers continue to struggle with a lack of awareness of design changes and other collaborators' actions, a persistent issue in decades of CSCW research. One significan… ▽ More

    Submitted 7 August, 2025; originally announced August 2025.

    Comments: To be published in the Proceedings of the ACM on Human-Computer Interaction, Volume 9, Issue CSCW2

  32. arXiv:2508.00590  [pdf, ps, other

    cs.CV eess.IV

    A Novel Modeling Framework and Data Product for Extended VIIRS-like Artificial Nighttime Light Image Reconstruction (1986-2024)

    Authors: Yihe Tian, Kwan Man Cheng, Zhengbo Zhang, Tao Zhang, Suju Li, Dongmei Yan, Bing Xu

    Abstract: Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current m… ▽ More

    Submitted 1 August, 2025; originally announced August 2025.

  33. arXiv:2507.21170  [pdf, ps, other

    cs.CR cs.AI cs.CL

    OneShield -- the Next Generation of LLM Guardrails

    Authors: Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty

    Abstract: The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes it ext… ▽ More

    Submitted 31 July, 2025; v1 submitted 25 July, 2025; originally announced July 2025.

  34. arXiv:2507.20688  [pdf, ps, other

    cs.CR

    Guard-GBDT: Efficient Privacy-Preserving Approximated GBDT Training on Vertical Dataset

    Authors: Anxiao Song, Shujie Cui, Jianli Bai, Ke Cheng, Yulong Shen, Giovanni Russello

    Abstract: In light of increasing privacy concerns and stringent legal regulations, using secure multiparty computation (MPC) to enable collaborative GBDT model training among multiple data owners has garnered significant attention. Despite this, existing MPC-based GBDT frameworks face efficiency challenges due to high communication costs and the computation burden of non-linear operations, such as division… ▽ More

    Submitted 21 December, 2025; v1 submitted 28 July, 2025; originally announced July 2025.

    Comments: Accepted by The 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2025)

  35. arXiv:2506.24123  [pdf, ps, other

    cs.CV

    Calligrapher: Freestyle Text Image Customization

    Authors: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen

    Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechani… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Project page: https://calligrapher2025.github.io/Calligrapher Code: https://github.com/Calligrapher2025/Calligrapher

  36. arXiv:2506.19199  [pdf, ps, other

    eess.SY cs.MA cs.RO

    Low-Cost Infrastructure-Free 3D Relative Localization with Sub-Meter Accuracy in Near Field

    Authors: Qiangsheng Gao, Ka Ho Cheng, Li Qiu, Zijun Gong

    Abstract: Relative localization in the near-field scenario is critically important for unmanned vehicle (UxV) applications. Although related works addressing 2D relative localization problem have been widely studied for unmanned ground vehicles (UGVs), the problem in 3D scenarios for unmanned aerial vehicles (UAVs) involves more uncertainties and remains to be investigated. Inspired by the phenomenon that a… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  37. arXiv:2506.17294  [pdf, ps, other

    cs.CL cs.AI cs.LG

    From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary

    Authors: Qirui Zheng, Xingbo Wang, Keyuan Cheng, Muhammad Asif Ali, Yunlong Lu, Wenxin Li

    Abstract: The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding field, offering benefits such as unlimited availability and personalized narration. However, current researches in this area remain fragmented, and a comprehensive survey that systematically unifies existing efforts is still missing. To bridge this gap, our survey introduces a unified… ▽ More

    Submitted 18 October, 2025; v1 submitted 17 June, 2025; originally announced June 2025.

  38. arXiv:2506.16112  [pdf, ps, other

    cs.CV

    Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

    Authors: Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang

    Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often… ▽ More

    Submitted 21 November, 2025; v1 submitted 19 June, 2025; originally announced June 2025.

    Comments: 17 pages

  39. arXiv:2506.03143  [pdf, ps, other

    cs.CL cs.AI cs.CV

    GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

    Authors: Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao

    Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to hand… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  40. arXiv:2506.00894  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.LG

    CODEMENV: Benchmarking Large Language Models on Code Migration

    Authors: Keyuan Cheng, Xudong Shen, Yihao Yang, Tengyue Wang, Yang Cao, Muhammad Asif Ali, Hanbin Wang, Lijie Hu, Di Wang

    Abstract: Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios.… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted by ACL 2025 Findings

  41. arXiv:2506.00829  [pdf, ps, other

    cs.CL cs.AI cs.LG

    COMPKE: Complex Question Answering under Knowledge Editing

    Authors: Keyuan Cheng, Zijian Kan, Zhixian He, Zhuoran Zhang, Muhammad Asif Ali, Ke Xu, Lijie Hu, Di Wang

    Abstract: Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when que… ▽ More

    Submitted 3 June, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted by ACL 2025 Findings

  42. arXiv:2505.19897  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.HC

    ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

    Authors: Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu

    Abstract: Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are pavin… ▽ More

    Submitted 27 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: work in progress

  43. arXiv:2505.17641  [pdf, ps, other

    cs.DC

    DecLock: A Case of Decoupled Locking for Disaggregated Memory

    Authors: Hanze Zhang, Ke Cheng, Rong Chen, Xingda Wei, Haibo Chen

    Abstract: This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this issue, we present DecLock, a locking mechanism for DM that employs decentralized coordination for ownership transfer across compute nodes (CNs) while retaining centr… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  44. arXiv:2505.04652  [pdf, other

    eess.IV cs.CV

    Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation

    Authors: Yi Lin, Dong Zhang, Xiao Fang, Yufan Chen, Kwang-Ting Cheng, Hao Chen

    Abstract: Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transfo… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Accepted by Medical Image Analysis

  45. arXiv:2505.03748  [pdf, ps, other

    cs.AR cs.AI

    APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design

    Authors: Yonghao Tan, Pingcheng Dong, Yongkun Wu, Yu Liu, Xuejiao Liu, Peng Luo, Shih-Yang Liu, Xijie Huang, Dong Zhang, Luhong Liang, Kwang-Ting Cheng

    Abstract: DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for… ▽ More

    Submitted 10 April, 2025; originally announced May 2025.

    Comments: 62nd ACM/IEEE Design Automation Conference (DAC) 2025

  46. arXiv:2504.14642  [pdf, ps, other

    cs.CV

    Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension

    Authors: Lin Li, Wei Chen, Jiahui Li, Kwang-Ting Cheng, Long Chen

    Abstract: Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning. However, they remain limited in visual relation understanding, struggling even with binary relation detection, let alone \textit{N}-ary relations involving multiple semantic roles. The core reason is the lack of modeling for \textit{structural semantic dependencies… ▽ More

    Submitted 13 December, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: AAAI 2026

  47. arXiv:2504.12048  [pdf, other

    cs.CV

    Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM

    Authors: Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kwan Man Cheng, Yaofei Wu, Wenwu Zhu

    Abstract: Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. Howe… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: AAAI 2025 Poster

  48. arXiv:2504.09160  [pdf, other

    cs.CV

    SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

    Authors: Qingyuan Wang, Rui Song, Jiaojiao Li, Kerui Cheng, David Ferstl, Yinlin Hu

    Abstract: We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but f… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025

  49. arXiv:2504.08672  [pdf, other

    cs.CL cs.AI cs.LG

    Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning

    Authors: Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu

    Abstract: Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: 14 pages, 7 figures

  50. arXiv:2503.22926  [pdf, other

    cs.RO

    SR-LIO++: Efficient LiDAR-Inertial Odometry and Quantized Mapping with Sweep Reconstruction

    Authors: Zikang Yuan, Ruiye Ming, Chengwei Zhao, Yonghao Tan, Pingcheng Dong, Hongcheng Luo, Yuzhong Jiao, Xin Yang, Kwang-Ting Cheng

    Abstract: Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on low-computational-power plat… ▽ More

    Submitted 8 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

    Comments: 10 pages, 12 figures