-
MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
Authors:
Hui Li,
Jiayue Lyu,
Fu-Yun Wang,
Kaihui Cheng,
Siyu Zhu,
Jingdong Wang
Abstract:
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFl…
▽ More
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
A Study of Finetuning Video Transformers for Multi-view Geometry Tasks
Authors:
Huimin Wu,
Kwang-Ting Cheng,
Stephen Lin,
Zhirong Wu
Abstract:
This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal…
▽ More
This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.
△ Less
Submitted 21 December, 2025;
originally announced December 2025.
-
Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach
Authors:
Zhe Li,
Kun Cheng,
Hanyue Mo,
Jintao Lu,
Ziwen Kuang,
Jianwen Ye,
Lixu Xu,
Xinya Meng,
Jiahui Zhao,
Shengda Ji,
Shuyuan Liu,
Mengyu Wang
Abstract:
A vision-based trajectory analysis solution is proposed to address the "zero-speed braking" issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contr…
▽ More
A vision-based trajectory analysis solution is proposed to address the "zero-speed braking" issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle's motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.
△ Less
Submitted 21 December, 2025;
originally announced December 2025.
-
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
Authors:
Hanlin Wang,
Hao Ouyang,
Qiuyu Wang,
Yue Yu,
Yihao Meng,
Wen Wang,
Ka Leong Cheng,
Shuailei Ma,
Qingyan Bai,
Yixuan Li,
Cheng Chen,
Yanhong Zeng,
Xing Zhu,
Yujun Shen,
Qifeng Chen
Abstract:
We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference im…
▽ More
We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal
Authors:
Weihan Xu,
Kan Jen Cheng,
Koichi Saito,
Muhammad Jehanzeb Mirza,
Tingle Li,
Yisi Liu,
Alexander H. Liu,
Liming Wang,
Masato Ishii,
Takashi Shibuya,
Yuki Mitsufuji,
Gopala Anumanchipalli,
Paul Pu Liang
Abstract:
Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and ma…
▽ More
Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Authors:
Yunhong Lu,
Yanhong Zeng,
Haobo Li,
Hao Ouyang,
Qiuyu Wang,
Ka Leong Cheng,
Jiapeng Zhu,
Hengyuan Cao,
Zhipeng Zhang,
Xing Zhu,
Yujun Shen,
Min Zhang
Abstract:
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and dimini…
▽ More
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
△ Less
Submitted 4 December, 2025;
originally announced December 2025.
-
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
Authors:
Zichen Liu,
Yue Yu,
Hao Ouyang,
Qiuyu Wang,
Shuailei Ma,
Ka Leong Cheng,
Wen Wang,
Qingyan Bai,
Yuxuan Zhang,
Yanhong Zeng,
Yixuan Li,
Xing Zhu,
Yujun Shen,
Qifeng Chen
Abstract:
We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for…
▽ More
We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.
△ Less
Submitted 2 December, 2025;
originally announced December 2025.
-
Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios
Authors:
Haoran Hu,
Junren Shi,
Shuo Jiang,
Kun Cheng,
Xia Yang,
Changhao Piao
Abstract:
Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction models.To overcome these dual challenges-computational complexity and…
▽ More
Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction models.To overcome these dual challenges-computational complexity and modeling insufficiency-along with the high false alarm rates of traditional static-threshold warnings, this paper introduces an integrated FCW framework that pairs a Hierarchical Spatio-Temporal Attention Network with a Dynamic Risk Threshold Adjustment algorithm. HSTAN employs a decoupled architecture (Graph Attention Network for spatial, cascaded GRU with self-attention for temporal) to achieve superior performance and efficiency, requiring only 12.3 ms inference time (73% faster than Transformer methods) and reducing the Average Displacement Error (ADE) to 0.73m (42.2% better than Social_LSTM) on the NGSIM dataset. Furthermore, Conformalized Quantile Regression enhances reliability by generating prediction intervals (91.3% coverage at 90% confidence), which the DTRA module then converts into timely warnings via a physics-informed risk potential function and an adaptive threshold mechanism inspired by statistical process control.Tested across multi-scenario datasets, the complete system demonstrates high efficacy, achieving an F1 score of 0.912, a low false alarm rate of 8.2%, and an ample warning lead time of 2.8 seconds, validating the framework's superior performance and practical deployment feasibility in complex environments.
△ Less
Submitted 25 November, 2025;
originally announced November 2025.
-
ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better
Authors:
Yuan Zhang,
Ming Lu,
Junwen Pan,
Tao Huang,
Kuan Cheng,
Qi She,
Shanghang Zhang
Abstract:
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore,…
▽ More
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.
△ Less
Submitted 21 November, 2025;
originally announced November 2025.
-
SAM 3: Segment Anything with Concepts
Authors:
Nicolas Carion,
Laura Gustafson,
Yuan-Ting Hu,
Shoubhik Debnath,
Ronghang Hu,
Didac Suris,
Chaitanya Ryali,
Kalyan Vasudev Alwala,
Haitham Khedr,
Andrew Huang,
Jie Lei,
Tengyu Ma,
Baishan Guo,
Arpit Kalla,
Markus Marks,
Joseph Greer,
Meng Wang,
Peize Sun,
Roman Rädle,
Triantafyllos Afouras,
Effrosyni Mavroudi,
Katherine Xu,
Tsung-Han Wu,
Yu Zhou,
Liliane Momeni
, et al. (13 additional authors not shown)
Abstract:
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object…
▽ More
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
△ Less
Submitted 20 November, 2025;
originally announced November 2025.
-
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution
Authors:
Xiao He,
Zhijun Tu,
Kun Cheng,
Mingrui Zhu,
Jie Hu,
Nannan Wang,
Xinbo Gao
Abstract:
The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-res…
▽ More
The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.
△ Less
Submitted 1 December, 2025; v1 submitted 19 November, 2025;
originally announced November 2025.
-
FedSDA: Federated Stain Distribution Alignment for Non-IID Histopathological Image Classification
Authors:
Cheng-Chang Tsai,
Kai-Wen Cheng,
Chun-Shien Lu
Abstract:
Federated learning (FL) has shown success in collaboratively training a model among decentralized data resources without directly sharing privacy-sensitive training data. Despite recent advances, non-IID (non-independent and identically distributed) data poses an inevitable challenge that hinders the use of FL. In this work, we address the issue of non-IID histopathological images with feature dis…
▽ More
Federated learning (FL) has shown success in collaboratively training a model among decentralized data resources without directly sharing privacy-sensitive training data. Despite recent advances, non-IID (non-independent and identically distributed) data poses an inevitable challenge that hinders the use of FL. In this work, we address the issue of non-IID histopathological images with feature distribution shifts from an intuitive perspective that has only received limited attention. Specifically, we address this issue from the perspective of data distribution by solely adjusting the data distributions of all clients. Building on the success of diffusion models in fitting data distributions and leveraging stain separation to extract the pivotal features that are closely related to the non-IID properties of histopathological images, we propose a Federated Stain Distribution Alignment (FedSDA) method. FedSDA aligns the stain distribution of each client with a target distribution in an FL framework to mitigate distribution shifts among clients. Furthermore, considering that training diffusion models on raw data in FL has been shown to be susceptible to privacy leakage risks, we circumvent this problem while still effectively achieving alignment. Extensive experimental results show that FedSDA is not only effective in improving baselines that focus on mitigating disparities across clients' model updates but also outperforms baselines that address the non-IID data issues from the perspective of data distribution. We show that FedSDA provides valuable and practical insights for the computational pathology community.
△ Less
Submitted 15 November, 2025;
originally announced November 2025.
-
PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities
Authors:
Zichao Wei,
Jun Zeng,
Ming Wen,
Zeliang Yu,
Kai Cheng,
Yiding Zhu,
Jingyi Guo,
Shiqi Zhou,
Le Yin,
Xiaodong Su,
Zhechao Ma
Abstract:
Software vulnerabilities are increasing at an alarming rate. However, manual patching is both time-consuming and resource-intensive, while existing automated vulnerability repair (AVR) techniques remain limited in effectiveness. Recent advances in large language models (LLMs) have opened a new paradigm for AVR, demonstrating remarkable progress. To examine the capability of LLMs in AVR, several vu…
▽ More
Software vulnerabilities are increasing at an alarming rate. However, manual patching is both time-consuming and resource-intensive, while existing automated vulnerability repair (AVR) techniques remain limited in effectiveness. Recent advances in large language models (LLMs) have opened a new paradigm for AVR, demonstrating remarkable progress. To examine the capability of LLMs in AVR, several vulnerability benchmarks have been proposed recently. However, they still suffer from key limitations of outdated vulnerabilities, limited language coverage, unreliable patch validation, and insufficient reproducibility. To overcome these challenges, we introduce PATCHEVAL, a multilingual benchmark for Go, JavaScript, and Python, languages for which existing benchmarks remain unexplored. PATCHEVAL curates a dataset of 1,000 vulnerabilities drawn from CVEs reported between 2015 and 2025, covering 65 distinct CWEs. A subset of 230 CVEs is further equipped with runtime sandbox environments, enabling patch verification through both security tests and functionality tests. To provide a systematic comparison of LLM-based vulnerability repair, we evaluate a series of state-of-the-art LLMs and agents, presenting an in-depth analysis that empirically yields key insights to guide future research in AVR.
△ Less
Submitted 14 November, 2025;
originally announced November 2025.
-
MagicView: Multi-View Consistent Identity Customization via Priors-Guided In-Context Learning
Authors:
Hengjia Li,
Jianjin Xu,
Keli Cheng,
Lei Wang,
Ning Bi,
Boxi Wu,
Fernando De la Torre,
Deng Cai
Abstract:
Recent advances in personalized generative models have demonstrated impressive capabilities in producing identity-consistent images of the same individual across diverse scenes. However, most existing methods lack explicit viewpoint control and fail to ensure multi-view consistency of generated identities. To address this limitation, we present MagicView, a lightweight adaptation framework that eq…
▽ More
Recent advances in personalized generative models have demonstrated impressive capabilities in producing identity-consistent images of the same individual across diverse scenes. However, most existing methods lack explicit viewpoint control and fail to ensure multi-view consistency of generated identities. To address this limitation, we present MagicView, a lightweight adaptation framework that equips existing generative models with multi-view generation capability through 3D priors-guided in-context learning. While prior studies have shown that in-context learning preserves identity consistency across grid samples, its effectiveness in multi-view settings remains unexplored. Building upon this insight, we conduct an in-depth analysis of the multi-view in-context learning ability, and design a conditioning architecture that leverages 3D priors to activate this capability for multi-view consistent identity customization. On the other hand, acquiring robust multi-view capability typically requires large-scale multi-dimensional datasets, which makes incorporating multi-view contextual learning under limited data regimes prone to textual controllability degradation. To address this issue, we introduce a novel Semantic Correspondence Alignment loss, which effectively preserves semantic alignment while maintaining multi-view consistency. Extensive experiments demonstrate that MagicView substantially outperforms recent baselines in multi-view consistency, text alignment, identity similarity, and visual quality, achieving strong results with only 100 multi-view training samples.
△ Less
Submitted 3 December, 2025; v1 submitted 31 October, 2025;
originally announced November 2025.
-
A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents
Authors:
Zhipeng Liao,
Kunming Shao,
Jiangnan Yu,
Liang Zhao,
Tim Kwang-Ting Cheng,
Chi-Ying Tsui,
Jie Yang,
Mohamad Sawan
Abstract:
With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in…
▽ More
With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in deploying customized agents. While deploying medical agents in edge devices ensures privacy protection, RAG implementations impose substantial memory access and energy consumption during the retrieval stage. This paper presents a hierarchical retrieval architecture for edge RAG, leveraging a two-stage retrieval scheme that combines approximate retrieval for candidate set generation, followed by high-precision retrieval on pre-selected document embeddings. The proposed architecture significantly reduces energy consumption and external memory access while maintaining retrieval accuracy. Simulation results show that, under TSMC 28nm technology, the proposed hierarchical retrieval architecture has reduced the overall memory access by nearly 50% and the computation by 75% compared to pure INT8 retrieval, and the total energy consumption for 1 MB data retrieval is 177.76 μJ/query.
△ Less
Submitted 30 October, 2025;
originally announced October 2025.
-
DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation
Authors:
Kunming Shao,
Zhipeng Liao,
Jiangnan Yu,
Liang Zhao,
Qiwei Li,
Xijie Huang,
Jingyu He,
Fengshi Tian,
Yi Zou,
Xiaomeng Wang,
Tim Kwang-Ting Cheng,
Chi-Ying Tsui
Abstract:
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or lim…
▽ More
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or limited computational accuracy. To address these challenges, we present DIRCRAG, a novel edge RAG acceleration architecture leveraging Digital In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM subarray with an SRAM cell, utilizing SRAM and differential sensing for robust ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all document embeddings within the CIM macro, DIRC achieves ultra-low-power, single-cycle data loading, substantially reducing both energy consumption and latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by extracting the bit-wise spatial error distribution of the ReRAM subarray and applying targeted bit-wise data remapping. An error detection circuit is also implemented to enhance readout resilience against deviceand circuit-level variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. It delivers a 4MB retrieval latency of 5.6μs/query and an energy consumption of 0.956μJ/query, while maintaining the retrieval precision.
△ Less
Submitted 29 October, 2025;
originally announced October 2025.
-
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Authors:
Qiushi Sun,
Mukai Li,
Zhoumianze Liu,
Zhihui Xie,
Fangzhi Xu,
Zhangyue Yin,
Kanzhi Cheng,
Zehao Li,
Zichen Ding,
Qi Liu,
Zhiyong Wu,
Zhuosheng Zhang,
Ben Kao,
Lingpeng Kong
Abstract:
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast…
▽ More
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents. Our code and data are available at https://github.com/OS-Copilot/OS-Sentinel.
△ Less
Submitted 9 December, 2025; v1 submitted 28 October, 2025;
originally announced October 2025.
-
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Authors:
Yihao Meng,
Hao Ouyang,
Yue Yu,
Qiuyu Wang,
Wen Wang,
Ka Leong Cheng,
Hanlin Wang,
Yixuan Li,
Cheng Chen,
Yanhong Zeng,
Yujun Shen,
Huamin Qu
Abstract:
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Wi…
▽ More
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
△ Less
Submitted 23 October, 2025;
originally announced October 2025.
-
Coverage-Recon: Coordinated Multi-Drone Image Sampling with Online Map Feedback
Authors:
Muhammad Hanif,
Reiji Terunuma,
Takumi Sumino,
Kelvin Cheng,
Takeshi Hatanaka
Abstract:
This article addresses collaborative 3D map reconstruction using multiple drones. Achieving high-quality reconstruction requires capturing images of keypoints within the target scene from diverse viewing angles, and coverage control offers an effective framework to meet this requirement. Meanwhile, recent advances in real-time 3D reconstruction algorithms make it possible to render an evolving map…
▽ More
This article addresses collaborative 3D map reconstruction using multiple drones. Achieving high-quality reconstruction requires capturing images of keypoints within the target scene from diverse viewing angles, and coverage control offers an effective framework to meet this requirement. Meanwhile, recent advances in real-time 3D reconstruction algorithms make it possible to render an evolving map during flight, enabling immediate feedback to guide drone motion. Building on this, we present Coverage-Recon, a novel coordinated image sampling algorithm that integrates online map feedback to improve reconstruction quality on-the-fly. In Coverage-Recon, the coordinated motion of drones is governed by a Quadratic Programming (QP)-based angle-aware coverage controller, which ensures multi-viewpoint image capture while enforcing safety constraints. The captured images are processed in real time by the NeuralRecon algorithm to generate an evolving 3D mesh. Mesh changes across the scene are interpreted as indicators of reconstruction uncertainty and serve as feedback to update the importance index of the coverage control as the map evolves. The effectiveness of Coverage-Recon is validated through simulation and experiments, demonstrating both qualitatively and quantitatively that incorporating online map feedback yields more complete and accurate 3D reconstructions than conventional methods. Project page: https://htnk-lab.github.io/coverage-recon/
△ Less
Submitted 21 October, 2025;
originally announced October 2025.
-
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Authors:
Qingyan Bai,
Qiuyu Wang,
Hao Ouyang,
Yue Yu,
Hanlin Wang,
Wen Wang,
Ka Leong Cheng,
Shuailei Ma,
Yanhong Zeng,
Zichen Liu,
Yinghao Xu,
Yujun Shen,
Qifeng Chen
Abstract:
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context…
▽ More
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
△ Less
Submitted 16 December, 2025; v1 submitted 17 October, 2025;
originally announced October 2025.
-
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
Authors:
Shih-Yang Liu,
Xin Dong,
Ximing Lu,
Shizhe Diao,
Mingjie Liu,
Min-Hung Chen,
Hongxu Yin,
Yu-Chiang Frank Wang,
Kwang-Ting Cheng,
Yejin Choi,
Jan Kautz,
Pavlo Molchanov
Abstract:
Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not…
▽ More
Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
△ Less
Submitted 16 October, 2025;
originally announced October 2025.
-
DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
Authors:
Zongcai Du,
Guilin Deng,
Xiaofeng Guo,
Xin Gao,
Linke Li,
Kaichang Cheng,
Fubo Han,
Siyu Yang,
Peng Liu,
Pan Zhong,
Qiang Fu
Abstract:
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality C…
▽ More
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
△ Less
Submitted 10 October, 2025;
originally announced October 2025.
-
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
Authors:
Yu Qi,
Haibo Zhao,
Ziyu Guo,
Siyuan Ma,
Ziyan Chen,
Yaokun Han,
Renrui Zhang,
Zitiantao Lin,
Shiji Xin,
Yijian Huang,
Kai Cheng,
Peiheng Wang,
Jiazheng Liu,
Jiayi Zhang,
Yizhe Zhu,
Wenqing Wang,
Yiran Qin,
Xupeng Zhu,
Haojie Huang,
Lawson L. S. Wong
Abstract:
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial un…
▽ More
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/
△ Less
Submitted 9 October, 2025;
originally announced October 2025.
-
AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues
Authors:
Krish Patel,
Dingkun Zhou,
Ajay Kankipati,
Akshaj Gupta,
Zeyi Austin Li,
Mohul Shukla,
Vibhor Narang,
Sara Kofman,
Zongli Ye,
Grace Wang,
Xiaoyu Shi,
Tingle Li,
Guan-Ting Lin,
Kan Jen Cheng,
Huang-Cheng Chou,
Jiachen Lian,
Gopala Anumanchipalli
Abstract:
Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a…
▽ More
Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a curated, single- and multi-turn synthetic audiovisual corpus with a real-world set and is assessed under continuous, categorical, and perceptual metrics. Experiments with leading LLMs show that visual cues reliably improve emotional coherence over audio-only baselines. Moreover, LLMs can leverage audio-visual cues to generate more emotion-aware speech. Models exhibit complementary strengths across metric families, indicating that automatic scores capture facets distinct from perceptual judgments. By releasing a systematic evaluation benchmark, AV-EMO-Reasoning offers a reproducible standard for evaluating emotion-aware dialogue and advances toward more natural, adaptive human-AI interaction.
△ Less
Submitted 8 October, 2025;
originally announced October 2025.
-
Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding
Authors:
Xixi Jiang,
Chen Yang,
Dong Zhang,
Pingcheng Dong,
Xin Yang,
Kwang-Ting Cheng
Abstract:
Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiot…
▽ More
Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiotemporal structure of video data and overlook the heterogeneous nature of information distribution, leading to suboptimal performance. In this paper, we propose a spatiotemporal information mining token merging (STIM-TM) method, representing the first dedicated approach for surgical video understanding. STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently. Specifically, the temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, preserving critical sequential information and maintaining continuity. Meanwhile, the spatial component prioritizes merging static tokens through temporal stability analysis, protecting dynamic regions containing essential surgical information. Operating in a training-free manner, STIM-TM achieves significant efficiency gains with over $65\%$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks. Our method also supports efficient training of long-sequence surgical videos, addressing computational bottlenecks in surgical applications.
△ Less
Submitted 28 September, 2025;
originally announced September 2025.
-
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence
Authors:
Xingxuan Zhang,
Gang Ren,
Han Yu,
Hao Yuan,
Hui Wang,
Jiansheng Li,
Jiayun Wu,
Lang Mo,
Li Mao,
Mingchao Hao,
Ningbo Dai,
Renzhe Xu,
Shuyang Li,
Tianyang Zhang,
Yue He,
Yuanrui Wang,
Yunjia Zhang,
Zijing Xu,
Dongzhe Li,
Fang Gao,
Hao Zou,
Jiandong Liu,
Jiashuo Liu,
Jiawei Xu,
Kaijie Cheng
, et al. (13 additional authors not shown)
Abstract:
We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX-16M and LimiX-2M, two instantiations of our large structured-data models (LDMs). Both models treat structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabu…
▽ More
We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX-16M and LimiX-2M, two instantiations of our large structured-data models (LDMs). Both models treat structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. They are pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, supporting rapid, training-free adaptation at inference. We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. LimiX-16M consistently surpasses strong baselines, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. Notably, LimiX-2M delivers strong results under tight compute and memory budgets. We also present the first scaling law study for LDMs, revealing how data and model scaling jointly influence downstream performance and offering quantitative guidance for tabular foundation modeling. All LimiX models are publicly accessible under Apache 2.0.
△ Less
Submitted 7 November, 2025; v1 submitted 3 September, 2025;
originally announced September 2025.
-
Rethinking Purity and Diversity in Multi-Behavior Sequential Recommendation from the Frequency Perspective
Authors:
Yongqiang Han,
Kai Cheng,
Kefan Wang,
Enhong Chen
Abstract:
In recommendation systems, users often exhibit multiple behaviors, such as browsing, clicking, and purchasing. Multi-behavior sequential recommendation (MBSR) aims to consider these different behaviors in an integrated manner to improve the recommendation performance of the target behavior. However, some behavior data will also bring inevitable noise to the modeling of user interests. Some researc…
▽ More
In recommendation systems, users often exhibit multiple behaviors, such as browsing, clicking, and purchasing. Multi-behavior sequential recommendation (MBSR) aims to consider these different behaviors in an integrated manner to improve the recommendation performance of the target behavior. However, some behavior data will also bring inevitable noise to the modeling of user interests. Some research efforts focus on data denoising from the frequency domain perspective to improve the accuracy of user preference prediction. These studies indicate that low-frequency information tends to be valuable and reliable, while high-frequency information is often associated with noise. In this paper, we argue that high-frequency information is by no means insignificant. Further experimental results highlight that low frequency corresponds to the purity of user interests, while high frequency corresponds to the diversity of user interests. Building upon this finding, we proposed our model PDB4Rec, which efficiently extracts information across various frequency bands and their relationships, and introduces Boostrapping Balancer mechanism to balance their contributions for improved recommendation performance. Sufficient experiments on real-world datasets demonstrate the effectiveness and efficiency of our model.
△ Less
Submitted 16 October, 2025; v1 submitted 28 August, 2025;
originally announced August 2025.
-
Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails
Authors:
Kellen Tan Cheng,
Anna Lisa Gentile,
Chad DeLuca,
Guang-Jie Ren
Abstract:
The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-…
▽ More
The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a sparse human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real LLM output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.
△ Less
Submitted 25 August, 2025;
originally announced August 2025.
-
EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Authors:
Jingwen Liu,
Kan Jen Cheng,
Jiachen Lian,
Akshay Anand,
Rishi Jain,
Faith Qiao,
Robin Netzorg,
Huang-Cheng Chou,
Tingle Li,
Guan-Ting Lin,
Gopala Anumanchipalli
Abstract:
Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via t…
▽ More
Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.
△ Less
Submitted 25 August, 2025; v1 submitted 24 August, 2025;
originally announced August 2025.
-
P/D-Device: Disaggregated Large Language Model between Cloud and Devices
Authors:
Yibo Jin,
Yixu Xu,
Yue Chen,
Chengbin Wang,
Tao Wang,
Jiaqi Huang,
Rongfei Zhang,
Yiming Dong,
Yuting Yan,
Ke Cheng,
Yingjie Zhu,
Shulan Wang,
Qianqian Tang,
Shuaishuai Meng,
Guanxin Cheng,
Ze Wang,
Shuyan Miao,
Ketao Wang,
Wen Liu,
Yifan Yang,
Tong Zhang,
Anran Wang,
Chengzhou Lu,
Tiantian Dong,
Yongsheng Zhang
, et al. (5 additional authors not shown)
Abstract:
Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, in…
▽ More
Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, increases dramatically with the growth on prompt length. In order to concur with such a bottleneck on resources, i.e., long occupation in cloud and limited on-device computing capacity, we propose to separate large language model between cloud and devices. That is, the cloud helps a portion of the content for each device, only in its prefill phase. Specifically, after receiving the first token from the cloud, decoupling with its own prefill, the device responds to the user immediately for a lower TTFT. Then, the following tokens from cloud are presented via a speed controller for smoothed TPOT (the time per output token), until the device catches up with the progress. On-device prefill is then amortized using received tokens while the resource usage in cloud is controlled. Moreover, during cloud prefill, the prompt can be refined, using those intermediate data already generated, to further speed up on-device inference. We implement such a scheme P/D-Device, and confirm its superiority over other alternatives. We further propose an algorithm to decide the best settings. Real-trace experiments show that TTFT decreases at least 60%, maximum TPOT is about tens of milliseconds, and cloud throughput increases by up to 15x.
△ Less
Submitted 12 August, 2025;
originally announced August 2025.
-
It's a Complete Haystack: Understanding Dependency Management Needs in Computer-Aided Design
Authors:
Kathy Cheng,
Alison Olechowski,
Shurui Zhou
Abstract:
In today's landscape, hardware development teams face increasing demands for better quality products, greater innovation, and shorter manufacturing lead times. Despite the need for more efficient and effective processes, hardware designers continue to struggle with a lack of awareness of design changes and other collaborators' actions, a persistent issue in decades of CSCW research. One significan…
▽ More
In today's landscape, hardware development teams face increasing demands for better quality products, greater innovation, and shorter manufacturing lead times. Despite the need for more efficient and effective processes, hardware designers continue to struggle with a lack of awareness of design changes and other collaborators' actions, a persistent issue in decades of CSCW research. One significant and unaddressed challenge is understanding and managing dependencies between 3D CAD (computer-aided design) models, especially when products can contain thousands of interconnected components. In this two-phase formative study, we explore designers' pain points of CAD dependency management through a thematic analysis of 100 online forum discussions and semi-structured interviews with 10 designers. We identify nine key challenges related to the traceability, navigation, and consistency of CAD dependencies, that harm the effective coordination of hardware development teams. To address these challenges, we propose design goals and necessary features to enhance hardware designers' awareness and management of dependencies, ultimately with the goal of improving collaborative workflows.
△ Less
Submitted 7 August, 2025;
originally announced August 2025.
-
A Novel Modeling Framework and Data Product for Extended VIIRS-like Artificial Nighttime Light Image Reconstruction (1986-2024)
Authors:
Yihe Tian,
Kwan Man Cheng,
Zhengbo Zhang,
Tao Zhang,
Suju Li,
Dongmei Yan,
Bing Xu
Abstract:
Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current m…
▽ More
Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current methods still suffer from two significant shortcomings: the underestimation of light intensity and the structural omission. To overcome these limitations, we propose a novel reconstruction framework consisting of a two-stage process: construction and refinement. The construction stage features a Hierarchical Fusion Decoder (HFD) designed to enhance the fidelity of the initial reconstruction. The refinement stage employs a Dual Feature Refiner (DFR), which leverages high-resolution impervious surface masks to guide and enhance fine-grained structural details. Based on this framework, we developed the Extended VIIRS-like Artificial Nighttime Light (EVAL) product for China, extending the standard data record backwards by 26 years to begin in 1986. Quantitative evaluation shows that EVAL significantly outperforms existing state-of-the-art products, boosting the $\text{R}^2$ from 0.68 to 0.80 while lowering the RMSE from 1.27 to 0.99. Furthermore, EVAL exhibits excellent temporal consistency and maintains a high correlation with socioeconomic parameters, confirming its reliability for long-term analysis. The resulting EVAL dataset provides a valuable new resource for the research community and is publicly available at https://doi.org/10.11888/HumanNat.tpdc.302930.
△ Less
Submitted 1 August, 2025;
originally announced August 2025.
-
OneShield -- the Next Generation of LLM Guardrails
Authors:
Chad DeLuca,
Anna Lisa Gentile,
Shubhi Asthana,
Bing Zhang,
Pawan Chowdhary,
Kellen Cheng,
Basel Shbita,
Pengyuan Li,
Guang-Jie Ren,
Sandeep Gopisetty
Abstract:
The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes it ext…
▽ More
The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes it extremely challenging to universally shield users against their potential risks, and one-size-fits-all solutions are unfeasible. In this work, we propose OneShield, our stand-alone, model-agnostic and customizable solution to safeguard LLMs. OneShield aims to provide facilities for defining risk factors, expressing and declaring contextual safety and compliance policies, and mitigating LLM risks, with a focus on each specific customer. We describe the implementation of the framework, discuss scalability considerations, and provide usage statistics of OneShield since its initial deployment.
△ Less
Submitted 31 July, 2025; v1 submitted 25 July, 2025;
originally announced July 2025.
-
Guard-GBDT: Efficient Privacy-Preserving Approximated GBDT Training on Vertical Dataset
Authors:
Anxiao Song,
Shujie Cui,
Jianli Bai,
Ke Cheng,
Yulong Shen,
Giovanni Russello
Abstract:
In light of increasing privacy concerns and stringent legal regulations, using secure multiparty computation (MPC) to enable collaborative GBDT model training among multiple data owners has garnered significant attention. Despite this, existing MPC-based GBDT frameworks face efficiency challenges due to high communication costs and the computation burden of non-linear operations, such as division…
▽ More
In light of increasing privacy concerns and stringent legal regulations, using secure multiparty computation (MPC) to enable collaborative GBDT model training among multiple data owners has garnered significant attention. Despite this, existing MPC-based GBDT frameworks face efficiency challenges due to high communication costs and the computation burden of non-linear operations, such as division and sigmoid calculations. In this work, we introduce Guard-GBDT, an innovative framework tailored for efficient and privacy-preserving GBDT training on vertical datasets. Guard-GBDT bypasses MPC-unfriendly division and sigmoid functions by using more streamlined approximations and reduces communication overhead by compressing the messages exchanged during gradient aggregation. We implement a prototype of Guard-GBDT and extensively evaluate its performance and accuracy on various real-world datasets. The results show that Guard-GBDT outperforms state-of-the-art HEP-XGB (CIKM'21) and SiGBDT (ASIA CCS'24) by up to $2.71\times$ and $12.21 \times$ on LAN network and up to $2.7\times$ and $8.2\times$ on WAN network. Guard-GBDT also achieves comparable accuracy with SiGBDT and plaintext XGBoost (better than HEP-XGB ), which exhibits a deviation of $\pm1\%$ to $\pm2\%$ only. Our implementation code is provided at https://github.com/XidianNSS/Guard-GBDT.git.
△ Less
Submitted 21 December, 2025; v1 submitted 28 July, 2025;
originally announced July 2025.
-
Calligrapher: Freestyle Text Image Customization
Authors:
Yue Ma,
Qingyan Bai,
Hao Ouyang,
Ka Leong Cheng,
Qiuyu Wang,
Hongyu Liu,
Zichen Liu,
Haofan Wang,
Jingye Chen,
Yujun Shen,
Qifeng Chen
Abstract:
We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechani…
▽ More
We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
△ Less
Submitted 30 June, 2025;
originally announced June 2025.
-
Low-Cost Infrastructure-Free 3D Relative Localization with Sub-Meter Accuracy in Near Field
Authors:
Qiangsheng Gao,
Ka Ho Cheng,
Li Qiu,
Zijun Gong
Abstract:
Relative localization in the near-field scenario is critically important for unmanned vehicle (UxV) applications. Although related works addressing 2D relative localization problem have been widely studied for unmanned ground vehicles (UGVs), the problem in 3D scenarios for unmanned aerial vehicles (UAVs) involves more uncertainties and remains to be investigated. Inspired by the phenomenon that a…
▽ More
Relative localization in the near-field scenario is critically important for unmanned vehicle (UxV) applications. Although related works addressing 2D relative localization problem have been widely studied for unmanned ground vehicles (UGVs), the problem in 3D scenarios for unmanned aerial vehicles (UAVs) involves more uncertainties and remains to be investigated. Inspired by the phenomenon that animals can achieve swarm behaviors solely based on individual perception of relative information, this study proposes an infrastructure-free 3D relative localization framework that relies exclusively on onboard ultra-wideband (UWB) sensors. Leveraging 2D relative positioning research, we conducted feasibility analysis, system modeling, simulations, performance evaluation, and field tests using UWB sensors. The key contributions of this work include: derivation of the Cramér-Rao lower bound (CRLB) and geometric dilution of precision (GDOP) for near-field scenarios; development of two localization algorithms -- one based on Euclidean distance matrix (EDM) and another employing maximum likelihood estimation (MLE); comprehensive performance comparison and computational complexity analysis against state-of-the-art methods; simulation studies and field experiments; a novel sensor deployment strategy inspired by animal behavior, enabling single-sensor implementation within the proposed framework for UxV applications. The theoretical, simulation, and experimental results demonstrate strong generalizability to other 3D near-field localization tasks, with significant potential for a cost-effective cross-platform UxV collaborative system.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
From Multimodal Perception to Strategic Reasoning: A Survey on AI-Generated Game Commentary
Authors:
Qirui Zheng,
Xingbo Wang,
Keyuan Cheng,
Muhammad Asif Ali,
Yunlong Lu,
Wenxin Li
Abstract:
The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding field, offering benefits such as unlimited availability and personalized narration. However, current researches in this area remain fragmented, and a comprehensive survey that systematically unifies existing efforts is still missing. To bridge this gap, our survey introduces a unified…
▽ More
The advent of artificial intelligence has propelled AI-Generated Game Commentary (AI-GGC) into a rapidly expanding field, offering benefits such as unlimited availability and personalized narration. However, current researches in this area remain fragmented, and a comprehensive survey that systematically unifies existing efforts is still missing. To bridge this gap, our survey introduces a unified framework that systematically organizes the AI-GGC landscape. We present a novel taxonomy focused on three core commentator capabilities: Live Observation, Strategic Analysis, and Historical Recall. Commentary is further categorized into three functional types: Descriptive, Analytical, and Background. Building on this structure, we provide an in-depth review of state-of-the-art methods, datasets, and evaluation metrics across various game genres. Finally, we highlight key challenges such as real-time reasoning, multimodal integration, and evaluation bottlenecks, and outline promising directions for future research and system development in AI-GGC.
△ Less
Submitted 18 October, 2025; v1 submitted 17 June, 2025;
originally announced June 2025.
-
Loss-Oriented Ranking for Automated Visual Prompting in LVLMs
Authors:
Yuan Zhang,
Chun-Kai Fan,
Tao Huang,
Ming Lu,
Sicheng Yu,
Junwen Pan,
Kuan Cheng,
Qi She,
Shanghang Zhang
Abstract:
Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often…
▽ More
Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we develop an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experiments indicate that AutoV enhances the performance of various LVLMs across multiple image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{10.2}\%$ accuracy gain on VizWiz, and AutoV boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, highlighting its potential as an optimal visual prompting method.
△ Less
Submitted 21 November, 2025; v1 submitted 19 June, 2025;
originally announced June 2025.
-
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Authors:
Qianhui Wu,
Kanzhi Cheng,
Rui Yang,
Chaoyun Zhang,
Jianwei Yang,
Huiqiang Jiang,
Jian Mu,
Baolin Peng,
Bo Qiao,
Reuben Tan,
Si Qin,
Lars Liden,
Qingwei Lin,
Huan Zhang,
Tong Zhang,
Jianbing Zhang,
Dongmei Zhang,
Jianfeng Gao
Abstract:
One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to hand…
▽ More
One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <ACTOR> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
CODEMENV: Benchmarking Large Language Models on Code Migration
Authors:
Keyuan Cheng,
Xudong Shen,
Yihao Yang,
Tengyue Wang,
Yang Cao,
Muhammad Asif Ali,
Hanbin Wang,
Lijie Hu,
Di Wang
Abstract:
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios.…
▽ More
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
COMPKE: Complex Question Answering under Knowledge Editing
Authors:
Keyuan Cheng,
Zijian Kan,
Zhixian He,
Zhuoran Zhang,
Muhammad Asif Ali,
Ke Xu,
Lijie Hu,
Di Wang
Abstract:
Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when que…
▽ More
Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.
△ Less
Submitted 3 June, 2025; v1 submitted 1 June, 2025;
originally announced June 2025.
-
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Authors:
Qiushi Sun,
Zhoumianze Liu,
Chang Ma,
Zichen Ding,
Fangzhi Xu,
Zhangyue Yin,
Haiteng Zhao,
Zhenyu Wu,
Kanzhi Cheng,
Zhaoyang Liu,
Jianing Wang,
Qintong Li,
Xiangru Tang,
Tianbao Xie,
Xiachong Feng,
Xiang Li,
Ben Kao,
Wenhai Wang,
Biqing Qi,
Lingpeng Kong,
Zhiyong Wu
Abstract:
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are pavin…
▽ More
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.
△ Less
Submitted 27 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
DecLock: A Case of Decoupled Locking for Disaggregated Memory
Authors:
Hanze Zhang,
Ke Cheng,
Rong Chen,
Xingda Wei,
Haibo Chen
Abstract:
This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this issue, we present DecLock, a locking mechanism for DM that employs decentralized coordination for ownership transfer across compute nodes (CNs) while retaining centr…
▽ More
This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this issue, we present DecLock, a locking mechanism for DM that employs decentralized coordination for ownership transfer across compute nodes (CNs) while retaining centralized state maintenance on memory nodes (MNs). DecLock features cooperative queue-notify locking that queues lock waiters on MNs atomically, enabling clients to transfer lock ownership via message-based notifications between CNs. This approach conserves MN-NIC resources for DM applications and ensures fairness. Evaluations show DecLock achieves throughput improvements of up to 43.37$\times$ and 1.81$\times$ over state-of-the-art RDMA-based spinlocks and MCS locks, respectively. Furthermore, DecLock helps two DM applications, including an object store and a real-world database index (Sherman), avoid performance degradation under high contention, improving throughput by up to 35.60$\times$ and 2.31$\times$ and reducing 99th-percentile latency by up to 98.8% and 82.1%.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation
Authors:
Yi Lin,
Dong Zhang,
Xiao Fang,
Yufan Chen,
Kwang-Ting Cheng,
Hao Chen
Abstract:
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transfo…
▽ More
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design
Authors:
Yonghao Tan,
Pingcheng Dong,
Yongkun Wu,
Yu Liu,
Xuejiao Liu,
Peng Luo,
Shih-Yang Liu,
Xijie Huang,
Dong Zhang,
Luhong Liang,
Kwang-Ting Cheng
Abstract:
DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for…
▽ More
DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. Code is available at https://github.com/Yonghao-Tan/APSQ.
△ Less
Submitted 10 April, 2025;
originally announced May 2025.
-
Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension
Authors:
Lin Li,
Wei Chen,
Jiahui Li,
Kwang-Ting Cheng,
Long Chen
Abstract:
Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning. However, they remain limited in visual relation understanding, struggling even with binary relation detection, let alone \textit{N}-ary relations involving multiple semantic roles. The core reason is the lack of modeling for \textit{structural semantic dependencies…
▽ More
Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning. However, they remain limited in visual relation understanding, struggling even with binary relation detection, let alone \textit{N}-ary relations involving multiple semantic roles. The core reason is the lack of modeling for \textit{structural semantic dependencies} among multi-entities, leading to unreliable outputs, hallucinations, and over-reliance on language priors (\eg, defaulting to ``person drinks a milk'' if a person is merely holding it). To this end, we propose Relation-R1, the \textit{first unified} relation comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-rewards optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Furthermore, we investigate the impact of various CoT strategies within this framework, demonstrating that a specific-to-general progressive approach in CoT guidance further improves generalization, especially in capturing synonymous \textit{N}-ary relations. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.
△ Less
Submitted 13 December, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM
Authors:
Zirui Pan,
Xin Wang,
Yipeng Zhang,
Hong Chen,
Kwan Man Cheng,
Yaofei Wu,
Wenwu Zhu
Abstract:
Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. Howe…
▽ More
Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at https://modular-cam.github.io.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow
Authors:
Qingyuan Wang,
Rui Song,
Jiaojiao Li,
Kerui Cheng,
David Ferstl,
Yinlin Hu
Abstract:
We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but f…
▽ More
We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent matching network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train SCFlow2 on a combination of dataset Objaverse, GSO and ShapeNet, and evaluate on BOP datasets with novel objects. After using our method as a post-processing, most state-of-the-art methods produce significantly better results, without any retraining or fine-tuning. The source code is available at https://scflow2.github.io.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Authors:
Fangzhi Xu,
Hang Yan,
Chang Ma,
Haiteng Zhao,
Qiushi Sun,
Kanzhi Cheng,
Junxian He,
Jun Liu,
Zhiyong Wu
Abstract:
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised…
▽ More
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
SR-LIO++: Efficient LiDAR-Inertial Odometry and Quantized Mapping with Sweep Reconstruction
Authors:
Zikang Yuan,
Ruiye Ming,
Chengwei Zhao,
Yonghao Tan,
Pingcheng Dong,
Hongcheng Luo,
Yuzhong Jiao,
Xin Yang,
Kwang-Ting Cheng
Abstract:
Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on low-computational-power plat…
▽ More
Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on low-computational-power platforms. To address these limitations, we introduce SR-LIO++, an innovative LIO system capable of achieving doubled output frequency relative to input frequency on resource-constrained hardware platforms, including the Raspberry Pi 4B. Our system employs a sweep reconstruction methodology to enhance LiDAR sweep frequency, generating high-frequency reconstructed sweeps. Building upon this foundation, we propose a caching mechanism for intermediate results (i.e., surface parameters) of the most recent segments, effectively minimizing redundant processing of common segments in adjacent reconstructed sweeps. This method decouples processing time from the traditionally linear dependence on reconstructed sweep frequency. Furthermore, we present a quantized map point management based on index table mapping, significantly reducing memory usage by converting global 3D point storage from 64-bit double precision to 8-bit char representation. This method also converts the computationally intensive Euclidean distance calculations in nearest neighbor searches from 64-bit double precision to 16-bit short and 32-bit integer formats, significantly reducing both memory and computational cost. Extensive experimental evaluations across three distinct computing platforms and four public datasets demonstrate that SR-LIO++ maintains state-of-the-art accuracy while substantially enhancing efficiency. Notably, our system successfully achieves 20Hz state output on Raspberry Pi 4B hardware.
△ Less
Submitted 8 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.