ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Authors: Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang, Siyu Chen, Xiaoda Yang, Tao Jin, Zhou Zhao

Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient frame… ▽ More Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets. △ Less

Submitted 9 April, 2026; originally announced April 2026.

arXiv:2604.06623 [pdf, ps, other]

doi 10.1109/TAI.2025.3633206

WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression

Authors: Weikai Qu, Sijun Liang, Cheng Pan, Zikuan Yang, Guanchi Zhou, Xianjun Fu, Bo Liu, Changmiao Wang, Ahmed Elazab

Abstract: Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few c… ▽ More Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at https://github.com/RICKand-MORTY/WeatherRemover. △ Less

Submitted 7 April, 2026; originally announced April 2026.

Comments: Accepted by IEEE Transactions on Artificial Intelligence

arXiv:2604.06622 [pdf, ps, other]

doi 10.1109/TRPMS.2026.3674764

Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction

Authors: Weikai Qu, Sijun Liang, Xianfeng Li, Cheng Pan, An Yan, Ahmed Elazab, Shanzhou Niu, Dong Zeng, Xiang Wan, Changmiao Wang

Abstract: In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, whic… ▽ More In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, which effectively eliminates artifacts caused by metals of different sizes while maintaining the integrity of the original anatomical structures of the image. Furthermore, this model only focuses on CT images affected by metal artifacts, thus negating the requirement for additional input data. The model is a streamlined UNet architecture, which incorporates multi-scale Mamba (MS-Mamba) as its core module. Within MS-Mamba, a flip mamba block captures comprehensive contextual information by analyzing images from multiple orientations. Subsequently, the average maximum feed-forward network integrates critical features with average features to suppress the artifacts. This combination allows MARMamba to reduce artifacts efficiently. The experimental results demonstrate that our model excels in reducing metal artifacts, offering distinct advantages over other models. It also strikes an optimal balance between computational demands, memory usage, and the number of parameters, highlighting its practical utility in the real world. The code of the presented model is available at: https://github.com/RICKand-MORTY/MARMamba. △ Less

Submitted 7 April, 2026; originally announced April 2026.

Comments: Accepted by IEEE Transactions on Radiation and Plasma Medical Sciences

arXiv:2604.02289 [pdf, ps, other]

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Authors: Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, Xiaoguang Han

Abstract: Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift resul… ▽ More Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models. △ Less

Submitted 2 April, 2026; originally announced April 2026.

arXiv:2603.22866 [pdf, ps, other]

Aerial Agentic AI: Synergizing LLM and SLM for Low-Altitude Wireless Networks

Authors: Li Dong, Feibo Jiang, Kezhi Wang, Cunhua Pan, Dong In Kim, Ekram Hossain

Abstract: Low-Altitude Wireless Networks (LAWNs), composed of Unmanned Aerial Vehicles (UAVs) and mobile terminals, are emerging as a critical extension of 6G. However, applying Large Language Models in LAWNs faces three major challenges: 1) Computational and energy constraints; 2) Communication and bandwidth limitations; 3) Real-time and reliability conflicts. To address these challenges, we propose Aerial… ▽ More Low-Altitude Wireless Networks (LAWNs), composed of Unmanned Aerial Vehicles (UAVs) and mobile terminals, are emerging as a critical extension of 6G. However, applying Large Language Models in LAWNs faces three major challenges: 1) Computational and energy constraints; 2) Communication and bandwidth limitations; 3) Real-time and reliability conflicts. To address these challenges, we propose Aerial Agentic AI, a hierarchical framework integrating UAV-side fast-thinking Small Language Model (SLMs) with BS-side slow-thinking Large Language Model (LLMs). First, we design SLM-based Agents capable of on-board perception, short-term memory enhancement, and real-time decision-making on the UAVs. Second, we implement a LLM-based Agent system that leverages long-term memory, global knowledge, and tool orchestration at the Base Station (BS) to perform deep reasoning, knowledge updates, and strategy optimization. Third, we establish an efficient hierarchical coordination mechanism, enabling UAVs to execute high-frequency tasks locally while synchronizing with the BS only when necessary. Experimental results validate the effectiveness of the proposed Aerial Agentic AI. △ Less

Submitted 24 March, 2026; originally announced March 2026.

arXiv:2603.18273 [pdf, ps, other]

EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research

Authors: Chenguang Pan, Zhou Zhang, Weixuan Xiao, Chengyuan Yao

Abstract: In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first inst… ▽ More In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first instantiation of this framework, we focus on predictive modeling tasks. Within this scope, EDM-ARS orchestrates five specialized LLM-powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, and Writer) through a state-machine coordinator that supports revision loops, checkpoint-based recovery, and sandboxed code execution. Given a research prompt and a dataset, EDM-ARS produces a complete LaTeX manuscript with real Semantic Scholar citations, validated machine learning analyses, and automated methodological peer review. We also provide a detailed description of the system architecture, the three-tier data registry design that encodes educational domain expertise, the specification of each agent, the inter-agent communication protocol, and mechanisms for error-handling and self-correction. Finally, we discuss current limitations, including single-dataset scope and formulaic paper output, and outline a phased roadmap toward causal inference, transfer learning, psychometric, and multi-dataset generalization. EDM-ARS is released as an open-source project to support the educational research community. △ Less

Submitted 18 March, 2026; originally announced March 2026.

arXiv:2603.14889 [pdf, ps, other]

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Authors: Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu, Yifu Chen, Chenyuhao Wen, Tianle Liang, Zhou Zhao

Abstract: The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challe… ▽ More The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/. △ Less

Submitted 16 March, 2026; originally announced March 2026.

arXiv:2603.09293 [pdf, ps, other]

Tensor Train Decomposition-based Channel Estimation for MIMO-AFDM Systems with Fractional Delay and Doppler

Authors: Ruizhe Wang, Cunhua Pan, Hong Ren, Haisu Wu, Jiangzhou Wang

Abstract: Affine Frequency Division Multiplexing (AFDM) has emerged as a promising chirp-based multicarrier technology for high-speed communication systems. To fully exploit the diversity gain offered by AFDM, accurate channel estimation is essential. However, existing studies have mainly focused on the integer-delay-tap scenario and single-symbol pilot-based estimation. Since delay taps in practice are gen… ▽ More Affine Frequency Division Multiplexing (AFDM) has emerged as a promising chirp-based multicarrier technology for high-speed communication systems. To fully exploit the diversity gain offered by AFDM, accurate channel estimation is essential. However, existing studies have mainly focused on the integer-delay-tap scenario and single-symbol pilot-based estimation. Since delay taps in practice are generally fractional, approximating them as integers not only degrades delay estimation accuracy but also severely affects Doppler frequency estimation. To address this problem, in this paper, we investigate channel estimation for multiple-input multiple-output (MIMO)-AFDM systems. A time-affine frequency (T-AF) domain pilot structure is proposed to exploit time-domain phase variations. By leveraging the rotational invariance property in the spatial and temporal domains, a channel estimation algorithm based on Vandermonde-structured tensor-train (TT) decomposition is developed. The proposed algorithm demonstrates superior computational efficiency compared with state-of-the-art parameter estimation methods. Moreover, diverging from current studies, we derive the global Ziv-Zakai bound (ZZB) as an alternative parameter estimation error lower bound to the Cramér-Rao bound (CRB). Numerical results show that the derived ZZB provides tighter global performance characterization and successfully captures the threshold phenomenon in mean square error (MSE) performance in the low-SNR regime. Furthermore, the proposed algorithm achieves superior communication performance relative to the existing schemes, while offering a computational speedup, reducing the execution time by an order of magnitude compared to the state-of-the-art iterative algorithms. △ Less

Submitted 19 March, 2026; v1 submitted 10 March, 2026; originally announced March 2026.

arXiv:2603.06887 [pdf, ps, other]

VertiAdaptor: Online Kinodynamics Adaptation for Vertically Challenging Terrain

Authors: Tong Xu, Chenhui Pan, Aniket Datar, Xuesu Xiao

Abstract: Autonomous driving in off-road environments presents significant challenges due to the dynamic and unpredictable nature of unstructured terrain. Traditional kinodynamic models often struggle to generalize across diverse geometric and semantic terrain types, underscoring the need for real-time adaptation to ensure safe and reliable navigation. We propose VertiAdaptor (VA), a novel online adaptation… ▽ More Autonomous driving in off-road environments presents significant challenges due to the dynamic and unpredictable nature of unstructured terrain. Traditional kinodynamic models often struggle to generalize across diverse geometric and semantic terrain types, underscoring the need for real-time adaptation to ensure safe and reliable navigation. We propose VertiAdaptor (VA), a novel online adaptation framework that efficiently integrates elevation with semantic embeddings to enable terrain-aware kinodynamic modeling and planning via function encoders. VA learns a kinodynamic space spanned by a set of neural ordinary differential equation basis functions, capturing complex vehicle-terrain interactions across varied environments. After offline training, the proposed approach can rapidly adapt to new, unseen environments by identifying kinodynamics in the learned space through a computationally efficient least-squares calculation. We evaluate VA within the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance both in simulation and on a physical Verti-4-Wheeler platform. Our results demonstrate that VA improves prediction accuracy by up to 23.9% and achieves a 5X faster adaptation time, advancing the robustness and reliability of autonomous robots in complex and evolving off-road environments. △ Less

Submitted 20 March, 2026; v1 submitted 6 March, 2026; originally announced March 2026.

arXiv:2603.06866 [pdf, ps, other]

CAR: Cross-Vehicle Kinodynamics Adaptation via Mobility Representation

Authors: Tong Xu, Chenhui Pan, Xuesu Xiao

Abstract: Developing autonomous off-road mobility typically requires either extensive, platform-specific data collection or relies on simplified abstractions, such as unicycle or bicycle models, that fail to capture the complex kinodynamics of diverse platforms, ranging from wheeled to tracked vehicles. This limitation hinders scalability across evolving heterogeneous autonomous robot fleets. To address thi… ▽ More Developing autonomous off-road mobility typically requires either extensive, platform-specific data collection or relies on simplified abstractions, such as unicycle or bicycle models, that fail to capture the complex kinodynamics of diverse platforms, ranging from wheeled to tracked vehicles. This limitation hinders scalability across evolving heterogeneous autonomous robot fleets. To address this challenge, we propose Cross-vehicle kinodynamics Adaptation via mobility Representation (CAR), a novel framework that enables rapid mobility transfer to new vehicles. CAR employs a Transformer encoder with Adaptive Layer Normalization to embed vehicle trajectory transitions and physical configurations into a shared mobility latent space. By identifying and extracting commonality from nearest neighbors within this latent space, our approach enables rapid kinodynamics adaptation to novel platforms with minimal data collection and computational overhead. We evaluate CAR using the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance on four distinct physical configurations of the Verti-4-Wheeler platform. With only one minute of new trajectory data, CAR achieves up to 67.2% reduction in prediction error compared to direct neighbor transfer across diverse unseen vehicle configurations, demonstrating the effectiveness of cross-vehicle mobility knowledge transfer in both simulated and real-world environments. △ Less

Submitted 20 March, 2026; v1 submitted 6 March, 2026; originally announced March 2026.

arXiv:2603.00482 [pdf, ps, other]

TokenCom: Vision-Language Model for Multimodal and Multitask Token Communications

Authors: Feibo Jiang, Siwei Tu, Li Dong, Xiaolong Li, Kezhi Wang, Cunhua Pan, Zhu Han, Jiangzhou Wang

Abstract: Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communication… ▽ More Visual-Language Models (VLMs), with their strong capabilities in image and text understanding, offer a solid foundation for intelligent communications. However, their effectiveness is constrained by limited token granularity, overlong visual token sequences, and inadequate cross-modal alignment. To overcome these challenges, we propose TaiChi, a novel VLM framework designed for token communications. TaiChi adopts a dual-visual tokenizer architecture that processes both high- and low-resolution images to collaboratively capture pixel-level details and global conceptual features. A Bilateral Attention Network (BAN) is introduced to intelligently fuse multi-scale visual tokens, thereby enhancing visual understanding and producing compact visual tokens. In addition, a Kolmogorov Arnold Network (KAN)-based modality projector with learnable activation functions is employed to achieve precise nonlinear alignment from visual features to the text semantic space, thus minimizing information loss. Finally, TaiChi is integrated into a multimodal and multitask token communication system equipped with a joint VLM-channel coding scheme. Experimental results validate the superior performance of TaiChi, as well as the feasibility and effectiveness of the TaiChi-driven token communication system. △ Less

Submitted 28 February, 2026; originally announced March 2026.

arXiv:2602.09336 [pdf, ps, other]

FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

Authors: Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao

Abstract: Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges… ▽ More Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters. △ Less

Submitted 9 February, 2026; originally announced February 2026.

arXiv:2602.09065 [pdf, ps, other]

Enhanced Graph Transformer with Serialized Graph Tokens

Authors: Ruixiang Wang, Yuyang Hong, Shiming Xiang, Chunhong Pan

Abstract: Transformers have demonstrated success in graph learning, particularly for node-level tasks. However, existing methods encounter an information bottleneck when generating graph-level representations. The prevalent single token paradigm fails to fully leverage the inherent strength of self-attention in encoding token sequences, and degenerates into a weighted sum of node signals. To address this is… ▽ More Transformers have demonstrated success in graph learning, particularly for node-level tasks. However, existing methods encounter an information bottleneck when generating graph-level representations. The prevalent single token paradigm fails to fully leverage the inherent strength of self-attention in encoding token sequences, and degenerates into a weighted sum of node signals. To address this issue, we design a novel serialized token paradigm to encapsulate global signals more effectively. Specifically, a graph serialization method is proposed to aggregate node signals into serialized graph tokens, with positional encoding being automatically involved. Then, stacked self-attention layers are applied to encode this token sequence and capture its internal dependencies. Our method can yield more expressive graph representations by modeling complex interactions among multiple graph tokens. Experimental results show that our method achieves state-of-the-art results on several graph-level benchmarks. Ablation studies verify the effectiveness of the proposed modules. △ Less

Submitted 9 February, 2026; originally announced February 2026.

Comments: ICASSP 2026

arXiv:2602.06485 [pdf, ps, other]

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

Authors: Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun

Abstract: While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the perf… ▽ More While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models. △ Less

Submitted 6 February, 2026; originally announced February 2026.

arXiv:2602.00148 [pdf, ps, other]

Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields

Authors: Shiqian Li, Ruihong Shen, Junfeng Ni, Chang Pan, Chi Zhang, Yixin Zhu

Abstract: Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered… ▽ More Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models. △ Less

Submitted 12 February, 2026; v1 submitted 29 January, 2026; originally announced February 2026.

Comments: 43 pages, ICLR 2026

arXiv:2601.20331 [pdf, ps, other]

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

Authors: Mai Su, Qihan Yu, Zhongtao Wang, Yilong Li, Chengwei Pan, Yisong Chen, Guoping Wang, Fei Zhu

Abstract: 3D Gaussian Splatting (3DGS) enables efficient rendering, yet accurate surface reconstruction remains challenging due to unreliable geometric supervision. Existing approaches predominantly rely on depth-based reprojection to infer visibility and enforce multi-view consistency, leading to a fundamental circular dependency: visibility estimation requires accurate depth, while depth supervision itsel… ▽ More 3D Gaussian Splatting (3DGS) enables efficient rendering, yet accurate surface reconstruction remains challenging due to unreliable geometric supervision. Existing approaches predominantly rely on depth-based reprojection to infer visibility and enforce multi-view consistency, leading to a fundamental circular dependency: visibility estimation requires accurate depth, while depth supervision itself is conditioned on visibility. In this work, we revisit multi-view geometric supervision from the perspective of visibility modeling. Instead of inferring visibility from pixel-wise depth consistency, we explicitly model visibility at the level of Gaussian primitives. We introduce a Gaussian visibility-aware multi-view geometric consistency (GVMV) formulation, which aggregates cross-view visibility of shared Gaussians to construct reliable supervision over co-visible regions. To further incorporate monocular priors, we propose a progressive quadtree-calibrated depth alignment (QDC) strategy that performs block-wise affine calibration under visibility-aware guidance, effectively mitigating scale ambiguity while preserving local geometric structures. Extensive experiments on DTU and Tanks and Temples demonstrate that our method consistently improves reconstruction accuracy over prior Gaussian-based approaches. Our code is fully open-sourced and available at an anonymous repository: https://github.com/GVGScode/GVGS. △ Less

Submitted 2 April, 2026; v1 submitted 28 January, 2026; originally announced January 2026.

arXiv:2601.16935 [pdf, ps, other]

AERO: Adaptive and Efficient Runtime-Aware OTA Updates for Energy-Harvesting IoT

Authors: Wei Wei, Jingye Xu, Sahidul Islam, Dakai Zhu, Chen Pan, Mimi Xie

Abstract: Energy-harvesting (EH) Internet of Things (IoT) devices operate under intermittent energy availability, which disrupts task execution and makes energy-intensive over-the-air (OTA) updates particularly challenging. Conventional OTA update mechanisms rely on reboots and incur significant overhead, rendering them unsuitable for intermittently powered systems. Recent live OTA update techniques reduce… ▽ More Energy-harvesting (EH) Internet of Things (IoT) devices operate under intermittent energy availability, which disrupts task execution and makes energy-intensive over-the-air (OTA) updates particularly challenging. Conventional OTA update mechanisms rely on reboots and incur significant overhead, rendering them unsuitable for intermittently powered systems. Recent live OTA update techniques reduce reboot overhead but still lack mechanisms to ensure consistency when updates interact with runtime execution. This paper presents AERO, an Adaptive and Efficient Runtime-Aware OTA update mechanism that integrates update tasks into the device's Directed Acyclic Graph (DAG) and schedules them alongside routine tasks under energy and timing constraints. By identifying update-affected execution regions and dynamically adjusting dependencies, AERO ensures consistent up date integration while adapting to intermittent energy availability. Experiments on representative workloads demonstrate improved update reliability and efficiency compared to existing live update approaches. △ Less

Submitted 23 January, 2026; originally announced January 2026.

Comments: Accepted at DATE 2026

arXiv:2601.14127 [pdf, ps, other]

The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning

Authors: Renmiao Chen, Yida Lu, Shiyao Cui, Xuan Ouyang, Victor Shea-Jay Huang, Shumin Zhang, Chengwei Pan, Han Qiu, Minlie Huang

Abstract: As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19… ▽ More As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at https://github.com/thu-coai/MIR-SafetyBench. △ Less

Submitted 20 January, 2026; originally announced January 2026.

Comments: *15 pages, 5 figures. Introduces MIR-SafetyBench (2,676 instances; 9 multi-image relations). Equal contribution; †Corresponding author. Code/data: https://github.com/thu-coai/MIR-SafetyBench

arXiv:2601.13910 [pdf, ps, other]

Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches

Authors: Changhao Pan, Dongyu Yao, Yu Zhang, Wenxiang Guo, Jingyu Lu, Zhiyuan Zhu, Zhou Zhao

Abstract: Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high-fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep-learning-based singing voice synthesis sy… ▽ More Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high-fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep-learning-based singing voice synthesis systems and their enabling technologies. To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at https://github.com/David-Pigeon/SyntheticSingers. △ Less

Submitted 20 January, 2026; originally announced January 2026.

Comments: Accepetd by IJCNLP-AACL 2025(Oral)

arXiv:2601.09988 [pdf, ps, other]

In-the-Wild Compliant Manipulation with UMI-FT

Authors: Hojung Choi, Yifan Hou, Chuer Pan, Seongheon Hong, Austin Patel, Xiaomeng Xu, Mark R. Cutkosky, Shuran Song

Abstract: Many manipulation tasks require careful force modulation. With insufficient force the task may fail, while excessive force could cause damage. The high cost, bulky size and fragility of commercial force/torque (F/T) sensors have limited large-scale, force-aware policy learning. We introduce UMI-FT, a handheld data-collection platform that mounts compact, six-axis force/torque sensors on each finge… ▽ More Many manipulation tasks require careful force modulation. With insufficient force the task may fail, while excessive force could cause damage. The high cost, bulky size and fragility of commercial force/torque (F/T) sensors have limited large-scale, force-aware policy learning. We introduce UMI-FT, a handheld data-collection platform that mounts compact, six-axis force/torque sensors on each finger, enabling finger-level wrench measurements alongside RGB, depth, and pose. Using the multimodal data collected from this device, we train an adaptive compliance policy that predicts position targets, grasp force, and stiffness for execution on standard compliance controllers. In evaluations on three contact-rich, force-sensitive tasks (whiteboard wiping, skewering zucchini, and lightbulb insertion), UMI-FT enables policies that reliably regulate external contact forces and internal grasp forces, outperforming baselines that lack compliance or force sensing. UMI-FT offers a scalable path to learning compliant manipulation from in-the-wild demonstrations. We open-source the hardware and software to facilitate broader adoption at:https://umi-ft.github.io/. △ Less

Submitted 14 January, 2026; originally announced January 2026.

Comments: submitted to ICRA 2026

arXiv:2601.08363 [pdf, ps, other]

PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark

Authors: Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan, Yudong Zhou, Yuqing Yang

Abstract: In real-world documents, the information relevant to a user query may reside anywhere from the beginning to the end. This makes position bias -- a systematic tendency of retrieval models to favor or neglect content based on its location -- a critical concern. Although recent studies have identified such bias, existing analyses focus predominantly on English, fail to disentangle document length fro… ▽ More In real-world documents, the information relevant to a user query may reside anywhere from the beginning to the end. This makes position bias -- a systematic tendency of retrieval models to favor or neglect content based on its location -- a critical concern. Although recent studies have identified such bias, existing analyses focus predominantly on English, fail to disentangle document length from information position, and lack a standardized framework for systematic diagnosis. To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios. PosIR comprises 310 datasets spanning 10 languages and 31 domains, with relevance tied to precise reference spans. At its methodological core, PosIR employs a length-controlled bucketing strategy that groups queries by positive document length and analyzes positional effects within each bucket. This design strictly isolates position bias from length-induced performance degradation. Extensive experiments on 10 state-of-the-art embedding-based retrieval models reveal that: (1) retrieval performance on PosIR with documents exceeding 1536 tokens correlates poorly with the MMTEB benchmark, exposing limitations of current short-text evaluations; (2) position bias is pervasive in embedding models and even increases with document length, with most models exhibiting primacy bias while certain models show unexpected recency bias; (3) as an exploratory investigation, gradient-based saliency analysis further uncovers two distinct internal mechanisms that correlate with these positional preferences. We hope that PosIR can serve as a valuable diagnostic framework to advance the development of position-robust retrieval systems. △ Less

Submitted 12 March, 2026; v1 submitted 13 January, 2026; originally announced January 2026.

Comments: Work in progress

arXiv:2601.07280 [pdf, ps, other]

ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios

Authors: Changzai Pan, Jie Zhang, Kaiwen Wei, Chenshuo Pan, Yu Zhao, Jingwang Huang, Jian Yang, Zhenhe Wu, Haoyang Zeng, Xiaoyan Gu, Weichao Sun, Yanbo Zhai, Yujie Mao, Zhuoru Jiang, Jiang Zhong, Shuangyong Song, Yongxiang Li, Zhongjiang He

Abstract: Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a… ▽ More Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA. △ Less

Submitted 12 January, 2026; originally announced January 2026.

arXiv:2512.22131 [pdf]

An Energy-Efficient RFET-Based Stochastic Computing Neural Network Accelerator

Authors: Sheng Lu, Qianhou Qu, Sungyong Jung, Qilian Liang, Chenyun Pan

Abstract: Stochastic computing (SC) offers significant reductions in hardware complexity for traditional convolutional neural networks(CNNs). However, despite its advantages, stochastic computing neural networks (SCNNs) often suffer from high resource consumption due to components such as stochastic number generators (SNGs) and accumulative parallel counters (APCs), which limit overall performance. This pap… ▽ More Stochastic computing (SC) offers significant reductions in hardware complexity for traditional convolutional neural networks(CNNs). However, despite its advantages, stochastic computing neural networks (SCNNs) often suffer from high resource consumption due to components such as stochastic number generators (SNGs) and accumulative parallel counters (APCs), which limit overall performance. This paper proposes a novel SCNN architecture leveraging reconfigurable field-effect transistors (RFETs). The inherent reconfigurability at the device level enables the design of highly efficient and compact SNGs, APCs, and other related essential components. Furthermore, a dedicated SCNN accelerator architecture is developed to facilitate system-level simulation. Based on accessible open-source standard cell libraries, experimental results demonstrate that the proposed RFET-based SCNN accelerator achieves significant reductions in area, latency, and energy consumption compared to its FinFET-based counterpart at the same technology node. △ Less

Submitted 27 January, 2026; v1 submitted 5 December, 2025; originally announced December 2025.

arXiv:2512.14222 [pdf, ps, other]

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Authors: Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin

Abstract: Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address… ▽ More Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component. △ Less

Submitted 16 December, 2025; v1 submitted 16 December, 2025; originally announced December 2025.

arXiv:2512.09663 [pdf, ps, other]

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Authors: Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, Chunhong Pan

Abstract: Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared… ▽ More Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench. △ Less

Submitted 10 December, 2025; originally announced December 2025.

arXiv:2512.04521 [pdf, ps, other]

WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism

Authors: Ruijing Liu, Cunhua Pan, Jiaming Zeng, Hong Ren, Kezhi Wang, Lei Kong, Jiangzhou Wang

Abstract: While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of t… ▽ More While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of the human body and accomplish sensing applications such as gesture recognition. Although many existing gesture sensing solutions perform well in-domain but lack cross-domain capabilities (i.e., recognition performance in untrained environments). To address this, we extract Doppler spectra from the channel state information (CSI) received by all receivers and concatenate each Doppler spectrum along the same time axis to generate fused images with multi-angle information as input features. Furthermore, inspired by the convolutional block attention module (CBAM), we propose a gesture recognition network that integrates a multi-semantic spatial attention mechanism with a self-attention-based channel mechanism. This network constructs attention maps to quantify the spatiotemporal features of gestures in images, enabling the extraction of key domain-independent features. Additionally, ResNet18 is employed as the backbone network to further capture deep-level features. To validate the network performance, we evaluate the proposed network on the public Widar3 dataset, and the results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%, significantly outperforming existing best solutions. △ Less

Submitted 4 December, 2025; originally announced December 2025.

arXiv:2512.02461 [pdf, ps, other]

Artificial Noise Aided Physical Layer Security for Near-Field MIMO with Fluid Antenna Systems

Authors: Peng Zhang, Jian Dang, Miaowen Wen, Ziyang Liu, Chen Zhao, Huaifeng Shi, Chengsheng Pan, Zaichen Zhang

Abstract: With the evolution of wireless systems toward large-scale arrays and high-frequency reconfigurable architectures, fluid antenna systems (FAS) operating in the near-field (NF) regime provide new degrees of freedom (DoF) for physical layer security (PLS). This paper proposes an artificial-noise (AN)-aided PLS scheme for NF fluid-antenna multiple-input multiple-output (FA-MIMO) systems, with joint be… ▽ More With the evolution of wireless systems toward large-scale arrays and high-frequency reconfigurable architectures, fluid antenna systems (FAS) operating in the near-field (NF) regime provide new degrees of freedom (DoF) for physical layer security (PLS). This paper proposes an artificial-noise (AN)-aided PLS scheme for NF fluid-antenna multiple-input multiple-output (FA-MIMO) systems, with joint beamforming (BF) and AN design for both compact and large arrays. An alternating-optimization (AO) framework addresses the sparsity-constrained non-convex design by splitting it into a continuous BF/AN joint-design subproblem and a discrete FAS port-selection subproblem. Closed-form fully digital BF/AN solutions are obtained via a generalized spectral water-filling procedure within a block coordinate descent (BCD) surrogate and realized by a hardware-efficient hybrid beamforming (HBF) architecture that embeds AN in the baseband without extra radio-frequency (RF) chains. For FAS port selection, a row-energy based prune--refit rule, aligned with Karush--Kuhn--Tucker (KKT) conditions of a group-sparsity surrogate, enables efficient active-port determination. Simulation results confirm that the proposed design exploits the geometry and position-domain DoF of FAS and significantly improves secrecy performance, particularly for non-extremely-large arrays where NF beam focusing alone is inadequate. △ Less

Submitted 2 December, 2025; originally announced December 2025.

arXiv:2512.02353 [pdf, ps, other]

A Cyclic Shift Embedded Pilot based Channel Estimation for Multi-User MIMO-OTFS systems with fractional delay and Doppler

Authors: Ruizhe Wang, Hong Ren, Cunhua Pan, Ruisong Weng, Jiangzhou Wang

Abstract: Orthogonal time frequency space (OTFS) modulation has been proposed to meet the demand for reliable communication in high-mobility scenarios for future wireless networks. However, in multi-user OTFS systems, conventional embedded pilot schemes require independent pilot allocation for each user, leading to linearly increasing pilot overhead. To address these issues, in this paper, we investigate th… ▽ More Orthogonal time frequency space (OTFS) modulation has been proposed to meet the demand for reliable communication in high-mobility scenarios for future wireless networks. However, in multi-user OTFS systems, conventional embedded pilot schemes require independent pilot allocation for each user, leading to linearly increasing pilot overhead. To address these issues, in this paper, we investigate the uplink channel estimation and pilot design for multi-user multiple-input multiple-output (MIMO)-OTFS systems. We propose a multi-dimensional decomposition-based channel estimation algorithm. Specifically, the proposed algorithm first estimates the angles of arrivals (AoAs) via subspace decomposition-based method. A spatial projection matrix, constructed from the estimated AOAs, decouples the received signal by propagation path subspace, effectively mitigating inter-path interference. The remaining fractional delay and Doppler can be obtained by a compressed sensing (CS)-based off-grid channel estimation method. Furthermore, to reduce the pilot overhead in multi-user OTFS systems, this paper proposes a novel cyclic shift embedded pilot (CSEP) structure, which can reuse users through cyclic shift-orthogonality of Zadoff-Chu (ZC) sequences. Compared with conventional embedded pilot structures, the CSEP structure can save over 30\% of pilot overhead. Finally, an imporved channel estimation method based on the CSEP structure is proposed. Simulation results demonstrate that it achieves superior performance in channel estimation. Moreover, the proposed CSEP structure and channel estimation algorithm achieve a favorable balance between computational complexity, estimation accuracy, and bit error rate (BER) performance. △ Less

Submitted 1 December, 2025; originally announced December 2025.

arXiv:2512.01809 [pdf, ps, other]

Much Ado About Noising: Dispelling the Myths of Generative Robotic Control

Authors: Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, Max Simchowitz

Abstract: Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multi-modal action distribution to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs… ▽ More Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multi-modal action distribution to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs) on common behavior cloning (BC) benchmarks. We find that GCPs do not owe their success to their ability to capture multi-modality or to express more complex observation-to-action mappings. Instead, we find that their advantage stems from iterative computation, as long as intermediate steps are supervised during training and this supervision is paired with a suitable level of stochasticity. As a validation of our findings, we show that a minimum iterative policy (MIP), a lightweight two-step regression-based policy, essentially matches the performance of flow GCPs, and often outperforms distilled shortcut models. Our results suggest that the distribution-fitting component of GCPs is less salient than commonly believed, and point toward new design spaces focusing solely on control performance. Project page: https://simchowitzlabpublic.github.io/much-ado-about-noising-project/ △ Less

Submitted 23 February, 2026; v1 submitted 1 December, 2025; originally announced December 2025.

arXiv:2512.01128 [pdf, ps, other]

OmniFD: A Unified Model for Versatile Face Forgery Detection

Authors: Haotian Liu, Haoyu Chen, Chenhui Pan, You Hu, Guoying Zhao, Xiaobai Li

Abstract: Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointl… ▽ More Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD's advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at https://github.com/haotianll/OmniFD. △ Less

Submitted 30 November, 2025; originally announced December 2025.

arXiv:2511.21796 [pdf, ps, other]

Sneak Path Current Modeling in Memristor Crossbar Arrays for Analog In-Memory Computing

Authors: Shah Zayed Riam, Zhenlin Pei, Kyle Mooney, Chenyun Pan, Na Gong, Jinhui Wang

Abstract: Memristor crossbar arrays have emerged as a key component for next-generation non-volatile memories, artificial neural networks, and analog in-memory computing (IMC) systems. By minimizing data transfer between the processor and memory, they offer substantial energy savings. However, a major design challenge in memristor crossbar arrays is the presence of sneak path currents, which degrade electri… ▽ More Memristor crossbar arrays have emerged as a key component for next-generation non-volatile memories, artificial neural networks, and analog in-memory computing (IMC) systems. By minimizing data transfer between the processor and memory, they offer substantial energy savings. However, a major design challenge in memristor crossbar arrays is the presence of sneak path currents, which degrade electrical performance, reduce noise margins, and limit reliable operations. This work presents a closed-form analytical framework based on 1.4nm technology for accurately estimating sneak path currents in memristor crossbar arrays. The proposed model captures the interdependence of key design parameters in memristor crossbar arrays, including array size, ON/OFF ratio of memristors, read voltage, and interconnect conditions, through mathematically derived relationships. It supports various practical configurations, such as different data patterns and connection strategies, enabling rapid and comprehensive sneak path current modeling. The sensitivity analysis includes how design parameters influence sneak path current and noise margin loss, underscoring the trade-offs involved in scaling crossbar arrays. Validation through SPICE simulations shows that the model achieves an error of less than 10.9% while being up to 4784 times faster than full circuit simulations. This analytical framework offers a powerful tool for quantitative assessment and pre-design/real-time optimization of memristor-based analog in-memory computing (IMC) architectures. △ Less

Submitted 14 January, 2026; v1 submitted 26 November, 2025; originally announced November 2025.

arXiv:2511.18794 [pdf, ps, other]

ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes

Authors: Zhongtao Wang, Jiaqi Dai, Qingtian Zhu, Yilong Li, Mai Su, Fei Zhu, Meng Gai, Shaorong Wang, Chengwei Pan, Yisong Chen, Guoping Wang

Abstract: Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompat… ▽ More Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It's also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset are publicly available at https://github.com/ZhongtaoWang/ChronoGS. △ Less

Submitted 24 November, 2025; originally announced November 2025.

MSC Class: 68U05

arXiv:2511.18538 [pdf, ps, other]

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

Authors: Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong , et al. (46 additional authors not shown)

Abstract: Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-b… ▽ More Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons. △ Less

Submitted 6 December, 2025; v1 submitted 23 November, 2025; originally announced November 2025.

arXiv:2511.09484 [pdf, ps, other]

SPIDER: Scalable Physics-Informed Dexterous Retargeting

Authors: Chaoyi Pan, Changhao Wang, Haozhi Qi, Zixi Liu, Homanga Bharadhwaj, Akash Sharma, Tingfan Wu, Guanya Shi, Jitendra Malik, Francois Hogan

Abstract: Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality, which could help address the data scarcity problem. However, due to the embodiment gap and missing dynamic inform… ▽ More Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality, which could help address the data scarcity problem. However, due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. To bridge this gap, we propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10X faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. As a universal physics-based retargeting method, SPIDER can work with diverse quality data and generate diverse and high-quality data to enable efficient policy learning with methods like RL. △ Less

Submitted 5 February, 2026; v1 submitted 12 November, 2025; originally announced November 2025.

Comments: Project website: https://jc-bao.github.io/spider-project/

arXiv:2511.04029 [pdf, ps, other]

Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface

Authors: Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu, Yangguang Li, Wanli Ouyang, Yuanming Hu, Guang Yang, ChoonHwai Yap

Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes,… ▽ More Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks. △ Less

Submitted 12 November, 2025; v1 submitted 5 November, 2025; originally announced November 2025.

arXiv:2510.22339 [pdf, ps, other]

Estimating Continuum Robot Shape under External Loading using Spatiotemporal Neural Networks

Authors: Enyi Wang, Zhen Deng, Chuanchuan Pan, Bingwei He, Jianwei Zhang

Abstract: This paper presents a learning-based approach for accurately estimating the 3D shape of flexible continuum robots subjected to external loads. The proposed method introduces a spatiotemporal neural network architecture that fuses multi-modal inputs, including current and historical tendon displacement data and RGB images, to generate point clouds representing the robot's deformed configuration. Th… ▽ More This paper presents a learning-based approach for accurately estimating the 3D shape of flexible continuum robots subjected to external loads. The proposed method introduces a spatiotemporal neural network architecture that fuses multi-modal inputs, including current and historical tendon displacement data and RGB images, to generate point clouds representing the robot's deformed configuration. The network integrates a recurrent neural module for temporal feature extraction, an encoding module for spatial feature extraction, and a multi-modal fusion module to combine spatial features extracted from visual data with temporal dependencies from historical actuator inputs. Continuous 3D shape reconstruction is achieved by fitting Bézier curves to the predicted point clouds. Experimental validation demonstrates that our approach achieves high precision, with mean shape estimation errors of 0.08 mm (unloaded) and 0.22 mm (loaded), outperforming state-of-the-art methods in shape sensing for TDCRs. The results validate the efficacy of deep learning-based spatiotemporal data fusion for precise shape estimation under loading conditions. △ Less

Submitted 25 October, 2025; originally announced October 2025.

Comments: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2510.18477 [pdf, ps, other]

LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Authors: Haichao Ji, Zibo Wang, Cheng Pan, Meng Han, Yifei Zhu, Dan Wang, Zhu Han

Abstract: Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving com… ▽ More Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting. △ Less

Submitted 30 October, 2025; v1 submitted 21 October, 2025; originally announced October 2025.

Comments: This paper has been accepted by the 16th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2025)

arXiv:2510.13995 [pdf, ps, other]

Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Authors: Kelvin Szolnoky, Anders Blilie, Nita Mulliqi, Toyonori Tsuzuki, Hemamali Samaratunga, Matteo Titus, Xiaoyi Ji, Sol Erika Boman, Einar Gudlaugsson, Svein Reidar Kjosavik, José Asenjo, Marcello Gambacorta, Paolo Libretti, Marcin Braun, Radisław Kordek, Roman Łowicki, Brett Delahunt, Kenneth A. Iczkowski, Theo van der Kwast, Geert J. L. H. van Leenders, Katia R. M. Leite, Chin-Chen Pan, Emiel Adrianus Maria Janssen, Martin Eklund, Lars Egevad , et al. (1 additional authors not shown)

Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model usin… ▽ More Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients. △ Less

Submitted 15 October, 2025; originally announced October 2025.

arXiv:2510.10396 [pdf, ps, other]

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Authors: Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao

Abstract: Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these chall… ▽ More Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io. △ Less

Submitted 17 October, 2025; v1 submitted 11 October, 2025; originally announced October 2025.

Comments: 24 pages

arXiv:2510.09266 [pdf, ps, other]

CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

Authors: Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng

Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- o… ▽ More Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs △ Less

Submitted 10 October, 2025; originally announced October 2025.

arXiv:2510.02614 [pdf, ps, other]

UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

Authors: Harsh Gupta, Xiaofeng Guo, Huy Ha, Chuer Pan, Muqing Cao, Dongjae Lee, Sebastian Scherer, Shuran Song, Guanya Shi

Abstract: We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in c… ▽ More We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, checkpoints, and result videos can be found at umi-on-air.github.io. △ Less

Submitted 13 March, 2026; v1 submitted 2 October, 2025; originally announced October 2025.

Comments: Result videos can be found at umi-on-air.github.io

arXiv:2509.01312 [pdf, ps, other]

TableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering

Authors: Sishi Xiong, Ziyang He, Zhongjiang He, Yu Zhao, Changzai Pan, Jie Zhang, Zhenhe Wu, Shuangyong Song, Yongxiang Li

Abstract: While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent fr… ▽ More While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent framework. It introduces three key innovations: (1) replacing the original fully verbalized table with structured table schema to bridge the semantic gap and reduce computational complexity; (2) a query-aware table zooming mechanism that dynamically generates sub-table schema through column selection and entity linking, significantly improving target localization efficiency; and (3) a Program-of-Thoughts (PoT) strategy that transforms queries into executable code to mitigate numerical hallucination. Additionally, we integrate the reasoning workflow with the ReAct paradigm to enable iterative reasoning. Extensive experiments demonstrate that our framework maintains the usability advantages while substantially enhancing performance and scalability across tables of varying scales. When implemented with the Qwen3-8B-Instruct LLM, TableZoomer achieves accuracy improvements of 19.34% and 25% over conventional PoT methods on the large-scale DataBench dataset and the small-scale Fact Checking task of TableBench dataset, respectively. △ Less

Submitted 1 September, 2025; originally announced September 2025.

arXiv:2508.19813 [pdf, ps, other]

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Authors: Jie Zhang, Changzai Pan, Kaiwen Wei, Sishi Xiong, Yu Zhao, Xiangyu Li, Jiaxin Peng, Xiaoyan Gu, Jian Yang, Wenhan Chang, Zhenhe Wu, Jiang Zhong, Shuangyong Song, Yongxiang Li, Xuelong Li

Abstract: Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing tab… ▽ More Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. △ Less

Submitted 23 September, 2025; v1 submitted 27 August, 2025; originally announced August 2025.

arXiv:2508.16379 [pdf, ps, other]

Agentic AI Empowered Multi-UAV Trajectory Optimization in Low-Altitude Economy Networks

Authors: Feibo Jiang, Li Dong, Xitao Pan, Kezhi Wang, Cunhua Pan

Abstract: This paper proposes a novel Agentic Retrieval-augmented generation with Mamba-Attention Integrated Transformer (ARMAIT) framework for multi-Unmanned Aerial Vehicle (UAV) trajectory optimization. The framework is built upon Large Language Models (LLMs), incorporating Retrieval-Augmented Generation (RAG) empowered by Agentic AI and integrated with a UAV-specific knowledge base. Through the Agentic R… ▽ More This paper proposes a novel Agentic Retrieval-augmented generation with Mamba-Attention Integrated Transformer (ARMAIT) framework for multi-Unmanned Aerial Vehicle (UAV) trajectory optimization. The framework is built upon Large Language Models (LLMs), incorporating Retrieval-Augmented Generation (RAG) empowered by Agentic AI and integrated with a UAV-specific knowledge base. Through the Agentic RAG, the LLM autonomously interprets high-level task requirements and identifies the key components necessary for trajectory optimization, including model inputs and outputs, network architecture, reward functions, and task constraints. To support efficient modeling across different system scales, we introduce the Mamba-Attention Integrated Transformer (MAIT), a hybrid neural architecture that combines the long-range dependency modeling capability of attention mechanisms with the efficient temporal dynamic representation of Mamba. Furthermore, a Trajectory-Group Relative Policy Optimization (T-GRPO) method is proposed to achieve unified policy gradient optimization in both discrete and continuous trajectory spaces for MAIT training. Extensive experimental results validate the feasibility and effectiveness of the proposed ARMAIT framework. △ Less

Submitted 22 August, 2025; originally announced August 2025.

arXiv:2508.10924 [pdf, ps, other]

ASAudio: A Survey of Advanced Spatial Audio Research

Authors: Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao

Abstract: With the rapid development of spatial audio technologies today, applications in AR, VR, and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their und… ▽ More With the rapid development of spatial audio technologies today, applications in AR, VR, and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outlining existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio. △ Less

Submitted 20 August, 2025; v1 submitted 8 August, 2025; originally announced August 2025.

arXiv:2508.08606 [pdf, ps, other]

Distributed optimization: designed for federated learning

Authors: Wenyou Guo, Ting Qu, Chunrong Pan, George Q. Huang

Abstract: Federated learning (FL), as a distributed collaborative machine learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both… ▽ More Federated learning (FL), as a distributed collaborative machine learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients. △ Less

Submitted 30 October, 2025; v1 submitted 11 August, 2025; originally announced August 2025.

Comments: 16 pages, 6 figures

arXiv:2508.05087 [pdf, ps, other]

doi 10.1145/3746027.3754561

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

Authors: Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, Minlie Huang

Abstract: Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content.… ▽ More Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.} △ Less

Submitted 7 August, 2025; originally announced August 2025.

Comments: 10 pages, 3 tables, 2 figures, to appear in the Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)

ACM Class: I.2.7; K.4.1; K.6.5

arXiv:2507.19017 [pdf, ps, other]

MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster

Authors: Laingjun Feng, Chenyi Pan, Xinjie Guo, Fei Mei, Benzhe Ning, Jianxiang Zhang, Xinyang Liu, Beirong Zhou, Zeng Shu, Chang Liu, Guang Yang, Zhenyu Han, Jiangben Wang, Bo Wang

Abstract: Reinforcement learning (RL) is a paradigm increasingly used to align large language models. Popular RL algorithms utilize multiple workers and can be modeled as a graph, where each node is the status of a worker and each edge represents dataflow between nodes. Owing to the heavy cross-node dependencies, the RL training system usually suffers from poor cluster scalability and low memory utilization… ▽ More Reinforcement learning (RL) is a paradigm increasingly used to align large language models. Popular RL algorithms utilize multiple workers and can be modeled as a graph, where each node is the status of a worker and each edge represents dataflow between nodes. Owing to the heavy cross-node dependencies, the RL training system usually suffers from poor cluster scalability and low memory utilization. In this article, we introduce MindSpeed RL, an effective and efficient system for large-scale RL training. Unlike existing centralized methods, MindSpeed RL organizes the essential data dependencies in RL training, i.e., sample flow and resharding flow, from a distributed view. On the one hand, a distributed transfer dock strategy, which sets controllers and warehouses on the basis of the conventional replay buffer, is designed to release the dispatch overhead in the sample flow. A practical allgather--swap strategy is presented to eliminate redundant memory usage in resharding flow. In addition, MindSpeed RL further integrates numerous parallelization strategies and acceleration techniques for systematic optimization. Compared with existing state-of-the-art systems, comprehensive experiments on the RL training of popular Qwen2.5-Dense-7B/32B, Qwen3-MoE-30B, and DeepSeek-R1-MoE-671B show that MindSpeed RL increases the throughput by 1.42 ~ 3.97 times. Finally, we open--source MindSpeed RL and perform all the experiments on a super pod of Ascend with 384 neural processing units (NPUs) to demonstrate the powerful performance and reliability of Ascend. △ Less

Submitted 25 July, 2025; originally announced July 2025.

Comments: 9 pages

MSC Class: CS

arXiv:2507.16666 [pdf, ps, other]

Reconfigurable Intelligent Surface-Enabled Green and Secure Offloading for Mobile Edge Computing Networks

Authors: Tong-Xing Zheng, Xinji Wang, Xin Chen, Di Mao, Jia Shi, Cunhua Pan, Chongwen Huang, Haiyang Ding, Zan Li

Abstract: This paper investigates a multi-user uplink mobile edge computing (MEC) network, where the users offload partial tasks securely to an access point under the non-orthogonal multiple access policy with the aid of a reconfigurable intelligent surface (RIS) against a multi-antenna eavesdropper. We formulate a non-convex optimization problem of minimizing the total energy consumption subject to secure… ▽ More This paper investigates a multi-user uplink mobile edge computing (MEC) network, where the users offload partial tasks securely to an access point under the non-orthogonal multiple access policy with the aid of a reconfigurable intelligent surface (RIS) against a multi-antenna eavesdropper. We formulate a non-convex optimization problem of minimizing the total energy consumption subject to secure offloading requirement, and we build an efficient block coordinate descent framework to iteratively optimize the number of local computation bits and transmit power at the users, the RIS phase shifts, and the multi-user detection matrix at the access point. Specifically, we successively adopt successive convex approximation, semi-definite programming, and semidefinite relaxation to solve the problem with perfect eavesdropper's channel state information (CSI), and we then employ S-procedure and penalty convex-concave to achieve robust design for the imperfect CSI case. We provide extensive numerical results to validate the convergence and effectiveness of the proposed algorithms. We demonstrate that RIS plays a significant role in realizing a secure and energy-efficient MEC network, and deploying a well-designed RIS can save energy consumption by up to 60\% compared to that without RIS. We further reveal impacts of various key factors on the secrecy energy efficiency, including RIS element number and deployment position, user number, task scale and duration, and CSI imperfection. △ Less

Submitted 22 July, 2025; originally announced July 2025.

Comments: 15 pages, 9 figures, accepted by IEEE Internet of Things Journal

arXiv:2507.09061 [pdf, ps, other]

Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control

Authors: Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, Max Simchowitz

Abstract: This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can su… ▽ More This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone. △ Less

Submitted 26 November, 2025; v1 submitted 11 July, 2025; originally announced July 2025.

Comments: Updated manuscript. New visualization figures and control-theory primer

Showing 1–50 of 372 results for author: Pan, C