-
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
Authors:
Ke Li,
Maoliang Li,
Jialiang Chen,
Jiayu Chen,
Zihao Zheng,
Shaoqi Wang,
Xiang Chen
Abstract:
Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting…
▽ More
Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
Authors:
Xuanjun Chen,
Chia-Yu Hu,
Sung-Feng Huang,
Haibin Wu,
Hung-yi Lee,
Jyh-Shing Roger Jang
Abstract:
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study prese…
▽ More
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
Authors:
DataFlow Team,
Bohan Zeng,
Daili Hua,
Kaixin Zhu,
Yifan Dai,
Bozhou Li,
Yuran Wang,
Chengzhuo Tong,
Yifan Yang,
Mingkun Chang,
Jianbin Zhao,
Zhou Liu,
Hao Liang,
Xiaochen Ma,
Ruichuan An,
Junbo Niu,
Zimo Meng,
Tianyi Bai,
Meiyi Qiang,
Huanyao Zhang,
Zhiyou Xiao,
Tianyu Guo,
Qinhan Yu,
Runhao Zhao,
Zhengpin Li
, et al. (16 additional authors not shown)
Abstract:
World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework cent…
▽ More
World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
GAIN: Multiplicative Modulation for Domain Adaptation
Authors:
Hengshuai Yao,
Xing Chen,
Ahmed Murtadha,
Guan Wang
Abstract:
Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience,…
▽ More
Adapting LLMs to new domains causes forgetting because standard methods (full fine-tuning, LoRA) inject new directions into the weight space. We propose GAIN, which re-emphasizes existing features through multiplicative modulation W_new = S * W. The learned diagonal matrix S is applied to the attention output projection and optionally the FFN. The principle mirrors gain modulation in neuroscience, where neurons adapt to context by scaling response strength while preserving selectivity.
We evaluate GAIN on five models from four families (774M to 70B), adapting sequentially across eight domains. GAIN-FFN matches LoRA's in-domain adaptation, but their effects on previously trained domains are opposite: GAIN-FFN improves them by 7-13% (validation PPL), while LoRA degrades them by 18-36%. Downstream accuracy confirms the pattern: for example, after seven sequential adaptations on Qwen2.5, GAIN-FFN degrades BoolQ by only 0.8% while LoRA damages it by 14.9%. GAIN adds 46K-230K parameters per model and can be absorbed into the pretrained weights for zero inference cost.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
MPTF-Net: Multi-view Pyramid Transformer Fusion Network for LiDAR-based Place Recognition
Authors:
Shuyuan Li,
Zihang Wang,
Xieyuanli Chen,
Wenkai Zhu,
Xiaoteng Fang,
Peizhou Ni,
Junhao Yang,
Dong Kong
Abstract:
LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistic…
▽ More
LiDAR-based place recognition (LPR) is essential for global localization and loop-closure detection in large-scale SLAM systems. Existing methods typically construct global descriptors from Range Images or BEV representations for matching. BEV is widely adopted due to its explicit 2D spatial layout encoding and efficient retrieval. However, conventional BEV representations rely on simple statistical aggregation, which fails to capture fine-grained geometric structures, leading to performance degradation in complex or repetitive environments. To address this, we propose MPTF-Net, a novel multi-view multi-scale pyramid Transformer fusion network. Our core contribution is a multi-channel NDT-based BEV encoding that explicitly models local geometric complexity and intensity distributions via Normal Distribution Transform, providing a noise-resilient structural prior. To effectively integrate these features, we develop a customized pyramid Transformer module that captures cross-view interactive correlations between Range Image Views (RIV) and NDT-BEV at multiple spatial scales. Extensive experiments on the nuScenes, KITTI and NCLT datasets demonstrate that MPTF-Net achieves state-of-the-art performance, specifically attaining a Recall@1 of 96.31\% on the nuScenes Boston split while maintaining an inference latency of only 10.02 ms, making it highly suitable for real-time autonomous unmanned systems.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
Light-Bound Transformers: Hardware-Anchored Robustness for Silicon-Photonic Computer Vision Systems
Authors:
Xuming Chen,
Deniz Najafi,
Chengwei Zhou,
Pietro Mercati,
Arman Roohi,
Mohsen Imani,
Mahdi Nikdast,
Shaahin Angizi,
Gourav Datta
Abstract:
Deploying Vision Transformers (ViTs) on near-sensor analog accelerators demands training pipelines that are explicitly aligned with device-level noise and energy constraints. We introduce a compact framework for silicon-photonic execution of ViTs that integrates measured hardware noise, robust attention training, and an energy-aware processing flow. We first characterize bank-level noise in micror…
▽ More
Deploying Vision Transformers (ViTs) on near-sensor analog accelerators demands training pipelines that are explicitly aligned with device-level noise and energy constraints. We introduce a compact framework for silicon-photonic execution of ViTs that integrates measured hardware noise, robust attention training, and an energy-aware processing flow. We first characterize bank-level noise in microring-resonator (MR) arrays, including fabrication variation, thermal drift, and amplitude noise, and convert these measurements into closed-form, activation-dependent variance proxies for attention logits and feed-forward activations. Using these proxies, we develop Chance-Constrained Training (CCT), which enforces variance-normalized logit margins to bound attention rank flips, and a noise-aware LayerNorm that stabilizes feature statistics without changing the optical schedule. These components yield a practical ``measure $\rightarrow$ model $\rightarrow$ train $\rightarrow$ run'' pipeline that optimizes accuracy under noise while respecting system energy limits. Hardware-in-the-loop experiments with MR photonic banks show that our approach restores near-clean accuracy under realistic noise budgets, with no in-situ learning or additional optical MACs.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
Combee: Scaling Prompt Learning for Self-Improving Language Model Agents
Authors:
Hanchen Li,
Runyuan He,
Qizheng Zhang,
Changxiu Ji,
Qiuyang Mang,
Xiaokun Chen,
Lakshya A Agrawal,
Wei-Liang Liao,
Eric Yang,
Alvin Cheung,
James Zou,
Kunle Olukotun,
Ion Stoica,
Joseph E. Gonzalez
Abstract:
Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their a…
▽ More
Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
Authors:
Shuhong Liu,
Chenyu Bao,
Ziteng Cui,
Xuangeng Chu,
Bin Ren,
Lin Gu,
Xiang Chen,
Mingrui Li,
Long Ma,
Marcos V. Conde,
Radu Timofte,
Yun Liu,
Ryo Umagami,
Tomohiro Hashimoto,
Zijian Hu,
Yuan Gan,
Tianhan Xu,
Yusuke Kurose,
Tatsuya Harada,
Junwei Yuan,
Gengjia Chang,
Xining Ge,
Mache You,
Qida Cao,
Zeliang Li
, et al. (81 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participa…
▽ More
This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
Authors:
Lingjie Zeng,
Xiaofan Chen,
Yanbo Wang,
Xiuying Chen
Abstract:
Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preservi…
▽ More
Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preserving accuracy does not, a priori, guarantee preserving trustworthiness. We conduct the first systematic empirical study of how CoT compression affects model trustworthiness, evaluating multiple models of different scales along three dimensions: safety, hallucination resistance, and multilingual robustness. Under controlled comparisons, we find that CoT compression frequently introduces trustworthiness regressions and that different methods exhibit markedly different degradation profiles across dimensions. To enable fair comparison across bases, we propose a normalized efficiency score for each dimension that reveals how naïve scalar metrics can obscure trustworthiness trade-offs. As an existence proof, we further introduce an alignment-aware DPO variant that reduces CoT length by 19.3\% on reasoning benchmarks with substantially smaller trustworthiness loss. Our findings suggest that CoT compression should be optimized not only for efficiency but also for trustworthiness, treating both as equally important design constraints.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
UAV Control and Communication Enabled Low-Altitude Economy: Challenges, Resilient Architecture and Co-design Strategies
Authors:
Tianhao Liang,
Nanchi Su,
Yuqi Ping,
Guangyu Lei,
Xinglin Chen,
Longyu Zhou,
Tingting Zhang,
Qinyu Zhang,
Tony Q. S. Quek
Abstract:
The emerging low-altitude economy has catalyzed the large-scale deployment of unmanned aerial vehicles (UAVs), driving a paradigm shift in environment monitoring, logistics, and emergency response. However, operating within these environments presents notable challenges as pervasive coverage holes, unpredictable interference, and spectrum scarcity. To this end, this article present a communication…
▽ More
The emerging low-altitude economy has catalyzed the large-scale deployment of unmanned aerial vehicles (UAVs), driving a paradigm shift in environment monitoring, logistics, and emergency response. However, operating within these environments presents notable challenges as pervasive coverage holes, unpredictable interference, and spectrum scarcity. To this end, this article present a communication and control co-design framework to enable a resilient architecture for cellular-connected UAVs. Specifically, we first characterize typical service applications and their stringent performance requirements, followed by a comprehensive analysis of the unique challenges. To bridge the gap between volatile wireless links and rigid flight stability, a three layered architecture is proposed, integrating pre-flight strategic planning, in-flight adaptive action, and system-level resource orchestration. Furthermore, we detail the key enabling technologies for communication and control co-design. Preliminary case studies are proposed to validate that the co-design framework significantly improve the resilience of cellular-connected UAV systems, providing a robust foundation for the evolution of intelligent low-altitude networks.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
SODA: Semi On-Policy Black-Box Distillation for Large Language Models
Authors:
Xiwen Chen,
Jingjing Wang,
Wenhui Zhu,
Peijie Qiu,
Xuanzhao Dong,
Hejian Sang,
Zhipeng Wang,
Alborz Geramifard,
Feng Luo
Abstract:
Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. T…
▽ More
Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
Your Agent is More Brittle Than You Think: Uncovering Indirect Injection Vulnerabilities in Agentic LLMs
Authors:
Wenhui Zhu,
Xuanzhao Dong,
Xiwen Chen,
Rui Cai,
Peijie Qiu,
Zhipeng Wang,
Oana Frunza,
Shao Tang,
Jindong Gu,
Yalin Wang
Abstract:
The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthor…
▽ More
The rapid deployment of open-source frameworks has significantly advanced the development of modern multi-agent systems. However, expanded action spaces, including uncontrolled privilege exposure and hidden inter-system interactions, pose severe security challenges. Specifically, Indirect Prompt Injections (IPI), which conceal malicious instructions within third-party content, can trigger unauthorized actions such as data exfiltration during normal operations. While current security evaluations predominantly rely on isolated single-turn benchmarks, the systemic vulnerabilities of these agents within complex dynamic environments remain critically underexplored. To bridge this gap, we systematically evaluate six defense strategies against four sophisticated IPI attack vectors across nine LLM backbones. Crucially, we conduct our evaluation entirely within dynamic multi-step tool-calling environments to capture the true attack surface of modern autonomous agents. Moving beyond binary success rates, our multidimensional analysis reveals a pronounced fragility. Advanced injections successfully bypass nearly all baseline defenses, and some surface-level mitigations even produce counterproductive side effects. Furthermore, while agents execute malicious instructions almost instantaneously, their internal states exhibit abnormally high decision entropy. Motivated by this latent hesitation, we investigate Representation Engineering (RepE) as a robust detection strategy. By extracting hidden states at the tool-input position, we revealed that the RepE-based circuit breaker successfully identifies and intercepts unauthorized actions before the agent commits to them, achieving high detection accuracy across diverse LLM backbones. This study exposes the limitations of current IPI defenses and provides a highly practical paradigm for building resilient multi-agent architectures.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement
Authors:
Xuanzhao Dong,
Wenhui Zhu,
Xiwen Chen,
Hao Wang,
Xin Li,
Yujian Xiong,
Jiajun Cheng,
Zhipeng Wang,
Shao Tang,
Oana Dumitrascu,
Yalin Wang
Abstract:
Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limit…
▽ More
Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
Authors:
Xiaoyu Chen,
Lu Dai,
Hanqing Wang,
Zhuoyu Li,
Wenbin Dai,
Yanzong Zheng,
Zhenggang Xia,
Junyong Lin,
Hui Xiong
Abstract:
Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find th…
▽ More
Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
Authors:
Tianci Luo,
Haohao Pan,
Jinpeng Wang,
Niu Lian,
Xinrui Chen,
Bin Chen,
Shu-Tao Xia,
Chun Yuan
Abstract:
Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these a…
▽ More
Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge. Prior work has made substantial progress on prompt retrieval and reranking strategies, but mainly focuses on prompt images while overlooking labels. We reveal these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (Label-aware Prompt Retrieval), which highlights the role of labels in prompt selection. Our framework first designs an image-label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation. We carefully design alternative optimization for experts and router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Code is available at https://github.com/luotc-why/CVPR26-LaPR.
△ Less
Submitted 4 April, 2026;
originally announced April 2026.
-
Learning Additively Compositional Latent Actions for Embodied AI
Authors:
Hangxing Wei,
Xiaoyu Chen,
Chuheng Zhang,
Tim Pearce,
Jianyu Chen,
Alex Lamb,
Li Zhao,
Jiang Bian
Abstract:
Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true s…
▽ More
Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Artificial Intelligence and Systemic Risk: A Unified Model of Performative Prediction, Algorithmic Herding, and Cognitive Dependency in Financial Markets
Authors:
Shuchen Meng,
Xupeng Chen
Abstract:
We develop a unified model in which AI adoption in financial markets generates systemic risk through three mutually reinforcing channels: performative prediction, algorithmic herding, and cognitive dependency. Within an extended rational expectations framework with endogenous adoption, we derive an equilibrium systemic risk coupling $r(φ) = φρβ/λ'(φ)$, where $φ$ is the AI adoption share, $ρ$ the a…
▽ More
We develop a unified model in which AI adoption in financial markets generates systemic risk through three mutually reinforcing channels: performative prediction, algorithmic herding, and cognitive dependency. Within an extended rational expectations framework with endogenous adoption, we derive an equilibrium systemic risk coupling $r(φ) = φρβ/λ'(φ)$, where $φ$ is the AI adoption share, $ρ$ the algorithmic signal correlation, $β$ the performative feedback intensity, and $λ'(φ)$ the endogenous effective price impact. Because $λ'(φ)$ is decreasing in $φ$, the coupling is convex in adoption, implying that the systemic risk multiplier $M = (1 - r)^{-1}$ grows superlinearly as AI penetration increases. The model is developed in three layers. First, endogenous fragility: market depth is decreasing and convex in AI adoption. Second, embedding the convex coupling within a supermodular adoption game produces a saddle-node bifurcation into an algorithmic monoculture. Third, cognitive dependency as an endogenous state variable yields an impossibility theorem (hysteresis requires dynamics beyond static frameworks) and a channel necessity theorem (each channel is individually necessary). Empirical validation uses the complete universe of SEC Form 13F filings (99.5 million holdings, 10,957 institutional managers, 2013--2024) with a Bartik shift-share instrument (first-stage $F = 22.7$). The model implies tail-loss amplification of 18--54%, economically significant relative to Basel III countercyclical buffers.
△ Less
Submitted 23 March, 2026;
originally announced April 2026.
-
Why Attend to Everything? Focus is the Key
Authors:
Hengshuai Yao,
Xing Chen,
Ahmed Murtadha,
Jin Li,
Shuai Shao,
Yasin Abbasi Yadkori,
Guan Wang,
Mingli Yuan,
William Chen,
Sen Song
Abstract:
We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with…
▽ More
We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.
△ Less
Submitted 12 March, 2026;
originally announced April 2026.
-
An Algebraic Method for Full-Rank Characterization in Binary Linear Coding
Authors:
Mingyang Zhu,
Laigang Guo,
Zhenyu Huang,
Xingbing Chen,
Jue Wang,
Tao Guo,
Xiao-Shan Gao
Abstract:
In this paper, we develop a characteristic set (CS)-based method for deriving full-rank equivalence conditions of symbolic matrices over the binary field. Such full-rank conditions are of fundamental importance for many linear coding problems in communication and information theory. Building on the developed CS-based method, we present an algorithm called Binary Characteristic Set for Full Rank (B…
▽ More
In this paper, we develop a characteristic set (CS)-based method for deriving full-rank equivalence conditions of symbolic matrices over the binary field. Such full-rank conditions are of fundamental importance for many linear coding problems in communication and information theory. Building on the developed CS-based method, we present an algorithm called Binary Characteristic Set for Full Rank (BCSFR), which efficiently derives the full-rank equivalence conditions as the zeros of a series of characteristic sets. In other words, the BCSFR algorithm can characterize all feasible linear coding schemes for certain linear coding problems (e.g., linear network coding and distributed storage coding), where full-rank constraints are imposed on several symbolic matrices to guarantee decodability or other properties of the codes. The derived equivalence conditions can be used to simplify the optimization of coding schemes, since the intractable full-rank constraints in the optimization problem are explicitly characterized by simple triangular-form equality constraints.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
Authors:
Chengyin Hu,
Yuxian Dong,
Yikun Guo,
Xiang Chen,
Junqi Wu,
Jiahuan Long,
Yiwei Wei,
Tingsong Jiang,
Wen Yao
Abstract:
Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deploym…
▽ More
Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos
Authors:
Ziyu Luo,
Lin Chen,
Qiang Qu,
Xiaoming Chen,
Yiran Shen
Abstract:
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but als…
▽ More
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Authors:
Haz Sameen Shahgir,
Xiaofu Chen,
Yu Fu,
Erfan Shayegani,
Nael Abu-Ghazaleh,
Yova Kementchedjhieva,
Yue Dong
Abstract:
Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to th…
▽ More
Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Authors:
Xue Liu,
Xin Ma,
Yuxin Ma,
Yongchang Peng,
Duo Wang,
Zhoufutu Wen,
Ge Zhang,
Kaiyuan Zhang,
Xinyu Chen,
Tianci He,
Jiani Hou,
Liang Hu,
Ziyun Huang,
Yongzhe Hui,
Jianpeng Jiao,
Chennan Ju,
Yingru Kong,
Yiran Li,
Mengyun Liu,
Luyao Ma,
Fei Ni,
Yiqing Ni,
Yueyan Qiu,
Yanle Ren,
Zilin Shi
, et al. (9 additional authors not shown)
Abstract:
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity be…
▽ More
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.
△ Less
Submitted 6 April, 2026; v1 submitted 27 March, 2026;
originally announced April 2026.
-
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
Authors:
Mengzhou Wu,
Yuzhe Guo,
Yuan Cao,
Haochuan Lu,
Songhe Zhu,
Pingzhe Qu,
Xin Chen,
Kang Qin,
Zhongpu Wang,
Xiaode Zhang,
Xinyi Wang,
Wei Dai,
Gang Cao,
Yuetang Deng,
Zhi Gong,
Dezhi Ran,
Linyi Li,
Wei Yang,
Tao Xie
Abstract:
Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback. Through a sys…
▽ More
Scaling generalist GUI agents is hindered by the data scalability bottleneck of expensive human demonstrations and the "distillation ceiling" of synthetic teacher supervision. To transcend these limitations, we propose UI-Oceanus, a framework that shifts the learning focus from mimicking high-level trajectories to mastering interaction physics via ground-truth environmental feedback. Through a systematic investigation of self-supervised objectives, we identify that forward dynamics, defined as the generative prediction of future interface states, acts as the primary driver for scalability and significantly outweighs inverse inference. UI-Oceanus leverages this insight by converting low-cost autonomous exploration, which is verified directly by system execution, into high-density generative supervision to construct a robust internal world model. Experimental evaluations across a series of models demonstrate the decisive superiority of our approach: models utilizing Continual Pre-Training (CPT) on synthetic dynamics outperform non-CPT baselines with an average success rate improvement of 7% on offline benchmarks, which amplifies to a 16.8% gain in real-world online navigation. Furthermore, we observe that navigation performance scales with synthetic data volume. These results confirm that grounding agents in forward predictive modeling offers a superior pathway to scalable GUI automation with robust cross-domain adaptability and compositional generalization.
△ Less
Submitted 11 February, 2026;
originally announced April 2026.
-
Subquadratic Counting via Perfect Marginal Sampling
Authors:
Xiaoyu Chen,
Zongchen Chen,
Kuikui Liu,
Xinyuan Zhang
Abstract:
We study the computational complexity of approximately computing the partition function of a spin system. Techniques based on standard counting-to-sampling reductions yield $\tilde{O}(n^2)$-time algorithms, where $n$ is the size of the input graph. We present new counting algorithms that break the quadratic-time barrier in a wide range of settings. For example, for the hardcore model of $λ$-weight…
▽ More
We study the computational complexity of approximately computing the partition function of a spin system. Techniques based on standard counting-to-sampling reductions yield $\tilde{O}(n^2)$-time algorithms, where $n$ is the size of the input graph. We present new counting algorithms that break the quadratic-time barrier in a wide range of settings. For example, for the hardcore model of $λ$-weighted independent sets in graphs of maximum degree $Δ$, we obtain a $\tilde{O}(n^{2-δ})$-time approximate counting algorithm, for some constant $δ> 0$, when the fugacity $λ< \frac{1}{Δ-1}$, improving over the previous regime of $λ= o(Δ^{-3/2})$ by Anand, Feng, Freifeld, Guo, and Wang (2025). Our results apply broadly to many other spin systems, such as the Ising model, hypergraph independent sets, and vertex colorings.
Interestingly, our work reveals a deep connection between $\textit{subquadratic}$ counting and $\textit{perfect}$ marginal sampling. For two-spin systems such as the hardcore and Ising models, we show that the existence of perfect marginal samplers directly yields subquadratic counting algorithms in a $\textit{black-box}$ fashion. For general spin systems, we show that almost all existing perfect marginal samplers can be adapted to produce a sufficiently low-variance marginal estimator in sublinear time, leading to subquadratic counting algorithms.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
Authors:
Jingliang Li,
Jindou Jia,
Tuo An,
Chuhao Zhou,
Xiangyu Chen,
Shilin Shan,
Boyu Ma,
Bofan Lyu,
Gen Li,
Jianfei Yang
Abstract:
When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated…
▽ More
When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning
Authors:
Jingyue Gao,
Yanjiang Guo,
Xiaoshuai Chen,
Jianyu Chen
Abstract:
Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts,…
▽ More
Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model's reasoning and the environment's randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model's saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Authors:
Pan Yi,
Weijie Li,
Xiaodong Chen,
Jiehua Zhang,
Li Liu,
Yongxiang Liu
Abstract:
Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN)…
▽ More
Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.
△ Less
Submitted 2 April, 2026; v1 submitted 2 April, 2026;
originally announced April 2026.
-
LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
Authors:
Zhihuan Wei,
Xinhang Chen,
Danyang Han,
Yang Hu,
Jie Liu,
Xuewen Miao,
Guijiang Li
Abstract:
General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded ar…
▽ More
General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
TOL: Textual Localization with OpenStreetMap
Authors:
Youqi Liao,
Shuhao Kang,
Jingyu Xu,
Olaf Wysocki,
Yan Xia,
Jianping Li,
Zhen Dong,
Bisheng Yang,
Xieyuanli Chen
Abstract:
Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-O…
▽ More
Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
Authors:
Wanqian Li,
Jintao Peng,
Zongfei Jing,
Tianyu Zhang,
Ze Long,
Xianjie Qiao,
Xiaoming Chen,
Dongxu Yang,
Kefeng Duan,
June Yang
Abstract:
Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weigh…
▽ More
Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Sublinear-query relative-error testing of halfspaces
Authors:
Xi Chen,
Anindya De,
Yizhi Huang,
Shivam Nadimpalli,
Rocco A. Servedio,
Tianqi Yang
Abstract:
The relative-error property testing model was introduced in [CDHLNSY24] to facilitate the study of property testing for "sparse" Boolean-valued functions, i.e. ones for which only a small fraction of all input assignments satisfy the function. In this framework, the distance from the unknown target function $f$ that is being tested to a function $g$ is defined as…
▽ More
The relative-error property testing model was introduced in [CDHLNSY24] to facilitate the study of property testing for "sparse" Boolean-valued functions, i.e. ones for which only a small fraction of all input assignments satisfy the function. In this framework, the distance from the unknown target function $f$ that is being tested to a function $g$ is defined as $\mathrm{Vol}(f \mathop{\triangle} g)/\mathrm{Vol}(f)$, where the numerator is the fraction of inputs on which $f$ and $g$ disagree and the denominator is the fraction of inputs that satisfy $f$.
Recent work [CDHNSY26] has shown that over the Boolean domain $\{0,1\}^n$, any relative-error testing algorithm for the fundamental class of halfspaces (i.e. linear threshold functions) must make $Ω(\log n)$ oracle calls. In this paper we complement the [CDHNSY26] lower bound by showing that halfspaces can be relative-error tested over $\mathbb{R}^n$ under the standard $N(0,I_n)$ Gaussian distribution using a sublinear number of oracle calls -- in particular, substantially fewer than would be required for learning. Our results use a wide range of tools including Hermite analysis, Gaussian isoperimetric inequalities, and geometric results on noise sensitivity and surface area.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Robust Autonomous Control of a Magnetic Millirobot in In Vitro Cardiac Flow
Authors:
Anuruddha Bhattacharjee,
Xinhao Chen,
Lamar O. Mair,
Suraj Raval,
Yancy Diaz-Mercado,
Axel Krieger
Abstract:
Untethered magnetic millirobots offer significant potential for minimally invasive cardiac therapies; however, achieving reliable autonomous control in pulsatile cardiac flow remains challenging. This work presents a vision-guided control framework enabling precise autonomous navigation of a magnetic millirobot in an in vitro heart phantom under physiologically relevant flow conditions. The system…
▽ More
Untethered magnetic millirobots offer significant potential for minimally invasive cardiac therapies; however, achieving reliable autonomous control in pulsatile cardiac flow remains challenging. This work presents a vision-guided control framework enabling precise autonomous navigation of a magnetic millirobot in an in vitro heart phantom under physiologically relevant flow conditions. The system integrates UNet-based localization, A* path planning, and a sliding mode controller with a disturbance observer (SMC-DOB) designed for multi-coil electromagnetic actuation. Although drag forces are estimated using steady-state CFD simulations, the controller compensates for transient pulsatile disturbances during closed-loop operation. In static fluid, the SMC-DOB achieved sub-millimeter accuracy (root-mean-square error, RMSE = 0.49 mm), outperforming PID and MPC baselines. Under moderate pulsatile flow (7 cm/s peak, 20 cP), it reduced RMSE by 37% and peak error by 2.4$\times$ compared to PID. It further maintained RMSE below 2 mm (0.27 body lengths) under elevated pulsatile flow (10 cm/s peak, 20 cP) and under low-viscosity conditions (4.3 cP, 7 cm/s peak), where baseline controllers exhibited unstable or failed tracking. These results demonstrate robust closed-loop magnetic control under time-varying cardiac flow disturbances and support the feasibility of autonomous millirobot navigation for targeted drug delivery.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation
Authors:
Lei Wang,
Yuanzi Li,
Jinchao Wu,
Heyang Gao,
Xiaohe Bo,
Xu Chen,
Ji-Rong Wen
Abstract:
Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the…
▽ More
Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a large-scale social simulation system designed around three core requirements: generality via auto-programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback-driven LLM fine-tuning. Leveraging this system, S-Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human-AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S-Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod's theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S-Researcher establishes a new human--AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
Authors:
Xiquan Li,
Xuenan Xu,
Ziyang Ma,
Wenxi Chen,
Haolin He,
Qiuqiang Kong,
Xie Chen
Abstract:
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel trainin…
▽ More
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
Authors:
Zhantao Chen,
Dongyi He,
Jin Fang,
Xi Chen,
Yishuo Liu,
Xiaozhen Zhong,
Xuejun Hu
Abstract:
As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics…
▽ More
As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.
△ Less
Submitted 2 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration
Authors:
Bei Yan,
Yuecong Min,
Jie Zhang,
Shiguang Shan,
Xilin Chen
Abstract:
Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this lim…
▽ More
Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Misconception Acquisition Dynamics in Large Language Models
Authors:
Naiming Liu,
Xinghe Chen,
Richard Baraniuk,
Mrinmaya Sachan,
Shashank Sonkar
Abstract:
Effective educational AI depends on modeling student misconceptions. Such models enable realistic learner simulation and diagnostic, adaptive tutoring. However, instruction-tuning large language models on student responses containing misconception errors can degrade reasoning abilities, creating a tension between faithful misconception modeling and preserving correct reasoning in other contexts. T…
▽ More
Effective educational AI depends on modeling student misconceptions. Such models enable realistic learner simulation and diagnostic, adaptive tutoring. However, instruction-tuning large language models on student responses containing misconception errors can degrade reasoning abilities, creating a tension between faithful misconception modeling and preserving correct reasoning in other contexts. To support both learner simulation and tutoring, we study two misconception-aware models: the Novice Student Misconception Model, trained to acquire a single misconception for simulating an individual student, and the Expert Tutor Misconception Model, trained on multiple misconceptions to capture the error patterns a tutor encounters across students. To study the misconception acquisition dynamics of both models, we develop MalAlgoLib, a library that generates algebra problems with correct solution traces and misconception-specific erroneous traces. Our experiments across three LLMs reveal that the student and the tutor model exhibit fundamentally different misconception acquisition dynamics. For the student model, a single misconception is not learned as a context-specific behavior. Models overapply it across problems, degrading correct-solving accuracy unless training includes correct examples to enforce boundaries. In contrast, the tutor model can learn multiple misconceptions jointly without sacrificing correct-solving accuracy. Critically, intermediate reasoning steps are the bottleneck. With final-answer supervision alone, models cannot learn where error enters the solution, so neither the student model nor the tutor model acquires misconceptions regardless of data size. Together, these results, enabled by MalAlgoLib, provide an interpretable account of misconception acquisition under instruction tuning and guidance for training misconception-aware LLMs while preserving correct reasoning.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Birdcast: Interest-aware BEV Multicasting for Infrastructure-assisted Collaborative Perception
Authors:
Yanan Ma,
Zhengru Fang,
Yihang Tao,
Yu Guo,
Yiqin Deng,
Xianhao Chen,
Yuguang Fang
Abstract:
Vehicle-to-infrastructure collaborative perception (V2I-CP) leverages a high-vantage node to transmit supplementary information, i.e., bird's-eye-view (BEV) feature maps, to vehicles, effectively overcoming line-of-sight limitations. However, the downlink V2I transmission introduces a significant communication bottleneck. Moreover, vehicles in V2I-CP require \textit{heterogeneous yet overlapping}…
▽ More
Vehicle-to-infrastructure collaborative perception (V2I-CP) leverages a high-vantage node to transmit supplementary information, i.e., bird's-eye-view (BEV) feature maps, to vehicles, effectively overcoming line-of-sight limitations. However, the downlink V2I transmission introduces a significant communication bottleneck. Moreover, vehicles in V2I-CP require \textit{heterogeneous yet overlapping} information tailored to their unique occlusions and locations, rendering standard unicast/broadcast protocols inefficient. To address this limitation, we propose \textit{Birdcast}, a novel multicasting framework for V2I-CP. By accounting for individual maps of interest, we formulate a joint feature selection and multicast grouping problem to maximize network-wide utility under communication constraints. Since this formulation is a mixed-integer nonlinear program and is NP-hard, we develop an accelerated greedy algorithm with a theoretical $(1 - 1/\sqrt{e})$ approximation guarantee. While motivated by CP, Birdcast provides a general framework applicable to a wide range of multicasting systems where users possess heterogeneous interests and varying channel conditions. Extensive simulations on the V2X-Sim dataset demonstrate that Birdcast significantly outperforms state-of-the-art baselines in both system utility and perception quality, achieving up to 27\% improvement in total utility and a 3.2\% increase in mean average precision (mAP).
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
In the Middle, Not on Top: AI-Mediated Communication for Patient-Provider Care Relationships
Authors:
Ut Gong,
Yibo Meng,
Qihan Zhang,
Xin Chen,
Yan Guan
Abstract:
Relationship-centered care relies on trust and meaningful connection. As AI enters clinical settings, we must ask not just what it can do, but how it should be positioned to support these values. We examine a "middle, not top" approach where AI mediates communication without usurping human judgment. Through studies of CLEAR, an asynchronous messaging system, we show how this configuration addresse…
▽ More
Relationship-centered care relies on trust and meaningful connection. As AI enters clinical settings, we must ask not just what it can do, but how it should be positioned to support these values. We examine a "middle, not top" approach where AI mediates communication without usurping human judgment. Through studies of CLEAR, an asynchronous messaging system, we show how this configuration addresses real-world constraints like time pressure and uneven health literacy. We find that mediator affordances (e.g., availability, neutrality) redistribute interpretive work and reduce relational friction. Ultimately, we frame AI mediation as relational infrastructure, highlighting critical design tensions around framing power and privacy.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Heterogeneous Mean Field Game Framework for LEO Satellite-Assisted V2X Networks
Authors:
Kangkang Sun,
Jianhua Li,
Xiuzhen Chen,
Mingzhe Chen,
Minyi Guo
Abstract:
Coordinating mixed fleets of massive vehicles under stringent delay constraints is a central scalability bottleneck in next-generation mobile computing networks, especially when passenger cars, freight trucks, and autonomous vehicles share the same radio and multi-access edge computing (MEC) infrastructure. Heterogeneous mean field games (HMFG) are a principled framework for this setting, but a fu…
▽ More
Coordinating mixed fleets of massive vehicles under stringent delay constraints is a central scalability bottleneck in next-generation mobile computing networks, especially when passenger cars, freight trucks, and autonomous vehicles share the same radio and multi-access edge computing (MEC) infrastructure. Heterogeneous mean field games (HMFG) are a principled framework for this setting, but a fundamental design question remains open: how many agent types should be used for a fleet of size $N$? The difficulty is a two-sided trade-off that existing theory does not resolve: using more types improves heterogeneity representation, but it reduces per-class sample size and weakens the mean-field approximation accuracy. This paper resolves that trade-off through an explicit $\varepsilon$-Nash error decomposition, a closed-form type-selection law, a heterogeneity-aware equilibrium solver, and a robust extension to time-varying LEO backhaul dynamics. For the 1D queue state space, the optimal type count satisfies $K^*(N)=Θ(N^{1/3})$; for the joint queue-channel model ($d=2$), the scaling becomes $K^*(N)=Θ(N^{1/5})$ with logarithmic correction. The unified formula $K^*(N)=Θ(N^{α/(α+β)})$ provides dimension-dependent design guidance, reducing type granularity to a principled, set-once system parameter rather than a per-deployment tuning burden. Experiments validate the 1D scaling law with empirical slope $0.334 \pm 0.004$, achieve $2.3\times$ faster PDHG convergence at $K=5$, and deliver up to $29.5\%$ lower delay and $60\%$ higher throughput than homogeneous baselines. Unlike model-free DRL methods whose training complexity scales with the state-action space, the proposed HMFG solver has per-iteration complexity $O(K^2 N_q N_t)$ independent of fleet size $N$, making it suitable for large-scale mobile edge computing deployment.
△ Less
Submitted 6 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition
Authors:
Qiong Liu,
Ruofei Xiong,
Xingzhen Chen,
Muyao Peng,
You Yang
Abstract:
Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selec…
▽ More
Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.
△ Less
Submitted 31 March, 2026;
originally announced April 2026.
-
The Mystery Deepens: On the Query Complexity of Tarski Fixed Points
Authors:
Xi Chen,
Yuhao Li,
Mihalis Yannakakis
Abstract:
We give an $O(\log^2 n)$-query algorithm for finding a Tarski fixed point over the $4$-dimensional lattice $[n]^4$, matching the $Ω(\log^2 n)$ lower bound of [EPRY20]. Additionally, our algorithm yields an ${O(\log^{\lceil (k-1)/3\rceil+1} n)}$-query algorithm for any constant $k$, improving the previous best upper bound ${O(\log^{\lceil (k-1)/2\rceil+1} n)}$ of [CL22].
Our algorithm uses a new…
▽ More
We give an $O(\log^2 n)$-query algorithm for finding a Tarski fixed point over the $4$-dimensional lattice $[n]^4$, matching the $Ω(\log^2 n)$ lower bound of [EPRY20]. Additionally, our algorithm yields an ${O(\log^{\lceil (k-1)/3\rceil+1} n)}$-query algorithm for any constant $k$, improving the previous best upper bound ${O(\log^{\lceil (k-1)/2\rceil+1} n)}$ of [CL22].
Our algorithm uses a new framework based on \emph{safe partial-information} functions. The latter were introduced in [CLY23] to give a reduction from the Tarski problem to its promised version with a unique fixed point. This is the first time they are directly used to design new algorithms for Tarski fixed points.
△ Less
Submitted 31 March, 2026;
originally announced April 2026.
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Authors:
Longwei Xu,
Feng Feng,
Shaojie Zhang,
Xin Chen,
Hang Li,
Anan Du,
Hailong Yu,
Pei Fu,
Zhenbo Luo,
Jian Luan
Abstract:
Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding sp…
▽ More
Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.
△ Less
Submitted 31 March, 2026;
originally announced April 2026.
-
Hierarchical Battery-Aware Game Algorithm for ISL Power Allocation in LEO Mega-Constellations
Authors:
Kangkang Sun,
Jianhua Li,
Xiuzhen Chen,
Minyi Guo
Abstract:
Sustaining high inter-satellite link (ISL) throughput under intermittent solar harvesting is a fundamental challenge for LEO mega-constellations. Existing frameworks impose static power ceilings that ignore real-time battery state and comprehensive onboard power budgets, causing eclipse-period energy crises. Learning-based approaches capture battery dynamics but lack equilibrium guarantees and do…
▽ More
Sustaining high inter-satellite link (ISL) throughput under intermittent solar harvesting is a fundamental challenge for LEO mega-constellations. Existing frameworks impose static power ceilings that ignore real-time battery state and comprehensive onboard power budgets, causing eclipse-period energy crises. Learning-based approaches capture battery dynamics but lack equilibrium guarantees and do not scale beyond small constellations. We propose the Hierarchical Battery-Aware Game (HBAG) algorithm, a unified game-theoretic framework for ISL power allocation that operates identically across finite and megaconstellation regimes. For finite constellations, HBAG converges to a unique variational equilibrium; as constellation size grows, the same distributed update rule converges to the mean field equilibrium without algorithm redesign. Comprehensive experiments on Starlink Shell A (172 satellites) show that HBAG achieves 100% energy sustainability rate (87.4 percentage points improvement over SATFLOW), eliminates eclipse-period battery depletion, maintains flow violation ratio below the 10% industry tolerance, and scales linearly to 5,000 satellites with less than 75 ms per-slot runtime.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Authors:
Tao Chen,
Kun Zhang,
Qiong Wu,
Xiao Chen,
Chao Chang,
Xiaoshuai Sun,
Yiyi Zhou,
Rongrong Ji
Abstract:
Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continu…
▽ More
Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
DF-ACBlurGAN: Structure-Aware Conditional Generation of Internally Repeated Patterns for Biomaterial Microtopography Design
Authors:
Rongjun Dong,
Xin Chen,
Morgan R Alexander,
Karthikeyan Sivakumar,
Reza Omdivar,
David A Winkler,
Grazziela Figueredo
Abstract:
Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and b…
▽ More
Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and boundary coherence, such as microtopographical biomaterial surfaces. In this work, biomaterial design serves as a use case to study conditional generation of repeated patterns under weak supervision and class imbalance. We propose DF-ACBlurGAN, a structure-aware conditional generative adversarial network that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity. Conditioning on experimentally derived biological response labels, the model synthesises designs aligned with target functional outcomes. Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches.
△ Less
Submitted 4 February, 2026;
originally announced March 2026.
-
SonoWorld: From One Image to a 3D Audio-Visual Scene
Authors:
Derong Jin,
Xiyi Chen,
Ming C. Lin,
Ruohan Gao
Abstract:
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D s…
▽ More
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs
Authors:
Chengyin Hu,
Jiaju Han,
Xuemeng Sun,
Qike Zhang,
Yiwei Wei,
Ang Li,
Chunlei Meng,
Xiang Chen,
Jiahuan Long
Abstract:
Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated se…
▽ More
Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
ConceptWeaver: Weaving Disentangled Concepts with Flow
Authors:
Jintao Chen,
Aiming Hao,
Xiaoqing Chen,
Chengyu Bai,
Chubin Chen,
Yanxun Li,
Jiahong Wu,
Xiangxiang Chu,
Shanghang Zhang
Abstract:
Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a…
▽ More
Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.