SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

Authors: Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen, Weili Guan, Min Zhang

Abstract: Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive "overthinking", generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framewo… ▽ More Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive "overthinking", generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framework that performs step-level, difficulty-aware pruning while preserving the core reasoning structure. SAT formulates reasoning as a Finite-State Machine (FSM) with distinct thinking modes (Slow, Normal, Fast, Skip). It navigates these states dynamically using a lightweight Process Reward Model (PRM), compressing easy steps while preserving depth for hard ones. Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy. △ Less

Submitted 9 April, 2026; originally announced April 2026.

Comments: accepted to ACL2026 main conference

arXiv:2604.00513 [pdf, ps, other]

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

Authors: Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng

Abstract: With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their abilit… ▽ More With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets. △ Less

Submitted 2 April, 2026; v1 submitted 1 April, 2026; originally announced April 2026.

Comments: 10 pages, 6 figures

arXiv:2604.00404 [pdf, ps, other]

The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

Authors: Xusheng He, Canyang Wu, Jinrong Zhang, Weili Guan, Jianlong Wu, Liqiang Nie

Abstract: This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM… ▽ More This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026. △ Less

Submitted 31 March, 2026; originally announced April 2026.

Comments: 1st Place Solution for the 5th PVUW MeViS-Text Challenge (CVPR 2026 Workshop)

arXiv:2604.00395 [pdf, ps, other]

Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge

Authors: Jinrong Zhang, Canyang Wu, Xusheng He, Weili Guan, Jianlong Wu, Liqiang Nie

Abstract: In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms… ▽ More In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track. △ Less

Submitted 31 March, 2026; originally announced April 2026.

Comments: 1st Place Solution for the 5th PVUW MOSE Challenge (CVPR 2026 Workshop)

arXiv:2603.30014 [pdf, ps, other]

Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

Authors: Derek Anderson, Amit Bashyal, Markus Diefenthaler, Cristiano Fanelli, Wen Guan, Tanja Horn, Alex Jentsch Meifeng Lin, Tadashi Maeno, Kei Nagai, Hemalata Nayak, Connor Pecar, Karthik Suresh, Fang-Ying Tsai, Anselm Vossen, Tianle Wang, Torre Wenaus

Abstract: The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable an… ▽ More The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable and flexible workflow engine. We present an AI-assisted framework for detector design optimization that integrates multi-objective Bayesian optimization with the PanDA--iDDS workflow engine to coordinate iterative simulations across heterogeneous resources. The framework addresses the challenge of exploring high-dimensional parameter spaces inherent in modern detector design. We demonstrate the framework using benchmark problems and realistic studies of the ePIC and dRICH detectors for the Electron-Ion Collider (EIC). Results show improved automation, scalability, and efficiency in multi-objective optimization. This work establishes a flexible and extensible paradigm for AI-driven detector design and other computationally intensive scientific applications. △ Less

Submitted 31 March, 2026; originally announced March 2026.

arXiv:2603.29842 [pdf, ps, other]

Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

Authors: Minyoung E. Kim, Dae Hee Yun, Aditi V. Patel, Madeline Hon, Webster Guan, Taegeon Lee, Brian Nguyen

Abstract: Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these… ▽ More Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain. △ Less

Submitted 31 March, 2026; originally announced March 2026.

Comments: 21 pages, 12 figures. Accepted at CVPR 2026

arXiv:2603.23786 [pdf, ps, other]

doi 10.71929/rubin/2570536

The Vera C. Rubin Observatory Data Preview 1

Authors: Vera C Rubin Observatory Team, Tatiana Acero Cuellar, Emily Acosta, Christina L Adair, Prakruth Adari, Jennifer K Adelman McCarthy, Anastasia Alexov, Russ Allbery, Robyn Allsman, Yusra AlSayyad, Jhonatan Amado, Nathan Amouroux, Pierre Antilogus, Alexis Aracena Alcayaga, Gonzalo Aravena Rojas, Claudio H Araya Cortes, Eric Aubourg, Tim S Axelrod, John Banovetz, Carlos Barria, Amanda E Bauer, Brian J Bauman, Ellen Bechtol, Keith Bechtol, Andrew C Becker , et al. (303 additional authors not shown)

Abstract: We present Rubin Data Preview 1 DP1, the first data from the NSF DOE Vera C Rubin Observatory, comprising raw and calibrated single epoch images, coadds, difference images, detection catalogs, and ancillary data products. DP1 is based on 1792 optical near infrared exposures acquired over 48 distinct nights by the Rubin Commissioning Camera LSSTComCam on the Simonyi Survey Telescope at the Summit F… ▽ More We present Rubin Data Preview 1 DP1, the first data from the NSF DOE Vera C Rubin Observatory, comprising raw and calibrated single epoch images, coadds, difference images, detection catalogs, and ancillary data products. DP1 is based on 1792 optical near infrared exposures acquired over 48 distinct nights by the Rubin Commissioning Camera LSSTComCam on the Simonyi Survey Telescope at the Summit Facility on Cerro Pachón Chile in late 2024. DP1 covers $\sim$15 deg$^2$ distributed across seven roughly equal-sized non-contiguous fields, each independently observed in six broad photometric bands $ugrizy$. The median FWHM of the point spread function across all bands is approximately 1.14 arcseconds, with the sharpest images reaching about 0.58 arcseconds. The 5$σ$ point source depths for coadded images in the deepest field the Extended Chandra Deep Field South are $u$ = 24.55, $g$ = 26.18, $r$ = 25.96, $i$ = 25.71, $z$ = 25.07, $y$ = 23.1. Other fields are no more than 2.2 magnitudes shallower in any band where they have nonzero coverage. DP1 contains approximately 2.3 million distinct astrophysical objects, of which 1.6 million are extended in at least one band in coadds and 431 solar system objects of which 93 are new discoveries. DP1 is approximately 3.5 TB in size and is available to Rubin data rights holders via the Rubin Science Platform a cloud based environment for the analysis of petascale astronomical data. While small compared to future LSST releases its high quality and diversity of data support a broad range of early science investigations ahead of full operations in 2026. △ Less

Submitted 24 March, 2026; originally announced March 2026.

Comments: 59 pages, 41 figures

Report number: RTN-095.lsst.io

arXiv:2603.20236 [pdf, ps, other]

EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models

Authors: Mingchen Song, Xiang Deng, Jie Wei, Dongmei Jiang, Liqiang Nie, Weili Guan

Abstract: Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and well-established model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely… ▽ More Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and well-established model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely on extensive bimanual datasets or fail to effectively leverage pre-trained unimanual policies. To address this limitation, we propose \textbf{EnergyAction}, a novel framework that compositionally transfers unimanual manipulation policies to bimanual tasks through the Energy-Based Models (EBMs). Specifically, our method incorporates three key innovations. First, we model individual unimanual policies as EBMs and leverage their compositional properties to compose left and right arm actions, enabling the fusion of unimanual policies into a bimanual policy. Second, we introduce an energy-based temporal-spatial coordination mechanism through energy constraints, ensuring the generated bimanual actions are both temporal coherence and spatial feasibility. Third, we propose two different energy-aware denoising strategies that dynamically adapt denoising steps based on action quality assessment. These strategies ensure the generation of high-quality actions while maintaining superior computational efficiency compared to fixed-step denoising approaches. Experimental results demonstrate that EnergyAction effectively transfers unimanual knowledge to bimanual tasks, achieving superior performance on both simulated and real-world tasks with minimal bimanual data. △ Less

Submitted 9 March, 2026; originally announced March 2026.

arXiv:2603.14251 [pdf, ps, other]

Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

Authors: Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang

Abstract: Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning.… ▽ More Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods. △ Less

Submitted 15 March, 2026; originally announced March 2026.

arXiv:2603.12606 [pdf, ps, other]

Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

Authors: Zesheng Yang, Xi Jiang, Bingzhang Hu, Weili Guan, Runmin Cong, Guo-Jun Qi, Feng Zheng

Abstract: Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To addres… ▽ More Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models. △ Less

Submitted 12 March, 2026; originally announced March 2026.

Comments: 12 pages, 6 figures

arXiv:2603.12138 [pdf, ps, other]

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Authors: Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen

Abstract: Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from th… ▽ More Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments. △ Less

Submitted 12 March, 2026; originally announced March 2026.

Comments: Accepted by CVPR 2026

arXiv:2603.07536 [pdf, ps, other]

Stabilization of premixed NH3/H2/air flames via bluff-body flame holders

Authors: Lukas Gaipl, Wei Guan, Ganesh Guggilla, Alexey Kropman, Frank Beyrau, Dominique Thévenin

Abstract: The stabilization mechanisms of fully premixed NH3/H2/air flames anchored behind a bluff body are investigated using combined experiments and direct numerical simulations. Particular attention is given to the interplay between preferential diffusion, heat release, flow recirculation, and turbulence-flame interaction. Comparison between non-reactive and reactive cases shows that thermal expansion s… ▽ More The stabilization mechanisms of fully premixed NH3/H2/air flames anchored behind a bluff body are investigated using combined experiments and direct numerical simulations. Particular attention is given to the interplay between preferential diffusion, heat release, flow recirculation, and turbulence-flame interaction. Comparison between non-reactive and reactive cases shows that thermal expansion strongly alters the flow field, increasing the recirculation zone length by about 40% and the shear layer width by roughly 50% near the end of the recirculation region. Excellent agreement between measurements and simulations for mean and fluctuating axial velocities validates the numerical approach. Analysis of the flame structure reveals a distinctive stabilization mechanism at the flame root: preferential hydrogen diffusion generates a localized diffusion flame branch that enhances radical production and promotes robust anchoring. Combustion proceeds sequentially, with hydrogen mainly consumed in the shear layer, followed by ammonia cracking and the main heat-release region. Near the bluff body, heat release is concentrated within the recirculation zone, while downstream regions are increasingly influenced by turbulence and velocity fluctuations. The roles of curvature and strain are quantified to assess stretch effects along the flame front. Convex curvature near the flame root enhances hydrogen enrichment and locally increases burning rates, reinforcing stabilization. In contrast, concave curvature and higher stretch near the end of the recirculation zone weaken the flame and mark a transition toward a turbulence-dominated regime. Overall, stabilization results from a coupled feedback between recirculation-driven heat exchange and rapid hydrogen oxidation, sustaining an intermediate ammonia reaction zone and enabling robust anchoring of carbon-free NH3/H2 flames. △ Less

Submitted 8 March, 2026; originally announced March 2026.

arXiv:2602.13978 [pdf, ps, other]

Constrained variational problems on perturbed lattice graphs

Authors: Weiqi Guan

Abstract: In this paper, we solve some constrained variational problems on perturbed lattice graphs $G$. The first problem addresses the existence of ground state normalized solutions to Schrödinger equations \begin{equation*} \left\{ \begin{aligned} &-Δ_{G} u+λu=\vert u\vert^{p-2}u,x\in G &\Vert u\Vert_{l^2(G)}^2=a. \end{aligned} \right. \end{equation*} We prove that if the graph is obtained by… ▽ More In this paper, we solve some constrained variational problems on perturbed lattice graphs $G$. The first problem addresses the existence of ground state normalized solutions to Schrödinger equations \begin{equation*} \left\{ \begin{aligned} &-Δ_{G} u+λu=\vert u\vert^{p-2}u,x\in G &\Vert u\Vert_{l^2(G)}^2=a. \end{aligned} \right. \end{equation*} We prove that if the graph is obtained by deleting finite edges in lattice graphs while maintaining connectivity, then there exists a threshold $α_G\in[0,\infty)$ such that there do not exist ground state normalized solution if $0<a<α_G$, and there exists a ground state normalized solution if $a>α_G.$ If the graph is obtained by adding finite edges $E^{'}$ to lattice graphs, we prove that there exist $E^{'}$ and $a_1$ such that for all $a>a_1,$ there do not exist ground state normalized solutions. The second problem concerns the existence of an extremal function for the Sobolev inequality. If the graph $G$ is obtained by deleting finite edges in lattice graphs while maintaining connectivity, for the Sobolev super-critical regime, we prove that there exists an extremal function. for the Sobolev critical regime, we prove that there exists $G$ such that extremal can be attained. If the graph is obtained by adding finite edges $E^{'}$ to lattice graphs, we prove that there exists $E^{'}$ such that there does not exist an extremal function. △ Less

Submitted 14 February, 2026; originally announced February 2026.

Comments: 17 pages

arXiv:2602.01167 [pdf, ps, other]

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

Authors: Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang

Abstract: Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematic… ▽ More Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available. △ Less

Submitted 1 February, 2026; originally announced February 2026.

arXiv:2602.00557 [pdf, ps, other]

ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

Authors: Weisheng Dai, Kai Lan, Jianyi Zhou, Bo Zhao, Xiu Su, Junwen Tong, Weili Guan, Shuo Yang

Abstract: Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack… ▽ More Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack of explicit action supervision hinders direct utilization. Prior work leverages VQ-VAE based frameworks to learn latent actions from human videos in an unsupervised manner. Nevertheless, since the training objective primarily focuses on reconstructing visual appearances rather than capturing inter-frame dynamics, the learned representations tend to rely on spurious visual cues, leading to shortcut learning and entangled latent representations that hinder transferability. To address this, we propose ConLA, an unsupervised pretraining framework for learning robotic policies from human videos. ConLA introduces a contrastive disentanglement mechanism that leverages action category priors and temporal cues to isolate motion dynamics from visual content, effectively mitigating shortcut learning. Extensive experiments show that ConLA achieves strong performance across diverse benchmarks. Notably, by pretraining solely on human videos, our method for the first time surpasses the performance obtained with real robot trajectory pretraining, highlighting its ability to extract pure and semantically consistent latent action representations for scalable robot learning. △ Less

Submitted 31 January, 2026; originally announced February 2026.

arXiv:2601.22803 [pdf, ps, other]

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

Authors: Ji Shi, Peiming Guo, Meishan Zhang, Miao Zhang, Xuebo Liu, Min Zhang, Weili Guan

Abstract: Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL… ▽ More Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty--aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state-of-the-art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT-3.5, while delivering over $20\times$ faster inference than competitive baselines. Code is available at https://github.com/LIGHTCHASER1/CVeDRL.git △ Less

Submitted 30 January, 2026; originally announced January 2026.

Comments: 17 pages, 3 figures

arXiv:2601.20597 [pdf, ps, other]

StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

Authors: Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie

Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caus… ▽ More Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches. △ Less

Submitted 28 January, 2026; originally announced January 2026.

arXiv:2601.20526 [pdf, ps, other]

IOTA: Corrective Knowledge-Guided Prompt Learning via Black-White Box Framework

Authors: Shaokun Wang, Yifan Yu, Yuhang He, Weili Guan, Yihong Gong

Abstract: Recently, adapting pre-trained models to downstream tasks has attracted increasing interest. Previous Parameter-Efficient-Tuning (PET) methods regard the pre-trained model as an opaque Black Box model, relying purely on data-driven optimization and underutilizing their inherent prior knowledge. This oversight limits the models' potential for effective downstream task adaptation. To address these i… ▽ More Recently, adapting pre-trained models to downstream tasks has attracted increasing interest. Previous Parameter-Efficient-Tuning (PET) methods regard the pre-trained model as an opaque Black Box model, relying purely on data-driven optimization and underutilizing their inherent prior knowledge. This oversight limits the models' potential for effective downstream task adaptation. To address these issues, we propose a novel black-whIte bOx prompT leArning framework (IOTA), which integrates a data-driven Black Box module with a knowledge-driven White Box module for downstream task adaptation. Specifically, the White Box module derives corrective knowledge by contrasting the wrong predictions with the right cognition. This knowledge is verbalized into interpretable human prompts and leveraged through a corrective knowledge-guided prompt selection strategy to guide the Black Box module toward more accurate predictions. By jointly leveraging knowledge- and data-driven learning signals, IOTA achieves effective downstream task adaptation. Experimental results on 12 image classification benchmarks under few-shot and easy-to-hard adaptation settings demonstrate the effectiveness of corrective knowledge and the superiority of our method over state-of-the-art methods. △ Less

Submitted 28 January, 2026; originally announced January 2026.

arXiv:2601.09636 [pdf, ps, other]

PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

Authors: Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

Abstract: While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted pr… ▽ More While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%. △ Less

Submitted 14 January, 2026; originally announced January 2026.

arXiv:2601.03632 [pdf, ps, other]

ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

Authors: Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, Xie Chen

Abstract: Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to… ▽ More Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model's implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance, reducing reliance on reference style while preserving text fidelity. On top of this, we apply style-specific LoRAs together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control, and introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance. Experiments show that ReStyle-TTS enables user-friendly, continuous, and relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre, and performs robustly in challenging mismatched reference-target style scenarios. △ Less

Submitted 7 January, 2026; originally announced January 2026.

arXiv:2512.15751 [pdf, ps, other]

GLOW: Graph-Language Co-Reasoning for Agentic Workflow Performance Prediction

Authors: Wei Guan, Jian Cao, Jinyu Cai, Qiqi Cai, Jianqi Gao, See-Kiong Ng

Abstract: Agentic Workflows (AWs) have emerged as a promising paradigm for solving complex tasks. However, the scalability of automating their generation is severely constrained by the high cost and latency of execution-based evaluation. Existing AW performance prediction methods act as surrogates but fail to simultaneously capture the intricate topological dependencies and the deep semantic logic embedded… ▽ More Agentic Workflows (AWs) have emerged as a promising paradigm for solving complex tasks. However, the scalability of automating their generation is severely constrained by the high cost and latency of execution-based evaluation. Existing AW performance prediction methods act as surrogates but fail to simultaneously capture the intricate topological dependencies and the deep semantic logic embedded in AWs. To address this limitation, we propose GLOW, a unified framework for AW performance prediction that combines the graph-structure modeling capabilities of GNNs with the reasoning power of LLMs. Specifically, we introduce a graph-oriented LLM, instruction-tuned on graph tasks, to extract topologically aware semantic features, which are fused with GNN-encoded structural representations. A contrastive alignment strategy further refines the latent space to distinguish high-quality AWs. Extensive experiments on FLORA-Bench show that GLOW outperforms state-of-the-art baselines in prediction accuracy and ranking utility. △ Less

Submitted 11 December, 2025; originally announced December 2025.

arXiv:2512.15049 [pdf, ps, other]

On the Stochastic Analysis of Random Linear Streaming Codes in Multi-Hop Relay Networks

Authors: Kai Huang, Xinyu Xie, Chunpeng Chen, Wenjie Guan, Xiaoran Wang, Jinbei Zhang

Abstract: In this paper, we aim to explore the stochastic performance limit of large-field-size Random Linear Streaming Codes (RLSCs) in multi-hop relay networks. In our model, a source transmits a sequence of streaming messages to a destination through multiple relays subject to a delay constraint. Most previous research focused on deterministic adversarial channel which introduces only restricted types of… ▽ More In this paper, we aim to explore the stochastic performance limit of large-field-size Random Linear Streaming Codes (RLSCs) in multi-hop relay networks. In our model, a source transmits a sequence of streaming messages to a destination through multiple relays subject to a delay constraint. Most previous research focused on deterministic adversarial channel which introduces only restricted types of erasure patterns, and aimed to design the optimal capacity-achieving codes. In this paper, we focus on stochastic channel where each hop is subject to i.i.d. packet erasures, and carry out stochastic analysis on the error probability of multi-hop RLSCs. Our contributions are three-folds. Firstly, the error event of large-field-size RLSCs is characterized in two-hop relay network with a novel framework, which features quantification of information flowing through each node in the network. Due to the erasures in different hops, some source symbols can be "detained" at the source or relay while others have arrived at the destination. By iteratively computing the number of detained symbols at each node, this framework extends the concept "information debt" from point-to-point network [Pinwen Su et al. 2022] into two-hop relay networks. Secondly, based on the error event, the expression of average error probability in two-hop network is derived by carefully analyzing the expectation terms. To handle the expectation over all possible erasure patterns along two hops of the network, the transition matrices of the detained symbols are novelly constructed in a "band fashion" with nested structure. Thirdly, the derived results in two-hop network are further generalized into relay networks with arbitrary number of hops. Furthermore, simulations are conducted to verify the accuracy of our stochastic analysis, and compare with some existing streaming codes for the adversarial channels. △ Less

Submitted 16 December, 2025; originally announced December 2025.

arXiv:2512.10416 [pdf, ps, other]

Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Authors: Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu

Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse en… ▽ More Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors. This work addresses these limitations in two complementary ways. First, we release WildRoad, a global off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly. Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. An efficient vertex extraction strategy also yields roughly 2.5X faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild. We release both the dataset and code at this repository. We release both the dataset and code at https://github.com/xiaofei-guan/MaGRoad. △ Less

Submitted 8 March, 2026; v1 submitted 11 December, 2025; originally announced December 2025.

Comments: This revision improves clarity and consistency throughout the paper. We refine terminology to more precisely describe the vertex extraction optimization, add motivational context to the edge feature encoding section, and clarify the overall inference pipeline. We also add an Acknowledgments section

arXiv:2512.10394 [pdf, ps, other]

RoboNeuron: A Middle-Layer Infrastructure for Agent-Driven Orchestration in Embodied AI

Authors: Weifan Guan, Qinghao Hu, Huasen Xi, Chenxiao Zhang, Aosheng Li, Jian Cheng

Abstract: Vision-language-action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad-hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re-integration. We introduce Ro… ▽ More Vision-language-action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad-hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re-integration. We introduce RoboNeuron, a middleware layer that connects the Model Context Protocol (MCP) for LLM agents with robot middleware such as ROS2. RoboNeuron bridges these ecosystems by deriving agent-callable tools directly from ROS schemas, providing a unified execution abstraction that supports both direct commands and modular composition, and localizing backend, runtime, and acceleration-preset changes within a stable inference boundary. We evaluate RoboNeuron in simulation and on hardware through multi-platform base control, arm motion, and VLA-based grasping tasks, demonstrating that it enables modular system orchestration under a unified interface while supporting backend transitions without system rewiring. The full code implementation of this work is available at github repo: https://github.com/guanweifan/RoboNeuron △ Less

Submitted 1 April, 2026; v1 submitted 11 December, 2025; originally announced December 2025.

arXiv:2512.08475 [pdf, ps, other]

Solving Oversmoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability

Authors: Weiqi Guan, Junlin He

Abstract: The relationship between Layer Normalization (LN) placement and the oversmoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid oversmoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience oversmoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothin… ▽ More The relationship between Layer Normalization (LN) placement and the oversmoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid oversmoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience oversmoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothing, preventing oversmoothing without the curse of depth. Empirical results across five benchmarks demonstrate that our approach supports deeper networks (up to 256 layers) and improves performance, requiring no additional parameters. Key contributions: Theoretical Characterization: Analysis of LN dynamics and their impact on oversmoothing and the curse of depth. A Principled Solution: A parameter-efficient method that induces algebraic smoothing and avoids oversmoothing and the curse of depth. Empirical Validation: Extensive experiments showing the effectiveness of the method in deeper GNNs. △ Less

Submitted 10 December, 2025; v1 submitted 9 December, 2025; originally announced December 2025.

Comments: 18 pages, 4 figures

arXiv:2512.06782 [pdf, ps, other]

Measuring Over-smoothing beyond Dirichlet energy

Authors: Weiqi Guan, Zihao Shi

Abstract: While Dirichlet energy serves as a prevalent metric for quantifying over-smoothing, it is inherently restricted to capturing first-order feature derivatives. To address this limitation, we propose a generalized family of node similarity measures based on the energy of higher-order feature derivatives. Through a rigorous theoretical analysis of the relationships among these measures, we establish t… ▽ More While Dirichlet energy serves as a prevalent metric for quantifying over-smoothing, it is inherently restricted to capturing first-order feature derivatives. To address this limitation, we propose a generalized family of node similarity measures based on the energy of higher-order feature derivatives. Through a rigorous theoretical analysis of the relationships among these measures, we establish the decay rates of Dirichlet energy under both continuous heat diffusion and discrete aggregation operators. Furthermore, our analysis reveals an intrinsic connection between the over-smoothing decay rate and the spectral gap of the graph Laplacian. Finally, empirical results demonstrate that attention-based Graph Neural Networks (GNNs) suffer from over-smoothing when evaluated under these proposed metrics. △ Less

Submitted 7 December, 2025; originally announced December 2025.

Comments: 17 pages, 1 figure

arXiv:2512.05126 [pdf, ps, other]

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Authors: Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

Abstract: Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) mod… ▽ More Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks. △ Less

Submitted 23 November, 2025; originally announced December 2025.

arXiv:2512.02792 [pdf, ps, other]

HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Authors: Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, Weili Guan

Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semant… ▽ More Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/. △ Less

Submitted 14 December, 2025; v1 submitted 2 December, 2025; originally announced December 2025.

Comments: Accepted by ACM MM 2025

arXiv:2511.19432 [pdf, ps, other]

Robotic chip-scale nanofabrication for superior consistency

Authors: Felix M. Mayor, Wenyan Guan, Erik Szakiel, Amir H. Safavi-Naeini, Samuel Gyger

Abstract: Unlike the rigid, high-volume automation found in industry, academic research requires process flexibility that has historically relied on variable manual operations. This hinders the fabrication of advanced, complex devices. We propose to address this gap by automating these low-volume, high-stakes tasks using a robotic arm to improve process control and consistency. As a proof of concept, we dep… ▽ More Unlike the rigid, high-volume automation found in industry, academic research requires process flexibility that has historically relied on variable manual operations. This hinders the fabrication of advanced, complex devices. We propose to address this gap by automating these low-volume, high-stakes tasks using a robotic arm to improve process control and consistency. As a proof of concept, we deploy this system for the resist development of Josephson junction devices. A statistical comparison of the process repeatability shows the robotic process achieves a resistance spread across chips close to 2%, a significant improvement over the ~7% spread observed from human operators, validating robotics as a solution to eliminate operator-dependent variability and a path towards industrial-level consistency in a research setting. △ Less

Submitted 24 November, 2025; originally announced November 2025.

Comments: 5 pages, 3 figures

arXiv:2511.12449 [pdf, ps, other]

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Authors: Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

Abstract: Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data.… ▽ More Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further release MBE2.0, a co-augmented Multimodal representation Benchmark for E-commerce representation learning and evaluation at https://huggingface.co/datasets/ZHNie/MBE2.0. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0. △ Less

Submitted 23 March, 2026; v1 submitted 15 November, 2025; originally announced November 2025.

Comments: 11 pages, 7 figures

arXiv:2511.11305 [pdf, ps, other]

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

Authors: Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

Abstract: We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves a… ▽ More We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences. △ Less

Submitted 18 November, 2025; v1 submitted 14 November, 2025; originally announced November 2025.

Comments: 31 pages, 12 figures

arXiv:2511.05769 [pdf]

Lived Experience in Dialogue: Co-designing Personalization in Large Language Models to Support Youth Mental Well-being

Authors: Kathleen W. Guan, Sarthak Giri, Mohammed Amara, Bernard J. Jansen, Enrico Liscio, Milena Esherick, Mohammed Al Owayyed, Ausrine Ratkute, Gayane Sedrakyan, Mark de Reuver, Joao Fernando Ferreira Goncalves, Caroline A. Figueroa

Abstract: Youth increasingly turn to large language models (LLMs) for mental well-being support, yet current personalization in LLMs can overlook the heterogeneous lived experiences shaping their needs. We conducted a participatory study with youth, parents, and youth care workers (N=38), using co-created youth personas as scaffolds, to elicit community perspectives on how LLMs can facilitate more meaningfu… ▽ More Youth increasingly turn to large language models (LLMs) for mental well-being support, yet current personalization in LLMs can overlook the heterogeneous lived experiences shaping their needs. We conducted a participatory study with youth, parents, and youth care workers (N=38), using co-created youth personas as scaffolds, to elicit community perspectives on how LLMs can facilitate more meaningful personalization to support youth mental well-being. Analysis identified three themes: person-centered contextualization responsive to momentary needs, explicit boundaries around scope and offline referral, and dialogic scaffolding for reflection and autonomy. We mapped these themes to persuasive design features for task suggestions, social facilitation, and system trustworthiness, and created corresponding dialogue extracts to guide LLM fine-tuning. Our findings demonstrate how lived experience can be operationalized to inform design features in LLMs, which can enhance the alignment of LLM-based interventions with the realities of youth and their communities, contributing to more effectively personalized digital well-being tools. △ Less

Submitted 7 November, 2025; originally announced November 2025.

arXiv:2510.24762 [pdf, ps, other]

Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

Authors: Wenzhen Luo, Wei Guan, Yifan Yao, Yimin Pan, Feng Wang, Zhipeng Yu, Zhe Wen, Liang Chen, Yihong Zhuang

Abstract: We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and… ▽ More We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation. △ Less

Submitted 22 October, 2025; originally announced October 2025.

arXiv:2510.22622 [pdf, ps, other]

DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection

Authors: Kangran Zhao, Yupeng Chen, Xiaoyu Zhang, Yize Chen, Weinan Guan, Baicheng Chen, Chengzhe Sun, Soumyya Kanti Datta, Qingshan Liu, Siwei Lyu, Baoyuan Wu

Abstract: The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training da… ▽ More The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench-MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench-MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in-depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench-MM, together with our large-scale Mega-MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection. △ Less

Submitted 26 October, 2025; originally announced October 2025.

Comments: Preprint

arXiv:2510.18032 [pdf, ps, other]

OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

Authors: Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang

Abstract: Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agen… ▽ More Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks. △ Less

Submitted 20 October, 2025; originally announced October 2025.

Comments: 8 pages for main content

arXiv:2510.17111 [pdf, ps, other]

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

Abstract: Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real… ▽ More Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence. △ Less

Submitted 23 October, 2025; v1 submitted 19 October, 2025; originally announced October 2025.

arXiv:2510.16833 [pdf, ps, other]

From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display

Authors: Xiangyu Mu, Dongliang Zhou, Jie Hou, Haijun Zhang, Weili Guan

Abstract: Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose… ▽ More Mannequin-based clothing displays offer a cost-effective alternative to real-model showcases for online fashion presentation, but lack realism and expressive detail. To overcome this limitation, we introduce a new task called mannequin-to-human (M2H) video generation, which aims to synthesize identity-controllable, photorealistic human videos from footage of mannequins. We propose M2HVideo, a pose-aware and identity-preserving video generation framework that addresses two key challenges: the misalignment between head and body motion, and identity drift caused by temporal modeling. In particular, M2HVideo incorporates a dynamic pose-aware head encoder that fuses facial semantics with body pose to produce consistent identity embeddings across frames. To address the loss of fine facial details due to latent space compression, we introduce a mirror loss applied in pixel space through a denoising diffusion implicit model (DDIM)-based one-step denoising. Additionally, we design a distribution-aware adapter that aligns statistical distributions of identity and clothing features to enhance temporal coherence. Extensive experiments on the UBC fashion dataset, our self-constructed ASOS dataset, and the newly collected MannequinVideos dataset captured on-site demonstrate that M2HVideo achieves superior performance in terms of clothing consistency, identity preservation, and video fidelity in comparison to state-of-the-art methods. △ Less

Submitted 19 October, 2025; originally announced October 2025.

arXiv:2510.04593 [pdf, ps, other]

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Authors: Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

Abstract: Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization… ▽ More Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice. △ Less

Submitted 19 November, 2025; v1 submitted 6 October, 2025; originally announced October 2025.

arXiv:2510.02930 [pdf, ps, other]

iDDS: Intelligent Distributed Dispatch and Scheduling for Workflow Orchestration

Authors: Wen Guan, Tadashi Maeno, Aleksandr Alekseev, Fernando Harald Barreiro Megino, Kaushik De, Edward Karavakis, Alexei Klimentov, Tatiana Korchuganova, FaHui Lin, Paul Nilsson, Torre Wenaus, Zhaoyu Yang, Xin Zhao

Abstract: The intelligent Distributed Dispatch and Scheduling (iDDS) service is a versatile workflow orchestration system designed for large-scale, distributed scientific computing. iDDS extends traditional workload and data management by integrating data-aware execution, conditional logic, and programmable workflows, enabling automation of complex and dynamic processing pipelines. Originally developed for… ▽ More The intelligent Distributed Dispatch and Scheduling (iDDS) service is a versatile workflow orchestration system designed for large-scale, distributed scientific computing. iDDS extends traditional workload and data management by integrating data-aware execution, conditional logic, and programmable workflows, enabling automation of complex and dynamic processing pipelines. Originally developed for the ATLAS experiment at the Large Hadron Collider, iDDS has evolved into an experiment-agnostic platform that supports both template-driven workflows and a Function-as-a-Task model for Python-based orchestration. This paper presents the architecture and core components of iDDS, highlighting its scalability, modular message-driven design, and integration with systems such as PanDA and Rucio. We demonstrate its versatility through real-world use cases: fine-grained tape resource optimization for ATLAS, orchestration of large Directed Acyclic Graph (DAG) workflows for the Rubin Observatory, distributed hyperparameter optimization for machine learning applications, active learning for physics analyses, and AI-assisted detector design at the Electron-Ion Collider. By unifying workload scheduling, data movement, and adaptive decision-making, iDDS reduces operational overhead and enables reproducible, high-throughput workflows across heterogeneous infrastructures. We conclude with current challenges and future directions, including interactive, cloud-native, and serverless workflow support. △ Less

Submitted 19 December, 2025; v1 submitted 3 October, 2025; originally announced October 2025.

arXiv:2509.21984 [pdf, ps, other]

Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models

Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Yongshuai Hou, Weili Guan, Jun Yu, Min Zhang

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled p… ▽ More Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled probing experiments, we observe that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a clear spatial bias in their semantic understanding. Further analysis indicates that this bias does not stem from the vision encoder, but rather from a mismatch in attention mechanisms between the vision encoder and the large language model, which disrupts the global information flow. Motivated by this insight, we propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that dynamically injects shared global visual context into each image token. AGCI works without architectural modifications, mitigating spatial bias by enhancing the semantic accessibility of image tokens while preserving the model's intrinsic capabilities. Extensive experiments demonstrate that AGCI not only enhances the spatial robustness of LVLMs, but also achieves strong performance on various downstream tasks and hallucination benchmarks. △ Less

Submitted 3 February, 2026; v1 submitted 26 September, 2025; originally announced September 2025.

arXiv:2509.21363 [pdf, ps, other]

A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision--Revised

Authors: Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu, Errui Ding

Abstract: Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from n… ▽ More Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection. First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight. Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction. In addition, we develop a novel mutual learning module (MLM) which serves as the building block of our method. Each MLM consists of multiple network branches trained in a mutual learning manner, which improves the performance by a large margin. Extensive experiments on seven challenging datasets demonstrate that the proposed method has delivered state-of-the-art results in both salient object detection and edge detection. △ Less

Submitted 21 September, 2025; originally announced September 2025.

Comments: 11 pages

Journal ref: CVPR.2019.00834

arXiv:2509.20410 [pdf, ps, other]

Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

Authors: Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong

Abstract: Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension ca… ▽ More Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction. △ Less

Submitted 4 November, 2025; v1 submitted 24 September, 2025; originally announced September 2025.

Comments: It requires internal PR approval

arXiv:2509.18102 [pdf, ps, other]

XMUspeech Systems for the ASVspoof 5 Challenge

Authors: Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong

Abstract: In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the perfor… ▽ More In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact-related information, we trained self-supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi-scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand-crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one-class loss functions and provided optimized configurations to better align with the anti-spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition. △ Less

Submitted 5 September, 2025; originally announced September 2025.

arXiv:2509.07817 [pdf, ps, other]

Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Authors: Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie

Abstract: Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. In… ▽ More Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type's utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters. △ Less

Submitted 9 September, 2025; originally announced September 2025.

arXiv:2509.03890 [pdf, ps, other]

FaMA: LLM-Empowered Agentic Assistant for Consumer-to-Consumer Marketplace

Authors: Yineng Yan, Xidong Wang, Jin Seng Cheng, Ran Hu, Wentao Guan, Nahid Farahmand, Hengte Lin, Yue Li

Abstract: The emergence of agentic AI, powered by Large Language Models (LLMs), marks a paradigm shift from reactive generative systems to proactive, goal-oriented autonomous agents capable of sophisticated planning, memory, and tool use. This evolution presents a novel opportunity to address long-standing challenges in complex digital environments. Core tasks on Consumer-to-Consumer (C2C) e-commerce platfo… ▽ More The emergence of agentic AI, powered by Large Language Models (LLMs), marks a paradigm shift from reactive generative systems to proactive, goal-oriented autonomous agents capable of sophisticated planning, memory, and tool use. This evolution presents a novel opportunity to address long-standing challenges in complex digital environments. Core tasks on Consumer-to-Consumer (C2C) e-commerce platforms often require users to navigate complex Graphical User Interfaces (GUIs), making the experience time-consuming for both buyers and sellers. This paper introduces a novel approach to simplify these interactions through an LLM-powered agentic assistant. This agent functions as a new, conversational entry point to the marketplace, shifting the primary interaction model from a complex GUI to an intuitive AI agent. By interpreting natural language commands, the agent automates key high-friction workflows. For sellers, this includes simplified updating and renewal of listings, and the ability to send bulk messages. For buyers, the agent facilitates a more efficient product discovery process through conversational search. We present the architecture for Facebook Marketplace Assistant (FaMA), arguing that this agentic, conversational paradigm provides a lightweight and more accessible alternative to traditional app interfaces, allowing users to manage their marketplace activities with greater efficiency. Experiments show FaMA achieves a 98% task success rate on solving complex tasks on the marketplace and enables up to a 2x speedup on interaction time. △ Less

Submitted 4 September, 2025; originally announced September 2025.

arXiv:2509.01894 [pdf, ps, other]

On the Analysis of Random Linear Streaming Codes in Stochastic Channels

Authors: Kai Huang, Wenjie Guan, Xiaoran Wang, Jinbei Zhang, Kechao Cai

Abstract: Random Linear Streaming Codes (RLSCs) can dramatically reduce the queuing delay of block codes in real-time services. In this paper, we aim to explore the fundamental limit of large-field-size RLSCs in stochastic symbol erasure channels (SEC). The Non-systematic RLSCs (NRLSCs) in i.i.d. SEC has been analyzed in [Pinwen Su et al. 2022]. In this work, we first derive the closed-form expression on th… ▽ More Random Linear Streaming Codes (RLSCs) can dramatically reduce the queuing delay of block codes in real-time services. In this paper, we aim to explore the fundamental limit of large-field-size RLSCs in stochastic symbol erasure channels (SEC). The Non-systematic RLSCs (NRLSCs) in i.i.d. SEC has been analyzed in [Pinwen Su et al. 2022]. In this work, we first derive the closed-form expression on the exact error probability of NRLSCs in Gilbert-Elliott symbol erasure channels (G-ESEC). Compared to i.i.d SEC, the erasure probability of G-ESEC depends on channel state, thus transitions between the states should be considered. To deal with the stochastic state transitions, we introduce two novel techniques. (i) To account for the impact of switching states on probability terms, we find and leverage the recursive structure of the state transition traces. (ii) To obtain the expected number of error timeslots, we derive the stationary initial distribution of the states, and formulate iterative equation to characterize the expectation terms. Then we analyze the Systematic RLSCs (SRLSCs) in a special SEC, i.e., the packet erasure channel (PEC). In this scenario, SRLSCs could save some source symbols which should have exceeded the decoding delay in NRLSCs, and thus could significantly reduce the error probability. To this point, our contributions are two-folds. (i) Through a case study, we find a counter-intuitive phenomenon that SRLSCs can cause unexpected error events comparing to NRLSCs in some erasure patterns. Then we fully characterize the error event of SRLSCs for any erasure pattern. (ii) For i.i.d. PEC, we derive an analytical expression on exact error probability of SRLSCs when length of memory approaches infinity and coding rate equals to 1/2. Simulations are conducted to verify the accuracy of our analysis and compare the performance of NRLSCs, SRLSCs, and existing streaming codes. △ Less

Submitted 1 September, 2025; originally announced September 2025.

arXiv:2508.15904 [pdf, ps, other]

Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping

Authors: Dexuan He, Xiao Zhou, Wenbin Guan, Liyuan Zhang, Xiaoman Zhang, Sinuo Xu, Ge Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Yanfeng Wang, Kun Sun, Ya Zhang, Weidi Xie

Abstract: Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instanc… ▽ More Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise. △ Less

Submitted 21 August, 2025; originally announced August 2025.

arXiv:2508.12628 [pdf, ps, other]

Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

Authors: Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu

Abstract: Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to selec… ▽ More Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications. △ Less

Submitted 18 August, 2025; originally announced August 2025.

arXiv:2508.11999 [pdf, ps, other]

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

Authors: Daoze Zhang, Chenghan Fu, Zhanheng Nie, Jianyu Liu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng

Abstract: With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that ge… ▽ More With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding. The data of our MBE benchmark is given in https://huggingface.co/datasets/Daoze/MM-Bench-E-Commerce. △ Less

Submitted 28 February, 2026; v1 submitted 16 August, 2025; originally announced August 2025.

Comments: Accepted by WSDM 2026 (oral). 11 pages, 9 figures

arXiv:2508.11247 [pdf, ps, other]

Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering

Authors: Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, Ning Jiang

Abstract: Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by levera… ▽ More Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency. △ Less

Submitted 15 August, 2025; originally announced August 2025.

Showing 1–50 of 214 results for author: Guan, W