-
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Authors:
Zeyue Tian,
Binxin Yang,
Zhaoyang Liu,
Jiexuan Zhang,
Ruibin Yuan,
Hubery Yin,
Qifeng Chen,
Chen Li,
Jing Lv,
Wei Xue,
Yike Guo
Abstract:
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often…
▽ More
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection
Authors:
Jiahua Pang,
Ying Li,
Dongpu Cao,
Jingcai Luo,
Yanuo Zheng,
Bao Yunfan,
Yujie Lei,
Rui Yuan,
Yuxi Tian,
Guojin Yuan,
Hongchang Chen,
Zhi Zheng,
Yongchun Liu
Abstract:
Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 1…
▽ More
Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100
images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.
△ Less
Submitted 10 April, 2026;
originally announced April 2026.
-
Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale
Authors:
Renzhong Yuan,
Yijun Zeng,
Xiaosong Gao,
Linxi Yu,
Haochun Liao,
Han Wang
Abstract:
When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing…
▽ More
When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8\times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ SLO-meeting requests/s with short P95 within tens of milliseconds of quota-tiered isolation. A predictor-noise sweep confirms graceful degradation under up to 60% multiplicative error. Heavy-dominated regimes separate policies on completion, tail, and interpretable shedding. We further compare short-priority allocation (biased toward interactive traffic) with Fair Queuing (round-robin across classes): Fair Queuing achieves +32% short-request P90 improvement over FIFO with only +17% long-request overhead, versus Short-Priority's +27% / +116% trade-off -- demonstrating that the allocation layer accommodates different fairness objectives without changing the remaining stack. We contribute the three-layer client-side decomposition, controlled evaluation of joint metrics across regimes, allocation-policy alternatives, and overload-policy evidence linking cost-ladder shedding to the stated service objective.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
Authors:
Ruicheng Yuan,
Zhenxuan Zhang,
Anbang Wang,
Liwei Hu,
Xiangqian Hua,
Yaya Peng,
Jiawei Luo,
Guang Yang
Abstract:
Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured repor…
▽ More
Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
△ Less
Submitted 20 March, 2026;
originally announced March 2026.
-
Noise and dynamics in acoustoelectric waveguides
Authors:
Ryan O. Behunin,
Andrew Shepherd,
Ruoyu Yuan,
Taylor Ray,
Matthew J. Storey,
Peter T. Rakich,
Nils T. Otterstrom,
Matt Eichenfield
Abstract:
We present a quantum field theoretic formulation of acoustoelectric interactions in waveguide-like systems of arbitrary cross-section. Building on an open quantum systems approach, we derive a unified description of plasmon-phonon coupling that incorporates dissipation, noise, and the influence of drift currents. Our analysis captures both bulk and surface plasmon modes, highlighting how drift cur…
▽ More
We present a quantum field theoretic formulation of acoustoelectric interactions in waveguide-like systems of arbitrary cross-section. Building on an open quantum systems approach, we derive a unified description of plasmon-phonon coupling that incorporates dissipation, noise, and the influence of drift currents. Our analysis captures both bulk and surface plasmon modes, highlighting how drift currents Doppler-shift plasmonic resonances and reshape the phonon noise spectrum. The resulting Heisenberg-Langevin equations yield closed-form expressions for frequency shifts, gain, and noise power spectra, enabling direct evaluation of performance metrics such as the noise factor in acoustoelectric amplifiers and oscillators. In the appropriate limits, this framework reproduces known results while extending them to complex geometries.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
Vision-Language Model Based Multi-Expert Fusion for CT Image Classification
Authors:
Jianfa Bai,
Kejin Lu,
Runtian Yuan,
Qingqiu Li,
Jilan Xu,
Junlin Hou,
Yuejie Zhang,
Rui Feng
Abstract:
Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volum…
▽ More
Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage~2a and Stage~2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
Clinical Priors Guided Lung Disease Detection in 3D CT Scans
Authors:
Kejin Lu,
Jianfa Bai,
Qingqiu Li,
Runtian Yuan,
Jilan Xu,
Junlin Hou,
Yuejie Zhang,
Rui Feng
Abstract:
Accurate classification of lung diseases from chest CT scans plays an important role in computer-aided diagnosis systems. However, medical imaging datasets often suffer from severe class imbalance, which may significantly degrade the performance of deep learning models, especially for minority disease categories. To address this issue, we propose a gender-aware two-stage lung disease classificatio…
▽ More
Accurate classification of lung diseases from chest CT scans plays an important role in computer-aided diagnosis systems. However, medical imaging datasets often suffer from severe class imbalance, which may significantly degrade the performance of deep learning models, especially for minority disease categories. To address this issue, we propose a gender-aware two-stage lung disease classification framework. The proposed approach explicitly incorporates gender information into the disease recognition pipeline. In the first stage, a gender classifier is trained to predict the patient's gender from CT scans. In the second stage, the input CT image is routed to a corresponding gender-specific disease classifier to perform final disease prediction. This design enables the model to better capture gender-related imaging characteristics and alleviate the influence of imbalanced data distribution. Experimental results demonstrate that the proposed method improves the recognition performance for minority disease categories, particularly squamous cell carcinoma, while maintaining competitive performance on other classes.
△ Less
Submitted 17 March, 2026; v1 submitted 16 March, 2026;
originally announced March 2026.
-
Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis
Authors:
Zhenxuan Zhang,
Peiyuan Jing,
Ruicheng Yuan,
Liwei Hu,
Anbang Wang,
Fanwen Wang,
Yinzhe Wu,
Kh Tohidul Islam,
Zhaolin Chen,
Zi Wang,
Peter Lally,
Guang Yang
Abstract:
Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the…
▽ More
Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
Thinking with Spatial Code for Physical-World Video Reasoning
Authors:
Jieneng Chen,
Wenxin Ma,
Ruisheng Yuan,
Yunzhi Zhang,
Jiajun Wu,
Alan Yuille
Abstract:
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to rea…
▽ More
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation
Authors:
Zilin Lu,
Ruifeng Yuan,
Weiwei Cao,
Wanxing Chang,
Zhongyu Wei,
Sinuo Wang,
Yong Xia,
Ling Zhang,
Jianpeng Zhang
Abstract:
Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact…
▽ More
Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact of data quantity and quality on the performance of RL in medical contexts, revealing that data quality plays a more critical role than quantity. To this end, we propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples. Second, we observe that the majority of tokens in radiology reports are template-like and diagnostically uninformative, whereas the low frequency of clinically critical tokens heightens the risk of being overlooked during optimization. To tackle this, we introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal. Unlike standard RL approaches that treat all tokens equally, DiTPO explicitly models the varying importance of different tokens through rule- or gradient-based mechanisms to prioritize clinically relevant content. Extensive experiments on the MIMIC-CXR, IU-Xray, and CheXpert Plus datasets demonstrate that our framework achieves state-of-the-art (SOTA) performance while requiring substantially fewer training samples in RL. Notably, on MIMIC-CXR, our framework attains an F1 score of 0.516 using only 20% of the RL training samples.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Authors:
Yinghao Ma,
Haiwen Xia,
Hewei Gao,
Weixiong Chen,
Yuxin Ye,
Yuchen Yang,
Sungkyun Chang,
Mingshuo Ding,
Yizhi Li,
Ruibin Yuan,
Simon Dixon,
Emmanouil Benetos
Abstract:
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, a…
▽ More
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
△ Less
Submitted 4 March, 2026; v1 submitted 28 February, 2026;
originally announced March 2026.
-
Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding
Authors:
Shangda Wu,
Ziya Zhou,
Yongyi Zang,
Yutong Zheng,
Dafang Liang,
Ruibin Yuan,
Qiuqiang Kong
Abstract:
We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background docu…
▽ More
We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.
△ Less
Submitted 28 February, 2026;
originally announced March 2026.
-
Co-Propagation of Quantum Time Synchronization and Optical Frequency Transfer over a 122 km Hollow-Core Fiber
Authors:
Huibo Hong,
Xiao Xiang,
Runai Quan,
Rongduo Lu,
Qian Zhou,
Dawei Ge,
Liuyan Han,
Bo Liu,
Ru Yuan,
Dechao Zhang,
Yuting Liu,
Bingke Shi,
ZhiGuang Xia,
Xinghua Li,
Mingtao Cao,
Tao Liu,
Ruifang Dong,
Shougang Zhang
Abstract:
The co-propagation of quantum and classical signals through shared optical fibers is crucial for scalable quantum networks. However, this coexistence is fundamentally limited by spontaneous Raman scattering (SpRS) from the bright classical light, which generates overwhelming noise that disrupts the single-photon-level quantum signals. Here, we overcome this long-standing challenge by leveraging th…
▽ More
The co-propagation of quantum and classical signals through shared optical fibers is crucial for scalable quantum networks. However, this coexistence is fundamentally limited by spontaneous Raman scattering (SpRS) from the bright classical light, which generates overwhelming noise that disrupts the single-photon-level quantum signals. Here, we overcome this long-standing challenge by leveraging the inherently ultralow nonlinearity of hollow-core fiber (HCF) to suppress SpRS noise. By operating both the quantum time synchronization (QTS) and classical optical frequency transfer (OFT) signals within the telecom C-band, separated by only ~10 nm, we successfully demonstrate their simultaneous transmission over a 122-km HCF link. With a classical OFT power of 1 mW, the QTS performance shows negligible degradation, maintaining sub-picosecond time stability at 2000 s, while the OFT achieves a fractional frequency instability of 10^-20. Near-sub-picosecond QTS stability is preserved even when the classical power is increased to 3 mW. Furthermore, simulations based on our experimental data indicate that with next-generation low-loss HCF, the platform can tolerate classical powers beyond 10 mW and extend the QTS range to over 500 km. By realizing a unified quantum-classical time-frequency distribution framework, this work establishes HCF as a highly capable and practical platform for future scalable quantum networks.
△ Less
Submitted 21 February, 2026;
originally announced February 2026.
-
AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
Authors:
R E Zera Marveen Lyngkhoi,
Chirag Chawla,
Pratinav Seth,
Utsav Avaiya,
Soham Bhattacharjee,
Mykola Khandoga,
Rui Yuan,
Vinay Kumar Sankarapu
Abstract:
Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface…
▽ More
Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
△ Less
Submitted 11 February, 2026; v1 submitted 10 February, 2026;
originally announced February 2026.
-
Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization
Authors:
Mykola Khandoga,
Rui Yuan,
Vinay Kumar Sankarapu
Abstract:
Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient…
▽ More
Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
△ Less
Submitted 9 February, 2026;
originally announced February 2026.
-
Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning
Authors:
Rui Yuan,
Mykola Khandoga,
Vinay Kumar Sankarapu
Abstract:
Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplore…
▽ More
Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplored. We introduce Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization to flexible Bregman divergences, including hand-designed alternatives (L2 in probability space) and learned neural mirror maps. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1, with random initialization already capturing most of the benefit. While evolutionary strategies meta-learning provides marginal accuracy improvements, its primary value lies in variance reduction ($\pm$0.2 versus $\pm$0.6) and efficiency gains (15% shorter responses on MBPP), suggesting that random initialization of neural mirror maps is sufficient for most practical applications. These results establish divergence choice as a critical, previously unexplored design dimension in group-based policy optimization for LLM reasoning.
△ Less
Submitted 4 February, 2026;
originally announced February 2026.
-
AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
Authors:
Dongjie Cheng,
Ruifeng Yuan,
Yongqi Li,
Runyang You,
Wenjie Wang,
Liqiang Nie,
Lei Zhang,
Wenjie Li
Abstract:
Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity o…
▽ More
Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.
△ Less
Submitted 25 January, 2026;
originally announced January 2026.
-
Mobile charges in MoS2/high-k oxide transistors: from abnormal instabilities to memory-like dynamics
Authors:
Shaokai Zhou,
Haihui Cai,
Yehao Wu,
Yufeng Min,
Renchen Yuan,
Yezhu Lv,
Jianming Huang,
Yuanyuan Shi,
Yury Yuryevich Illarionov
Abstract:
MoS$_2$ field-effect transistors (FETs) with high-\textit{k} oxides currently lag behind silicon standards in bias and temperature stability due to ubiquitous border oxide traps that cause clockwise (CW) hysteresis in gate transfer characteristics. While suppressing this effect is typically mandatory for logic FETs, here we explore an alternative strategy where the initial CW hysteresis can be dyn…
▽ More
MoS$_2$ field-effect transistors (FETs) with high-\textit{k} oxides currently lag behind silicon standards in bias and temperature stability due to ubiquitous border oxide traps that cause clockwise (CW) hysteresis in gate transfer characteristics. While suppressing this effect is typically mandatory for logic FETs, here we explore an alternative strategy where the initial CW hysteresis can be dynamically overcome by stronger counterclockwise (CCW) hysteresis towards memory-like dynamics. We systematically compare hysteresis in similar back-gated MoS$_2$/HfO$_2$ and MoS$_2$/Al$_2$O$_3$ FETs up to 275\textdegree C. At room temperature, both devices initially show sizable CW hysteresis. However, at 175\textdegree C MoS$_2$/HfO$_2$ FETs exhibit dominant CCW dynamics coupled with self-doping and negative differential resistance (NDR) effects. Our compact model suggests that this behavior is caused by the drift of mobile oxygen vacancies (\textit{V}\({}_{\mathrm{O}}^{+}\) or \textit{V}\({}_{\mathrm{O}}^{2+}\)) within HfO$_2$ which also causes negative $V_{\mathrm{th}}$ shift under a constant positive bias stress. This alternative mechanism effectively overrides the initial CW hysteresis and enables intrinsic memory functionality that can be enhanced by using narrower gate bias sweep ranges. In contrast, the MoS$_2$/Al$_2$O$_3$ FETs display only minor CCW dynamics even at 275\textdegree C due to higher drift activation energies for the same vacancies, thereby maintaining superior stability. Our results reveal an insulators selection paradigm: Al$_2$O$_3$ layers are better suited to suppress detrimental negative $V_{\mathrm{th}}$ shifts in MoS$_2$ logic FETs at high temperatures, whereas their HfO$_2$ counterparts can serve as active memory layers that would exploit these abnormal instabilities.
△ Less
Submitted 23 January, 2026;
originally announced January 2026.
-
An efficient treatment of heat-flux boundary conditions in GSIS for rarefied gas flows
Authors:
Yanbing Zhang,
Ruifeng Yuan,
Liyan Luo,
Lei Wu
Abstract:
Heat-flux boundary conditions are challenging to implement efficiently in rarefied gas flow simulations because the wall-reflected gas temperature and density must be determined dynamically during the computation. This paper aims to tackle this problem within the general synthetic iterative scheme (GSIS), where the Boltzmann kinetic equation is solved deterministically in an outer loop and macrosc…
▽ More
Heat-flux boundary conditions are challenging to implement efficiently in rarefied gas flow simulations because the wall-reflected gas temperature and density must be determined dynamically during the computation. This paper aims to tackle this problem within the general synthetic iterative scheme (GSIS), where the Boltzmann kinetic equation is solved deterministically in an outer loop and macroscopic synthetic equations are solved in an inner loop. To avoid kinetic-macroscopic boundary-flux mismatch and the resulting convergence bottlenecks, for the macroscopic boundary flux at every inner iteration, the incident increment is estimated using a Maxwellian distribution, and then the reflected contribution is obtained by boundary conditions consistent with those in the kinetic solver. In addition to retaining the fast-converging and asymptotic-preserving properties of GSIS, the proposed method significantly reduces the iterations required to determine the wall-reflected gas parameters. Numerical simulations of rarefied gas flows in and around a 3D nozzle, a 2D adiabatic cylinder, and a 2D annular heat-transfer configuration show good agreement with the direct simulation Monte Carlo method, while achieving substantial efficiency gains over conventional iterative schemes.
△ Less
Submitted 20 January, 2026;
originally announced January 2026.
-
CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning
Authors:
Wenxin Ma,
Chenlong Wang,
Ruisheng Yuan,
Hao Chen,
Nanru Dai,
S. Kevin Zhou,
Yijun Yang,
Alan Yuille,
Jieneng Chen
Abstract:
Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnosti…
▽ More
Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: https://github.com/CausalSpatial/CausalSpatial
△ Less
Submitted 19 January, 2026;
originally announced January 2026.
-
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
Authors:
Ziyang Ma,
Guanrou Yang,
Wenxi Chen,
Zhifu Gao,
Yexing Du,
Xiquan Li,
Zhisheng Zheng,
Haina Zhu,
Jianheng Zhuo,
Zheshu Song,
Ruiyang Xu,
Tiranrui Wang,
Yifan Yang,
Yanqiao Zhu,
Zhikang Niu,
Liumeng Xue,
Yinghao Ma,
Ruibin Yuan,
Shiliang Zhang,
Kai Yu,
Eng Siong Chng,
Xie Chen
Abstract:
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-languag…
▽ More
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
△ Less
Submitted 14 January, 2026;
originally announced January 2026.
-
OMUDA: Omni-level Masking for Unsupervised Domain Adaptation in Semantic Segmentation
Authors:
Yang Ou,
Xiongwei Zhao,
Xinye Yang,
Yihan Wang,
Yicheng Di,
Rong Yuan,
Xieyuanli Chen,
Xu Zhu
Abstract:
Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsu…
▽ More
Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross-domain contextual ambiguity, inconsistent feature representations, and class-wise pseudo-label noise. To address these challenges, we propose Omni-level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context-Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre-trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo-labels by explicitly modeling class-wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross-domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA->Cityscapes and GTA5->Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state-of-the-art results with an average improvement of 7%.
△ Less
Submitted 13 December, 2025;
originally announced December 2025.
-
AutoMV: An Automatic Multi-Agent System for Music Video Generation
Authors:
Xiaoxuan Tang,
Xinping Lei,
Chaoran Zhu,
Shiyun Chen,
Ruibin Yuan,
Yizhi Li,
Changjae Oh,
Ge Zhang,
Wenhao Huang,
Emmanouil Benetos,
Yang Liu,
Jiaheng Liu,
Yinghao Ma
Abstract:
Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attr…
▽ More
Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.
△ Less
Submitted 13 December, 2025;
originally announced December 2025.
-
Real-Time-Capable Betatron Tune Measurement from Schottky Spectra Using Deep Learning and Uncertainty-Aware Kalman Filtering
Authors:
Peihan Sun,
Manzhou Zhang,
Renxian Yuan,
Deming Li,
Jian Dong,
Ying Shi
Abstract:
Betatron tune measurement is essential for beam control in compact proton-therapy synchrotrons, yet conventional peak-detection techniques are not robust under the low signal-to-noise ratio (SNR) conditions typical of these machines. This work presents a lightweight convolutional neural network that performs real-time tune extraction from Schottky spectra with sub-millisecond inference latency and…
▽ More
Betatron tune measurement is essential for beam control in compact proton-therapy synchrotrons, yet conventional peak-detection techniques are not robust under the low signal-to-noise ratio (SNR) conditions typical of these machines. This work presents a lightweight convolutional neural network that performs real-time tune extraction from Schottky spectra with sub-millisecond inference latency and calibrated uncertainty estimates. The model uses attention-based pooling for reliable peak localization and a dual-branch architecture that jointly predicts the tune and its associated uncertainty. Trained with a Laplace negative log-likelihood loss, it produces uncertainty estimates whose magnitude tracks the instantaneous prediction error, which enables uncertainty-aware Kalman filtering for temporal smoothing. Experiments on a large synthetic dataset spanning SNR levels from 0 to $-20$\,dB demonstrate substantial performance gains over traditional peak-detection baselines, while the Kalman filter further suppresses transient outliers in time-series operation. Preliminary validation on operational beam data confirms stable tune tracking without retraining. With only about $2.0\times 10^{4}$ trainable parameters and real-time inference on commodity GPU hardware, the proposed diagnostic offers a practical solution for rapid and accurate betatron tune monitoring in compact medical synchrotrons and similar accelerators.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
Controllable risk scenario generation from human crash data for autonomous vehicle testing
Authors:
Qiujing Lu,
Xuanhan Wang,
Runze Yuan,
Wei Lu,
Xinyi Gong,
Shuo Feng
Abstract:
Ensuring the safety of autonomous vehicles (AV) requires rigorous testing under both everyday driving and rare, safety-critical conditions. A key challenge lies in simulating environment agents, including background vehicles (BVs) and vulnerable road users (VRUs), that behave realistically in nominal traffic while also exhibiting risk-prone behaviors consistent with real-world accidents. We introd…
▽ More
Ensuring the safety of autonomous vehicles (AV) requires rigorous testing under both everyday driving and rare, safety-critical conditions. A key challenge lies in simulating environment agents, including background vehicles (BVs) and vulnerable road users (VRUs), that behave realistically in nominal traffic while also exhibiting risk-prone behaviors consistent with real-world accidents. We introduce Controllable Risk Agent Generation (CRAG), a framework designed to unify the modeling of dominant nominal behaviors and rare safety-critical behaviors. CRAG constructs a structured latent space that disentangles normal and risk-related behaviors, enabling efficient use of limited crash data. By combining risk-aware latent representations with optimization-based mode-transition mechanisms, the framework allows agents to shift smoothly and plausibly from safe to risk states over extended horizons, while maintaining high fidelity in both regimes. Extensive experiments show that CRAG improves diversity compared to existing baselines, while also enabling controllable generation of risk scenarios for targeted and efficient evaluation of AV robustness.
△ Less
Submitted 26 November, 2025;
originally announced December 2025.
-
Surrogate-assisted airfoil optimization in rarefied gas flows
Authors:
Xiaoda Li,
Ruifeng Yuan,
Yanbing Zhang,
Lei Wu
Abstract:
With growing interest in space exploration, optimized airfoil design has become increasingly important. However, airfoil design in rarefied gas flows remains underexplored because solving the Boltzmann equation formulated in a six dimensional phase space is time consuming. To address this problem, a solver-in-the-loop Bayesian optimization framework for symmetric, thickness-only airfoils is develo…
▽ More
With growing interest in space exploration, optimized airfoil design has become increasingly important. However, airfoil design in rarefied gas flows remains underexplored because solving the Boltzmann equation formulated in a six dimensional phase space is time consuming. To address this problem, a solver-in-the-loop Bayesian optimization framework for symmetric, thickness-only airfoils is developed. First, airfoils are parameterized using a class shape transformation that enforce geometric admissibility. Second, a Gaussian process expected improvement surrogate is coupled in batches to a fast converging, asymptotic preserving Boltzmann solver for sample efficient exploration. Drag minimizing airfoils are identified in a wide range of gas rarefaction. It is found that, at Mach numbers Ma=2 and 4, the streamwise force increases with the gas rarefaction and shifts from pressure dominated to shear dominated drag, while optimization reduces drag at all conditions. The benefit of optimization peaks in the weakly rarefied regime, about 30% at Ma=2 and 40 to 50% at Ma=4, and falls to a few percent in transition and free-molecular flow regimes. Drag decomposition shows that these gains come mainly from reduced pressure drag, with viscous drag almost unchanged. The optimal airfoils form a coherent rarefaction-aware family: they retain a smooth, single-peaked thickness profile, are aft-loaded at low gas rarefaction, and exhibit a forward shift of maximum thickness and thickness area toward mid-chord as gas rarefaction increases. These trends provide a physically interpretable map that narrows the design space.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
Multi-Accent Mandarin Dry-Vocal Singing Dataset: Benchmark for Singing Accent Recognition
Authors:
Zihao Wang,
Ruibin Yuan,
Ziqi Geng,
Hengjia Li,
Xingwei Qu,
Xinyi Li,
Songye Chen,
Haoying Fu,
Roger B. Dannenberg,
Kejun Zhang
Abstract:
Singing accent research is underexplored compared to speech accent studies, primarily due to the scarcity of suitable datasets. Existing singing datasets often suffer from detail loss, frequently resulting from the vocal-instrumental separation process. Additionally, they often lack regional accent annotations. To address this, we introduce the Multi-Accent Mandarin Dry-Vocal Singing Dataset (MADV…
▽ More
Singing accent research is underexplored compared to speech accent studies, primarily due to the scarcity of suitable datasets. Existing singing datasets often suffer from detail loss, frequently resulting from the vocal-instrumental separation process. Additionally, they often lack regional accent annotations. To address this, we introduce the Multi-Accent Mandarin Dry-Vocal Singing Dataset (MADVSD). MADVSD comprises over 670 hours of dry vocal recordings from 4,206 native Mandarin speakers across nine distinct Chinese regions. In addition to each participant recording audio of three popular songs in their native accent, they also recorded phonetic exercises covering all Mandarin vowels and a full octave range. We validated MADVSD through benchmark experiments in singing accent recognition, demonstrating its utility for evaluating state-of-the-art speech models in singing contexts. Furthermore, we explored dialectal influences on singing accent and analyzed the role of vowels in accentual variations, leveraging MADVSD's unique phonetic exercises.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model
Authors:
Zihao Wang,
Ruibin Yuan,
Ziqi Geng,
Hengjia Li,
Xingwei Qu,
Xinyi Li,
Songye Chen,
Haoying Fu,
Roger B. Dannenberg,
Kejun Zhang
Abstract:
Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based solely on pitch and rhythm. We advocate for a shift from discriminative to descriptive evaluation, creating a complete eco…
▽ More
Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based solely on pitch and rhythm. We advocate for a shift from discriminative to descriptive evaluation, creating a complete ecosystem for reference-free, multi-dimensional assessment. First, we introduce Sing-MD, a large-scale dataset annotated by experts across four dimensions: breath control, timbre quality, emotional expression, and vocal technique. Our analysis reveals significant annotation inconsistencies among experts, challenging the validity of traditional accuracy-based metrics. Second, addressing the memory limitations of Multimodal Large Language Models (MLLMs) in analyzing full-length songs, we propose VocalVerse. This efficient hybrid architecture leverages a lightweight acoustic encoder to model global performance features and long-term dependencies. Third, to address automated metric shortcomings, we establish the H-TPR (Human-in-the-loop Tiered Perceptual Ranking) benchmark, which evaluates a model's ability to generate perceptually valid rankings rather than predicting noisy ground-truth scores.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
The Initialization Determines Whether In-Context Learning Is Gradient Descent
Authors:
Shifeng Xie,
Rui Yuan,
Simone Rossi,
Thomas Hannagan
Abstract:
In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have chal…
▽ More
In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.
△ Less
Submitted 3 December, 2025;
originally announced December 2025.
-
Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering
Authors:
Haoxiang Zhang,
Ruihao Yuan,
Lihui Zhang,
Yushi Luo,
Qiang Zhang,
Pan Ding,
Xiaodong Ren,
Weijie Xing,
Niu Gao,
Jishan Chen,
Chubo Zhang
Abstract:
The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Sta…
▽ More
The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Starting from only 32 experimental samples in an aerial K439B superalloy castings repair welding case, we first augment physically plausible synthetic data through a three-stage protocol: differentiated noise injection calibrated to process variabilities, enforcement of hard physical constraints, and preservation of inter-parameter relationships. We then employ a nested optimization strategy for constitutive model discovery, where symbolic regression explores equation structures while differential evolution optimizes parameters, followed by intensive parameter refinement using hybrid global-local optimization. The resulting interpretable constitutive equation achieves 88% accuracy in predicting hot-cracking tendency. This equation not only provides quantitative predictions but also delivers explicit physical insight, revealing how thermal, geometric, and metallurgical mechanisms couple to drive cracking-thereby advancing engineers' cognitive understanding of the process. Furthermore, the constitutive equation serves as a multi-functional tool for process optimization and high-fidelity virtual data generation, enabling accuracy improvements in other data-driven models. Our approach provides a general blueprint for developing trustworthy AI systems that embed engineering domain knowledge directly into their architecture, enabling reliable adoption in high-stakes industrial applications where data is limited but physical understanding is available.
△ Less
Submitted 28 November, 2025;
originally announced December 2025.
-
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling
Authors:
Yang Xiao,
Chunpu Xu,
Ruifeng Yuan,
Jiashuo Wang,
Wenjie Li,
Pengfei Liu
Abstract:
Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routi…
▽ More
Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.
△ Less
Submitted 29 November, 2025;
originally announced December 2025.
-
Digital Twin-Driven Secure Access Strategy for SAGIN-Enabled IoT Networks
Authors:
Hui Liang,
Zhihui Wu,
Runqi Yuan,
Guobin Zhang,
Yanfeng Zhang,
Jinkai Zheng,
Tom H. Luan
Abstract:
In space-air-ground integrated networks (SAGIN)-enabled IoT networks, secure access has become a significant challenge due to the increasing risks of eavesdropping attacks. To address these threats to data confidentiality, this paper proposes a Digital Twin (DT)-driven secure access strategy. The strategy leverages a virtual replica of the physical SAGIN environment within the DT framework to cont…
▽ More
In space-air-ground integrated networks (SAGIN)-enabled IoT networks, secure access has become a significant challenge due to the increasing risks of eavesdropping attacks. To address these threats to data confidentiality, this paper proposes a Digital Twin (DT)-driven secure access strategy. The strategy leverages a virtual replica of the physical SAGIN environment within the DT framework to continuously assess dynamic eavesdropping risks by quantifying secrecy capacity. Operating within this DT framework, an evolutionary game model dynamically balances the DT-updated secrecy capacity against queuing delay, steering IoT devices toward more secure and efficient access decisions. Furthermore, a novel distributed algorithm, integral to the DT operation, is developed to obtain the equilibrium access strategy for each device in a scalable manner. Simulation results demonstrate that the proposed DT-based approach substantially improves the security of SAGIN-enabled IoT networks. Additionally, it effectively balances system load, prevents overload occurrences, and decreases queuing delay compared to benchmark schemes, thereby comprehensively improving overall network performance.
△ Less
Submitted 26 November, 2025;
originally announced November 2025.
-
HunyuanVideo 1.5 Technical Report
Authors:
Bing Wu,
Chang Zou,
Changlin Li,
Duojun Huang,
Fang Yang,
Hao Tan,
Jack Peng,
Jianbing Wu,
Jiangfeng Xiong,
Jie Jiang,
Linus,
Patrol,
Peizhen Zhang,
Peng Chen,
Penghao Zhao,
Qi Tian,
Songtao Liu,
Weijie Kong,
Weiyan Wang,
Xiao He,
Xin Li,
Xinchi Deng,
Xuefei Zhe,
Yang Li,
Yanxin Long
, et al. (56 additional authors not shown)
Abstract:
We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding til…
▽ More
We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.
△ Less
Submitted 24 November, 2025; v1 submitted 24 November, 2025;
originally announced November 2025.
-
A fast-converging and asymptotic-preserving method for adjoint shape optimization of rarefied gas flows
Authors:
Yanbing Zhang,
Ruifeng Yuan,
Lei Wu
Abstract:
Adjoint based shape optimization is a powerful technique in fluid-dynamics optimization, capable of identifying an optimal shape within only dozens of design iterations. However, when extended to rarefied gas flows, the computational cost becomes enormous because both the six dimensional primal and adjoint Boltzmann equations must be solved for each candidate shape. Building on the general synthet…
▽ More
Adjoint based shape optimization is a powerful technique in fluid-dynamics optimization, capable of identifying an optimal shape within only dozens of design iterations. However, when extended to rarefied gas flows, the computational cost becomes enormous because both the six dimensional primal and adjoint Boltzmann equations must be solved for each candidate shape. Building on the general synthetic iterative scheme (GSIS) for solving the primal Boltzmann model equation, this paper presents a fast converging and asymptotic preserving method for solving the adjoint kinetic equation. The GSIS accelerates the convergence of the adjoint kinetic equation by incorporating solutions of macroscopic synthetic equations, whose constitutive relations include the Newtonian stress law along with higher order terms capturing rarefaction effects. As a result, the method achieves asymptotic preservation (allowing the use of large spatial cell sizes in the continuum limit) while maintaining accuracy in highly rarefied regimes. Numerical tests demonstrate exceptional performance on drag minimization problems for 3D bodies, achieving drag reductions of 34.5% in the transition regime and 61.1% in the slip-flow regime within roughly ten optimization iterations. For each candidate shape, converged solutions of the primal and adjoint Boltzmann equation are obtained with only a few dozen updates of the velocity distribution function, dramatically reducing computational cost compared with conventional methods.
△ Less
Submitted 23 November, 2025;
originally announced November 2025.
-
Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity
Authors:
Xiaoming He,
Gaofeng Wang,
Huajun Cui,
Rui Yuan,
Haitao Zhao
Abstract:
Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user-intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent…
▽ More
Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user-intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent prediction and low-latency action execution essential. Although numerous approaches can model intent relationships, they encounter severe obstacles when scaling to high-dimensional action sequences and managing intensive on-board computation.
We propose an Intent-Driven Framework for Autonomous Network Optimization comprising prediction and decision modules. First, implicit intent modeling is adopted to mitigate inaccuracies arising from ambiguous user expressions. For prediction, we introduce Hyperdimensional Transformer (HDT), which embeds data into a Hyperdimensional space via Hyperdimensional vector encoding and replaces standard matrix and attention operations with symbolic Hyperdimensional computations. For decision-making, where AAV must respond to user intent while planning trajectories, we design Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO). Building upon MAPPO, it samples actions through two independently parameterized networks and cascades the user-intent network into the trajectory network to maintain action dependencies.
We evaluate our framework on a real IoT action dataset with authentic wireless data. Experimental results demonstrate that HDT and DA-MAPPO achieve superior performance across diverse scenarios.
△ Less
Submitted 28 January, 2026; v1 submitted 23 November, 2025;
originally announced November 2025.
-
A Digital SRAM-Based Compute-In-Memory Macro for Weight-Stationary Dynamic Matrix Multiplication in Transformer Attention Score Computation
Authors:
Jianyi Yu,
Tengxiao Wang,
Yuxuan Wang,
Xiang Fu,
Fei Qiao,
Ying Wang,
Rui Yuan,
Liyuan Liu,
Cong Shi
Abstract:
Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. To extend these benefits to Transformer, this brief proposes a digital CIM macro to compute attention score. To eliminate dynamic matrix multiplication (MM), we reconstru…
▽ More
Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. To extend these benefits to Transformer, this brief proposes a digital CIM macro to compute attention score. To eliminate dynamic matrix multiplication (MM), we reconstruct the computation as static MM using a combined QK-weight matrix, so that inputs can be directly fed to a single CIM macro to obtain the score results. However, this introduces a new challenge of 2-input static MM. The computation is further decomposed into four groups of bit-serial logical and addition operations. This allows 2-input to directly activate the word line via AND gate, thus realizing 2-input static MM with minimal overhead. A hierarchical zero-value bit skipping mechanism is introduced to prioritize skipping zero-value bits in the 2-input case. This mechanism effectively utilizes data sparsity of 2-input, significantly reducing redundant operations. Implemented in a 65-nm process, the 0.35 mm2 macro delivers 42.27 GOPS at 1.24 mW, yielding 34.1 TOPS/W energy and 120.77 GOPS/mm2 area efficiency. Compared to CPUs and GPUs, it achieves ~25x and ~13x higher efficiency, respectively. Against other Transformer-CIMs, it demonstrates at least 7x energy and 2x area efficiency gains, highlighting its strong potential for edge intelligence.
△ Less
Submitted 12 December, 2025; v1 submitted 15 November, 2025;
originally announced November 2025.
-
SuckTac: Camera-based Tactile Sucker for Unstructured Surface Perception and Interaction
Authors:
Ruiyong Yuan,
Jieji Ren,
Zhanxuan Peng,
Feifei Chen,
Guoying Gu
Abstract:
Suckers are significant for robots in picking, transferring, manipulation and locomotion on diverse surfaces. However, most of the existing suckers lack high-fidelity perceptual and tactile sensing, which impedes them from resolving the fine-grained geometric features and interaction status of the target surface. This limits their robust performance with irregular objects and in complex, unstructu…
▽ More
Suckers are significant for robots in picking, transferring, manipulation and locomotion on diverse surfaces. However, most of the existing suckers lack high-fidelity perceptual and tactile sensing, which impedes them from resolving the fine-grained geometric features and interaction status of the target surface. This limits their robust performance with irregular objects and in complex, unstructured environments. Inspired by the adaptive structure and high-performance sensory capabilities of cephalopod suckers, in this paper, we propose a novel, intelligent sucker, named SuckTac, that integrates a camera-based tactile sensor directly within its optimized structure to provide high-density perception and robust suction. Specifically, through joint structure design and optimization and based on a multi-material integrated casting technique, a camera and light source are embedded into the sucker, which enables in-situ, high-density perception of fine details like surface shape, texture and roughness. To further enhance robustness and adaptability, the sucker's mechanical design is also optimized by refining its profile, adding a compliant lip, and incorporating surface microstructure. Extensive experiments, including challenging tasks such as robotic cloth manipulation and soft mobile robot inspection, demonstrate the superior performance and broad applicability of the proposed system.
△ Less
Submitted 4 November, 2025;
originally announced November 2025.
-
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Authors:
Zihan Liu,
Zhikang Niu,
Qiuyang Xiao,
Zhisheng Zheng,
Ruoqi Yuan,
Yuhang Zang,
Yuhang Cao,
Xiaoyi Dong,
Jianze Liang,
Xie Chen,
Leilei Sun,
Dahua Lin,
Jiaqi Wang
Abstract:
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench comb…
▽ More
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
△ Less
Submitted 28 November, 2025; v1 submitted 28 October, 2025;
originally announced October 2025.
-
Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration
Authors:
Zheng Wei,
Mingchen Li,
Zeqian Zhang,
Ruibin Yuan,
Pan Hui,
Huamin Qu,
James Evans,
Maneesh Agrawala,
Anyi Rao
Abstract:
Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi-agent collaboration. First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation that leverages a film-production-inspired architecture to enab…
▽ More
Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi-agent collaboration. First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation that leverages a film-production-inspired architecture to enable modular specialization and scalable inter-agent collaboration. Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions among agents lacking sufficient context, reducing individual memory requirements while ensuring adequate contextual information. Third, we transition from directed acyclic graphs (DAGs) to directed cyclic graphs with limited retries, allowing agents to reflect and refine outputs iteratively, thereby improving earlier stages through feedback from subsequent nodes. These contributions lay the groundwork for developing more robust multi-agent systems in creative tasks.
△ Less
Submitted 25 October, 2025;
originally announced October 2025.
-
Channel Modeling of Satellite-to-Underwater Laser Communication Links: An Analytical-Monte Carlo Hybrid Approach
Authors:
Zhixing Wang,
Renzhi Yuan,
Haifeng Yao,
Chuang Yang,
Mugen Peng
Abstract:
Channel modeling for satellite-to-underwater laser communication (StULC) links remains challenging due to long distances and the diversity of the channel constituents. The StULC channel is typically segmented into three isolated channels: the atmospheric channel, the air-water interface channel, and the underwater channel. Previous studies involving StULC channel modeling either focused on separat…
▽ More
Channel modeling for satellite-to-underwater laser communication (StULC) links remains challenging due to long distances and the diversity of the channel constituents. The StULC channel is typically segmented into three isolated channels: the atmospheric channel, the air-water interface channel, and the underwater channel. Previous studies involving StULC channel modeling either focused on separated channels or neglected the combined effects of particles and turbulence on laser propagation. In this paper, we established a comprehensive StULC channel model by an analytical-Monte Carlo hybrid approach, taking into account the effects of both particles and turbulence. We first obtained the intensity distribution of the transmitted laser beam after passing through the turbulent atmosphere based on the extended Huygens-Fresnel principle. Then we derived a closed-form probability density function of the photon propagating direction after passing through the air-water interface, which greatly simplified the modeling of StULC links. At last, we employed a Monte Carlo method to model the underwater links and obtained the power distribution at the receiving plane. Based on the proposed StULC channel model, we analyzed the bit error rate and the outage probability under different environmental conditions. Numerical results demonstrated that, the influence of underwater particle concentration on the communication performance is much pronounced than those of both the atmospheric turbulence and the underwater turbulence. Notably, increasing the wind speed at the air-water interface does not significantly worsen the communication performance of the StULC links.
△ Less
Submitted 24 September, 2025;
originally announced October 2025.
-
The existence of negatively curved metrics on locally conformally flat manifolds with boundary
Authors:
Rirong Yuan
Abstract:
We use certain Morse functions to construct conformal metrics with negative sectional curvature on locally conformally flat manifolds with boundary. Moreover, without conformally flatness assumption, we also construct conformal metric of positive Einstein tensor.
We use certain Morse functions to construct conformal metrics with negative sectional curvature on locally conformally flat manifolds with boundary. Moreover, without conformally flatness assumption, we also construct conformal metric of positive Einstein tensor.
△ Less
Submitted 20 October, 2025;
originally announced October 2025.
-
Click, Predict, Trust: Clinician-in-the-Loop AI Segmentation for Lung Cancer CT-Based Prognosis within the Knowledge-to-Action Framework
Authors:
Mohammad R. Salmanpour,
Sonya Falahati,
Amir Hossein Pouria,
Amin Mousavi,
Somayeh Sadat Mehrnia,
Morteza Alizadeh,
Arman Gorji,
Zeinab Farsangi,
Alireza Safarian,
Mehdi Maghsudi,
Carlos Uribe,
Arman Rahmim,
Ren Yuan
Abstract:
Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic a…
▽ More
Lung cancer remains the leading cause of cancer mortality, with CT imaging central to screening, prognosis, and treatment. Manual segmentation is variable and time-intensive, while deep learning (DL) offers automation but faces barriers to clinical adoption. Guided by the Knowledge-to-Action framework, this study develops a clinician-in-the-loop DL pipeline to enhance reproducibility, prognostic accuracy, and clinical trust. Multi-center CT data from 999 patients across 12 public datasets were analyzed using five DL models (3D Attention U-Net, ResUNet, VNet, ReconNet, SAM-Med3D), benchmarked against expert contours on whole and click-point cropped images. Segmentation reproducibility was assessed using 497 PySERA-extracted radiomic features via Spearman correlation, ICC, Wilcoxon tests, and MANOVA, while prognostic modeling compared supervised (SL) and semi-supervised learning (SSL) across 38 dimensionality reduction strategies and 24 classifiers. Six physicians qualitatively evaluated masks across seven domains, including clinical meaningfulness, boundary quality, prognostic value, trust, and workflow integration. VNet achieved the best performance (Dice = 0.83, IoU = 0.71), radiomic stability (mean correlation = 0.76, ICC = 0.65), and predictive accuracy under SSL (accuracy = 0.88, F1 = 0.83). SSL consistently outperformed SL across models. Radiologists favored VNet for peritumoral representation and smoother boundaries, preferring AI-generated initial masks for refinement rather than replacement. These results demonstrate that integrating VNet with SSL yields accurate, reproducible, and clinically trusted CT-based lung cancer prognosis, highlighting a feasible path toward physician-centered AI translation.
△ Less
Submitted 19 October, 2025;
originally announced October 2025.
-
Wetted-Area Minimum and Inlet-Outlet Reciprocity in Optimal Manifolds of Rarefied Gas Flows
Authors:
Ruifeng Yuan,
Lei Wu
Abstract:
While flow optimization has been extensively studied in the continuum regime, its extension to rarefied gas flows remains less explored. Here, based on the Boltzmann model equation, an adjoint topology optimization method is employed to design two-dimensional single inlet multi outlet manifolds, aiming to maximize the total mass flow rate while maintaining outflow uniformity. Two key findings are…
▽ More
While flow optimization has been extensively studied in the continuum regime, its extension to rarefied gas flows remains less explored. Here, based on the Boltzmann model equation, an adjoint topology optimization method is employed to design two-dimensional single inlet multi outlet manifolds, aiming to maximize the total mass flow rate while maintaining outflow uniformity. Two key findings are revealed. (1) analogous to the Knudsen minimum in mass flow rate in the transition regime, a wetted-area minimum is identified, but in the slip flow regime. This phenomenon arises from the competition between flow bend loss and surface friction loss, with the latter being affected by velocity slip at the solid surface. (2) the inlet outlet reciprocity emerges in the free molecular flow regime, where the optimal design becomes invariant to inlet outlet orientation and pressure ratio. Additional insights are gained regarding the channel curvature, compressibility effects, and the constraint of outflow uniformity. These findings elucidate the mechanisms governing rarefied gas transport and offer design guidance for manifolds operating in vacuum environments.
△ Less
Submitted 16 October, 2025;
originally announced October 2025.
-
Auto-scaling Continuous Memory for GUI Agent
Authors:
Wenyi Wu,
Kun Zhou,
Ruoxin Yuan,
Vivian Yu,
Stephen Wang,
Zhiting Hu,
Biwei Huang
Abstract:
We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous e…
▽ More
We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone's input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \$4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2\% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
△ Less
Submitted 10 October, 2025;
originally announced October 2025.
-
Topological surface magnon-polariton in an insulating canted antiferromagnet
Authors:
Weixin Li,
Rundong Yuan,
Fenglin Zhong,
Bo Peng,
Jean-Philippe Ansermet,
Haiming Yu
Abstract:
Excitation and control of antiferromagnetic magnon modes lie at the heart of coherent antiferromagnetic spintronics. Here, we propose a topological surface magnon-polariton as a new approach in the prototypical magnonic material hematite. We show that in an insulating canted antiferromagnet, where strong-coupled magnon-photon modes can be achieved using electrical on-chip layouts, a surface magnon…
▽ More
Excitation and control of antiferromagnetic magnon modes lie at the heart of coherent antiferromagnetic spintronics. Here, we propose a topological surface magnon-polariton as a new approach in the prototypical magnonic material hematite. We show that in an insulating canted antiferromagnet, where strong-coupled magnon-photon modes can be achieved using electrical on-chip layouts, a surface magnon-polariton mode exists in the gap of the bulk magnon-photon bands. The emergence of surface magnon-polariton mode is further attributed to the nontrivial topology of bulk magnon-photon bands. Magnon-photon coupling enhances the Berry curvature near the anticrossing points, leading to a topological bulk Chern band associated with the surface magnon-polaritons. Our work provides a general principle for the utilization of optomagnetic properties in antiferromagnets, with an illustration of its experimental feasibility and wide generality as manifested in hematite.
△ Less
Submitted 26 October, 2025; v1 submitted 9 October, 2025;
originally announced October 2025.
-
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
Authors:
Chunbo Hao,
Ruibin Yuan,
Jixun Yao,
Qixin Deng,
Xinyi Bai,
Yanbo Wang,
Wei Xue,
Lei Xie
Abstract:
Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduc…
▽ More
Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.
△ Less
Submitted 8 April, 2026; v1 submitted 3 October, 2025;
originally announced October 2025.
-
UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
Authors:
Sitong Cheng,
Weizhen Bian,
Xinsheng Wang,
Ruibin Yuan,
Jianyi Chen,
Shunshun Yin,
Yike Guo,
Wei Xue
Abstract:
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of transla…
▽ More
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
Multi-Stage CD-Kennedy Receiver for QPSK Modulated CV-QKD in Turbulent Channels
Authors:
Renzhi Yuan,
Zhixing Wang,
Shouye Miao,
Mufei Zhao,
Haifeng Yao,
Bin Cao,
Mugen Peng
Abstract:
Continuous variable-quantum key distribution (CV-QKD) protocols attract increasing attentions in recent years because they enjoy high secret key rate (SKR) and good compatibility with existing optical communication infrastructure. Classical coherent receivers are widely employed in coherent states based CV-QKD protocols, whose detection performance is bounded by the standard quantum limit (SQL). R…
▽ More
Continuous variable-quantum key distribution (CV-QKD) protocols attract increasing attentions in recent years because they enjoy high secret key rate (SKR) and good compatibility with existing optical communication infrastructure. Classical coherent receivers are widely employed in coherent states based CV-QKD protocols, whose detection performance is bounded by the standard quantum limit (SQL). Recently, quantum receivers based on displacement operators are experimentally demonstrated with detection performance outperforming the SQL in various practical conditions. However, potential applications of quantum receivers in CV-QKD protocols under turbulent channels are still not well explored, while practical CV-QKD protocols must survive from the atmospheric turbulence in satellite-to-ground optical communication links. In this paper, we consider the possibility of using a quantum receiver called multi-stage CD-Kennedy receiver to enhance the SKR performance of a quadrature phase shift keying (QPSK) modulated CV-QKD protocol in turbulent channels. We first derive the error probability of the multi-stage CD-Kennedy receiver for detecting QPSK signals in turbulent channels and further propose three types of multi-stage CD-Kennedy receiver with different displacement choices, i.e., the Type-I, Type-II, and Type-III receivers. Then we derive the SKR of a QPSK modulated CV-QKD protocol using the multi-stage CD-Kennedy receiver and post-selection strategy in turbulent channels. Numerical results show that the multi-stage CD-Kennedy receiver can outperform the classical coherent receiver in turbulent channels in terms of both error probability and SKR performance and the Type-II receiver can tolerate worse channel conditions compared with Type-I and Type-III receivers in terms of error probability performance.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
Authors:
Longhao Li,
Zhao Guo,
Hongjie Chen,
Yuhang Dai,
Ziyu Zhang,
Hongfei Xue,
Tianlun Zuo,
Chengyou Wang,
Shuiyuan Wang,
Jie Li,
Jian Kang,
Xin Xu,
Hui Bu,
Binbin Zhang,
Ruibin Yuan,
Ziya Zhou,
Wei Xue,
Lei Xie
Abstract:
The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and re…
▽ More
The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.
△ Less
Submitted 5 September, 2025; v1 submitted 4 September, 2025;
originally announced September 2025.
-
Altermagnetic Shastry-Sutherland fullerene networks
Authors:
Jiaqi Wu,
Alaric Sanders,
Rundong Yuan,
Bo Peng
Abstract:
The interplay between quantum magnetism and many-body physics is of fundamental importance in condensed matter physics. %Magnetic exchange interactions in frustrated lattices give rise to rich phase diagrams. Molecular building blocks provide a versatile platform for exploring the exotic quantum phases arising from complex orderings in frustrated lattices. Here we demonstrate a showcase system bas…
▽ More
The interplay between quantum magnetism and many-body physics is of fundamental importance in condensed matter physics. %Magnetic exchange interactions in frustrated lattices give rise to rich phase diagrams. Molecular building blocks provide a versatile platform for exploring the exotic quantum phases arising from complex orderings in frustrated lattices. Here we demonstrate a showcase system based on altermagnetic Shastry-Sutherland fullerene networks, which can be constructed from a C$_{40}$ molecular synthon with two effective spin-1/2 sites due to the resonance structures. The charge-neutral, pure-carbon systems exhibit an altermagnetic ground state with fully compensated spins arranged in alternating C$_{40}$ units in a 2D rutile-like lattice, leading to $d$-wave splitting of the spin-polarised electronic band structure and strong chiral-split magnons. We report a rich phase diagram including altermagentic, quantum spin liquid, plaquette, and dimer phases, which can be accessed via moderate strains. Our findings open a new avenue for exploring quantum many-body physics based on scalable, chemically-feasible, molecular quantum materials.
△ Less
Submitted 11 September, 2025; v1 submitted 28 August, 2025;
originally announced August 2025.