-
Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports
Authors:
Qing Yan,
Wenyu Yang,
Yufei Wang,
Wenhao Ma,
Linchong Hu,
Yifei Jin,
Anton Dahbura
Abstract:
Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a n…
▽ More
Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a novel player selection framework that learns professional-specific reward functions from logged gameplay demonstrations, allowing organizations to rank candidates by their stylistic alignment with a target star player. Our proposed architecture utilizes a multimodal, two-branch intake: one branch encodes structured state-action trajectories derived from high-resolution in-game telemetry, while the second encodes temporally aligned tactical pseudo-commentary generated by Vision-Language Models (VLMs) from broadcast footage. These representations are fused and evaluated via a Generative Adversarial Imitation Learning (GAIL) objective, where a discriminator learns to capture the unique mechanical and tactical signatures of elite professionals. By transitioning from generic skill estimation to scouting "by reward," this framework provides a scalable, workflow-aware digital twin system that enables data-driven roster construction and targeted talent discovery across massive candidate pools.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
Detecting and Enhancing Intellectual Humility in Online Political Discourse
Authors:
Samantha D'Alonzo,
Rachel Chen,
Weidong Zhang,
Melody Yu,
Jasmine Mangat,
Ivory Yang,
Weicheng Ma,
Martin Saveski,
Soroush Vosoughi,
Nabeel Gillani
Abstract:
Intellectual humility (IH)-a recognition of one's own intellectual limitations-can reduce polarization and foster more understanding across lines of difference. Yet little work explores how IH can be systematically defined, measured, evaluated, and enhanced in spaces that often lack it the most: online political discussions. In this paper, we seek to bridge these gaps by exploring two questions: 1…
▽ More
Intellectual humility (IH)-a recognition of one's own intellectual limitations-can reduce polarization and foster more understanding across lines of difference. Yet little work explores how IH can be systematically defined, measured, evaluated, and enhanced in spaces that often lack it the most: online political discussions. In this paper, we seek to bridge these gaps by exploring two questions: 1) how might preexisting levels of IH influence future expressions of IH during online political discourse? and 2) can online interventions enhance IH across different political topics and conversational environments? To pursue these questions, we define a codebook characterizing different dimensions of IH and intellectual arrogance (IA) and have researchers use it to annotate several hundred Reddit posts, which we then use to develop and validate a classifier to support IH analysis at scale. These tools subsequently enable two key contributions: i) an observational data analysis of how IH varies across different political discussions on Reddit, which reveals that more/less IH environments tend to contain future posts of a similar nature, and ii) a randomized control trial evaluating strategies for nudging discussion participants to demonstrate more IH in their posts, which reveals the possibility of enhancing IH in online discussions across a range of contentious topics. Our findings highlight the possibility of measuring and increasing IH online without necessarily reducing engagement.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment
Authors:
Wanli Ma,
Sivasakthy Selvakumaran,
Dain G. Farrimond,
Adam A. Dennis,
Samuel E. Rigby
Abstract:
Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapi…
▽ More
Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: https://github.com/IMPACTSquad/Blast-Mamba
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Authors:
Ruiyang Li,
Fang Liu,
Licheng Jiao,
Xinglin Xie,
Jiayao Hao,
Shuo Li,
Xu Liu,
Jingyi Yang,
Lingling Li,
Puhua Chen,
Wenping Ma
Abstract:
Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reli…
▽ More
Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
VCC-DSA: A Novel Vascular Consistency Constrained DSA Imaging Model for Motion Artifact Suppression
Authors:
Rongjun Ge,
Weilong Mao,
Jian Lu,
Rong Yan,
Yikun Zhang,
Peng Yuan,
Jun Xiang,
Hui Tang,
Guanyu Yang,
Yudong Zhang,
Yang Chen,
Shuo Li
Abstract:
Digital Subtraction Angiography (DSA) is a clinically significant imaging technique for diagnosing cerebrovascular disease, as gold-standard. However, the artifacts caused by motion of high-attenuation tissues such as bones, teeth, and catheters, seriously reduce the visibility of blood vessels. This paper presents a novel Vascular Consistency Constrained DSA Imaging Model (VCC-DSA) for robust mot…
▽ More
Digital Subtraction Angiography (DSA) is a clinically significant imaging technique for diagnosing cerebrovascular disease, as gold-standard. However, the artifacts caused by motion of high-attenuation tissues such as bones, teeth, and catheters, seriously reduce the visibility of blood vessels. This paper presents a novel Vascular Consistency Constrained DSA Imaging Model (VCC-DSA) for robust motion suppression and precise vascular imaging with the following designs: 1) We specially design a Learning-based Subtraction Mapping Paradigm, so that the ill-posed problem of existing learning-based methods can be solved to enhance the stability of the algorithm. 2) Our model effectively develops Residual Dense Blocks and details-shortcut to improve the performance under complex structures, such as moving bones overlapping with blood vessels, and small features, like peripheral vessels. 3) An innovative Vascular Consistency Strategy is proposed to extract intrinsically consistency from the various relative motions in mask-live images, so that spontaneously distils the vascular structure with contrast-agent development and robustly suppress motion artifacts, and also naturally alleviates the high matching requirements of data. 4) We creatively design a Mixup-based Data Self-evolution Strategy for data-intra self-enhancement in training loop, so that the training data gains dynamically optimized to promote model better learning the vascular features, and excluding the irrelevant structures in live/mask image and even the inevitable-artifacts/fake-structure in label. Prospectively, to further evaluate practical value, an actual general anesthesia animal experiment is specially conducted, besides the assessment on human clinical data. Compared with other method, our model improves the PSNR and SSIM by 73.4% and 8.56%, respectively.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
Probing the Origin of Magnetar X-ray Polarization Diversity: A Multi-wavelength Geometrical Study of 1E 1547.0-5408 and 1E 2259+586
Authors:
Biao-Peng Li,
Zhi-Fu Gao,
Wen-Qi Ma,
Wei-Feng Zhang
Abstract:
The exceptionally high X-ray polarization recently detected in the magnetar 1E 1547.0-5408 is considered a strong candidate signature of quantum electrodynamic vacuum birefringence, an interpretation that hinges critically on the source's viewing geometry. This stark contrast to the typically lower polarization degrees seen in other magnetars prompts a fundamental question: to what extent does vie…
▽ More
The exceptionally high X-ray polarization recently detected in the magnetar 1E 1547.0-5408 is considered a strong candidate signature of quantum electrodynamic vacuum birefringence, an interpretation that hinges critically on the source's viewing geometry. This stark contrast to the typically lower polarization degrees seen in other magnetars prompts a fundamental question: to what extent does viewing geometry, rather than intrinsic physics, drive the observed polarization diversity? To answer this, we perform a systematic, comparative geometrical analysis of two magnetars representing opposite extremes: the high-polarization source 1E 1547.0-5408 and the low-polarization source 1E 2259+586. The data are modelled within a unified Bayesian framework with both the classical rotating vector model (CRVM) and a twisted-magnetosphere extension (MRVM). For 1E 2259+586, both models favour a geometry with moderate magnetic inclination and viewing angles but a small impact angle. By combining the single-epoch phase-resolved fit with three-epoch phase-averaged position angle measurements, we find no significant secular evolution of the twist parameter λand derive a conservative upper limit of |Δλ|<0.79 at the 95 per cent level over 26 day. For 1E 1547.0-5408, the observed position angle curve is already well reproduced by the CRVM, while the MRVM shows no statistically significant advantage. When radio-informed priors are imposed, the posterior shifts towards a nearly aligned configuration consistent with the radio constraints.Both sources show no evidence for strong, static global twists in the current epoch. The observed polarization dichotomy arises from the confluence of viewing geometry, intrinsic surface emission physics, and magnetospheric propagation effects.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
Ultrafast decoupling of the pseudogap from superconductivity in a pressurized cuprate
Authors:
Yanghao Meng,
Wenjin Mao,
Liucheng Chen,
Elbert E. M. Chia,
Yifeng Yang,
Jianlin Luo,
Lin Zhao,
Xingjiang Zhou,
Xiaohui Yu,
Xinbo Wang
Abstract:
The relationship between the pseudogap and superconductivity remains a central puzzle in the physics of cuprates. Hydrostatic pressure provides a clean tuning parameter free from chemical disorder, yet probing the microscopic energy scales of these phases under compression has remained experimentally challenging. Here, we utilize ultrafast optical spectroscopy to construct the high-pressure phase…
▽ More
The relationship between the pseudogap and superconductivity remains a central puzzle in the physics of cuprates. Hydrostatic pressure provides a clean tuning parameter free from chemical disorder, yet probing the microscopic energy scales of these phases under compression has remained experimentally challenging. Here, we utilize ultrafast optical spectroscopy to construct the high-pressure phase diagram of the underdoped cuprate Bi$_2$Sr$_2$CaCu$_2$O$_{8+δ}$ up to 37 GPa. Our results reveal a striking dichotomy within the pseudogap state: while the onset temperature $T^*$ rises monotonically with pressure, the energy gap $Δ_{\mathrm{PG}}$ is continuously suppressed. In contrast, the critical temperature $T_{\mathrm{c}}$ and the superconducting gap $Δ_{\mathrm{SC}}$ trace a correlated dome-like trajectory, demonstrating that superconductivity evolves independently from the pseudogap. Furthermore, an abrupt collapse of the gap ratio $2Δ_{\mathrm{SC}}/k_{\mathrm{B}}T_{\mathrm{c}}$ near 8 GPa marks a pressure-driven dimensional crossover, quenching two-dimensional phase fluctuations to stabilize global three-dimensional coherence. Upon reaching 37 GPa, the superconducting condensate is completely quenched into an insulating-like state. By resolving the extended phase evolution, our findings disentangle the pseudogap and superconducting orders, establishing a rigorous experimental basis for the pairing mechanism of high-temperature superconductivity.
△ Less
Submitted 11 April, 2026;
originally announced April 2026.
-
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards
Authors:
Kai-Yuan Guo,
Jiang Wang,
Renjie Zhao,
Tianyi Wang,
Wandong Mao,
Yu Gao,
Mou Xiao Feng,
Yi Xu
Abstract:
Large Language Models (LLMs) have become a key foundation for enabling personalized smart home experiences. While existing studies have explored how smart home assistants understand user queries to control devices in real time, their ability to perform memory-driven device control remains challenging from both evaluation and methodological perspectives. In terms of evaluation, existing benchmarks…
▽ More
Large Language Models (LLMs) have become a key foundation for enabling personalized smart home experiences. While existing studies have explored how smart home assistants understand user queries to control devices in real time, their ability to perform memory-driven device control remains challenging from both evaluation and methodological perspectives. In terms of evaluation, existing benchmarks either focus on immediate device control or general open-domain memory retrieval tasks, and therefore cannot effectively evaluate a model's ability to perform memory-driven device control. Methodologically, while memory-driven device control can be approached using Reinforcement Learning, conventional RL methods generally rely on outcome-based supervision (i.e., whether the final task is achieved). This lack of intermediate feedback can lead to sub-optimal performance or local failures in fine-grained memory management tasks (adding, updating, deleting, and utilizing). To address these issues, we first release MemHomeLife, built from anonymized real-world long-term user interaction logs. To enable more fine-grained evaluation of different memory-related subtasks, we further construct MemHome, the first benchmark designed to systematically evaluate memory-driven device control in smart home scenarios.
△ Less
Submitted 11 April, 2026;
originally announced April 2026.
-
High-temperature superconductivity in Nd$_{0.85}$Sr$_{0.15}$NiO$_2$ membranes under pressure
Authors:
Yonghun Lee,
Mengnan Wang,
Xin Wei,
Yijun Yu,
Wendy L. Mao,
Yu Lin,
Harold Y. Hwang
Abstract:
Lattice compression has emerged as a fundamental tuning parameter for nickelate superconductivity. Pressure acts as a trigger to induce superconductivity in bulk Ruddlesden-Popper nickelates. For infinite-layer nickelate thin films, compressive epitaxial strain and rare-earth ion chemical pressure have been used to substantially enhance the superconducting transition temperature ($T_c$). Efforts t…
▽ More
Lattice compression has emerged as a fundamental tuning parameter for nickelate superconductivity. Pressure acts as a trigger to induce superconductivity in bulk Ruddlesden-Popper nickelates. For infinite-layer nickelate thin films, compressive epitaxial strain and rare-earth ion chemical pressure have been used to substantially enhance the superconducting transition temperature ($T_c$). Efforts to go further have been constrained by the limits of epitaxial stability or the challenges of measuring thin films in high-pressure environments. Here, we overcome this limitation by developing a technique to incorporate freestanding infinite-layer $\mathrm{Nd_{0.85}Sr_{0.15}NiO_2}$ membranes into a diamond anvil cell. Using this platform, we observe a strong increase in $T_c$ up to our highest measurement pressure of $\sim$90 GPa, where a superconducting downturn can be observed near liquid nitrogen temperatures. Strikingly, we find a simple linear enhancement of $T_c$ at a rate of 0.65 K GPa$^{-1}$, with no signs of saturation. This suggests that the pairing strength in infinite-layer nickelates can be raised to a surprisingly high scale, using an approach that can be broadly applied to many two-dimensional materials.
△ Less
Submitted 10 April, 2026;
originally announced April 2026.
-
Pontryagin's Principle for Leakage-Immune Adiabatic Quantum State Transfer
Authors:
Xiao-Yu Dong,
Xi-Lai Wang,
Wen-Long Ma
Abstract:
The standard stimulated Raman adiabatic passage (STIRAP) protocol enables high-fidelity quantum state transfer in an ideal three-level system via adiabatic following of a dark state evolution. However, in practical systems with more energy levels, control pulses with finite spectral selectivity often couple the three-level subspace to the remaining subspace, introducing leakage that fundamentally…
▽ More
The standard stimulated Raman adiabatic passage (STIRAP) protocol enables high-fidelity quantum state transfer in an ideal three-level system via adiabatic following of a dark state evolution. However, in practical systems with more energy levels, control pulses with finite spectral selectivity often couple the three-level subspace to the remaining subspace, introducing leakage that fundamentally limits the transfer performance. Here, we adopt a multilevel chain model for STIRAP that explicitly incorporates this leakage subspace. Using Pontryagin's maximum principle, we formulate a leakage-penalized quantum optimal control problem with the control pulses constrained to experimentally feasible Gaussian pulse families. We derive explicit gradients of the objective functional with respect to the pulse parameters, enabling efficient low-dimensional optimization that suppresses leakage while preserving the counterintuitive STIRAP pulse ordering. Numerical simulations for a superconducting transmon platform demonstrate that the optimized control pulses can significantly enhance the target-state transfer fidelity and provide enhanced robustness to amplitude miscalibration and detuning drifts.
△ Less
Submitted 13 April, 2026; v1 submitted 10 April, 2026;
originally announced April 2026.
-
Medical Reasoning with Large Language Models: A Survey and MR-Bench
Authors:
Xiaohan Ren,
Chenxiao Fan,
Wenyin Ma,
Hongliang He,
Chongming Gao,
Xiaoyan Zhao,
Fuli Feng
Abstract:
Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical r…
▽ More
Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.
△ Less
Submitted 17 March, 2026;
originally announced April 2026.
-
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
Authors:
Chonghan Qin,
Xiachong Feng,
Weitao Ma,
Xiaocheng Feng,
Lingpeng Kong
Abstract:
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit…
▽ More
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".
△ Less
Submitted 15 April, 2026; v1 submitted 9 April, 2026;
originally announced April 2026.
-
RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild
Authors:
Wenjing Margaret Mao,
Jefferson Ng,
Luyang Hu,
Daniel Gehrig,
Antonio Loquercio
Abstract:
Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metr…
▽ More
Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning. For videos, data and more, visit the project webpage: https://roshi-mocap.github.io/
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
Adaptive Distributionally Robust Optimal Control with Bayesian Ambiguity Sets
Authors:
Wentao Ma,
Zhiping Chen,
Huifu Xu,
Enlu Zhou
Abstract:
In stochastic optimal control (SOC), uncertainty may arise from incomplete knowledge of the true probability distribution of the underlying environment, which is known as Knightian or epistemic uncertainty. Distributionally robust optimal control (DROC) models are subsequently proposed to tackle this source of uncertainty. While such models are effective in some practical applications, most existi…
▽ More
In stochastic optimal control (SOC), uncertainty may arise from incomplete knowledge of the true probability distribution of the underlying environment, which is known as Knightian or epistemic uncertainty. Distributionally robust optimal control (DROC) models are subsequently proposed to tackle this source of uncertainty. While such models are effective in some practical applications, most existing DROC models are offline and can be overly conservative when data are scarce. Moreover, they cannot be applied to the case when samples are generated episodically. Motivated by the Bayesian SOC framework recently proposed by Shapiro et al.~\cite{shapiro2025episodic}, we propose an adaptive DROC model in which the ambiguity set is updated via Bayesian learning from new data. Under some moderate conditions, we derive a tractable risk-averse reformulation, establish consistency of the optimal value function and optimal policy for an infinite-horizon SOC and establish a finite-sample posterior credibility guarantee for the policy value induced by the proposed episodic Bayesian DROC model. We also study the stability and statistical robustness of the proposed model with respect to sample perturbations that often arise in data-driven environments. To solve the episodic Bayesian DROC model, we propose a Bellman-operator cutting-plane (BOCP) algorithm that is computationally efficient and provably convergent. Numerical results on an inventory control problem demonstrate the effectiveness, adaptivity, and robust performance of the proposed model and algorithm.
△ Less
Submitted 9 April, 2026; v1 submitted 8 April, 2026;
originally announced April 2026.
-
Visualizing the interplay of dual electronic nematicities in kagome superconductors
Authors:
Yunmei Zhang,
Jun Zhan,
Ping Wu,
Yun-Peng Huang,
Qixiao Yuan,
Hongyu Li,
Zhuying Wang,
Wanru Ma,
Shuikang Yu,
Kunming Zhang,
Wanlin Cheng,
Deshu Chen,
Minrui Chen,
Tao Wu,
Ziji Xiang,
Xianxin Wu,
Zhenyu Wang,
Xianhui Chen
Abstract:
Kagome superconductor AV$_3$Sb$_5$ (A stands for K, Rb, and Cs) hosts a wealth of intertwined electronic orders driven by geometric frustration and electron correlations. Among them, the breaking of rotational and/or time-reversal symmetry, observed within the triple-$Q$ charge density wave (CDW) phase yet exhibiting a more complex temperature dependence, remains a central puzzle. Here, by using s…
▽ More
Kagome superconductor AV$_3$Sb$_5$ (A stands for K, Rb, and Cs) hosts a wealth of intertwined electronic orders driven by geometric frustration and electron correlations. Among them, the breaking of rotational and/or time-reversal symmetry, observed within the triple-$Q$ charge density wave (CDW) phase yet exhibiting a more complex temperature dependence, remains a central puzzle. Here, by using scanning tunneling microscopy to study the electronic structures of CsV$_3$Sb$_5$ as a function of temperature and Ti doping, we disentangle the interrelation between two distinct nematic order parameters, one associated with the CDW and the other manifested as $C_2$ distortion of the V-$d_{x^{2}-y^{2}}$ Fermi pockets without breaking transition symmetry. The latter persists to high doping levels and high temperatures where the long-range CDW is fully suppressed. Moreover, its nematic director is oriented in a lattice direction distinct from that of the CDW-induced nematicity at intermediate doping, and eventually aligns with the strong nematic CDW order in the pristine compound where the quasiparticles of vanadium orbitals become coherent below a lower characteristic temperature. These observations, combined with Ginzburg-Landau analysis, reveal a rich interplay between two nematic orders that can be assigned to distinct kagome-lattice orbitals. Our results shed new light on the enigmatic intertwined orders in this family and establish a rare material platform in which dual nematic orders coexist and couple to give rise to unusual correlated phenomena.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Authors:
Weian Mao,
Xi Lin,
Wei Huang,
Yuxin Xie,
Tianfu Fu,
Bohan Zhuang,
Song Han,
Yukang Chen
Abstract:
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-Ro…
▽ More
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Authors:
Tommie Kerssies,
Gabriele Berton,
Ju He,
Qihang Yu,
Wufei Ma,
Daan de Geus,
Gijs Dubbelman,
Liang-Chieh Chen
Abstract:
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent spa…
▽ More
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
Authors:
Yiming Mao,
Zixi Yu,
Weixin Mao,
Yinhao Li,
Qirui Hu,
Zihan Lan,
Minzhao Zhu,
Hua Chen
Abstract:
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Ad…
▽ More
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Validating Computational Markers of Depressive Behavior: Cross-Linguistic Speech-Based Depression Detection with Neurophysiological Validation
Authors:
Fuxiang Tao,
Dongwei Li,
Shuning Tang,
Xuri Ge,
Wei Ma,
Anna Esposito,
Alessandro Vinciarelli
Abstract:
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EE…
▽ More
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model's speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model's cross-linguistic robustness, not only supports that the CDMA framework's approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.
△ Less
Submitted 5 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
HICT: High-precision 3D CBCT reconstruction from a single X-ray
Authors:
Wen Ma,
Jiaxiang Liu,
Zikai Xiao,
Ziyang Wang,
Feng Yang,
Zuozhu Liu
Abstract:
Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT's high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi…
▽ More
Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT's high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi-view projections from a single panoramic image using a video diffusion model, and then reconstructs high-fidelity CBCT from the projections using a ray-based dynamic attention network and an X-ray sampling strategy. To support this, we built XCT, a large-scale dataset combining public CBCT data with 500 paired PX-CBCT cases. Extensive experiments show that HiCT achieves state-of-the-art performance, delivering accurate and geometrically consistent reconstructions for clinical use.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training
Authors:
Yusheng Zheng,
Wenan Mao,
Shuyi Cheng,
Fuqiu Feng,
Guangshui Li,
Zhaoyan Liao,
Yongzhuo Huang,
Zhenwei Xiao,
Yuqing Li,
Andi Quinn,
Tao Ma
Abstract:
Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days.…
▽ More
Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
Authors:
Zhongying Deng,
Cheng Tang,
Ziyan Huang,
Jiashi Lin,
Ying Chen,
Junzhi Ning,
Chenglong Ma,
Jiyao Liu,
Wei Li,
Yinghao Zhu,
Shujian Gao,
Yanyan Huang,
Sibo Ju,
Yanzhou Su,
Pengcheng Chen,
Wenhao Tang,
Tianbin Li,
Haoyu Wang,
Yuanfeng Ji,
Hui Sun,
Shaobo Min,
Liang Peng,
Feilong Tang,
Haochen Xue,
Rulin Zhou
, et al. (102 additional authors not shown)
Abstract:
Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of…
▽ More
Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.
△ Less
Submitted 28 March, 2026;
originally announced March 2026.
-
Pushing the Limits of Pulse Shape Discrimination in a Large Liquid Xenon Detector
Authors:
D. S. Akerib,
A. K. Al Musalhi,
F. Alder,
B. J. Almquist,
C. S. Amarasinghe,
A. Ames,
T. J. Anderson,
N. Angelides,
H. M. Araújo,
J. E. Armstrong,
M. Arthurs,
A. Baker,
S. Balashov,
J. Bang,
J. W. Bargemann,
E. E. Barillier,
K. Beattie,
A. Bhatti,
T. P. Biesiadzinski,
H. J. Birch,
E. Bishop,
G. M. Blockinger,
C. A. J. Brew,
P. Brás,
S. Burdin
, et al. (186 additional authors not shown)
Abstract:
The LUX-ZEPLIN (LZ) experiment is a direct-detection dark matter experiment, optimized to search for weakly interacting massive particles (WIMPs) through WIMP-nucleon interactions. The main challenge in dark matter detection is differentiating between WIMP signals and background events. In LZ, the ratio of ionization to scintillation signals (charge-to-light) is the primary method for rejecting el…
▽ More
The LUX-ZEPLIN (LZ) experiment is a direct-detection dark matter experiment, optimized to search for weakly interacting massive particles (WIMPs) through WIMP-nucleon interactions. The main challenge in dark matter detection is differentiating between WIMP signals and background events. In LZ, the ratio of ionization to scintillation signals (charge-to-light) is the primary method for rejecting electronic recoil (ER) background. Pulse shape discrimination (PSD) offers a method for additional ER backgrounds rejection in liquid xenon detectors. In this paper, the discrimination power of PSD with the LZ experiment is discussed. To precisely characterize the scintillation pulse shape, an analysis framework is developed to reconstruct the detection time of individual photons. Using LZ calibration data, the photon-timing prompt fraction discriminator is optimized and achieves ER leakage as low as $15\%$. For specific background processes such as $^{124}$Xe double electron capture, the leakage is reduced further to about $5\%$. PSD is combined with charge-to-light to form two-factor discrimination (TFD). The optimized TFD performance is compared with the performance of the charge-to-light method, with the corresponding false positive rate reduced by up to a factor of two for large scintillation pulses. Finally, PSD and TFD are applied to data from LZ's WS2024 run and their performance is summarized.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models
Authors:
Chen Zheng,
Yuxuan Lai,
Haoyang Lu,
Wentao Ma,
Jitao Yang,
Jian Wang
Abstract:
The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the qu…
▽ More
The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
Back to Basics: Revisiting ASR in the Age of Voice Agents
Authors:
Geeyang Tay,
Wentao Ma,
Jaewon Lee,
Yuzhi Tang,
Daniel Lee,
Weisu Yin,
Dongming Shen,
Silin Meng,
Yi Zhu,
Mu Li,
Alex Smola
Abstract:
Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce W…
▽ More
Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
Privacy-Preserving EHR Data Transformation via Geometric Operators: A Human-AI Co-Design Technical Report
Authors:
Maolin Wang,
Beining Bao,
Gan Yuan,
Hongyu Chen,
Bingkun Zhao,
Baoshuo Kan,
Jiming Xu,
Qi Shi,
Yinggong Zhao,
Yao Wang,
Wei Ying Ma,
Jun Yan
Abstract:
Electronic health records (EHRs) and other real-world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi-center studies, large-scale model development, and broader biomedical discovery. Exist…
▽ More
Electronic health records (EHRs) and other real-world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi-center studies, large-scale model development, and broader biomedical discovery. Existing privacy-preserving approaches, including multi-party computation and related cryptographic techniques, provide strong protection but often introduce substantial computational overhead, reducing the efficiency of large-scale machine learning and foundation-model training. In addition, many such methods make data usable for restricted computation while leaving them effectively invisible to clinicians and researchers, limiting their value in workflows that still require direct inspection, exploratory analysis, and human interpretation. We propose a real-world-data transformation framework for privacy-preserving sharing of structured clinical records. Instead of converting data into opaque representations, our approach constructs transformed numeric views that preserve medical semantics and major statistical properties while, under a clearly specified threat model, provably breaking direct linkage between those views and protected patient-level attributes. Through collaboration between computer scientists and the AI agent \textbf{SciencePal}, acting as a constrained tool inventor under human guidance, we design three transformation operators that are non-reversible within this threat model, together with an additional mixing strategy for high-risk scenarios, supported by theoretical analysis and empirical evaluation under reconstruction, record linkage, membership inference, and attribute inference attacks.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management
Authors:
Yaqi Xie,
Xinru Hao,
Jiaxi Liu,
Will Ma,
Linwei Xin,
Lei Cao,
Yidong Zhang
Abstract:
Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts…
▽ More
Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
Multiscale simulations guided advances for all-optical phase-change waveguides
Authors:
Hanyi Zhang,
Wanting Ma,
Wen Zhou,
Xueqi Xing,
Junying Zhang,
Tiankuo Huang,
Ding Xu,
Xiaozhe Wang,
Riccardo Mazzarello,
En Ma,
Jiang-Jing Wang,
Wei Zhang
Abstract:
Photonic computing using chalcogenide phase-change materials (PCMs) is under active development for energy-efficient artificial intelligence (AI) applications. A key requirement is to enable as many optically programmable levels per device as possible, while maintaining relatively low optical loss. In this work, we carry out multiscale simulations using density functional theory and finite-differe…
▽ More
Photonic computing using chalcogenide phase-change materials (PCMs) is under active development for energy-efficient artificial intelligence (AI) applications. A key requirement is to enable as many optically programmable levels per device as possible, while maintaining relatively low optical loss. In this work, we carry out multiscale simulations using density functional theory and finite-difference time-domain methods, proposing a "the shorter the better" strategy to optimize the performance of Sb2Te photonic waveguide devices. Our subsequent experimental characterizations of Sb2Te thin films and optical device measurements fully verify our theoretical predictions. In particular, we reveal the unconventional optical properties of metastable crystalline Sb2Te, and utilize these features for device design, yielding a simultaneous improvement in both the programming window and the optical loss. Overall, an optical programming precision exceeding 7-bit is achieved using a single waveguide cell, setting a new record for all-optical phase-change memory devices. Our work serves as a compelling example of computational material design, which demonstrates the predictive power of multiscale simulations in guiding the design of phase-change photonic devices for enhanced performance.
△ Less
Submitted 9 April, 2026; v1 submitted 19 March, 2026;
originally announced March 2026.
-
Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation
Authors:
Haoyun Chen,
Fenghe Tang,
Wenxin Ma,
Shaohua Kevin Zhou
Abstract:
Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these…
▽ More
Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model's predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
Spin crossover in FeO under shock compression
Authors:
Lélia Libon,
Alessandra Ravasio,
Silvia Pandolfi,
Yanyao Zhang,
Xuehui Wei,
Jean-Alexis Hernandez,
Hong Yang,
Amanda J. Chen,
Tommaso Vinci,
Alessandra Benuzzi-Mounaix,
Clemens Prescher,
François Soubiran,
Hae Ja Lee,
Eric Galtier,
Nick Czapla,
Wendy L. Mao,
Arianna E. Gleason,
Sang Heon Shim,
Roberto Alonso-Mori,
Guillaume Morard
Abstract:
FeO (wüstite), which exhibits complex electronic and structural properties with increasing pressure and temperature, is a key mineralogical phase for understanding deep planetary interiors. However, direct measurements of its spin state at high-pressure and temperature remain challenging in static compression experiments. Here, we employ laser-driven shock compression to extend the FeO principal H…
▽ More
FeO (wüstite), which exhibits complex electronic and structural properties with increasing pressure and temperature, is a key mineralogical phase for understanding deep planetary interiors. However, direct measurements of its spin state at high-pressure and temperature remain challenging in static compression experiments. Here, we employ laser-driven shock compression to extend the FeO principal Hugoniot up to $\sim$900 GPa and perform in situ X-ray diffraction and X-ray emission spectroscopy up to 250 GPa, probing FeO's crystal structure and spin state. We demonstrate a continuous spin crossover of iron in FeO over a broad pressure range, with the high-spin state persisting beyond Earth's core-mantle boundary (CMB) conditions. These observations provide new experimental constraints on iron spin state at extreme conditions essential for geophysical models of (exo)planetary interiors.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
Authors:
Liangbin Huang,
Xiaohua Liao,
Chaoqun Cui,
Shijing Wang,
Zhaolong Huang,
Yanlong Du,
Wenji Mao
Abstract:
Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces sev…
▽ More
Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
Directivity Enhancement of Movable Antenna Arrays with Mutual Coupling
Authors:
Wei Xu,
Lipeng Zhu,
Wenyan Ma,
An Liu,
Rui Zhang
Abstract:
In conventional antenna arrays, mutual coupling between antenna elements is often regarded as detrimental. However, under specific conditions, it can be harnessed to enhance the far-field directivity (i.e., beamforming gain). Theoretically, the directivity of an N-antenna superdirective array over the endfire direction can reach N^{2}, significantly exceeding the directivity of a traditional uncou…
▽ More
In conventional antenna arrays, mutual coupling between antenna elements is often regarded as detrimental. However, under specific conditions, it can be harnessed to enhance the far-field directivity (i.e., beamforming gain). Theoretically, the directivity of an N-antenna superdirective array over the endfire direction can reach N^{2}, significantly exceeding the directivity of a traditional uncoupled array which is N over all directions. This paper investigates the potential of mutual coupling effects in movable antenna (MA) arrays for directivity enhancement. A low-complexity algorithm called Greedy Search and Gradient Descent (GS-GD) is proposed to optimize the antenna positions for maximizing the array directivity over any given direction, where the antenna positions are first selected sequentially from discrete grid points and then continuously refined through gradient descent (GD) optimization. Numerical results demonstrate that the optimized MA array design by exploiting the antenna coupling achieves significant directivity gains compared to the conventional uniform linear array (ULA) without antenna coupling over all directions. Additionally, the proposed GS-GD algorithm is shown to approach the global optimum closely in most directions.
△ Less
Submitted 17 March, 2026;
originally announced March 2026.
-
From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding
Authors:
Xu Zhang,
Wenxin Ma,
Chenxu Wu,
Rongsheng Wang,
Kun Zhang,
S. Kevin Zhou
Abstract:
ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, n…
▽ More
ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving
Authors:
Tong Nie,
Yihong Tang,
Junlin He,
Yuewen Mei,
Jie Sun,
Lijun Sun,
Wei Ma,
Jian Sun
Abstract:
Deploying autonomous driving systems requires robustness against long-tail scenarios that are rare but safety-critical. While adversarial training offers a promising solution, existing methods typically decouple scenario generation from policy optimization and rely on heuristic surrogates. This leads to objective misalignment and fails to capture the shifting failure modes of evolving policies. Th…
▽ More
Deploying autonomous driving systems requires robustness against long-tail scenarios that are rare but safety-critical. While adversarial training offers a promising solution, existing methods typically decouple scenario generation from policy optimization and rely on heuristic surrogates. This leads to objective misalignment and fails to capture the shifting failure modes of evolving policies. This paper presents ADV-0, a closed-loop min-max optimization framework that treats the interaction between driving policy (defender) and adversarial agent (attacker) as a zero-sum Markov game. By aligning the attacker's utility directly with the defender's objective, we reveal the optimal adversary distribution. To make this tractable, we cast dynamic adversary evolution as iterative preference learning, efficiently approximating this optimum and offering an algorithm-agnostic solution to the game. Theoretically, ADV-0 converges to a Nash Equilibrium and maximizes a certified lower bound on real-world performance. Experiments indicate that it effectively exposes diverse safety-critical failures and greatly enhances the generalizability of both learned policies and motion planners against unseen long-tail risks.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
AI Can Learn Scientific Taste
Authors:
Jingqi Tong,
Mingzhe Li,
Hangcheng Li,
Yongzhuo Yang,
Yurong Mou,
Weijie Ma,
Zhiheng Xi,
Hongji Chen,
Xiaoran Liu,
Qinyuan Cheng,
Ming Zhang,
Qiguang Chen,
Weifeng Ge,
Qipeng Guo,
Tianlei Ying,
Tianxiang Sun,
Yining Zheng,
Xinchi Chen,
Jun Zhao,
Ning Ding,
Xuanjing Huang,
Yugang Jiang,
Xipeng Qiu
Abstract:
Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinfo…
▽ More
Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
Robust and Active Visible-Light Integrated Photonics on Thin-Film Lithium Tantalate for Underwater Optical Wireless Communications
Authors:
Changjian Guo,
Xingjie Li,
Xiaofeng Wu,
Jiajie Deng,
Wenchang Yang,
Weilong Ma,
Ziliang Ruan,
Kaixuan Chen,
Sailing He,
Liu Liu
Abstract:
Visible-light integrated photonics enables compact platforms for sensing, precision metrology, and free-space data links at visible wavelengths. However, many applications remain limited by the lack of high-speed and robust modulators in the blue-green band. Here we report, both operating at 532 nm, thin-film lithium tantalate waveguides of propagation losses of dB/cm scale and modulators with a f…
▽ More
Visible-light integrated photonics enables compact platforms for sensing, precision metrology, and free-space data links at visible wavelengths. However, many applications remain limited by the lack of high-speed and robust modulators in the blue-green band. Here we report, both operating at 532 nm, thin-film lithium tantalate waveguides of propagation losses of dB/cm scale and modulators with a flat frequency response to ~50 GHz. The modulator remains stable when delivering 5 dBm modulated optical power, which cannot be achieved by thin-film lithium niobate based counterparts under similar conditions and structures. We validate system-level underwater wireless optical communications (UWOCs) by transmitting 112 Gb/s signals over a 3-m underwater link. This represents the first integrated external modulator-based UWOC system, overcoming the bandwidth-power-chirp trade-offs of traditional directly modulated laser based systems. We further demonstrate dual-drive modulators for optical single-sideband and electro-optic frequency-comb generations in the green-wavelength band. These results provide a foundation for complex, robust, and active visible-light photonic integrated circuits for underwater optical applications.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution
Authors:
Kun Chen,
Qingchao Kong,
Zhao Feifei,
Wenji Mao
Abstract:
Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integra…
▽ More
Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.
△ Less
Submitted 17 March, 2026; v1 submitted 14 March, 2026;
originally announced March 2026.
-
Collaborative Multi-Agent Optimization for Personalized Memory System
Authors:
Wenyu Mao,
Haoyang Liu,
Zhao Liu,
Haosong Tan,
Yaorui Shi,
Jiancan Wu,
An Zhang,
Xiang Wang
Abstract:
Memory systems are crucial to personalized LLMs by mitigating the context window limitation in capturing long-term user-LLM conversations. Typically, such systems leverage multiple agents to handle multi-granular memory construction and personalized memory retrieval tasks. To optimize the system, existing methods focus on specializing agents on their local tasks independently via prompt engineerin…
▽ More
Memory systems are crucial to personalized LLMs by mitigating the context window limitation in capturing long-term user-LLM conversations. Typically, such systems leverage multiple agents to handle multi-granular memory construction and personalized memory retrieval tasks. To optimize the system, existing methods focus on specializing agents on their local tasks independently via prompt engineering or fine-tuning. However, they overlook cross-agent collaboration, where independent optimization on local agents hardly guarantees the global system performance. To address this issue, we propose a Collaborative Reinforcement Learning Framework for Multi-Agent Memory Systems (CoMAM), jointly optimizing local agents to facilitate collaboration. Specifically, we regularize agents' execution as a sequential Markov decision process (MDP) to embed inter-agent dependencies into the state transition, yielding both local task rewards (e.g., information coverage for memory construction) and global rewards (i.e., query-answer accuracy). Then, we quantify each agent's contribution via group-level ranking consistency between local and global rewards, treating them as adaptive weights to assign global credit and integrate local-global rewards. Each agent is optimized by these integrated rewards, aligning local improvements with the global performance. Experiments show CoMAM outperforms leading memory systems, validating the efficacy of our proposed collaborative reinforcement learning for joint optimization.
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation
Authors:
Wenxuan Ma,
Chaofan Zhang,
Yinghao Cai,
Guocai Yao,
Shaowei Cui,
Shuo Wang
Abstract:
Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensabl…
▽ More
Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
Topological Tunneling Magnetoresistance Driven by Type-II Weyl-Like States in the Room-Temperature Half-Metal Mn2PC Monolayer
Authors:
Wei Ma,
Yu-Ting Wang,
Wen-Bo Sun,
Zhiheng Lv,
Shuai Shi,
Jian-Hong Rong,
Tie-Lei Song,
Zhi-Feng Liu
Abstract:
We predict the tetragonal Mn2PC monolayer to be a room-temperature ferromagnetic half-metal with a Curie temperature of 554 K. The spin-up channel hosts type-II Weyl-like crossings at the Fermi level with highly anisotropic band dispersion, whereas the spin-down channel is a wide-gap semiconductor. Topological edge states obtained from tight-binding calculations confirm the non-trivial bulk topolo…
▽ More
We predict the tetragonal Mn2PC monolayer to be a room-temperature ferromagnetic half-metal with a Curie temperature of 554 K. The spin-up channel hosts type-II Weyl-like crossings at the Fermi level with highly anisotropic band dispersion, whereas the spin-down channel is a wide-gap semiconductor. Topological edge states obtained from tight-binding calculations confirm the non-trivial bulk topology. Spin-orbit coupling opens a small gap of 11.2 meV at the Weyl-like crossings, generating pronounced Berry curvature and a sizable anomalous Hall conductivity near the Fermi level. Based on these properties, we propose topological tunneling magnetoresistance in a Mn2PC-based magnetic tunnel junction: the parallel configuration conducts through fully spin-polarized Weyl-like carriers, while the antiparallel configuration is suppressed by the half-metallic gap, yielding a giant magnetoresistance ratio. The concurrent anomalous Hall effect in the conducting state provides an experimentally accessible signature of the topological carriers. These results identify the Mn2PC monolayer as a promising platform for room-temperature topological spintronic devices.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
SUBTA: A Framework for Supported User-Guided Bimanual Teleoperation in Structured Assembly
Authors:
Xiao Liu,
Prakash Baskaran,
Songpo Li,
Simon Manschitz,
Wei Ma,
Dirk Ruiken,
Soshi Iba
Abstract:
In human-robot collaboration, shared autonomy enhances human performance through precise, intuitive support. Effective robotic assistance requires accurately inferring human intentions and understanding task structures to determine optimal support timing and methods. In this paper, we present SUBTA, a supported teleoperation system for bimanual assembly that couples learned intention estimation, s…
▽ More
In human-robot collaboration, shared autonomy enhances human performance through precise, intuitive support. Effective robotic assistance requires accurately inferring human intentions and understanding task structures to determine optimal support timing and methods. In this paper, we present SUBTA, a supported teleoperation system for bimanual assembly that couples learned intention estimation, scene-graph task planning, and context-dependent motion assists. We validate our approach through a user study (N=12) comparing standard teleoperation, motion-support only, and SUBTA. Linear mixed-effects analysis revealed that SUBTA significantly outperformed standard teleoperation in position accuracy (p<0.001, d=1.18) and orientation accuracy (p<0.001, d=1.75), while reducing mental demand (p=0.002, d=1.34). Post-experiment ratings indicate clearer, more trustworthy visual feedback and predictable interventions in SUBTA. The results demonstrate that SUBTA greatly improves both effectiveness and user experience in teleoperation.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
3-D Trajectory Optimization for Robust Direction Sensing in Movable Antenna Systems
Authors:
Wenyan Ma,
Lipeng Zhu,
Xiaodan Shao,
Rui Zhang
Abstract:
This paper presents a novel wireless sensing system where a movable antenna (MA) continuously moves and receives sensing signals within a three-dimensional (3-D) region to enhance sensing performance compared with conventional fixed-position antenna (FPA)-based sensing. We show that the performance of direction vector estimation for a target is fundamentally related to the 3-D MA trajectory in ter…
▽ More
This paper presents a novel wireless sensing system where a movable antenna (MA) continuously moves and receives sensing signals within a three-dimensional (3-D) region to enhance sensing performance compared with conventional fixed-position antenna (FPA)-based sensing. We show that the performance of direction vector estimation for a target is fundamentally related to the 3-D MA trajectory in terms of the mean square angular error lower-bound (MSAEB), which is adopted as a coordinate-invariant performance metric. In particular, the closed-form expression of the MSAEB is derived as a function of the trajectory covariance matrix. Theoretical analysis shows that two-dimensional (2-D) antenna movement suffers from performance divergence for target direction close to the endfire direction of the 2-D MA plane, whereas 3-D movement can achieve isotropic sensing performance over the entire angular region. To achieve robust sensing performance, we formulate a min-max optimization problem to minimize the maximum (worst-case) MSAEB over a given continuous angular region wherein the target is located. An efficient successive convex approximation (SCA) algorithm is developed to optimize the 3-D MA trajectory and obtain a locally optimal solution. Numerical results demonstrate that the proposed 3-D MA sensing scheme is able to significantly reduce the worst-case mean square angular error (MSAE) compared with conventional arrays with FPAs and MA systems with 2-D movement only, thus achieving more accurate and robust direction estimation over the entire angular region.
△ Less
Submitted 11 March, 2026;
originally announced March 2026.
-
Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought
Authors:
Yuling Jiao,
Yanming Lai,
Huazhen Lin,
Wensen Ma,
Houduo Qi,
Defeng Sun
Abstract:
Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by…
▽ More
Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems?
Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model's capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.
△ Less
Submitted 12 March, 2026; v1 submitted 16 February, 2026;
originally announced March 2026.
-
Kaluza-Klein mode mixing in braneworlds: constraints on scalar absorption and physical degrees of freedom
Authors:
Wen-Xuan Ma,
Chun-E Fu
Abstract:
We investigate the mixing between Kaluza-Klein (KK) modes for a bulk U(1) gauge field within braneworld models. By demanding orthonormality and completeness for the KK basis functions, we demonstrate that the decoupling of mixed sectors, specifically of the vector-scalar and scalar-scalar types, imposes stringent constraints on the warp factors of codimension-d (d>1) backgrounds. We show that the…
▽ More
We investigate the mixing between Kaluza-Klein (KK) modes for a bulk U(1) gauge field within braneworld models. By demanding orthonormality and completeness for the KK basis functions, we demonstrate that the decoupling of mixed sectors, specifically of the vector-scalar and scalar-scalar types, imposes stringent constraints on the warp factors of codimension-d (d>1) backgrounds. We show that the gauge invariance of the four-dimensional effective action is preserved despite such mixing, manifesting as an intrinsic property of the massive vector KK sector. However, the generic presence of vector-scalar mixing fundamentally alters the absorption mechanism of the scalar modes, dynamically shifting the physical masses of the vector KK modes away from their unperturbed eigenvalues. In (4+2)-dimensional models, the existence of two distinct scalar sectors significantly enriches the mixing dynamics. As the massive vectors absorb only specific linear combinations of these scalars, a residual set of massive scalar KK modes persists as physical degrees of freedom.
△ Less
Submitted 10 March, 2026;
originally announced March 2026.
-
EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
Authors:
Jiajun Cao,
Xiaoan Zhang,
Xiaobao Wei,
Liyuqiu Huang,
Wang Zijian,
Hanzhen Zhang,
Zhengyu Jia,
Wei Mao,
Hao Wang,
Xianming Liu,
Shuchang Zhou,
Yang Wang,
Shanghang Zhang
Abstract:
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracl…
▽ More
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
△ Less
Submitted 13 March, 2026; v1 submitted 10 March, 2026;
originally announced March 2026.
-
Integrating Heterogeneous Information in Randomized Experiments: A Unified Calibration Framework
Authors:
Wei Ma,
Zeqi Wu,
Zheng Zhang
Abstract:
In modern randomized experiments, large-scale data collection increasingly yields rich baseline covariates and auxiliary information from multiple sources. Such information offers opportunities for more precise treatment effect estimation, but it also raises the challenge of integrating heterogeneous information coherently without compromising validity. Covariate-adaptive randomization (CAR) is wi…
▽ More
In modern randomized experiments, large-scale data collection increasingly yields rich baseline covariates and auxiliary information from multiple sources. Such information offers opportunities for more precise treatment effect estimation, but it also raises the challenge of integrating heterogeneous information coherently without compromising validity. Covariate-adaptive randomization (CAR) is widely used to improve covariate balance at the design stage, but it typically balances only a small set of covariates used to form strata, making covariate adjustment at the analysis stage essential for more efficient estimation of treatment effects. Beyond standard covariate adjustment, it is often desirable to incorporate auxiliary information, including cross-stratum information, predictions from various machine learning models, and external data from historical trials or real-world sources. While this auxiliary information is widely available, existing covariate adjustment methods under CAR primarily exploit within-stratum covariates and do not provide a coherent mechanism for integrating it. We propose a unified calibration framework that integrates such information through an information proxy vector and calibration weights defined by a convex optimization problem. The resulting estimator recovers many recent covariate adjustment procedures as special cases while providing a systematic mechanism for both internal and external information borrowing within a single framework. We establish large-sample validity and a no-harm efficiency guarantee, showing that incorporating additional information sources cannot increase asymptotic variance, and we extend the theory to settings in which both the number of strata and the number of information sources grow with the sample size.
△ Less
Submitted 7 March, 2026;
originally announced March 2026.
-
Thinking with Spatial Code for Physical-World Video Reasoning
Authors:
Jieneng Chen,
Wenxin Ma,
Ruisheng Yuan,
Yunzhi Zhang,
Jiajun Wu,
Alan Yuille
Abstract:
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to rea…
▽ More
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
The Evolution of Eco-routing under Population Growth: Evidence from Six U.S. Cities
Authors:
Zhiheng Shi,
Xiaohan Xu,
Wei Ma,
Kairui Feng,
Bin He
Abstract:
Rapid urban population growth drives car travel demand, increasing transport carbon emissions and posing a critical challenge to sustainable development. Although existing studies have demonstrated that eco-routing can reduce individual emissions, research gaps remain. On the one hand, such personal reductions have a negligible impact on overall emissions, and cannot be simply aggregated to captur…
▽ More
Rapid urban population growth drives car travel demand, increasing transport carbon emissions and posing a critical challenge to sustainable development. Although existing studies have demonstrated that eco-routing can reduce individual emissions, research gaps remain. On the one hand, such personal reductions have a negligible impact on overall emissions, and cannot be simply aggregated to capture the complex effects of large-scale eco-routing. On the other hand, under population growth, the long-term effectiveness of eco-routing, as well as the evolution of its efficiency and traveler route choice, remain underexplored. To address these limitations, this study proposes Time-Only and Time-Carbon user equilibrium (UE) models, integrates them with a demand forecasting method for simulating future network traffic, and designs multi-dimensional metrics to characterize urban dynamics. Using real-world road networks, commuting origin-destination (OD) demand, and population projections under various shared socioeconomic pathways (SSPs) for six representative U.S. cities as a case study, we conduct a comprehensive analysis of urban dynamics across different routing strategies and population sizes. The results reveal that while eco-routing mitigates total emissions, emissions in most cities scale superlinearly with population, a scaling order that remains invariant regardless of routing and construction strategies. Moreover, under population growth, travelers using eco-routing tend to increasingly select shorter routes, giving rise to carbon bottlenecks. A strategy of targeted capacity expansion on these critical bottlenecks (0.46% of links) significantly reduces both emissions (3%) and travel time (28%) without compromising eco-routing efficiency. This study provides a foundation for formulating low-carbon urban transport planning and emission reduction policies.
△ Less
Submitted 3 March, 2026;
originally announced March 2026.
-
APAO: Adaptive Prefix-Aware Optimization for Generative Recommendation
Authors:
Yuanqing Yu,
Yifan Wang,
Weizhi Ma,
Zhiqiang Guo,
Min Zhang
Abstract:
Generative recommendation has recently emerged as a promising paradigm in sequential recommendation. It formulates the task as an autoregressive generation process, predicting discrete tokens of the next item conditioned on user interaction histories. Existing generative recommendation models are typically trained with token-level likelihood objectives, such as cross-entropy loss, while employing…
▽ More
Generative recommendation has recently emerged as a promising paradigm in sequential recommendation. It formulates the task as an autoregressive generation process, predicting discrete tokens of the next item conditioned on user interaction histories. Existing generative recommendation models are typically trained with token-level likelihood objectives, such as cross-entropy loss, while employing multi-step beam search during inference to generate ranked item candidates. However, this leads to a fundamental training-inference inconsistency: standard training assumes ground-truth history is always available, ignoring the fact that beam search prunes low-probability branches during inference. Consequently, the correct item may be prematurely discarded simply because its initial tokens (prefixes) have low scores. To address this issue, we propose the Adaptive Prefix-Aware Optimization (APAO) framework, which introduces prefix-level optimization losses to better align the training objective with the inference setting. Furthermore, we design an adaptive worst-prefix optimization strategy that dynamically focuses on the most vulnerable prefixes during training, thereby enhancing the model's ability to retain correct candidates under beam search constraints. We provide theoretical analyses to demonstrate the effectiveness and efficiency of our framework. Extensive experiments on multiple datasets further show that APAO consistently alleviates the training-inference inconsistency and improves performance across various generative recommendation backbones. Our codes are publicly available at https://github.com/yuyq18/APAO.
△ Less
Submitted 3 March, 2026;
originally announced March 2026.
-
From Shallow to Deep: Pinning Semantic Intent via Causal GRPO
Authors:
Shuyi Zhou,
Zeen Song,
Wenwen Qiang,
Jiyan Sun,
Yao Zhou,
Yinlong Liu,
Wei Ma
Abstract:
Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO)…
▽ More
Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.
△ Less
Submitted 3 March, 2026;
originally announced March 2026.