arXiv:2604.14889 [pdf, ps, other]

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

Authors: Xinyu Liu, Xin Liu, Bo Jin, Runsong Zhao, Pengcheng Huang, Junhao Ruan, Bei Li, Chunyang Xiao, Tong Xiao, Jingbo Zhu

Abstract: While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate t… ▽ More While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate the efficiency issues while maintaining CoT reasoning performance. Our framework adopts the same minimalist design for both context compression and multi-token prediction via special tokens and their corresponding position layout tailored to each token type. Comprehensive experiments on four reasoning benchmarks demonstrate that MemoSight reduces the KV cache footprint by up to 66% and accelerates inference by 1.56x, while outperforming existing CoT compression methods. △ Less

Submitted 16 April, 2026; originally announced April 2026.

arXiv:2604.14683 [pdf, ps, other]

DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Authors: Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu

Abstract: Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report gen… ▽ More Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available. △ Less

Submitted 16 April, 2026; originally announced April 2026.

arXiv:2604.13938 [pdf, ps, other]

ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

Authors: Tianze Xia, Zijian Ning, Zonglin Zhao, Mingjia Wang

Abstract: Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion a… ▽ More Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model's architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench. △ Less

Submitted 15 April, 2026; originally announced April 2026.

arXiv:2604.10531 [pdf, ps, other]

PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

Authors: Jiahui Zhang, Rouyi Wang, Kuangqi Zhou, Tianshu Xiao, Lingyan Zhu, Yaosen Min, Yang Wang

Abstract: Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29… ▽ More Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. The data and code are publicly available at https://github.com/ZGCI-AI4S-Pep/PepBenchmark/. △ Less

Submitted 12 April, 2026; originally announced April 2026.

Journal ref: ICLR 2026

arXiv:2604.10496 [pdf, ps, other]

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

Authors: Xiangyang Yin, Xingyu Liu, Tianhua Xia, Bo Bao, Vithursan Thangarasa, Valavan Manohararajah, Eric Sather, Sai Qian Zhang

Abstract: Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing… ▽ More Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at https://github.com/SAI-Lab-NYU/CodeQuant. △ Less

Submitted 12 April, 2026; originally announced April 2026.

arXiv:2604.09617 [pdf, ps, other]

AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation

Authors: Haoxuan Zhang, Ruochi Li, Zhenni Liang, Mehri Sattari, Phat Vo, Collin Qu, Ting Xiao, Junhua Ding, Yang Zhang, Haihua Chen

Abstract: Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, si… ▽ More Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, since web-scale repositories such as Hugging Face often contain incomplete or inconsistent metadata, leading to missing or noisy information; and (iii) lack of benchmarks, as the absence of standardized datasets and evaluation protocols hinders fair and reproducible assessment of documentation quality. To address these limitations, we propose AdaQE-CG, an Adaptive Query Expansion for Card Generation framework that combines dynamic information extraction with cross-card knowledge transfer. Its Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) module iteratively refines extraction queries to recover richer and more complete information from scientific papers and repositories, while its Inter-Card Completion using the MetaGAI Pool (ICC-MP) module fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. In addition, we introduce MetaGAI-Bench, the first large-scale, expert-annotated benchmark for evaluating GAI documentation. Comprehensive experiments across five quality dimensions show that AdaQE-CG substantially outperforms existing approaches, exceeds human-authored data cards, and approaches human-level quality for model cards. Code, prompts, and data are publicly available at: https://github.com/haoxuan-unt2024/AdaQE-CG. △ Less

Submitted 16 March, 2026; originally announced April 2026.

Comments: This paper has been accepted to the main conference of WWW 2026

arXiv:2604.05519 [pdf, ps, other]

Active noise cancellation on open-ear smart glasses

Authors: Kuang Yuan, Freddy Yifei Liu, Tong Xiao, Yiwen Song, Chengyi Shen, Saksham Bhutani, Justin Chan, Swarun Kumar

Abstract: Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error mic… ▽ More Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100--1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration. △ Less

Submitted 7 April, 2026; originally announced April 2026.

arXiv:2603.28524 [pdf, ps, other]

SesQ: A Surface Electrostatic Simulator for Precise Energy Participation Ratio Simulation in Superconducting Qubits

Authors: Ziang Wang, Shuyuan Guan, Feng Wu, Xiaohang Zhang, Qiong Li, Jianxin Chen, Xin Wan, Tian Xia, Hui-Hai Zhao

Abstract: An accurate and efficient numerical electromagnetic model for superconducting qubits is essential for characterizing and minimizing design-dependent dielectric losses. The energy participation ratio (EPR) is the commonly adopted metric used to evaluate these losses, but its calculation presents a severe multiscale computational challenge. Conventional finite element method (FEM) requires 3D volume… ▽ More An accurate and efficient numerical electromagnetic model for superconducting qubits is essential for characterizing and minimizing design-dependent dielectric losses. The energy participation ratio (EPR) is the commonly adopted metric used to evaluate these losses, but its calculation presents a severe multiscale computational challenge. Conventional finite element method (FEM) requires 3D volumetric meshing, leading to prohibitive computational costs and memory requirements when attempting to capture singular electric fields at nanometer-thin material interfaces. To address this bottleneck, we propose SesQ, a surface integral equation simulator tailored for the precise simulation of the EPR. By applying discretization on 2D surfaces, deriving a semi-analytical multilayer Green's function, and employing a dedicated non-conformal boundary mesh refinement scheme, SesQ accurately resolves singular edge fields without an explosive growth in the number of unknowns. Validations with analytically solvable models demonstrate that SesQ accelerates capacitance extraction by roughly two orders of magnitude compared to commercial FEM tools. While achieving comparable accuracy for capacitance extraction, SesQ delivers superior precision for EPR calculation. Simulations of practical transmon qubits further reveal that FEM approaches tend to significantly underestimate the EPR. Finally, the high efficiency of SesQ enables rapid iteration in the layout optimization, as demonstrated by minimizing the EPR of the qubit pattern, establishing the simulator as a powerful tool for the automated design of low-loss superconducting quantum circuits. △ Less

Submitted 30 March, 2026; originally announced March 2026.

Comments: 15 pages, 14 figures, 3 tables

arXiv:2603.28367 [pdf, ps, other]

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Authors: Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However,… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios. △ Less

Submitted 30 March, 2026; originally announced March 2026.

arXiv:2603.27992 [pdf, ps, other]

Scaling of Long-Range Loop-Erased Random Walks

Authors: Tianning Xiao, Xianzhi Pan, Zhijie Fan, Youjin Deng

Abstract: We study the scaling properties of long-range loop-erased random walks (LR-LERW), where the underlying random walker performs Lévy-flight-like jumps with a power-law step-length distribution $P(\mathbf{r})\sim |\mathbf{r}|^{-(d+σ)}$. Using extensive Monte Carlo simulations, we measure the scaling relation $N \sim R^{d_N}$ between the loop-erased step number $N$ and the spatial extent $R$, and dete… ▽ More We study the scaling properties of long-range loop-erased random walks (LR-LERW), where the underlying random walker performs Lévy-flight-like jumps with a power-law step-length distribution $P(\mathbf{r})\sim |\mathbf{r}|^{-(d+σ)}$. Using extensive Monte Carlo simulations, we measure the scaling relation $N \sim R^{d_N}$ between the loop-erased step number $N$ and the spatial extent $R$, and determine the geometric exponent $d_N$ for various values of $σ$ in spatial dimensions $d = 1, 2,$ and $3$, as well as at the marginal point $σ= 2$ in $d=4$ and $5$. We observe a continuous crossover from long-range (LR) to short-range (SR) behavior as $σ$ increases. Below the upper critical dimension $d<d_c=4$, for $σ< d/2$, loop erasure is asymptotically irrelevant and $d_N=σ$, consistent with Lévy-flight scaling. For $d/2 < σ< 2$, loop erasure becomes relevant and $d_N$ varies continuously toward the SR-LERW value. At the marginal points with $σ=d/2$ or $σ=2$, clear logarithmic corrections are observed. At and above the upper critical dimension, $d \geq 4$, the scaling at $σ=2$ is found to be $N \sim R^2/\ln R$, consistent with that of the corresponding Lévy flight. Our results provide a systematic numerical determination of $d_N(σ)$ for the LR-LERW across dimensions, and are consistent with $σ_* = 2$ as the boundary between LR and SR critical behaviors recently established in a broad variety of statistical models. △ Less

Submitted 29 March, 2026; originally announced March 2026.

arXiv:2603.25108 [pdf, ps, other]

MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Authors: Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, Tong Xiao

Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, wh… ▽ More Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL. △ Less

Submitted 26 March, 2026; originally announced March 2026.

Comments: Accepted by CVPR 2026

arXiv:2603.21705 [pdf, ps, other]

Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs

Authors: Tian Xia

Abstract: Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume t… ▽ More Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient -- an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9\%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging. △ Less

Submitted 23 March, 2026; originally announced March 2026.

Comments: 14 pages, NeurIPS 2026 submission

arXiv:2603.21659 [pdf, ps, other]

IMMSched: Interruptible Multi-DNN Scheduling via Parallel Multi-Particle Optimizing Subgraph Isomorphism

Authors: Boran Zhao, Hetian Liu, Zihang Yuan, Yanbin Hu, Wenzhe Zhao, Tian Xia, Pengju Ren

Abstract: The growing demand for multi-DNN workloads with unpredictable task arrival times has highlighted the need for interruptible scheduling on edge accelerators. However, existing preemptive frameworks typically assume known task arrival times and rely on CPU-based offline scheduling, which incurs heavy runtime overhead and struggles to handle unpredictable task arrivals. Even worse, prior studies have… ▽ More The growing demand for multi-DNN workloads with unpredictable task arrival times has highlighted the need for interruptible scheduling on edge accelerators. However, existing preemptive frameworks typically assume known task arrival times and rely on CPU-based offline scheduling, which incurs heavy runtime overhead and struggles to handle unpredictable task arrivals. Even worse, prior studies have shown that multi-DNN scheduling requires solving an NP-hard subgraph isomorphism problem on large directed acyclic graphs within limited time, which is extremely challenging. To tackle this, we propose IMMSched, a parallel subgraph isomorphism method that combines Multi-Particle Optimization with the Ullmann algorithm based on a probabilistic continuous-relaxation scheme, eliminating the serial data dependencies of previous works. Finally, a quantized scheduling scheme and a global controller in the hardware architecture further combine multi-particle results for consensus-guided exploration. Evaluations demonstrate that IMMSched achieves orders-of-magnitude reductions in scheduling latency and energy consumption, enabling real-time execution of unpredictable DNN tasks on edge accelerators. △ Less

Submitted 23 March, 2026; originally announced March 2026.

arXiv:2603.21213 [pdf, ps, other]

Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis

Authors: Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker

Abstract: Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-d… ▽ More Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression. △ Less

Submitted 22 March, 2026; originally announced March 2026.

arXiv:2603.20218 [pdf, ps, other]

An experimental study of KV cache reuse strategies in chunk-level caching systems

Authors: Samuel Cestola, Tianxiang Xia, Zheng Weiyan, Zheng Pengfei, Diego Didona

Abstract: Retrieval-augmented generation improves large language models' accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using diffe… ▽ More Retrieval-augmented generation improves large language models' accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using different techniques. We make two main contributions. First, we show that existing CLC approaches have fundamental limitations that limit their accuracy or their applicability. We back this conclusion with an extensive CLC system experimental evaluation. Second, we observe that existing CLC techniques are complementary. We leverage this insight to propose a new CLC design that carefully combines them and achieves better accuracy. △ Less

Submitted 3 March, 2026; originally announced March 2026.

ACM Class: I.2.7

arXiv:2603.19733 [pdf, ps, other]

PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

Authors: Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng

Abstract: While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable pe… ▽ More While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input's inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs. △ Less

Submitted 20 March, 2026; originally announced March 2026.

arXiv:2603.19564 [pdf, ps, other]

Wearable Foundation Models Should Go Beyond Static Encoders

Authors: Yu Yvonne Wu, Yuwei Zhang, Hyungjun Yoon, Ting Dang, Dimitris Spathis, Tong Xia, Qiang Yang, Jing Han, Dong Ma, Sung-Ju Lee, Cecilia Mascolo

Abstract: Wearable foundation models (WFMs), trained on large volumes of data collected by affordable, always-on devices, have demonstrated strong performance on short-term, well-defined health monitoring tasks, including activity recognition, fitness tracking, and cardiovascular signal assessment. However, most existing WFMs primarily map short temporal windows to predefined labels via static encoders, emp… ▽ More Wearable foundation models (WFMs), trained on large volumes of data collected by affordable, always-on devices, have demonstrated strong performance on short-term, well-defined health monitoring tasks, including activity recognition, fitness tracking, and cardiovascular signal assessment. However, most existing WFMs primarily map short temporal windows to predefined labels via static encoders, emphasizing retrospective prediction rather than reasoning over evolving personal history, context, and future risk trajectories. As a result, they are poorly suited for modeling chronic, progressive, or episodic health conditions that unfold over weeks, months or years. Hence, we argue that WFMs must move beyond static encoders and be explicitly designed for longitudinal, anticipatory health reasoning. We identify three foundational shifts required to enable this transition: (1) Structurally rich data, which goes beyond isolated datasets or outcome-conditioned collection to integrated multimodal, long-term personal trajectories, and contextual metadata, ideally supported by open and interoperable data ecosystems; (2) Longitudinal-aware multimodal modeling, which prioritizes long-context inference, temporal abstraction, and personalization over cross-sectional or population-level prediction; and (3) Agentic inference systems, which move beyond static prediction to support planning, decision-making, and clinically grounded intervention under uncertainty. Together, these shifts reframe wearable health monitoring from retrospective signal interpretation toward continuous, anticipatory, and human-aligned health support. △ Less

Submitted 19 March, 2026; originally announced March 2026.

Comments: 13 pages

arXiv:2603.19097 [pdf, ps, other]

DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Authors: Yilin Wang, Yuchun Fan, Jiaoyang Li, Ziming Zhu, Yongyu Mu, Qiaozhi He, Tong Xiao, Jingbo Zhu

Abstract: Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities und… ▽ More Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline. △ Less

Submitted 19 March, 2026; originally announced March 2026.

Comments: Accepted by ICASSP 2026

arXiv:2603.17556 [pdf, ps, other]

Characterization of Deconvolution-Based PMT Waveform Reconstruction Under Large Charge Dynamic Range and Varying Scintillation Time Profiles

Authors: Xingyi Lin, Jinghuan Xu, Yongbo Huang, Jingzhe Tang, Tianying Xiao, Yingke Li

Abstract: Photomultiplier tubes (PMTs) are widely used as photon sensors for neutrino and dark matter detection. Accurate charge and time information extracted from PMT waveforms is crucial for event reconstruction. An algorithm based on deconvolution technology was proposed and applied to the reconstruction of PMT waveforms. This study further investigated the reliability of the deconvolution algorithm whe… ▽ More Photomultiplier tubes (PMTs) are widely used as photon sensors for neutrino and dark matter detection. Accurate charge and time information extracted from PMT waveforms is crucial for event reconstruction. An algorithm based on deconvolution technology was proposed and applied to the reconstruction of PMT waveforms. This study further investigated the reliability of the deconvolution algorithm when handling a large charge dynamic range (0-200 photoelectrons), varying scintillation time profiles, and muon-induced large signals. Monte Carlo data confirmed that the deconvolution algorithm exhibits relatively stable reconstruction performance: the residual non-linearity of charge reconstruction is controlled to approximately 1\% over the range of 0 to 200 photoelectrons for various configurations of undershoots and scintillation time profiles, and the algorithm is capable of handling muon-induced large signals. △ Less

Submitted 20 March, 2026; v1 submitted 18 March, 2026; originally announced March 2026.

arXiv:2603.16483 [pdf, ps, other]

On the Emotion Understanding of Synthesized Speech

Authors: Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao

Abstract: Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically eva… ▽ More Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging. △ Less

Submitted 17 March, 2026; originally announced March 2026.

arXiv:2603.16206 [pdf, ps, other]

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Authors: Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao

Abstract: Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However,… ▽ More Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA. △ Less

Submitted 17 March, 2026; originally announced March 2026.

Comments: Working in process

arXiv:2603.14682 [pdf]

Giant anomalous Hall conductivity in frustrated magnet EuCo2Al9

Authors: Sheng Xu, Jian-Feng Zhang, Shu-Xiang Li, Junfa Lin, Xiaobai Ma, Wenyun Yang, Jun-Jian Mi, Zheng Li, Tian-Hao Li, Yue-Yang Wu, Jiang Ma, Qian Tao, Wen-He Jiao, Xiaofeng Xu, Zengwei Zhu, Yuanfeng Xu, Hanjie Guo, Tian-Long Xia, Zhu-An Xu

Abstract: The interaction between conduction electrons and localized magnetic moments profoundly influences the electrical and magnetic properties of materials, giving rise to a variety of fascinating physical phenomena and quantum effects. Here, we discover a giant anomalous Hall effect (AHE) in a frustrated Eu-based magnet, exhibiting a giant anomalous Hall conductivity (AHC) of 31000 Ω-1cm-1 and a remark… ▽ More The interaction between conduction electrons and localized magnetic moments profoundly influences the electrical and magnetic properties of materials, giving rise to a variety of fascinating physical phenomena and quantum effects. Here, we discover a giant anomalous Hall effect (AHE) in a frustrated Eu-based magnet, exhibiting a giant anomalous Hall conductivity (AHC) of 31000 Ω-1cm-1 and a remarkable anomalous Hall angle (AHA, tanθH) of 12 %--surpassing conventional mechanisms (either intrinsic or extrinsic) by two orders of magnitude. Combining magnetotransport, quantum oscillations, neutron diffraction and ab initio calculations, we establish that the giant AHC originates from fluctuating spin chirality skew scattering, generated by indirect Ruderman-Kittel-Kasuya-Yosida (RKKY) interactions of Eu-4f moments. Simultaneously, Hund's coupling of itinerant electrons and localized Eu-4f spins triggers giant exchange splitting, evidenced by temperature-dependent Fermi surface reconstruction. This work establishes a frustrated magnetic platform for engineering the AHE and elucidates the governing role of exchange interactions and spin textures in quantum transport, while also providing a framework for designing unconventional spintronic systems that harness emergent spin-texture dynamics. △ Less

Submitted 15 March, 2026; originally announced March 2026.

Comments: 15 pages, 5 figures. To appear in Materials Today

arXiv:2603.13480 [pdf, ps, other]

Blazar Constraints on Axions through New Spectral Modulation Searches in 1ES 1959+650 & B2 1811+31

Authors: Andrea Giovanni De Marchi, Orion Ning, Tianzhuo Xiao

Abstract: Blazars are unique astrophysical environments whose high-energy $γ$-ray spectra are susceptible to modulations in the presence of ultralight axions. We search for these modulations, induced by axion-photon mixing, in Fermi-LAT spectral data of previously unexplored blazar targets, focusing in particular on blazars 1ES 1959+650 and B2 1811+31, whose flare states provide a clean testbed for axion ac… ▽ More Blazars are unique astrophysical environments whose high-energy $γ$-ray spectra are susceptible to modulations in the presence of ultralight axions. We search for these modulations, induced by axion-photon mixing, in Fermi-LAT spectral data of previously unexplored blazar targets, focusing in particular on blazars 1ES 1959+650 and B2 1811+31, whose flare states provide a clean testbed for axion activity. In both cases, we find no evidence for axions, and set exclusion regions on the axion-photon coupling for masses between $10^{-9}$ eV $\lesssim$ $m_a$ $\lesssim$ $10^{-8}$ eV, with sensitivities typically reaching $g_{a γγ} \sim 10^{-11} - 10^{-10}$ GeV$^{-1}$ depending on the assumed blazar modeling choices. We examine the broad impact of modeling uncertainties, finding that the resulting constraints can vary substantially across plausible configurations. We discuss the implications of these systematic effects and their relevance for similar blazar-like searches in the future. △ Less

Submitted 13 March, 2026; originally announced March 2026.

Comments: 14 pages + references, 11 figures, 1 table

arXiv:2603.11327 [pdf, ps, other]

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Authors: Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi

Abstract: This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowi… ▽ More This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search. △ Less

Submitted 18 March, 2026; v1 submitted 11 March, 2026; originally announced March 2026.

Comments: 23 pages, Preprint

arXiv:2603.09221 [pdf, ps, other]

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

Authors: Peihao Wang, Shan Yang, Xijun Wang, Tesi Xiao, Xin Liu, Changlong Yu, Yu Lou, Pan Li, Zhangyang Wang, Ming Lin, René Vidal

Abstract: Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning… ▽ More Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training. △ Less

Submitted 10 March, 2026; originally announced March 2026.

arXiv:2603.07599 [pdf, ps, other]

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Authors: Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao, Jingbo Zhu

Abstract: Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of syste… ▽ More Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration. △ Less

Submitted 8 March, 2026; originally announced March 2026.

arXiv:2603.06542 [pdf, ps, other]

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

Abstract: Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challeng… ▽ More Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts. △ Less

Submitted 6 March, 2026; originally announced March 2026.

arXiv:2603.04491 [pdf]

Giant Magnetocrystalline Anisotropy in Honeycomb Iridate NiIrO3 with Large Coercive Field Exceeding 17 T

Authors: Chuanhui Zhu, Pengfei Tan, Xiao-Sheng Ni, Jingchun Gao, Yuting Chang, Mei-Huan Zhao, Zheng Deng, Shuang Zhao, Tao Xia, Jinjin Yang, Changqing Jin, Junfeng Wang, Chengliang Lu, Yisheng Chai, Dao-Xin Yao, Man-Rong Li

Abstract: The realization of unconventional quantum phases in frustrated and spin-orbit coupled materials remains at the forefront of quantum materials research. Here we report the synthesis and discovery of NiIrO3, the first honeycomb iridate with coupled 3d-5d magnetic sublattices, through a soft topotactic reaction. Structural analysis reveals an ilmenite-type stacking of edge-sharing NiO6 and IrO6 octah… ▽ More The realization of unconventional quantum phases in frustrated and spin-orbit coupled materials remains at the forefront of quantum materials research. Here we report the synthesis and discovery of NiIrO3, the first honeycomb iridate with coupled 3d-5d magnetic sublattices, through a soft topotactic reaction. Structural analysis reveals an ilmenite-type stacking of edge-sharing NiO6 and IrO6 octahedral honeycomb sublattices in a Kitaev geometry. Comprehensive magnetic and electrical transport measurements unveil its long-range ferrimagnetic order below 213 K, which is in sharp contrast to the predominantly antiferromagnetic order in the known honeycomb iridates. Notably, the titled compound displays an exceptionally large magnetocrystalline anisotropy energy of 32.2 meV/f.u. and a giant coercivity with coercive field exceeding 17.3 T below 4.2 K, both ranking among the highest observed in iridates to date. Combined experimental and theoretical investigations indicate that the exceptional anisotropy and coercivity originate from the synergistic effect between strong lattice frustration in the coupled 3d-5d honeycomb lattice network and the robust spin-orbit coupling of the Ir4+ (Jeff = 1/2) state. This work positions NiIrO3 as a promising platform to investigate low-dimensional and frustrated quantum spin systems, and highlights its potential for spintronic applications through the targeted engineering of 3d-5d interactions. △ Less

Submitted 4 March, 2026; originally announced March 2026.

arXiv:2603.02932 [pdf]

doi 10.1088/1674-1056/ae3473

A simple scheme to realize the Rice-Mele model in acoustic system

Authors: Tianzhi Xia, Xiying Fan, Qi Chen, Yuanlei Zhang, Zhe Li

Abstract: The Rice-Mele (RM) model, as a paradigmatic extension of the Su-Schrieffer-Heeger (SSH) chain, plays a pivotal role in understanding topological phases and quantized adiabatic transport in one-dimensional systems. Its realization in acoustic systems, however, has been hindered by the need for simultaneous precise modulation of on-site potentials and couplings. In this work, we demonstrate a method… ▽ More The Rice-Mele (RM) model, as a paradigmatic extension of the Su-Schrieffer-Heeger (SSH) chain, plays a pivotal role in understanding topological phases and quantized adiabatic transport in one-dimensional systems. Its realization in acoustic systems, however, has been hindered by the need for simultaneous precise modulation of on-site potentials and couplings. In this work, we demonstrate a method to linearly tune on-site potentials and couplings, thus realizing an acoustic Rice-Mele model. During parameter evolution, the system exhibits a Thouless pump, with the acoustic field distribution adiabatically shifting from the left edge through the bulk to the right edge, fully consistent with tight-binding model predictions. Moreover, the strategy of leveraging geometric parameters to linearly and precisely control on-site potentials and couplings is highly effective and universal for designing acoustic metamaterials, and it can be extended to other classical wave systems. △ Less

Submitted 4 March, 2026; v1 submitted 3 March, 2026; originally announced March 2026.

Comments: 10 pages, 4 figures, article in press (Chinese Physics B). https://doi.org/10.1088/1674-1056/ae3473. v2: Added references[17-19] to acknowledge the prior work on shift currents

arXiv:2603.02266 [pdf, ps, other]

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Authors: Ruixiang Mao, Xiangnan Ma, Dan Chen, Ziming Zhu, Yuan Ge, Aokai Hao, Haishu Zhao, Yifu Huo, Qing Yang, Kaiyan Chang, Xiaoqian Liu, Chenglong Wang, Qiaozhi He, Tong Xiao, Jingbo Zhu

Abstract: Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation… ▽ More Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR$^2$, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR$^2$ reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity. △ Less

Submitted 28 February, 2026; originally announced March 2026.

Comments: Under Review

arXiv:2603.00155 [pdf, ps, other]

EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection

Authors: Wenxin Tang, Jingyu Xiao, Yanpei Gong, Fengyuan Ran, Tongchuan Xia, Junliang Liu, Man Ho Lam, Wenxuan Wang, Michael R. Lyu

Abstract: Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end fra… ▽ More Automated academic poster generation aims to distill lengthy research papers into concise, visually coherent presentations. Existing Multimodal Large Language Models (MLLMs) based approaches, however, suffer from three critical limitations: low information density in full-paper inputs, excessive token consumption, and unreliable layout verification. We present EfficientPosterGen, an end-to-end framework that addresses these challenges through semantic-aware retrieval and token-efficient multimodal generation. EfficientPosterGen introduces three core innovations: (1) Semantic-aware Key Information Retrieval (SKIR), which constructs a semantic contribution graph to model inter-segment relationships and selectively preserves important content; (2) Visual-based Context Compression (VCC), which renders selected text segments into images to shift textual information into the visual modality, significantly reducing token usage while generating poster-ready bullet points; and (3) Agentless Layout Violation Detection (ALVD), a deterministic color-gradient-based algorithm that reliably detects content overflow and spatial sparsity without auxiliary MLLMs. Extensive experiments demonstrate that EfficientPosterGen achieves substantial improvements in token efficiency and layout reliability while maintaining high poster quality, offering a scalable solution for automated academic poster generation. Our code is available at https://github.com/vinsontang1/EfficientPosterGen-Code. △ Less

Submitted 25 February, 2026; originally announced March 2026.

arXiv:2603.00058 [pdf, ps, other]

PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

Authors: Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, Yong Li

Abstract: Computational reproducibility is essential for the credibility of scientific findings, particularly in the social sciences, where findings often inform real-world decisions. Manual reproducibility assessment is costly and time-consuming, as it is nontrivial to reproduce the reported findings using the authors' released code and data. Recent advances in large models (LMs) have inspired agent-based… ▽ More Computational reproducibility is essential for the credibility of scientific findings, particularly in the social sciences, where findings often inform real-world decisions. Manual reproducibility assessment is costly and time-consuming, as it is nontrivial to reproduce the reported findings using the authors' released code and data. Recent advances in large models (LMs) have inspired agent-based approaches for automated reproducibility assessment. However, existing approaches often struggle due to limited context capacity, inadequate task-specific tooling, and insufficient result capture. To address these, we propose PaperRepro, a novel two-stage, multi-agent approach that separates execution from evaluation. In the execution stage, agents execute the reproduction package and edit the code to capture reproduced results as explicit artifacts. In the evaluation stage, agents evaluate reproducibility using explicit evidence. PaperRepro assigns distinct responsibilities to agents and equips them with task-specific tools and expert prompts, mitigating context and tooling limitations. It further maximizes the LM's coding capability to enable more complete result capture for evaluation. On REPRO-Bench, a social science reproducibility assessment benchmark, PaperRepro achieves the best overall performance, with a 21.9% relative improvement in score-agreement accuracy over the strongest prior baseline. We further refine the benchmark and introduce REPRO-Bench-S, a benchmark stratified by execution difficulty for more diagnostic evaluation of automated reproducibility assessment systems. Our code and data are publicly available △ Less

Submitted 10 February, 2026; originally announced March 2026.

arXiv:2602.22584 [pdf, ps, other]

Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Authors: Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Abstract: Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficient… ▽ More Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions. △ Less

Submitted 25 February, 2026; originally announced February 2026.

arXiv:2602.22576 [pdf, ps, other]

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Authors: Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and… ▽ More Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points. △ Less

Submitted 25 February, 2026; originally announced February 2026.

arXiv:2602.18709 [pdf, ps, other]

IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

Authors: Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, Zhizhong Su

Abstract: Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-inst… ▽ More Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability. △ Less

Submitted 27 March, 2026; v1 submitted 20 February, 2026; originally announced February 2026.

arXiv:2602.18452 [pdf, ps, other]

RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

Abstract: As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking re… ▽ More As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity. △ Less

Submitted 5 March, 2026; v1 submitted 4 February, 2026; originally announced February 2026.

arXiv:2602.15918 [pdf, ps, other]

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Authors: Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, Zhe Jiang

Abstract: Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reaso… ▽ More Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs. △ Less

Submitted 17 February, 2026; originally announced February 2026.

arXiv:2602.14257 [pdf, ps, other]

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Authors: Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are… ▽ More While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard. △ Less

Submitted 15 February, 2026; originally announced February 2026.

Comments: 15 pages, 11 figures

arXiv:2602.13551 [pdf, ps, other]

Small Reward Models via Backward Inference

Authors: Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov

Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inf… ▽ More Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at https://github.com/yikee/FLIP. △ Less

Submitted 25 February, 2026; v1 submitted 13 February, 2026; originally announced February 2026.

arXiv:2602.11645 [pdf]

Epitaxial Growth and Anomalous Hall Effect in High-Quality Altermagnetic $α$-MnTe Thin Films

Authors: Tian-Hao Shao, Xingze Dai, Wenyu Hu, Ming-Yuan Zhu, Yuanqiang He, Lin-He Yang, Jingjing Liu, Meng Yang, Xiang-Rui Liu, Jing-Jing Shi, Tian-Yi Xiao, Yu-Jie Hao, Xiao-Ming Ma, Yue Dai, Meng Zeng, Qinwu Gao, Gan Wang, Junxue Li, Chao Wang, Chang Liu

Abstract: The recent identification of $α$-MnTe as a candidate altermagnet has attracted considerable interest, particularly for its potential application in magnetic random-access memory. However, the development of high-quality thin films - essential for practical implementation - has remained limited. Here, we report the epitaxial growth of centimeter-scale $α$-MnTe thin films on InP(111) substrates via… ▽ More The recent identification of $α$-MnTe as a candidate altermagnet has attracted considerable interest, particularly for its potential application in magnetic random-access memory. However, the development of high-quality thin films - essential for practical implementation - has remained limited. Here, we report the epitaxial growth of centimeter-scale $α$-MnTe thin films on InP(111) substrates via molecular beam epitaxy (MBE). Through X-ray diffraction (XRD) analysis, we construct a MnTe phase diagram that provides clear guidance for stabilizing the pure $α$-MnTe phase, revealing that it is favored under high Te/Mn flux ratios and elevated growth temperatures. Cross-sectional electron microscopy confirms an atomically sharp film-substrate interface, consistent with a layer-by-layer epitaxial growth mode. Remarkably, these high-quality $α$-MnTe films exhibit a pronounced anomalous Hall effect (AHE) originating from Berry curvature, despite a net magnetic moment approaching zero - a signature of robust altermagnetic character. Our work establishes a viable route for synthesizing wafer-scale $α$-MnTe thin films and highlights their promise for altermagnet-based spintronics and magnetic sensing. △ Less

Submitted 12 February, 2026; originally announced February 2026.

Comments: 27 pages, 5 figures. Submitted on Jan. 21, 2026

arXiv:2602.09311 [pdf, ps, other]

Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem

Authors: Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata, Yasutaka Kamei

Abstract: Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakine… ▽ More Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakiness within individual projects, its broader ecosystem-wide impact remains largely unexplored. In this paper, we present an empirical study of test flakiness in the OpenStack ecosystem, which focuses on (1) cross-project flakiness, where flaky tests impact multiple projects, and (2) inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others. By analyzing 649 OpenStack projects, we identify 1,535 cross-project flaky tests and 1,105 inconsistently flaky tests. We find that cross-project flakiness affects 55% of OpenStack projects and significantly increases both review time and computational costs. Surprisingly, 70% of unit tests exhibit cross-project flakiness, challenging the assumption that unit tests are inherently insulated from issues that span modules like integration and system-level tests. Through qualitative analysis, we observe that race conditions in CI, inconsistent build configurations, and dependency mismatches are the primary causes of inconsistent flakiness. These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies. △ Less

Submitted 9 February, 2026; originally announced February 2026.

arXiv:2602.09082 [pdf, ps, other]

UI-Venus-1.5 Technical Report

Authors: Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo , et al. (2 additional authors not shown)

Abstract: GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging. In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications. The proposed model family comprises two dense variants (2B and 8B) and one mixture-o… ▽ More GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging. In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications. The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios. Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus △ Less

Submitted 24 February, 2026; v1 submitted 9 February, 2026; originally announced February 2026.

arXiv:2602.08559 [pdf, ps, other]

QARM V2: Quantitative Alignment Multi-Modal Recommendation for Reasoning User Sequence Modeling

Authors: Tian Xia, Jiaqi Zhang, Yueyang Liu, Hongjian Dou, Tingya Yin, Jiangxia Cao, Xulei Liang, Tianlu Xie, Lihao Liu, Xiang Chen, Shen Wang, Changxin Lao, Haixiang Gan, Jinkai Yu, Keting Cen, Lu Hao, Xu Zhang, Qiqiang Zhong, Zhongbo Sun, Yiyu Wang, Shuang Yang, Mingxin Wen, Xiangyu Wu, Shaoguo Liu, Tingting Gao , et al. (3 additional authors not shown)

Abstract: With the evolution of large language models (LLMs), there is growing interest in leveraging their rich semantic understanding to enhance industrial recommendation systems (RecSys). Traditional RecSys relies on ID-based embeddings for user sequence modeling in the General Search Unit (GSU) and Exact Search Unit (ESU) paradigm, which suffers from low information density, knowledge isolation, and wea… ▽ More With the evolution of large language models (LLMs), there is growing interest in leveraging their rich semantic understanding to enhance industrial recommendation systems (RecSys). Traditional RecSys relies on ID-based embeddings for user sequence modeling in the General Search Unit (GSU) and Exact Search Unit (ESU) paradigm, which suffers from low information density, knowledge isolation, and weak generalization ability. While LLMs offer complementary strengths with dense semantic representations and strong generalization, directly applying LLM embeddings to RecSys faces critical challenges: representation unmatch with business objectives and representation unlearning end-to-end with downstream tasks. In this paper, we present QARM V2, a unified framework that bridges LLM semantic understanding with RecSys business requirements for user sequence modeling. △ Less

Submitted 9 February, 2026; originally announced February 2026.

Comments: Work in progress

arXiv:2602.07621 [pdf, ps, other]

SciClaimEval: Cross-modal Claim Verification in Scientific Papers

Authors: Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Tian Cheng Xia, Florian Boudin, Andre Greiner-Petter, Akiko Aizawa

Abstract: We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models… ▽ More We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline. △ Less

Submitted 13 February, 2026; v1 submitted 7 February, 2026; originally announced February 2026.

Comments: Accepted at LREC 2026; 12 pages; data is available at https://sciclaimeval.github.io/

arXiv:2602.07320 [pdf, ps, other]

Incorruptible Neural Networks: Training Models that can Generalize to Large Internal Perturbations

Authors: Philip Jacobson, Ben Feinberg, Suhas Kumar, Sapan Agarwal, T. Patrick Xiao, Christopher Bennett

Abstract: Flat regions of the neural network loss landscape have long been hypothesized to correlate with better generalization properties. A closely related but distinct problem is training models that are robust to internal perturbations to their weights, which may be an important need for future low-power hardware platforms. In this paper, we explore the usage of two methods, sharpness-aware minimization… ▽ More Flat regions of the neural network loss landscape have long been hypothesized to correlate with better generalization properties. A closely related but distinct problem is training models that are robust to internal perturbations to their weights, which may be an important need for future low-power hardware platforms. In this paper, we explore the usage of two methods, sharpness-aware minimization (SAM) and random-weight perturbation (RWP), to find minima robust to a variety of random corruptions to weights. We consider the problem from two angles: generalization (how do we reduce the noise-robust generalization gap) and optimization (how do we maximize performance from optimizers when subject to strong perturbations). First, we establish, both theoretically and empirically, that an over-regularized RWP training objective is optimal for noise-robust generalization. For small-magnitude noise, we find that SAM's adversarial objective further improves performance over any RWP configuration, but performs poorly for large-magnitude noise. We link the cause of this to a vanishing-gradient effect, caused by unevenness in the loss landscape, affecting both SAM and RWP. Lastly, we demonstrate that dynamically adjusting the perturbation strength to match the evolution of the loss landscape improves optimizing for these perturbed objectives. △ Less

Submitted 6 February, 2026; originally announced February 2026.

arXiv:2602.02792 [pdf]

Experimental Quantification of Spin-Phonon Coupling in Molecular Qubits using Inelastic Neutron Scattering

Authors: Stefan H. Lohaus, Kay T. Xia, Yongqiang Cheng, Ryan G. Hadt

Abstract: Electronic spin superposition states enable nanoscale sensing through their sensitivity to the local environment, yet their sensitivity to vibrational motion also limits their coherence times. In molecular spin systems, chemical tunability and atomic-scale resolution are accompanied by a dense, thermally accessible phonon spectrum that introduces efficient spin relaxation pathways. Despite extensi… ▽ More Electronic spin superposition states enable nanoscale sensing through their sensitivity to the local environment, yet their sensitivity to vibrational motion also limits their coherence times. In molecular spin systems, chemical tunability and atomic-scale resolution are accompanied by a dense, thermally accessible phonon spectrum that introduces efficient spin relaxation pathways. Despite extensive theoretical work, there is little experimental consensus on which vibrational energies dominate spin relaxation or how molecular structure controls spin-phonon coupling (SPC). We present a fully experimental method to quantify SPC coefficients by combining temperature-dependent vibrational spectra from inelastic neutron scattering with spin relaxation rates measured by electron paramagnetic resonance. We apply this framework to two model S = 1/2 systems, copper(II) phthalocyanine (CuPc) and copper(II) octaethylporphyrin (CuOEP). Two distinct relaxation regimes emerge: below 40 K, weakly coupled lattice modes below $50~\mathrm{cm}^{-1}$ dominate, whereas above 40 K, optical phonons above ~$185~\mathrm{cm}^{-1}$ become thermally populated and drive relaxation with SPC coefficients nearly three orders of magnitude larger. Structural distortions in CuOEP that break planar symmetry soften the crystal lattice and enhance anharmonic scattering, but also raise the energy of stretching modes at the molecular core where the spins reside. This redistributes vibrational energy toward the molecular periphery and out of plane, ultimately reducing SPC relative to CuPc and enabling room-temperature spin coherence in CuOEP. Although our method does not provide mode-specific SPC coefficients, it quantifies contributions from distinct spectral regions and establishes a broadly applicable, fully experimental link between crystal structure, lattice dynamics, and spin relaxation. △ Less

Submitted 2 February, 2026; originally announced February 2026.

Comments: 21 pages, 5 figures, 1 table

arXiv:2602.01766 [pdf, ps, other]

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

Authors: Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng

Abstract: The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT… ▽ More The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine-tuning on extremely long contexts, we introduce a novel layer-level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks. The code is available at: https://anonymous.4open.science/r/comet-B00B/ △ Less

Submitted 2 February, 2026; originally announced February 2026.

arXiv:2602.01078 [pdf, ps, other]

AutoHealth: An Uncertainty-Aware Multi-Agent System for Autonomous Health Data Modeling

Authors: Tong Xia, Weibin Li, Gang Liu, Yong Li

Abstract: LLM-based agents have demonstrated strong potential for autonomous machine learning, yet their applicability to health data remains limited. Existing systems often struggle to generalize across heterogeneous health data modalities, rely heavily on predefined solution templates with insufficient adaptation to task-specific objectives, and largely overlook uncertainty estimation, which is essential… ▽ More LLM-based agents have demonstrated strong potential for autonomous machine learning, yet their applicability to health data remains limited. Existing systems often struggle to generalize across heterogeneous health data modalities, rely heavily on predefined solution templates with insufficient adaptation to task-specific objectives, and largely overlook uncertainty estimation, which is essential for reliable decision-making in healthcare. To address these challenges, we propose \textit{AutoHealth}, a novel uncertainty-aware multi-agent system that autonomously models health data and assesses model reliability. \textit{AutoHealth} employs closed-loop coordination among five specialized agents to perform data exploration, task-conditioned model construction, training, and optimization, while jointly prioritizing predictive performance and uncertainty quantification. Beyond producing ready-to-use models, the system generates comprehensive reports to support trustworthy interpretation and risk-aware decision-making. To rigorously evaluate its effectiveness, we curate a challenging real-world benchmark comprising 17 tasks across diverse data modalities and learning settings. \textit{AutoHealth} completes all tasks and outperforms state-of-the-art baselines by 29.2\% in prediction performance and 50.2\% in uncertainty estimation. △ Less

Submitted 1 February, 2026; originally announced February 2026.

arXiv:2602.00760 [pdf, ps, other]

APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards

Authors: Kaiyan Chang, Chenwei Zhu, Yingfeng Luo, Yifu Huo, Chenglong Wang, Xiaoqian Liu, Qiaozhi He, Tong Xiao, Zhengtao Yu, Jingbo Zhu

Abstract: Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning pr… ▽ More Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring substantially fewer computational resources for RL training. △ Less

Submitted 9 February, 2026; v1 submitted 31 January, 2026; originally announced February 2026.

Comments: Under Review

arXiv:2601.22580 [pdf, ps, other]

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Authors: Chao Wang, Bei Li, Jiaqi Zhang, Xinyu Liu, Yuchun Fan, Linkun Lyu, Xin Chen, Jingang Wang, Tong Xiao, Peng Pei, Xunliang Cai

Abstract: The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but s… ▽ More The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures. △ Less

Submitted 30 January, 2026; originally announced January 2026.

Showing 1–50 of 995 results for author: Xiao, T