Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Authors: Xiangyuan Xue, Yuyu Wang, Ruijie Yao, Xiaoyue Ni, Xiaofan Jiang, Jingping Nie

Abstract: Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this wor… ▽ More Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on Post-All. Fine-tuning substantially improves several CTC-based models, whereas Whisper shows unstable adaptation. As an exploratory case study, we further stratify results by fluent and non-fluent speakers; although the non-fluent subset is small, it is consistently more challenging than the fluent subset. Overall, our findings show that post-exercise ASR robustness is strongly model-dependent, that in-domain adaptation can be highly effective but not uniformly stable, and that future post-exercise ASR studies should explicitly separate fluency-related effects from exercise-induced speech variation. △ Less

Submitted 29 March, 2026; originally announced March 2026.

arXiv:2603.26341 [pdf, ps, other]

HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network

Authors: Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, Yupeng Hu

Abstract: Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the… ▽ More Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT. △ Less

Submitted 27 March, 2026; originally announced March 2026.

Comments: Accepted by ICASSP 2026

arXiv:2603.25754 [pdf, ps, other]

DUGC-VRNet: Joint VR Recognition and Channel Estimation for Spatially Non-Stationary XL-MIMO

Authors: Jinhao Nie, Guangchi Zhang, Miao Cui, Hao Fu, Xiaoli Chu

Abstract: In this letter, we address spatially non-stationary near-field channel estimation for extremely large-scale multiple-input multiple-output (XL-MIMO) systems with a hybrid combining architecture. One key challenge in the considered problem lies in that conventional channel estimation algorithms typically struggle to effectively identify and adapt to the partial antenna visibility caused by varying… ▽ More In this letter, we address spatially non-stationary near-field channel estimation for extremely large-scale multiple-input multiple-output (XL-MIMO) systems with a hybrid combining architecture. One key challenge in the considered problem lies in that conventional channel estimation algorithms typically struggle to effectively identify and adapt to the partial antenna visibility caused by varying visibility regions (VRs), thereby compromising estimation accuracy. To perform joint VR recognition and channel estimation, we integrate a deep unfolding network (DUN) with a graph convolution network (GCN), leading to a Deep Unfolding and Graph Convolution coupled, Visibility Region Aware Network (DUGC-VRNet). By leveraging the channel's graph structure, the GCN infers and feeds back VR information to dynamically guide the DUN's updates, thereby enhancing reliable channel estimation under spatial non-stationarity. To reduce DUGC-VRNet's complexity, we apply weight pruning to obtain a lightweight network. Simulation results demonstrate that the DUGC-VRNet and its pruned variant achieve superior channel estimation and more accurate VR recognition under spatially non-stationary conditions. △ Less

Submitted 24 March, 2026; originally announced March 2026.

arXiv:2603.23638 [pdf, ps, other]

Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments

Authors: Yi Han, Lingfei Qian, Yan Wang, Yueru He, Xueqing Peng, Dongji Feng, Yankai Chen, Haohang Li, Yupeng Cao, Jimin Huang, Xue Liu, Jian-Yun Nie, Sophia Ananiadou

Abstract: Large language models (LLMs) have enabled agentic systems that can reason, plan, and act across complex tasks, but it remains unclear whether they can allocate resources effectively under uncertainty. Unlike short-horizon reactive decisions, allocation requires committing scarce resources over time while balancing competing objectives and preserving flexibility for future needs. We introduce Enter… ▽ More Large language models (LLMs) have enabled agentic systems that can reason, plan, and act across complex tasks, but it remains unclear whether they can allocate resources effectively under uncertainty. Unlike short-horizon reactive decisions, allocation requires committing scarce resources over time while balancing competing objectives and preserving flexibility for future needs. We introduce EnterpriseArena, the first benchmark for evaluating agents on long-horizon enterprise resource allocation. It instantiates CFO-style decision-making in a 132-month enterprise simulator combining firm-level financial data, anonymized business documents, macroeconomic and industry signals, and expert-validated operating rules. The environment is partially observable and reveals the state only through budgeted organizational tools, forcing agents to trade off information acquisition against conserving scarce resources. Experiments on eleven advanced LLMs show that this setting remains highly challenging: only 16% of runs survive the full horizon, and larger models do not reliably outperform smaller ones. These results identify long-horizon resource allocation under uncertainty as a distinct capability gap for current LLM agents. △ Less

Submitted 24 March, 2026; originally announced March 2026.

arXiv:2603.22323 [pdf, ps, other]

A Multi-Task Targeted Learning Framework for Lithium-Ion Battery State-of-Health and Remaining Useful Life

Authors: Chenhan Wang, Zhengyi Bao, Huipin Lin, Jiahao Nie, Chunxiang Zhu

Abstract: Accurately predicting the state-of-health (SOH) and remaining useful life (RUL) of lithium-ion batteries is crucial for ensuring the safe and efficient operation of electric vehicles while minimizing associated risks. However, current deep learning methods are limited in their ability to selectively extract features and model time dependencies for these two parameters. Moreover, most existing meth… ▽ More Accurately predicting the state-of-health (SOH) and remaining useful life (RUL) of lithium-ion batteries is crucial for ensuring the safe and efficient operation of electric vehicles while minimizing associated risks. However, current deep learning methods are limited in their ability to selectively extract features and model time dependencies for these two parameters. Moreover, most existing methods rely on traditional recurrent neural networks, which have inherent shortcomings in long-term time-series modeling. To address these issues, this paper proposes a multi-task targeted learning framework for SOH and RUL prediction, which integrates multiple neural networks, including a multi-scale feature extraction module, an improved extended LSTM, and a dual-stream attention module. First, a feature extraction module with multi-scale CNNs is designed to capture detailed local battery decline patterns. Secondly, an improved extended LSTM network is employed to enhance the model's ability to retain long-term temporal information, thus improving temporal relationship modeling. Building on this, the dual-stream attention module-comprising polarized attention and sparse attention to selectively focus on key information relevant to SOH and RUL, respectively, by assigning higher weights to important features. Finally, a many-to-two mapping is achieved through the dual-task layer. To optimize the model's performance and reduce the need for manual hyperparameter tuning, the Hyperopt optimization algorithm is used. Extensive comparative experiments on battery aging datasets demonstrate that the proposed method reduces the average RMSE for SOH and RUL predictions by 111.3\% and 33.0\%, respectively, compared to traditional and state-of-the-art methods. △ Less

Submitted 20 March, 2026; originally announced March 2026.

Comments: https://github.com/wch1121/Joint-prediction-of-SOH-and-RUL

arXiv:2603.18397 [pdf, ps, other]

FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra

Authors: Jianan Nie, Peng Gao

Abstract: Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion… ▽ More Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder. Notably, it achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). We also visualize the generated molecules, which further demonstrate that FlowMS produces structurally plausible candidates closely resembling ground truth structures. These results establish discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery. △ Less

Submitted 18 March, 2026; originally announced March 2026.

arXiv:2603.16292 [pdf, ps, other]

Attention-guided Evidence Grounding for Spoken Question Answering

Authors: Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

Abstract: Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large La… ▽ More Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%. △ Less

Submitted 17 March, 2026; v1 submitted 17 March, 2026; originally announced March 2026.

Comments: Accepted to ICME 2026

arXiv:2603.08721 [pdf, ps, other]

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Authors: Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren Zhao

Abstract: New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it… ▽ More New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent's ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template-based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers. △ Less

Submitted 10 February, 2026; originally announced March 2026.

arXiv:2603.07918 [pdf, ps, other]

Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

Authors: Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu

Abstract: Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utili… ▽ More Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image. In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map. To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance. The code will be available at https://github.com/yingkai-zhang/UAFL. △ Less

Submitted 8 March, 2026; originally announced March 2026.

arXiv:2602.24133 [pdf, ps, other]

FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking

Authors: Sifan Zhou, Jiahao Nie, Ziyu Zhao, Yichao Cao, Xiaobo Lu

Abstract: In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bott… ▽ More In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. The IMM module employs a temp-oral-difference siamese encoder to capture global motion patterns between adjacent frames. The Focus-and-Suppress attention that enhance the foreground semantics via motion-salient feature gating and suppress the background noise based on the temporal-aware motion context from IMM without explicit segmentation. Based on above two designs, FocusTrack enables end-to-end training with compact one-stage pipeline. Extensive experiments on prominent 3D tracking benchmarks, such as KITTI, nuScenes, and Waymo, demonstrate that the FocusTrack achieves new SOTA performance while running at a high speed with 105 FPS. △ Less

Submitted 15 March, 2026; v1 submitted 27 February, 2026; originally announced February 2026.

Comments: Acceptted in ACM MM 2025

arXiv:2602.22547 [pdf, ps, other]

Towards Dynamic Dense Retrieval with Routing Strategy

Authors: Zhan Su, Fengran Mo, Jinghan Zhang, Yuchen Hui, Jia Ao Sun, Bingbing Wen, Jian-Yun Nie

Abstract: The \textit{de facto} paradigm for applying dense retrieval (DR) to new tasks involves fine-tuning a pre-trained model for a specific task. However, this paradigm has two significant limitations: (1) It is difficult adapt the DR to a new domain if the training dataset is limited. (2) Old DR models are simply replaced by newer models that are trained from scratch when the former are no longer up… ▽ More The \textit{de facto} paradigm for applying dense retrieval (DR) to new tasks involves fine-tuning a pre-trained model for a specific task. However, this paradigm has two significant limitations: (1) It is difficult adapt the DR to a new domain if the training dataset is limited. (2) Old DR models are simply replaced by newer models that are trained from scratch when the former are no longer up to date. Especially for scenarios where the model needs to be updated frequently, this paradigm is prohibitively expensive. To address these challenges, we propose a novel dense retrieval approach, termed \textit{dynamic dense retrieval} (DDR). DDR uses \textit{prefix tuning} as a \textit{module} specialized for a specific domain. These modules can then be compositional combined with a dynamic routing strategy, enabling highly flexible domain adaptation in the retrieval part. Extensive evaluation on six zero-shot downstream tasks demonstrates that this approach can surpass DR while utilizing only 2\% of the training parameters, paving the way to achieve more flexible dense retrieval in IR. We see it as a promising future direction for applying dense retrieval to various tasks. △ Less

Submitted 25 February, 2026; originally announced February 2026.

arXiv:2602.20019 [pdf, ps, other]

Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision

Authors: Yuxing Tian, Yiyan Qi, Fengran Mo, Weixu Zhang, Jian Guo, Jian-Yun Nie

Abstract: Dynamic graph anomaly detection (DGAD) is critical for many real-world applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize p… ▽ More Dynamic graph anomaly detection (DGAD) is critical for many real-world applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize poorly to unseen anomalies. To address this gap, we consider a largely underexplored problem in DGAD: learning a discriminative boundary from normal/unlabeled data, while leveraging limited labeled anomalies \textbf{when available} without sacrificing generalization to unseen anomalies. To this end, we propose an effective, generalizable, and model-agnostic framework with three main components: (i) residual representation encoding that capture deviations between current interactions and their historical context, providing anomaly-relevant signals; (ii) a restriction loss that constrain the normal representations within an interval bounded by two co-centered hyperspheres, ensuring consistent scales while keeping anomalies separable; (iii) a bi-boundary optimization strategy that learns a discriminative and robust boundary using the normal log-likelihood distribution modeled by a normalizing flow. Extensive experiments demonstrate the superiority of our framework across diverse evaluation settings. △ Less

Submitted 23 February, 2026; originally announced February 2026.

Comments: 21 pages, 7 figures

arXiv:2602.19969 [pdf, ps, other]

ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting

Authors: Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, Jian-Yun Nie

Abstract: The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly co… ▽ More The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose \textbf{ReAttn}, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method. △ Less

Submitted 23 February, 2026; originally announced February 2026.

Comments: Accepted by EACL2026

arXiv:2602.16990 [pdf, ps, other]

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Authors: Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

Abstract: Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longi… ▽ More Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub. △ Less

Submitted 18 February, 2026; originally announced February 2026.

arXiv:2602.12783 [pdf, ps, other]

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Authors: Yuejie Li, Ke Yang, Yueying Hua, Berlin Chen, Jianhao Nie, Yueping He, Caixin Kang

Abstract: Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spok… ▽ More Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval. △ Less

Submitted 13 February, 2026; originally announced February 2026.

arXiv:2602.10016 [pdf, ps, other]

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Authors: Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie , et al. (4 additional authors not shown)

Abstract: Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify… ▽ More Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact. △ Less

Submitted 13 February, 2026; v1 submitted 10 February, 2026; originally announced February 2026.

Comments: 10 pages, 4 figures

arXiv:2602.05218 [pdf, ps, other]

Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification

Authors: Jiahao Nie, Yun Xing, Wenbin An, Qingsong Zhao, Jiawei Shao, Yap-Peng Tan, Alex C. Kot, Shijian Lu, Xuelong Li

Abstract: Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as p… ▽ More Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets. △ Less

Submitted 4 February, 2026; originally announced February 2026.

arXiv:2602.05217 [pdf, ps, other]

Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation

Authors: Jiahao Nie, Guanqiao Fu, Wenbin An, Yap-Peng Tan, Alex C. Kot, Shijian Lu

Abstract: Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's… ▽ More Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%). △ Less

Submitted 4 February, 2026; originally announced February 2026.

arXiv:2602.05215 [pdf, ps, other]

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

Authors: Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan, Alex C. Kot, Shijian Lu

Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event… ▽ More Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special <event> token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins. △ Less

Submitted 4 February, 2026; originally announced February 2026.

arXiv:2601.20706 [pdf, ps, other]

Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

Authors: Binglei Lou, Haoran Wu, Yao Lai, Jiayi Nie, Can Xiao, Xuan Guo, Rika Antonova, Robert Mullins, Aaron Zhao

Abstract: Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary… ▽ More Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations. △ Less

Submitted 28 January, 2026; originally announced January 2026.

arXiv:2601.20318 [pdf, ps, other]

CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting

Authors: Jiyuan Xu, Wenyu Zhang, Xin Jing, Shuai Chen, Shuai Zhang, Jiahao Nie

Abstract: Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-chann… ▽ More Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-channel dependencies and limits performance. To address these limitations, we propose \textbf{CPiRi}, a \textbf{channel permutation invariant (CPI)} framework that infers cross-channel structure from data rather than memorizing a fixed ordering, enabling deployment in settings with structural and distributional co-drift without retraining. CPiRi couples \textbf{spatio-temporal decoupling architecture} with \textbf{permutation-invariant regularization training strategy}: a frozen pretrained temporal encoder extracts high-quality temporal features, a lightweight spatial module learns content-driven inter-channel relations, while a channel shuffling strategy enforces CPI during training. We further \textbf{ground CPiRi in theory} by analyzing permutation equivariance in multivariate time series forecasting. Experiments on multiple benchmarks show state-of-the-art results. CPiRi remains stable when channel orders are shuffled and exhibits strong \textbf{inductive generalization} to unseen channels even when trained on \textbf{only half} of the channels, while maintaining \textbf{practical efficiency} on large-scale datasets. The source code is released at https://github.com/JasonStraka/CPiRi. △ Less

Submitted 27 February, 2026; v1 submitted 28 January, 2026; originally announced January 2026.

Comments: 22 pages, 10 figures, ICLR 2026

arXiv:2601.14896 [pdf, ps, other]

Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation

Authors: Rui Qi, Fengran Mo, Yufeng Chen, Xue Zhang, Shuo Wang, Hongliang Li, Jinan Xu, Meng Jiang, Jian-Yun Nie, Kaiyu Huang

Abstract: Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all'' strategy i… ▽ More Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all'' strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search-augmented reinforcement learning framework that integrates a language-coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at https://github.com/Cherry-qwq/LcRL-Open. △ Less

Submitted 21 January, 2026; originally announced January 2026.

arXiv:2601.14716 [pdf, ps, other]

PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

Authors: Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian

Abstract: We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model… ▽ More We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. △ Less

Submitted 21 January, 2026; originally announced January 2026.

arXiv:2601.09028 [pdf, ps, other]

OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG

Authors: Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie

Abstract: The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed t… ▽ More The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators. △ Less

Submitted 23 January, 2026; v1 submitted 13 January, 2026; originally announced January 2026.

Comments: Accepted by ACM WWW 2026

arXiv:2512.20556 [pdf, ps, other]

Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

Authors: Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li

Abstract: Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to e… ▽ More Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks. △ Less

Submitted 23 December, 2025; originally announced December 2025.

Comments: Accepted to WACV 2026

arXiv:2512.09200 [pdf, ps, other]

Meta Lattice: Model Space Redesign for Cost-Effective Industry-Scale Ads Recommendations

Authors: Liang Luo, Yuxin Chen, Zhengyu Zhang, Mengyue Hang, Andrew Gu, Buyun Zhang, Boyang Liu, Chen Chen, Chengze Fan, Dong Liang, Fan Yang, Feifan Gu, Huayu Li, Jade Nie, Jiayi Xu, Jiyan Yang, Jongsoo Park, Laming Chen, Longhao Jin, Qianru Li, Qin Huang, Shali Jiang, Shiwen Shen, Shuaiwen Wang, Sihan Zeng , et al. (17 additional authors not shown)

Abstract: The rapidly evolving landscape of products, surfaces, policies, and regulations poses significant challenges for deploying state-of-the-art recommendation models at industry scale, primarily due to data fragmentation across domains and escalating infrastructure costs that hinder sustained quality improvements. To address this challenge, we propose Lattice, a recommendation framework centered aro… ▽ More The rapidly evolving landscape of products, surfaces, policies, and regulations poses significant challenges for deploying state-of-the-art recommendation models at industry scale, primarily due to data fragmentation across domains and escalating infrastructure costs that hinder sustained quality improvements. To address this challenge, we propose Lattice, a recommendation framework centered around model space redesign that extends Multi-Domain, Multi-Objective (MDMO) learning beyond models and learning objectives. Lattice addresses these challenges through a comprehensive model space redesign that combines cross-domain knowledge sharing, data consolidation, model unification, distillation, and system optimizations to achieve significant improvements in both quality and cost-efficiency. Our deployment of Lattice at Meta has resulted in 10% revenue-driving top-line metrics gain, 11.5% user satisfaction improvement, 6% boost in conversion rate, with 20% capacity saving. △ Less

Submitted 14 December, 2025; v1 submitted 9 December, 2025; originally announced December 2025.

Comments: Accepted to KDD 2026

arXiv:2511.18171 [pdf, ps, other]

BPMN to PDDL: Translating Business Workflows for AI Planning

Authors: Jasper Nie, Christian Muise, Victoria Armstrong

Abstract: Business Process Model and Notation (BPMN) is a widely used standard for modelling business processes. While automated planning has been proposed as a method for simulating and reasoning about BPMN workflows, most implementations remain incomplete or limited in scope. This project builds upon prior theoretical work to develop a functional pipeline that translates BPMN 2.0 diagrams into PDDL repres… ▽ More Business Process Model and Notation (BPMN) is a widely used standard for modelling business processes. While automated planning has been proposed as a method for simulating and reasoning about BPMN workflows, most implementations remain incomplete or limited in scope. This project builds upon prior theoretical work to develop a functional pipeline that translates BPMN 2.0 diagrams into PDDL representations suitable for planning. The system supports core BPMN constructs, including tasks, events, sequence flows, and gateways, with initial support for parallel and inclusive gateway behaviour. Using a non-deterministic planner, we demonstrate how to generate and evaluate valid execution traces. Our implementation aims to bridge the gap between theory and practical tooling, providing a foundation for further exploration of translating business processes into well-defined plans. △ Less

Submitted 22 November, 2025; originally announced November 2025.

Comments: 8 pages, 3 figures. Code and generated PDDL outputs available at https://github.com/QuMuLab/bpmn-to-pddl-translation

ACM Class: I.2.8; D.2.11

arXiv:2511.17946 [pdf, ps, other]

Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

Authors: Shuo Zhang, Fabrizio Gotti, Fengran Mo, Jian-Yun Nie

Abstract: Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail kn… ▽ More Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to retrieve $n$-gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at https://github.com/WWWonderer/ostd. △ Less

Submitted 22 November, 2025; originally announced November 2025.

arXiv:2511.17196 [pdf, ps, other]

Real Noise Decoupling for Hyperspectral Image Denoising

Authors: Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu

Abstract: Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a multi-stage noise-dec… ▽ More Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a multi-stage noise-decoupling framework that decomposes complex noise into explicitly modeled and implicitly modeled components. This decoupling reduces the complexity of noise and enhances the learnability of HSI denoising methods when applied to real paired data. Specifically, for explicitly modeled noise, we utilize an existing noise model to generate paired data for pre-training a denoising network, equipping it with prior knowledge to handle the explicitly modeled noise effectively. For implicitly modeled noise, we introduce a high-frequency wavelet guided network. Leveraging the prior knowledge from the pre-trained module, this network adaptively extracts high-frequency features to target and remove the implicitly modeled noise from real paired HSIs. Furthermore, to effectively eliminate all noise components and mitigate error accumulation across stages, a multi-stage learning strategy, comprising separate pre-training and joint fine-tuning, is employed to optimize the entire framework. Extensive experiments on public and our captured datasets demonstrate that our proposed framework outperforms state-of-the-art methods, effectively handling complex real-world noise and significantly enhancing HSI quality. △ Less

Submitted 21 November, 2025; originally announced November 2025.

arXiv:2511.17044 [pdf, ps, other]

Parametric Retrieval-Augmented Generation using Latent Routing of LoRA Adapters

Authors: Zhan Su, Fengran Mo, Jinghan Zhang, Yuchen Hui, Jiaao Sun, Jian-yun Nie

Abstract: Parametric Retrieval-Augmented Generation (PRAG) is a RAG approach that integrates external knowledge directly into model parameters using a LoRA adapter, aiming at reducing the inference cost compared to traditional RAG. However, current PRAG approaches adopt a \textit{one-to-one} document encoding scheme, using a dedicated LoRA adapter for each individual document. This scheme introduces two maj… ▽ More Parametric Retrieval-Augmented Generation (PRAG) is a RAG approach that integrates external knowledge directly into model parameters using a LoRA adapter, aiming at reducing the inference cost compared to traditional RAG. However, current PRAG approaches adopt a \textit{one-to-one} document encoding scheme, using a dedicated LoRA adapter for each individual document. This scheme introduces two major limitations: 1) As the number of documents increases, there will be a prohibitive cost for training and storage. 2) The LoRA adapters may largely overlap due to the shared knowledge across documents, making the approach highly inefficient. To overcome these challenges, we propose the Poly-PRAG approach, which uses a small set of LoRA adapters that are able to encode more general knowledge. Each document can be encoded using a combination of them through a latent routing function. By jointly training the LoRA adapters and the latent routing function, each LoRA adapter is able to encode a shared part of the knowledge across documents, and the routing function can select the best combination of adapters for a document. Experimental results on four benchmarks demonstrate the effectiveness of the Poly-PRAG compared to other strong PRAG baselines. In addition, this approach reduces the storage requirement by avoiding the need to store a large number of LoRA adapters and offers a more efficient way to encode external knowledge into LLMs. △ Less

Submitted 23 January, 2026; v1 submitted 21 November, 2025; originally announced November 2025.

arXiv:2511.15580 [pdf, ps, other]

CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

Authors: Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao, Xiaobo Lu, Shuo Wang

Abstract: 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders e… ▽ More 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU. △ Less

Submitted 22 November, 2025; v1 submitted 19 November, 2025; originally announced November 2025.

Comments: Accepted by AAAI 2026 (Oral)

arXiv:2511.07803 [pdf, ps, other]

Judging by the Rules: Compliance-Aligned Framework for Modern Slavery Statement Monitoring

Authors: Wenhao Xu, Akshatha Arodi, Jian-Yun Nie, Arsene Fansi Tchango

Abstract: Modern slavery affects millions of people worldwide, and regulatory frameworks such as Modern Slavery Acts now require companies to publish detailed disclosures. However, these statements are often vague and inconsistent, making manual review time-consuming and difficult to scale. While NLP offers a promising path forward, high-stakes compliance tasks require more than accurate classification: the… ▽ More Modern slavery affects millions of people worldwide, and regulatory frameworks such as Modern Slavery Acts now require companies to publish detailed disclosures. However, these statements are often vague and inconsistent, making manual review time-consuming and difficult to scale. While NLP offers a promising path forward, high-stakes compliance tasks require more than accurate classification: they demand transparent, rule-aligned outputs that legal experts can verify. Existing applications of large language models (LLMs) often reduce complex regulatory assessments to binary decisions, lacking the necessary structure for robust legal scrutiny. We argue that compliance verification is fundamentally a rule-matching problem: it requires evaluating whether textual statements adhere to well-defined regulatory rules. To this end, we propose a novel framework that harnesses AI for rule-level compliance verification while preserving expert oversight. At its core is the Compliance Alignment Judge (CA-Judge), which evaluates model-generated justifications based on their fidelity to statutory requirements. Using this feedback, we train the Compliance Alignment LLM (CALLM), a model that produces rule-consistent, human-verifiable outputs. CALLM improves predictive performance and generates outputs that are both transparent and legally grounded, offering a more verifiable and actionable solution for real-world compliance analysis. △ Less

Submitted 10 November, 2025; originally announced November 2025.

Comments: To appear at AAAI-26 (Social Impact Track)

arXiv:2511.01293 [pdf, ps, other]

Detecting Generated Images by Fitting Natural Image Distributions

Authors: Yonggang Zhang, Jun Nie, Xinmei Tian, Mingming Gong, Kun Zhang, Bo Han

Abstract: The increasing realism of generated images has raised significant concerns about their potential misuse, necessitating robust detection methods. Current approaches mainly rely on training binary classifiers, which depend heavily on the quantity and quality of available generated images. In this work, we propose a novel framework that exploits geometric differences between the data manifolds of nat… ▽ More The increasing realism of generated images has raised significant concerns about their potential misuse, necessitating robust detection methods. Current approaches mainly rely on training binary classifiers, which depend heavily on the quantity and quality of available generated images. In this work, we propose a novel framework that exploits geometric differences between the data manifolds of natural and generated images. To exploit this difference, we employ a pair of functions engineered to yield consistent outputs for natural images but divergent outputs for generated ones, leveraging the property that their gradients reside in mutually orthogonal subspaces. This design enables a simple yet effective detection method: an image is identified as generated if a transformation along its data manifold induces a significant change in the loss value of a self-supervised model pre-trained on natural images. Further more, to address diminishing manifold disparities in advanced generative models, we leverage normalizing flows to amplify detectable differences by extruding generated images away from the natural image manifold. Extensive experiments demonstrate the efficacy of this method. Code is available at https://github.com/tmlr-group/ConV. △ Less

Submitted 3 November, 2025; originally announced November 2025.

Comments: 25 pages, 9 figures, NeurIPS 2025 spotlight

arXiv:2510.11695 [pdf, ps, other]

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

Authors: Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou

Abstract: Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based… ▽ More Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents. △ Less

Submitted 29 October, 2025; v1 submitted 13 October, 2025; originally announced October 2025.

arXiv:2510.08886 [pdf, ps, other]

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Authors: Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Yankai Chen, Víctor Gutiérrez-Basulto, Xiao-Yang Liu, Xue Liu, Jian-Yun Nie

Abstract: Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-d… ▽ More Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/. △ Less

Submitted 18 February, 2026; v1 submitted 9 October, 2025; originally announced October 2025.

arXiv:2510.08825 [pdf, ps, other]

Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

Authors: Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Jian-Yun Nie

Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions -- they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling… ▽ More Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions -- they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate'' principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16\% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks. △ Less

Submitted 9 October, 2025; originally announced October 2025.

arXiv:2510.00977 [pdf, ps, other]

It Takes Two: Your GRPO Is Secretly DPO

Authors: Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie

Abstract: Group Relative Policy Optimization (GRPO) has emerged as a prominent reinforcement learning algorithm for post-training Large Language Models. Different from critic-based methods such as PPO, GRPO estimates the advantage function using group-level statistics to reduce the variance of policy gradient estimators. While the prevailing view attributes GRPO's effectiveness to large group sizes for accu… ▽ More Group Relative Policy Optimization (GRPO) has emerged as a prominent reinforcement learning algorithm for post-training Large Language Models. Different from critic-based methods such as PPO, GRPO estimates the advantage function using group-level statistics to reduce the variance of policy gradient estimators. While the prevailing view attributes GRPO's effectiveness to large group sizes for accurate advantage estimation, we propose a different perspective. We demonstrate that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This perspective establishes a fundamental connection between GRPO and DPO, wherein group size influences only the Monte Carlo estimators of the contrastive objective. To validate this, we investigate the minimal two-rollout case (2-GRPO), a configuration permissible under the contrastive framework but typically considered insufficient for reward normalization. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains 98.1% of the performance of 16-GRPO, while requiring only 12.5% of the rollouts and 21% of the training time. This study offers a new perspective for future algorithm design in LLM post-training. △ Less

Submitted 29 January, 2026; v1 submitted 1 October, 2025; originally announced October 2025.

arXiv:2509.24214 [pdf, ps, other]

Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis

Authors: Xuecheng Wu, Junxiao Xue, Xinyi Yin, Yunyun Shi, Liangyu Fu, Danlei Huang, Yifan Wang, Jia Zhang, Jiayu Nie, Jun Wang

Abstract: Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for brea… ▽ More Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github. △ Less

Submitted 28 September, 2025; originally announced September 2025.

arXiv:2509.22951 [pdf, ps, other]

Tiny-QMoE

Authors: Jack Cashman, Jiaqi Nie

Abstract: The QMoE model provides a practical approach for compression of massive Mixture-of-Experts (MoE) models. QMoE offers a solution geared towards memory limitations that often reach terabyte scales, and it has the advantage of working with high sparsity models which implicitly lend themselves to compression techniques. QMoE also has the advantage of only taking MoE models into account and does not ev… ▽ More The QMoE model provides a practical approach for compression of massive Mixture-of-Experts (MoE) models. QMoE offers a solution geared towards memory limitations that often reach terabyte scales, and it has the advantage of working with high sparsity models which implicitly lend themselves to compression techniques. QMoE also has the advantage of only taking MoE models into account and does not evaluate its use with non mixture of expert systems. Although this prior attempt focuses on the limitations of large servers with the latest NVIDIA hardware which in the case of the H100 and V100 which have 80 GB of HBM (High Bandwidth Memory), what is not being considered is a significantly more constrained environment, such as in the case of mobile devices which may have in the case of the iPhone anywhere from 4 to 8 GB of unified memory which also needs to be shared with the operating system and additional processes. Although edge devices such as phones and laptops are becoming increasingly more computationally powerful, they are still not close to the level of advanced server machines such as NVIDIA. An additional constraint that we must consider is that of latency. The communication time of sending a request to an LLM server and then getting it back is an additional waiting time that can be removed. We may also want to use LLM technology in environments where there is no reliable network connection. △ Less

Submitted 26 September, 2025; originally announced September 2025.

arXiv:2509.18137 [pdf, ps, other]

LoRALib: A Standardized Benchmark for Evaluating LoRA-MoE Methods

Authors: Shaoheng Wang, Yao Lu, Yuqi Li, Yaxin Gao, Jiaqi Nie, Shanqing Yu, Yingli Tian, Qi Xuan

Abstract: As a parameter efficient fine-tuning (PEFT) method, low-rank adaptation (LoRA) can save significant costs in storage and computing, but its strong adaptability to a single task is often accompanied by insufficient cross-task generalization capabilities. To improve this, existing work combines LoRA with mixture-of-experts (MoE) to enhance the model's adaptability through expert modules and routing… ▽ More As a parameter efficient fine-tuning (PEFT) method, low-rank adaptation (LoRA) can save significant costs in storage and computing, but its strong adaptability to a single task is often accompanied by insufficient cross-task generalization capabilities. To improve this, existing work combines LoRA with mixture-of-experts (MoE) to enhance the model's adaptability through expert modules and routing mechanisms. However, existing LoRA-MoE methods lack unified standards in models, datasets, hyperparameters, and evaluation methods, making it difficult to conduct fair comparisons between different methods. To this end, we proposed a unified benchmark named LoRALib. Specifically, we standardized datasets from $40$ downstream tasks into a unified format, fine-tuned them using the same hyperparameters and obtained $680$ LoRA modules across $17$ model architectures. Based on this LoRA library, we conduct large-scale experiments on $3$ representative LoRA-MoE methods and different LoRA selection mechanisms using the open-sourced testing tool OpenCompass. Extensive experiments show that LoRAMoE performs best, and that prioritizing LoRAs relevant to the target task can further improve the performance of MoE. We hope these findings will inspire future work. Our datasets and LoRA library are available at https://huggingface.co/datasets/YaoLuzjut/LoRAOcean_dataset and https://huggingface.co/YaoLuzjut/models. △ Less

Submitted 14 September, 2025; originally announced September 2025.

arXiv:2509.15473 [pdf, ps, other]

doi 10.1145/3737901.3768369

Breathing and Semantic Pause Detection and Exertion-Level Classification in Post-Exercise Speech

Authors: Yuyu Wang, Wuyue Xia, Huaxiu Yao, Jingping Nie

Abstract: Post-exercise speech contains rich physiological and linguistic cues, often marked by semantic pauses, breathing pauses, and combined breathing-semantic pauses. Detecting these events enables assessment of recovery rate, lung function, and exertion-related abnormalities. However, existing works on identifying and distinguishing different types of pauses in this context are limited. In this work, b… ▽ More Post-exercise speech contains rich physiological and linguistic cues, often marked by semantic pauses, breathing pauses, and combined breathing-semantic pauses. Detecting these events enables assessment of recovery rate, lung function, and exertion-related abnormalities. However, existing works on identifying and distinguishing different types of pauses in this context are limited. In this work, building on a recently released dataset with synchronized audio and respiration signals, we provide systematic annotations of pause types. Using these annotations, we systematically conduct exploratory breathing and semantic pause detection and exertion-level classification across deep learning models (GRU, 1D CNN-LSTM, AlexNet, VGG16), acoustic features (MFCC, MFB), and layer-stratified Wav2Vec2 representations. We evaluate three setups-single feature, feature fusion, and a two-stage detection-classification cascade-under both classification and regression formulations. Results show per-type detection accuracy up to 89$\%$ for semantic, 55$\%$ for breathing, 86$\%$ for combined pauses, and 73$\%$overall, while exertion-level classification achieves 90.5$\%$ accuracy, outperformin prior work. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Comments: 6 pages, 3rd ACM International Workshop on Intelligent Acoustic Systems and Applications (IASA 25)

arXiv:2509.13723 [pdf, ps, other]

DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning

Authors: Yaxin Gao, Yao Lu, Zongfei Zhang, Jiaqi Nie, Shanqing Yu, Qi Xuan

Abstract: Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model… ▽ More Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model for compression, incurring a significant amount of additional computation. To avoid this, we propose a two-stage, training-free approach, called Dual-Stage Progressive Compression (DSPC). In the coarse-grained stage, semantic-related sentence filtering removes sentences with low semantic value based on TF-IDF. In the fine-grained stage, token importance is assessed using attention contribution, cross-model loss difference, and positional importance, enabling the pruning of low-utility tokens while preserving semantics. We validate DSPC on LLaMA-3.1-8B-Instruct and GPT-3.5-Turbo under a constrained token budget and observe consistent improvements. For instance, in the FewShot task of the Longbench dataset, DSPC achieves a performance of 49.17 by using only 3x fewer tokens, outperforming the best state-of-the-art baseline LongLLMLingua by 7.76. △ Less

Submitted 18 September, 2025; v1 submitted 17 September, 2025; originally announced September 2025.

arXiv:2509.10070 [pdf, ps, other]

Toward Minimum Graphic Parity Networks

Authors: Yixin Cao, Yiren Lu, Junhong Nie, Xiaoming Sun, Guojing Tian

Abstract: Quantum circuits composed of CNOT and $R_z$ are fundamental building blocks of many quantum algorithms, so optimizing the synthesis of such quantum circuits is crucial. We address this problem from a theoretical perspective by studying the graphic parity network synthesis problem. A graphic parity network for a graph $G$ is a quantum circuit composed solely of CNOT gates where each edge of $G$ is… ▽ More Quantum circuits composed of CNOT and $R_z$ are fundamental building blocks of many quantum algorithms, so optimizing the synthesis of such quantum circuits is crucial. We address this problem from a theoretical perspective by studying the graphic parity network synthesis problem. A graphic parity network for a graph $G$ is a quantum circuit composed solely of CNOT gates where each edge of $G$ is represented in the circuit, and the final state of the wires matches the original input. We aim to synthesize graphic parity networks with the minimum number of gates, specifically for quantum algorithms addressing combinatorial optimization problems with Ising formulations. We demonstrate that a graphic parity network for a connected graph with $n$ vertices and $m$ edges requires at least $m+n-1$ gates. This lower bound can be improved to $m+Ω(m) = m+Ω(n^{1.5})$ when the shortest cycle in the graph has a length of at least five. We complement this result with a simple randomized algorithm that synthesizes a graphic parity network with expected $m + O(n^{1.5}\sqrt{\log n})$ gates. Additionally, we begin exploring connected graphs that allow for graphic parity networks with exactly $m+n-1$ gates. We conjecture that all such graphs belong to a newly defined graph class. Furthermore, we present a linear-time algorithm for synthesizing minimum graphic parity networks for graphs within this class. However, this graph class is not closed under taking induced subgraphs, and we show that recognizing it is $\textsf{NP}$-complete, which is complemented with a fixed-parameter tractable algorithm parameterized by the treewidth. △ Less

Submitted 12 September, 2025; originally announced September 2025.

arXiv:2509.09505 [pdf, ps, other]

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

Authors: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, Aaron Zhao

Abstract: LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This,… ▽ More LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5x higher utilization than existing accelerators, and delivers 2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced. △ Less

Submitted 24 September, 2025; v1 submitted 11 September, 2025; originally announced September 2025.

arXiv:2508.12271 [pdf, ps, other]

SNNSIR: A Simple Spiking Neural Network for Stereo Image Restoration

Authors: Ronghua Xu, Jin Xie, Jing Nie, Jiale Cao, Yanwei Pang

Abstract: Spiking Neural Networks (SNNs), characterized by discrete binary activations, offer high computational efficiency and low energy consumption, making them well-suited for computation-intensive tasks such as stereo image restoration. In this work, we propose SNNSIR, a simple yet effective Spiking Neural Network for Stereo Image Restoration, specifically designed under the spike-driven paradigm where… ▽ More Spiking Neural Networks (SNNs), characterized by discrete binary activations, offer high computational efficiency and low energy consumption, making them well-suited for computation-intensive tasks such as stereo image restoration. In this work, we propose SNNSIR, a simple yet effective Spiking Neural Network for Stereo Image Restoration, specifically designed under the spike-driven paradigm where neurons transmit information through sparse, event-based binary spikes. In contrast to existing hybrid SNN-ANN models that still rely on operations such as floating-point matrix division or exponentiation, which are incompatible with the binary and event-driven nature of SNNs, our proposed SNNSIR adopts a fully spike-driven architecture to achieve low-power and hardware-friendly computation. To address the expressiveness limitations of binary spiking neurons, we first introduce a lightweight Spike Residual Basic Block (SRBB) to enhance information flow via spike-compatible residual learning. Building on this, the Spike Stereo Convolutional Modulation (SSCM) module introduces simplified nonlinearity through element-wise multiplication and highlights noise-sensitive regions via cross-view-aware modulation. Complementing this, the Spike Stereo Cross-Attention (SSCA) module further improves stereo correspondence by enabling efficient bidirectional feature interaction across views within a spike-compatible framework. Extensive experiments on diverse stereo image restoration tasks, including rain streak removal, raindrop removal, low-light enhancement, and super-resolution demonstrate that our model achieves competitive restoration performance while significantly reducing computational overhead. These results highlight the potential for real-time, low-power stereo vision applications. The code will be available after the article is accepted. △ Less

Submitted 17 August, 2025; originally announced August 2025.

Comments: 11 pages

arXiv:2508.10955 [pdf, ps, other]

Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Authors: Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng

Abstract: By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance o… ▽ More By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs. △ Less

Submitted 14 August, 2025; originally announced August 2025.

Comments: 21 pages, 361 references

arXiv:2508.08634 [pdf, ps, other]

Adaptive Personalized Conversational Information Retrieval

Authors: Fengran Mo, Yuchen Hui, Yuxing Tian, Zhaoxuan Tan, Chuan Meng, Zhan Su, Kaiyu Huang, Jian-Yun Nie

Abstract: Personalized conversational information retrieval (CIR) systems aim to satisfy users' complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users' personal informat… ▽ More Personalized conversational information retrieval (CIR) systems aim to satisfy users' complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users' personal information and conversational context using large language models without distinguishing the specific requirements for each query turn. Such a ``one-size-fits-all'' personalization strategy might lead to sub-optimal results. In this paper, we propose an adaptive personalization method, in which we first identify the required personalization level for a query and integrate personalized queries with other query reformulations to produce various enhanced queries. Then, we design a personalization-aware ranking fusion approach to assign fusion weights dynamically to different reformulated queries, depending on the required personalization level. The proposed adaptive personalized conversational information retrieval framework APCIR is evaluated on two TREC iKAT datasets. The results confirm the effectiveness of adaptive personalization of APCIR by outperforming state-of-the-art methods. △ Less

Submitted 12 August, 2025; originally announced August 2025.

Comments: Accepted by CIKM 2025

arXiv:2508.06902 [pdf, ps, other]

eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

Authors: Xuecheng Wu, Dingkang Yang, Danlei Huang, Xinyi Yin, Yifan Wang, Jia Zhang, Jiayu Nie, Liangyu Fu, Yang Liu, Junxiao Xue, Hadi Amirpour, Wei Zhou

Abstract: Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale ann… ▽ More Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github. △ Less

Submitted 9 August, 2025; originally announced August 2025.

arXiv:2508.04001 [pdf, ps, other]

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

Authors: Fengran Mo, Jinghan Zhang, Yuchen Hui, Jia Ao Sun, Zhichao Xu, Zhan Su, Jian-Yun Nie

Abstract: Conversational search aims to satisfy users' complex information needs via multiple-turn interactions. The key challenge lies in revealing real users' search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this trainin… ▽ More Conversational search aims to satisfy users' complex information needs via multiple-turn interactions. The key challenge lies in revealing real users' search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness. △ Less

Submitted 12 November, 2025; v1 submitted 5 August, 2025; originally announced August 2025.

Comments: Accepted by AAAI 2026

arXiv:2508.03999 [pdf, ps, other]

Tensorized Clustered LoRA Merging for Multi-Task Interference

Authors: Zhan Su, Fengran Mo, Guojun Liang, Jinghan Zhang, Bingbing Wen, Prayag Tiwari, Jian-Yun Nie

Abstract: Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we… ▽ More Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textit{task interference}, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textit{text-level} and \textit{parameter-level}. At the \textit{text-level}, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textit{parameter-level}, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4\% accuracy on Phi-3 and +2.3\% on Mistral-7B (+2.3\%), demonstrating the effectiveness of TC-LoRA in LLM adaptation. △ Less

Submitted 5 August, 2025; originally announced August 2025.

Showing 1–50 of 245 results for author: Nie, J