-
SpanKey: Dynamic Key Space Conditioning for Neural Network Access Control
Authors:
WenBin Yan
Abstract:
SpanKey is a lightweight way to gate inference without encrypting weights or chasing leaderboard accuracy on gated inference. The idea is to condition activations on secret keys. A basis matrix $B$ defines a low-dimensional key subspace $Span(B)$; during training we sample coefficients $α$ and form keys $k=α^\top B$, then inject them into intermediate activations with additive or multiplicative ma…
▽ More
SpanKey is a lightweight way to gate inference without encrypting weights or chasing leaderboard accuracy on gated inference. The idea is to condition activations on secret keys. A basis matrix $B$ defines a low-dimensional key subspace $Span(B)$; during training we sample coefficients $α$ and form keys $k=α^\top B$, then inject them into intermediate activations with additive or multiplicative maps and strength $γ$. Valid keys lie in $Span(B)$; invalid keys are sampled outside that subspace. We make three points. (i) Mechanism: subspace key injection and a multi-layer design space. (ii) Failure mode: key absorption, together with two analytical results (a Beta-energy split and margin-tail diagnostics), explains weak baseline separation in energy and margin terms -- these are not a security theorem. iii) Deny losses and experiments: Modes A--C and extensions, with CIFAR-10 ResNet-18 runs and MNIST ablations for Mode B. We summarize setup and first-order analysis, injectors, absorption, deny losses and ablations, a threat discussion that does not promise cryptography, and closing remarks on scale. Code: \texttt{https://github.com/mindmemory-ai/dksc}
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
Authors:
Xin Li,
Yeying Jin,
Suhang Yao,
Beibei Lin,
Zhaoxin Fan,
Wending Yan,
Xin Jin,
Zongwei Wu,
Bingchen Li,
Peishu Shi,
Yufei Yang,
Yu Li,
Zhibo Chen,
Bihan Wen,
Robby T. Tan,
Radu Timofte,
Runzhe Li,
Kui Jiang,
Zhaocheng Yu,
Yiang Chen,
Junjun Jiang,
Xianming Liu,
Hongde Gu,
Zeliang Li,
Mache You
, et al. (73 additional authors not shown)
Abstract:
This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for train…
▽ More
This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
COREY: A Prototype Study of Entropy-Guided Operator Fusion with Hadamard Reparameterization for Selective State Space Models
Authors:
Bo Ma,
Jinsong Wu,
Hongjiang Wei,
Weiqi Yan
Abstract:
State Space Models (SSMs), represented by the Mamba family, provide linear-time sequence modeling and are attractive for long-context inference. Yet practical deployments remain memory-bandwidth limited because selective state updates are often decomposed into fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype framework that combines memory-aware op…
▽ More
State Space Models (SSMs), represented by the Mamba family, provide linear-time sequence modeling and are attractive for long-context inference. Yet practical deployments remain memory-bandwidth limited because selective state updates are often decomposed into fragmented kernels with repeated intermediate tensor materialization. We present COREY, a prototype framework that combines memory-aware operator fusion with Hadamard-based feature reparameterization. Activation entropy, estimated with fixed-width histograms, is used as a runtime scheduling statistic to place fusion boundaries and choose tile sizes. To regularize heavy-tailed activations, we absorb normalized Hadamard transforms into linear projections, preserving functional equivalence while reducing peak-coordinate concentration. In a controlled prototype study over heavy-tailed SSM activations, COREY consistently reduces proxy latency, improves throughput, and lowers DRAM traffic relative to unfused and fixed-depth baselines. Low-bit results are reported only through a hand-crafted stability proxy and are intended as diagnostic evidence rather than checkpoint-level quality claims. Code repository: https://github.com/mabo1215/COREY_Transformer.git.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
RIRF: Reasoning Image Restoration Framework
Authors:
Wending Yan,
Rongkai Zhang,
Kaihua Tang,
Yu Cheng,
Qiankun Liu
Abstract:
Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\&R), a novel framework that integrates structured Chain…
▽ More
Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R\&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R\&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R\&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.
△ Less
Submitted 10 April, 2026;
originally announced April 2026.
-
Rhythm-consistent semi-Markov simulation of tourist mobility rhythms with probabilistic event-to-POI assignment: Hakone, Japan
Authors:
Jianhao Shi,
Tomio Miwa,
Wanglin Yan
Abstract:
Understanding the timing and sequencing of activity participation in tourist mobility is central to travel behavior research, yet GPS trajectories are noisy, irregularly sampled, and only weakly linked to activity locations, which limits interpretation and scenario analysis. We address this by mapping each stay event to candidate points of interest (POIs) probabilistically, using explicit prior-li…
▽ More
Understanding the timing and sequencing of activity participation in tourist mobility is central to travel behavior research, yet GPS trajectories are noisy, irregularly sampled, and only weakly linked to activity locations, which limits interpretation and scenario analysis. We address this by mapping each stay event to candidate points of interest (POIs) probabilistically, using explicit prior-likelihood weighting that yields a normalized compatibility distribution rather than hard matching. Using one month of high-density tourist trajectories in Hakone, Japan (November 2021), we construct semantic stay-event sequences based on observed place-category labels (MID10) and describe mobility rhythms through hour-by-category profiles, category transitions, and expected dwell patterns. Building on these rhythm signatures, we develop a rhythm-consistent semi-Markov simulator that generates synthetic stay-event sequences with time-conditioned transitions and category-dependent dwell behavior. In the observed data, hour-by-category summaries are computed by probability-weighted aggregation over soft labels; in simulation, each event is generated with a discrete category and a sampled dwell duration, enabling like-for-like comparison after aggregation. We further conduct counterfactual POI-inventory scenarios to quantify how hypothetical POI configuration changes shift stay intensity across time, categories, and space, particularly around hubs and main corridors. Observed-simulated comparisons show close agreement in temporal profiles and category distributions, indicating that probabilistic labeling and rhythm-consistent simulation preserve key mobility structure while providing an interpretable basis for transport-geography scenario evaluation.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents
Authors:
Bo Ma,
Jinsong Wu,
Weiqi Yan
Abstract:
In LLM/VLM agents, prompt privacy risk propagates beyond a single model call because raw user content can flow into retrieval queries, memory writes, tool calls, and logs. Existing de-identification pipelines address document boundaries but not this cross-stage propagation. We propose BodhiPromptShield, a policy-aware framework that detects sensitive spans, routes them via typed placeholders, sema…
▽ More
In LLM/VLM agents, prompt privacy risk propagates beyond a single model call because raw user content can flow into retrieval queries, memory writes, tool calls, and logs. Existing de-identification pipelines address document boundaries but not this cross-stage propagation. We propose BodhiPromptShield, a policy-aware framework that detects sensitive spans, routes them via typed placeholders, semantic abstraction, or secure symbolic mapping, and delays restoration to authorized boundaries. Relative to enterprise redaction, this adds explicit propagation-aware mediation and restoration timing as a security variable. Under controlled evaluation on the Controlled Prompt-Privacy Benchmark (CPPB), stage-wise propagation suppresses from 10.7\% to 7.1\% across retrieval, memory, and tool stages; PER reaches 9.3\% with 0.94 AC and 0.92 TSR, outperforming generic de-identification. These are controlled systems results on CPPB rather than formal privacy guarantees or public-benchmark transfer claims.
The project repository is available at https://github.com/mabo1215/BodhiPromptShield.git.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
Authors:
Xiangyi Li,
Kyoung Whan Choe,
Yimin Liu,
Xiaokun Chen,
Chujun Tao,
Bingran You,
Wenbo Chen,
Zonglin Di,
Jiankai Sun,
Shenghan Zheng,
Jiajun Bao,
Yuanli Wang,
Weixiang Yan,
Yiyuan Li,
Han-chung Lee
Abstract:
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and…
▽ More
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification. We release the trajectories and future dataset at https://clawsbench.com.
△ Less
Submitted 8 April, 2026; v1 submitted 6 April, 2026;
originally announced April 2026.
-
Maximizing T2-Only Prostate Cancer Localization from Expected Diffusion Weighted Imaging
Authors:
Weixi Yi,
Yipei Wang,
Wen Yan,
Hanyuan Zhang,
Natasha Thorley,
Alexander Ng,
Shonit Punwani,
Fernando Bianco,
Mark Emberton,
Veeru Kasivisvanathan,
Dean C. Barratt,
Shaheer U. Saeed,
Yipeng Hu
Abstract:
Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substant…
▽ More
Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substantial clinical benefits by reducing costs and expertise required to acquire other sequences. This work investigates an arguably more challenging application using only T2w at inference, but to localize individual cancers based on independent histopathology labels. We formulate DWI images as a latent modality (readily available during training) to classify cancer presence at local Barzell zones, given only T2w images as input. In the resulting expectation-maximization algorithm, a latent modality generator (implemented using a flow matching-based generative model) approximates the latent DWI image posterior distribution in the E-steps, while in M-steps a cancer localizer is simultaneously optimized with the generative model to maximize the expected likelihood of cancer presence. The proposed approach provides a novel theoretical framework for learning from a privileged DWI modality, yielding superior cancer localization performance compared to approaches that lack training DWI images or existing frameworks for privileged learning and incomplete modalities. The proposed T2-only methods perform competitively or better than baseline methods using multiple input sequences (e.g., improving the patient-level F1 score by 14.4\% and zone-level QWK by 5.3\% over the T2w+DWI baseline). We present quantitative evaluations using internal and external datasets from 4,133 prostate cancer patients with histopathology-verified labels.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study
Authors:
Bo Ma,
Jinsong Wu,
Weiqi Yan,
Hongjiang Wei
Abstract:
Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier w…
▽ More
Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher--student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
Authors:
Ziqi Miao,
Haonan Jia,
Lijun Li,
Chen Qian,
Yuan Xiong,
Wenting Yan,
Jing Shao
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving rea…
▽ More
Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
△ Less
Submitted 9 April, 2026; v1 submitted 30 March, 2026;
originally announced March 2026.
-
From Passersby to Placemaking: Designing Autonomous Vehicle-Pedestrian Encounters for an Urban Shared Space
Authors:
Yiyuan Wang,
Martin Tomitsch,
Marius Hoggenmüller,
Senuri Wijenayake,
Wai Yan,
Luke Hespanhol
Abstract:
Autonomous vehicles (AVs) tend to disrupt the atmosphere and pedestrian experience in urban shared spaces, undermining the focus of these spaces on people and placemaking. We investigate how external human-machine interfaces (eHMIs) supporting AV-pedestrian interaction can be extended to consider the characteristics of an urban shared space. Inspired by urban HCI, we devised three place-based eHMI…
▽ More
Autonomous vehicles (AVs) tend to disrupt the atmosphere and pedestrian experience in urban shared spaces, undermining the focus of these spaces on people and placemaking. We investigate how external human-machine interfaces (eHMIs) supporting AV-pedestrian interaction can be extended to consider the characteristics of an urban shared space. Inspired by urban HCI, we devised three place-based eHMI designs that (i) enhance a conventional intent eHMI and (ii) exhibit content and physical integration with the space. In an evaluation study, 25 participants experienced the eHMIs in an immersive simulation of the space via virtual reality and shared their impressions through think-aloud, interviews, and questionnaires. Results showed that the place-based eHMIs had a notable effect on influencing the perception of AV interaction, including aspects like visual aesthetics and sense of reassurance, and on fostering a sense of place, such as social interactivity and the intentionality to coexist. In measuring qualities of pedestrian experience, we found that perceived safety significantly correlated with user experience and affect, including the attractiveness of eHMIs and feelings of pleasantness. The paper opens the avenue for exploring how eHMIs may contribute to the placemaking goals of pedestrian-centric spaces and improve the experience of people encountering AVs within these environments.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
Quadrature Oscillation System for Coordinated Motion in Crawling Origami Robot
Authors:
Sean Liu,
Ankur Mehta,
Wenzhong Yan
Abstract:
Origami-inspired robots offer rapid, accessible design and manufacture with diverse functionalities. In particular, origami robots without conventional electronics have the unique advantage of functioning in extreme environments such as ones with high radiation or large magnetic fields. However, the absence of sophisticated control systems limits these robots to simple autonomous behaviors. In our…
▽ More
Origami-inspired robots offer rapid, accessible design and manufacture with diverse functionalities. In particular, origami robots without conventional electronics have the unique advantage of functioning in extreme environments such as ones with high radiation or large magnetic fields. However, the absence of sophisticated control systems limits these robots to simple autonomous behaviors. In our previous studies, we developed a printable, electronics-free, and self-sustained oscillator that generates simple complementary square-wave signals. Our study presents a quadrature oscillation system capable of generating four square-wave signals a quarter-cycle out of phase, enabling four distinct states. Such control signals are important in various engineering and robotics applications, such as orchestrating limb movements in bio-inspired robots. We demonstrate the practicality and value of this oscillation system by designing and constructing an origami crawling robot that utilizes the quadrature oscillator to achieve coordinated locomotion. Together, the oscillator and robot illustrate the potential for more complex control and functions in origami robotics, paving the way for more electronics-free, rapid-design origami robots with advanced autonomous behaviors.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation
Authors:
Penghao Liang,
Mengwei Yuan,
Jianan Liu,
Jing Yang,
Xianyou Li,
Weiran Yan,
Yichao Wu
Abstract:
We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to asse…
▽ More
We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 -- a state-of-the-art API calling model -- for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.
△ Less
Submitted 23 February, 2026;
originally announced March 2026.
-
On the Degrees of Freedom of Gridded Control Points in Learning-Based Medical Image Registration
Authors:
Wen Yan,
Qianye Yang,
Yipei Wang,
Shonit Punwani,
Mark Emberton,
Vasilis Stavrinides,
Yipeng Hu,
Dean Barratt
Abstract:
Many registration problems are ill-posed in homogeneous or noisy regions, and dense voxel-wise decoders can be unnecessarily high-dimensional. A sparse control-point parameterisation provides a compact, smooth deformation representation while reducing memory and improving stability. This work investigates the required control points for learning-based registration network development. We present G…
▽ More
Many registration problems are ill-posed in homogeneous or noisy regions, and dense voxel-wise decoders can be unnecessarily high-dimensional. A sparse control-point parameterisation provides a compact, smooth deformation representation while reducing memory and improving stability. This work investigates the required control points for learning-based registration network development. We present GridReg, a learning-based registration framework that replaces dense voxel-wise decoding with displacement predictions at a sparse grid of control points. This design substantially cuts the parameter count and memory while retaining registration accuracy. Multiscale 3D encoder feature maps are flattened into a 1D token sequence with positional encoding to retain spatial context. The model then predicts a sparse gridded deformation field using a cross-attention module. We further introduce grid-adaptive training, enabling an adaptive model to operate at multiple grid sizes at inference without retraining. This work quantitatively demonstrates the benefits of using sparse grids. Using three data sets for registering prostate gland, pelvic organs and neurological structures, the results suggested a significant improvement with the usage of grid-controled displacement field. Alternatively, the superior registration performance was obtained using the proposed approach, with a similar or less computational cost, compared with existing algorithms that predict DDFs or displacements sampled on scattered key points.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation
Authors:
Wen Yan,
Yipei Wang,
Shiqi Huang,
Natasha Thorley,
Mark Emberton,
Vasilis Stavrinides,
Yipeng Hu,
Dean Barratt
Abstract:
Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent 'clean' lesion mask, and propose a hierarchical e…
▽ More
Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent 'clean' lesion mask, and propose a hierarchical expectation-maximisation (HierEM) framework that alternates between: (1) inferring a voxel-wise posterior distribution over the latent mask, and (2) training a CNN using this posterior as a soft target and estimate site-specific sensitivity and specificity under a hierarchical prior. This hierarchical prior decomposes label-quality into a global mean with site- and case-level deviations, reducing site-specific bias by penalising the likelihood term contributed only by site deviations. Experiments on three cohorts demonstrate that the proposed hierarchical EM framework enhances cross-site generalisation compared to state-of-the-art methods. For pooled-dataset evaluation, the per-site mean DSC ranges from 29.50% to 39.69%; for leave-one-site-out generalisation, it ranges from 27.91% to 32.67%, yielding statistically significant improvements over comparison methods (p<0.039). The method also produces interpretable per-site latent label-quality estimates (sensitivity alpha ranges from 31.5% to 47.3% at specificity beta approximates 0.99), supporting post-hoc analyses of cross-site annotation variability. These results indicate that explicitly modelling site-dependent annotation can improve cross-site generalisation.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search
Authors:
Bo Ma,
Wei Qi Yan,
Jinsong Wu
Abstract:
Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchic…
▽ More
Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/ε$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}
△ Less
Submitted 18 March, 2026; v1 submitted 13 March, 2026;
originally announced March 2026.
-
REAEDP: Entropy-Calibrated Differentially Private Data Release with Formal Guarantees and Attack-Based Evaluation
Authors:
Bo Ma,
Jinsong Wu,
Wei Qi Yan
Abstract:
Sensitive data release is vulnerable to output-side privacy threats such as membership inference, attribute inference, and record linkage. This creates a practical need for release mechanisms that provide formal privacy guarantees while preserving utility in measurable ways. We propose REAEDP, a differential privacy framework that combines entropy-calibrated histogram release, a synthetic-data rel…
▽ More
Sensitive data release is vulnerable to output-side privacy threats such as membership inference, attribute inference, and record linkage. This creates a practical need for release mechanisms that provide formal privacy guarantees while preserving utility in measurable ways. We propose REAEDP, a differential privacy framework that combines entropy-calibrated histogram release, a synthetic-data release mechanism, and attack-based evaluation. On the theory side, we derive an explicit sensitivity bound for Shannon entropy, together with an extension to Rényi entropy, for adjacent histogram datasets, enabling calibrated differentially private release of histogram statistics. We further study a synthetic-data mechanism $\mathcal{F}$ with a privacy-test structure and show that it satisfies a formal differential privacy guarantee under the stated parameter conditions. On multiple public tabular datasets, the empirical entropy change remains below the theoretical bound in the tested regime, standard Laplace and Gaussian baselines exhibit comparable trends, and both membership-inference and linkage-style attack performance move toward random-guess behavior as the privacy parameter decreases. These results support REAEDP as a practically usable privacy-preserving release pipeline in the tested settings. Source code: https://github.com/mabo1215/REAEDP.git
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
TSDCRF: Balancing Privacy and Multi-Object Tracking via Time-Series CRF and Normalized Control Penalty
Authors:
Bo Ma,
Jinsong Wu,
Weiqi Yan
Abstract:
Multi-object tracking in video often requires appearance or location cues that can reveal sensitive identity information, while adding privacy-preserving noise typically disrupts cross-frame association and causes ID switches or target loss. We propose TSDCRF, a plug-in refinement framework that balances privacy and tracking by combining three components: (i) $(\varepsilon,δ)$-differential privacy…
▽ More
Multi-object tracking in video often requires appearance or location cues that can reveal sensitive identity information, while adding privacy-preserving noise typically disrupts cross-frame association and causes ID switches or target loss. We propose TSDCRF, a plug-in refinement framework that balances privacy and tracking by combining three components: (i) $(\varepsilon,δ)$-differential privacy via calibrated Gaussian noise on sensitive regions under a configurable privacy budget; (ii) a Normalized Control Penalty (NCP) that down-weights unstable or conflicting class predictions before noise injection to stabilize association; and (iii) a time-series dynamic conditional random field (DCRF) that enforces temporal consistency and corrects trajectory deviation after noise, mitigating ID switches and resilience to trajectory hijacking. The pipeline is agnostic to the choice of detector and tracker (e.g., YOLOv4 and DeepSORT). We evaluate on MOT16, MOT17, Cityscapes, and KITTI. Results show that TSDCRF achieves a better privacy--utility trade-off than white noise and prior methods (NTPD, PPDTSA): lower KL-divergence shift, lower tracking RMSE, and improved robustness under trajectory hijacking while preserving privacy. Source code in https://github.com/mabo1215/TSDCRF.git
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration
Authors:
Dongxu Zhang,
Jihua Zhu,
Shiqi Li,
Wenbiao Yan,
Haoran Xu,
Peilin Fan,
Huimin Lu
Abstract:
Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised regi…
▽ More
Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}.
△ Less
Submitted 13 March, 2026;
originally announced March 2026.
-
ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation
Authors:
Wenhao Yan,
Ning An,
Linxu Li,
Bingsheng Bi,
Bo Jiang,
Zhigang Lu,
Baoxu Liu,
Junrong Liu,
Cong Dong
Abstract:
Advanced Persistent Threats (APTs) pose critical challenges to modern cybersecurity due to their multi-stage and stealthy nature. While provenance-based detection approaches show promise in capturing causal attack semantics, current threat provenance practices face two paradoxical issues: (1) expert skepticism, where human analysts doubt the capability of traditional detection models to identify c…
▽ More
Advanced Persistent Threats (APTs) pose critical challenges to modern cybersecurity due to their multi-stage and stealthy nature. While provenance-based detection approaches show promise in capturing causal attack semantics, current threat provenance practices face two paradoxical issues: (1) expert skepticism, where human analysts doubt the capability of traditional detection models to identify complex attacks; and (2) expert dependence, as analysts cannot manually process large-scale raw logs to detect threats without these models. Consequently, collaboration between humans and traditional models remains the prevailing paradigm. However, this renders investigation quality contingent upon human expertise and frequently results in alert fatigue. To address these challenges, we present ProvAgent, a framework that evolves the threat provenance paradigm from human-model collaboration to a novel collaboration between multi-agent systems and traditional models. ProvAgent leverages the speed and cost-efficiency of traditional models for initial anomaly screening over large-scale logs. By enforcing fine-grained identity-behavior consistency via graph contrastive learning, it profiles entities based on specific attributes to generate high-fidelity alerts. With these alerts serving as investigation entry points, ProvAgent achieves in-depth autonomous investigation through a hypothesis-verification multi-agent framework. Evaluations with real-world datasets demonstrate that ProvAgent outperforms six state-of-the-art (SOTA) baselines in anomaly detection. Through automated investigation, ProvAgent reconstructs near-complete attack processes at a minimum cost of \$0.06 per day.
△ Less
Submitted 10 March, 2026;
originally announced March 2026.
-
TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA
Authors:
Mengwei Yuan,
Jianan Liu,
Jing Yang,
Xianyou Li,
Weiran Yan,
Yichao Wu,
Penghao Liang
Abstract:
Large Language Model (LLM) has exhibited strong reasoning ability in text-based contexts across various domains, yet the limitation of context window poses challenges for the model on long-range inference tasks and necessitates a memory storage system. While many current storage approaches have been proposed with episodic notes and graph representations of memory, retrieval methods still primarily…
▽ More
Large Language Model (LLM) has exhibited strong reasoning ability in text-based contexts across various domains, yet the limitation of context window poses challenges for the model on long-range inference tasks and necessitates a memory storage system. While many current storage approaches have been proposed with episodic notes and graph representations of memory, retrieval methods still primarily rely on predefined workflows or static similarity top-k over embeddings. To address this inflexibility, we introduced a novel tool-augmented autonomous memory retrieval framework (TA-Mem), which contains: (1) a memory extraction LLM agent which is prompted to adaptively chuck an input into sub-context based on semantic correlation, and extract information into structured notes, (2) a multi-indexed memory database designed for different types of query methods including both key-based lookup and similarity-based retrieval, (3) a tool-augmented memory retrieval agent which explores the memory autonomously by selecting appropriate tools provided by the database based on the user input, and decides whether to proceed to the next iteration or finalizing the response after reasoning on the fetched memories. The TA-Mem is evaluated on the LoCoMo dataset, achieving significant performance improvements over existing baseline approaches. In addition, an analysis of tool use across different question types also demonstrates the adaptivity of the proposed method.
△ Less
Submitted 10 March, 2026;
originally announced March 2026.
-
Feature Resemblance: Towards a Theoretical Understanding of Analogical Reasoning in Transformers
Authors:
Ruichen Xu,
Wenjing Yan,
Ying-Jun Angela Zhang
Abstract:
Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning th…
▽ More
Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.
△ Less
Submitted 22 March, 2026; v1 submitted 5 March, 2026;
originally announced March 2026.
-
ProFound: A moderate-sized vision foundation model for multi-task prostate imaging
Authors:
Yipei Wang,
Yinsong Xu,
Weixi Yi,
Shaheer Ullah Saeed,
Natasha Thorley,
Alexander Ng,
Yukun Zhou,
Wen Yan,
Dean Barratt,
Shonit Punwani,
Veeru Kasivisvanathan,
Mark Emberton,
Daniel C. Alexander,
Yipeng Hu
Abstract:
Many diagnostic and therapeutic clinical tasks for prostate cancer increasingly rely on multi-parametric MRI. Automating these tasks is challenging because they necessitate expert interpretations, which are difficult to scale to capitalise on modern deep learning. Although modern automated systems achieve expert-level performance in isolated tasks, their general clinical utility remains limited by…
▽ More
Many diagnostic and therapeutic clinical tasks for prostate cancer increasingly rely on multi-parametric MRI. Automating these tasks is challenging because they necessitate expert interpretations, which are difficult to scale to capitalise on modern deep learning. Although modern automated systems achieve expert-level performance in isolated tasks, their general clinical utility remains limited by the requirement of large task-specific labelled datasets. In this paper, we present ProFound, a domain-specialised vision foundation model for volumetric prostate mpMRI. ProFound is pre-trained using several variants of self-supervised approaches on a diverse, multi-institutional collection of 5,000 patients, with a total of over 22,000 unique 3D MRI volumes (over 1,800,000 2D image slices). We conducted a systematic evaluation of ProFound across a broad spectrum of $11$ downstream clinical tasks on over 3,000 independent patients, including prostate cancer detection, Gleason grading, lesion localisation, gland volume estimation, zonal and surrounding structure segmentation. Experimental results demonstrate that finetuned ProFound consistently outperforms or remains competitive with state-of-the-art specialised models and existing medical vision foundation models trained/finetuned on the same data.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
Authors:
Weicai Yan,
Yuhong Dai,
Qi Ran,
Haodong Li,
Wang Lin,
Hao Liao,
Xing Xie,
Tao Jin,
Jianxun Lian
Abstract:
Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming sce…
▽ More
Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.
△ Less
Submitted 22 March, 2026; v1 submitted 3 March, 2026;
originally announced March 2026.
-
PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
Authors:
Bo Ma,
Jinsong Wu,
Weiqi Yan,
Catherine Shi,
Minh Nguyen
Abstract:
Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a backgro…
▽ More
Dashcam videos collected by autonomous or assisted-driving systems are increasingly shared for safety auditing and model improvement. Even when explicit GPS metadata are removed, an attacker can still infer the recording location by matching background visual cues (e.g., buildings and road layouts) against large-scale street-view imagery. This paper studies location-privacy leakage under a background-based retrieval attacker, and proposes PPEDCRF, a privacy-preserving enhanced dynamic conditional random field framework that injects calibrated perturbations only into inferred location-sensitive background regions while preserving foreground detection utility. PPEDCRF consists of three components: (i) a dynamic CRF that enforces temporal consistency to discover and track location sensitive regions across frames, (ii) a normalized control penalty (NCP) that allocates perturbation strength according to a hierarchical sensitivity model, and (iii) a utility-preserving noise injection module that minimizes interference to object detection and segmentation. Experiments on public driving datasets demonstrate that PPEDCRF significantly reduces location-retrieval attack success (e.g., Top-k retrieval accuracy) while maintaining competitive detection performance (e.g., mAP and segmentation metrics) compared with common baselines such as global noise, white-noise masking, and feature-based anonymization. The source code is in https://github.com/mabo1215/PPEDCRF.git
△ Less
Submitted 2 April, 2026; v1 submitted 2 March, 2026;
originally announced March 2026.
-
Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration
Authors:
Yunguan Fu,
Wenjia Bai,
Wen Yan,
Matthew J Clarkson,
Rhodri Huw Davies,
Yipeng Hu
Abstract:
Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a t…
▽ More
Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Code is available at https://github.com/mathpluscode/FlowReg.
△ Less
Submitted 3 March, 2026; v1 submitted 1 March, 2026;
originally announced March 2026.
-
KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging
Authors:
Lianjun Liu,
Hongli An,
Weiqi Yan,
Xin Du,
Shengchuan Zhang,
Huazhong Liu,
Yunshan Zhong
Abstract:
The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge…
▽ More
The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.Code is available at https://github.com/lianjunl13-sudo/KVSlimmer.
△ Less
Submitted 8 March, 2026; v1 submitted 28 February, 2026;
originally announced March 2026.
-
Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models
Authors:
Yichao Wu,
Penghao Liang,
Yafei Xiang,
Mengwei Yuan,
Jianan Liu,
Jing Yang,
Xianyou Li,
Weiran Yan
Abstract:
Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete fo…
▽ More
Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.
△ Less
Submitted 28 February, 2026;
originally announced March 2026.
-
PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning
Authors:
Dongxu Zhang,
Yiding Sun,
Pengcheng Li,
Yumou Liu,
Hongqiang Lin,
Haoran Xu,
Xiaoxuan Mu,
Liang Lin,
Wenbiao Yan,
Ning Yang,
Chaowei Fang,
Juanjuan Zhao,
Jihua Zhu,
Conghui He,
Cheng Tan
Abstract:
While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical st…
▽ More
While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising $\sim$86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.
△ Less
Submitted 27 February, 2026;
originally announced February 2026.
-
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency
Authors:
Weilong Yan,
Haipeng Li,
Hao Xu,
Nianjin Ye,
Yihao Ai,
Shuaicheng Liu,
Jingyu Hu
Abstract:
This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage…
▽ More
This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{https://github.com/DavidYan2001/LaS-Comp}{LaS-Comp}.
△ Less
Submitted 18 March, 2026; v1 submitted 21 February, 2026;
originally announced February 2026.
-
Temporal Consistency-Aware Text-to-Motion Generation
Authors:
Hongsong Wang,
Wenjing Yan,
Qiuxia Lai,
Xin Geng
Abstract:
Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignmen…
▽ More
Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.
△ Less
Submitted 20 February, 2026;
originally announced February 2026.
-
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
Authors:
Zhen Zhang,
Kaiqiang Song,
Xun Wang,
Yebowen Hu,
Weixiang Yan,
Chenyang Zhao,
Henry Peng Zou,
Haoyun Deng,
Sathish Reddy Indurthi,
Shujian Liu,
Simin Ma,
Xiaoyang Wang,
Xin Eric Wang,
Song Wang
Abstract:
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building…
▽ More
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
△ Less
Submitted 20 February, 2026; v1 submitted 12 February, 2026;
originally announced February 2026.
-
TabSieve: Explicit In-Table Evidence Selection for Tabular Prediction
Authors:
Yongyao Wang,
Ziqi Miao,
Lu Yang,
Haonan Jia,
Wenting Yan,
Chen Qian,
Lijun Li
Abstract:
Tabular prediction can benefit from in-table rows as few-shot evidence, yet existing tabular models typically perform instance-wise inference and LLM-based prompting is often brittle. Models do not consistently leverage relevant rows, and noisy context can degrade performance. To address this challenge, we propose TabSieve, a select-then-predict framework that makes evidence usage explicit and aud…
▽ More
Tabular prediction can benefit from in-table rows as few-shot evidence, yet existing tabular models typically perform instance-wise inference and LLM-based prompting is often brittle. Models do not consistently leverage relevant rows, and noisy context can degrade performance. To address this challenge, we propose TabSieve, a select-then-predict framework that makes evidence usage explicit and auditable. Given a table and a query row, TabSieve first selects a small set of informative rows as evidence and then predicts the missing target conditioned on the selected evidence. To enable this capability, we construct TabSieve-SFT-40K by synthesizing high-quality reasoning trajectories from 331 real tables using a strong teacher model with strict filtering. Furthermore, we introduce TAB-GRPO, a reinforcement learning recipe that jointly optimizes evidence selection and prediction correctness with separate rewards, and stabilizes mixed regression and classification training via dynamic task-advantage balancing. Experiments on a held-out benchmark of 75 classification and 52 regression tables show that TabSieve consistently improves performance across shot budgets, with average gains of 2.92% on classification and 4.45% on regression over the second-best baseline. Further analysis indicates that TabSieve concentrates more attention on the selected evidence, which improves robustness to noisy context.
△ Less
Submitted 12 February, 2026;
originally announced February 2026.
-
Paint by Odor: An Exploration of Odor Visualization through Large Language Model and Generative AI
Authors:
Gang Yu,
Yuchi Sun,
Weining Yan,
Xinyu Wang,
Qi Lu
Abstract:
Odor visualization translates odor information and perception into visual outcomes and arouses the corresponding olfactory synesthesia, surpassing the spatial limitation that odors can only be perceived where they are present. Traditional odor visualization has typically relied on unidimensional mappings, such as odor-to-color associations, and has required extensive manual design efforts. However…
▽ More
Odor visualization translates odor information and perception into visual outcomes and arouses the corresponding olfactory synesthesia, surpassing the spatial limitation that odors can only be perceived where they are present. Traditional odor visualization has typically relied on unidimensional mappings, such as odor-to-color associations, and has required extensive manual design efforts. However, the advent of generative AI (Gen AI) and large language models (LLMs) presents a new opportunity for automatic odor visualization. Nonetheless, gaps remain in bridging olfactory perception with generative tools to produce odor images. To address these gaps, this paper introduces Paint by Odor, a pipeline that leverages Gen AI and LLMs to transform olfactory perceptions into rich, aesthetically engaging visual representations. Two experiments were conducted, where 30 participants smelled real-world odors and provided descriptive data and 28 participants evaluated 560 generated odor images through seven systematically designed prompts. Our findings explored the capability of LLMs in producing olfactory perception by comparing it with human responses and revealed the underlying mechanisms and effects of language-based descriptions and several abstraction styles on odor visualization. Our work further discussed the possibility of automatic odor visualization without human participation. These explorations and results have bridged the research gap in odor visualization using LLMs and Gen AI, offering valuable design insights and various possibilities for future applications.
△ Less
Submitted 3 February, 2026;
originally announced February 2026.
-
4DPC$^2$hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping
Authors:
Xindan Zhang,
Weilong Yan,
Yufei Shi,
Xuerui Qiu,
Tao He,
Ying Li,
Ming Li,
Hehe Fan
Abstract:
Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of m…
▽ More
Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC$^2$hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC$^2$hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC$^2$hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.
△ Less
Submitted 3 February, 2026;
originally announced February 2026.
-
FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
Authors:
Chenrui Xu,
Wenjing Yan,
Ying-Jun Angela Zhang
Abstract:
Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with unifo…
▽ More
Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with uniform singular values. However, this strict isotropy discards potentially valuable curvature information encoded in gradient spectra, motivating optimization methods that balance geometric structure with adaptivity. We introduce FISMO (Fisher-Structured Momentum-Orthogonalized) optimizer, which generalizes isotropic updates to incorporate anisotropic curvature information through Fisher information geometry. By reformulating the optimizer update as a trust-region problem constrained by a Kronecker-factored Fisher metric, FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability. We establish convergence guarantees for FISMO in stochastic nonconvex settings, proving an $\mathcal{O}(1/\sqrt{T})$ rate for the expected squared gradient norm with explicit characterization of variance reduction through mini-batching. Empirical evaluation on image classification and language modeling benchmarks demonstrates that FISMO achieves superior training efficiency and final performance compared to established baselines.
△ Less
Submitted 29 January, 2026;
originally announced January 2026.
-
CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification
Authors:
Zhuonan Wang,
Wenjie Yan,
Wenqiao Zhang,
Xiaohui Song,
Jian Ma,
Ke Yao,
Yibo Yu,
Beng Chin Ooi
Abstract:
Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However…
▽ More
Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However, due to the single-modality nature, subtle lesion patterns, and significant inter-device variability, existing methods still face limitations in generalization and high-confidence prediction. To address these challenges, we propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy. Architecturally, we introduce HaC, a hypernetwork-based adaptive conditioning layer that dynamically generates parameters according to input feature distributions, thereby improving cross-domain adaptability. From a training perspective, we develop RaP, a reliability-aware prediction scheme built upon evidential uncertainty learning, which encourages the model to emphasize low-confidence samples and improves overall stability and reliability. We further construct a large-scale ophthalmic angiography dataset covering both FFA and ICGA modalities, comprising multiple retinal disease categories for model training and evaluation. Experimental results demonstrate that CLEAR-Mamba consistently outperforms multiple baseline models, including the original MedMamba, across various metrics-showing particular advantages in multi-disease classification and reliability-aware prediction. This study provides an effective solution that balances generalizability and reliability for modality-specific medical image classification tasks. Our project can be accessed at https://github.com/ZJU4HealthCare/CLEAR-Mamba.
△ Less
Submitted 10 March, 2026; v1 submitted 28 January, 2026;
originally announced January 2026.
-
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Authors:
Dongxu Zhang,
Yiding Sun,
Cheng Tan,
Wenbiao Yan,
Ning Yang,
Jihua Zhu,
Haijun Zhang
Abstract:
While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically…
▽ More
While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA.
△ Less
Submitted 11 March, 2026; v1 submitted 20 January, 2026;
originally announced January 2026.
-
WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation
Authors:
Zishan Shu,
Juntong Wu,
Wei Yan,
Xudong Liu,
Hongyu Zhang,
Chang Liu,
Youdong Mao,
Jie Chen
Abstract:
Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped w…
▽ More
Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.
△ Less
Submitted 13 January, 2026;
originally announced January 2026.
-
Surrogate compliance modeling enables reinforcement learned locomotion gaits for soft robots
Authors:
Jue Wang,
Mingsong Jiang,
Luis A. Ramirez,
Bilige Yang,
Mujun Zhang,
Esteban Figueroa,
Wenzhong Yan,
Rebecca Kramer-Bottiglio
Abstract:
Adaptive morphogenetic robots adapt their morphology and control policies to meet changing tasks and environmental conditions. Many such systems leverage soft components, which enable shape morphing but also introduce simulation and control challenges. Soft-body simulators remain limited in accuracy and computational tractability, while rigid-body simulators cannot capture soft-material dynamics.…
▽ More
Adaptive morphogenetic robots adapt their morphology and control policies to meet changing tasks and environmental conditions. Many such systems leverage soft components, which enable shape morphing but also introduce simulation and control challenges. Soft-body simulators remain limited in accuracy and computational tractability, while rigid-body simulators cannot capture soft-material dynamics. Here, we present a surrogate compliance modeling approach: rather than explicitly modeling soft-body physics, we introduce indirect variables representing soft-material deformation within a rigid-body simulator. We validate this approach using our amphibious robotic turtle, a quadruped with soft morphing limbs designed for multi-environment locomotion. By capturing deformation effects as changes in effective limb length and limb center of mass, and by applying reinforcement learning with extensive randomization of these indirect variables, we achieve reliable policy learning entirely in a rigid-body simulation. The resulting gaits transfer directly to hardware, demonstrating high-fidelity sim-to-real performance on hard, flat substrates and robust, though lower-fidelity, transfer on rheologically complex terrains. The learned closed-loop gaits exhibit unprecedented terrestrial maneuverability and achieve an order-of-magnitude reduction in cost of transport compared to open-loop baselines. Field experiments with the robot further demonstrate stable, multi-gait locomotion across diverse natural terrains, including gravel, grass, and mud.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations
Authors:
Wenhao Yan,
Sheng Ye,
Zhuoyi Yang,
Jiayan Teng,
ZhenHui Dong,
Kairui Wen,
Xiaotao Gu,
Yong-Jin Liu,
Jie Tang
Abstract:
Achieving controllable character animation that meets studio-grade standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (a framew…
▽ More
Achieving controllable character animation that meets studio-grade standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (a framework toward \textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), which is designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-grade requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade controlling. Code and model are available at \href{https://github.com/zai-org/SCAIL}{zai-org/SCAIL}.
△ Less
Submitted 23 March, 2026; v1 submitted 5 December, 2025;
originally announced December 2025.
-
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
Authors:
Zhiyu Xu,
Weilong Yan,
Yufei Shi,
Xin Meng,
Tao He,
Huiping Zhuang,
Ming Li,
Hehe Fan
Abstract:
Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the…
▽ More
Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.
△ Less
Submitted 22 November, 2025;
originally announced November 2025.
-
THD-BAR: Topology Hierarchical Derived Brain Autoregressive Modeling for EEG Generic Representations
Authors:
Wenchao Yang,
Weidong Yan,
Wenkang Liu,
Yulan Ma,
Yang Li
Abstract:
Large-scale pre-trained models hold significant potential for learning universal EEG representations. However, most existing methods, particularly autoregressive (AR) frameworks, primarily rely on straightforward temporal sequencing of multi-channel EEG data, which fails to capture the rich physiological characteristics inherent to EEG signals. Moreover, their time-centered modeling approach also…
▽ More
Large-scale pre-trained models hold significant potential for learning universal EEG representations. However, most existing methods, particularly autoregressive (AR) frameworks, primarily rely on straightforward temporal sequencing of multi-channel EEG data, which fails to capture the rich physiological characteristics inherent to EEG signals. Moreover, their time-centered modeling approach also limits the effective representation of the dynamic spatial topology of brain activity. To address these challenges and fully exploit the potential of large-scale EEG models, we propose a novel Topology Hierarchical Derived Brain Autoregressive Modeling (THD-BAR) for EEG generic representations. The core innovation of THD-BAR lies in the introduction of the Brain Topology Hierarchy (BTH), which establishes a multi-scale spatial order for EEG channels. This hierarchical structure enables a redefinition of autoregressive learning as a "next-scale-time prediction" problem, effectively capturing both spatial and temporal dynamics. Based on BTH, we design a Topology-Hierarchical Vector Quantized-Variational Autoencoder (THVQ-VAE) for multi-scale tokenization and develop an enhanced Brain Autoregressive (BAR) module with specialized masking strategies for prediction. Through extensive large-scale pre-training on 17 datasets, followed by rigorous validation on 10 downstream datasets spanning 5 distinct tasks, THD-BAR consistently outperforms existing methods. These results highlight the superior generalization and modeling capabilities of our proposed approach.
△ Less
Submitted 5 November, 2025;
originally announced November 2025.
-
Learning Implicit Neural Degradation Representation for Unpaired Image Dehazing
Authors:
Shuaibin Fan,
Senming Zhong,
Wenchao Yan,
Minglong Xue
Abstract:
Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common…
▽ More
Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at https://github.com/Fan-pixel/NeDR-Dehaze.
△ Less
Submitted 17 November, 2025;
originally announced November 2025.
-
Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View
Authors:
Jianyu Qi,
Ding Zou,
Wenrui Yan,
Rui Ma,
Jiaxu Li,
Zhijie Zheng,
Zhiguo Yang,
Rongchang Zhao
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical as…
▽ More
Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.
△ Less
Submitted 10 November, 2025;
originally announced November 2025.
-
Test-Time Iterative Error Correction for Efficient Diffusion Models
Authors:
Yunshan Zhong,
Weiqi Yan,
Yuxin Zhang
Abstract:
With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in dep…
▽ More
With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the model's output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models. The code is available in https://github.com/zysxmu/IEC.
△ Less
Submitted 9 February, 2026; v1 submitted 9 November, 2025;
originally announced November 2025.
-
Robustness study of the bio-inspired musculoskeletal arm robot based on the data-driven iterative learning algorithm
Authors:
Jianbo Yuan,
Jing Dai,
Yerui Fan,
Yaxiong Wu,
Yunpeng Liang,
Weixin Yan
Abstract:
The human arm exhibits remarkable capabilities, including both explosive power and precision, which demonstrate dexterity, compliance, and robustness in unstructured environments. Developing robotic systems that emulate human-like operational characteristics through musculoskeletal structures has long been a research focus. In this study, we designed a novel lightweight tendon-driven musculoskelet…
▽ More
The human arm exhibits remarkable capabilities, including both explosive power and precision, which demonstrate dexterity, compliance, and robustness in unstructured environments. Developing robotic systems that emulate human-like operational characteristics through musculoskeletal structures has long been a research focus. In this study, we designed a novel lightweight tendon-driven musculoskeletal arm (LTDM-Arm), featuring a seven degree-of-freedom (DOF) skeletal joint system and a modularized artificial muscular system (MAMS) with 15 actuators. Additionally, we employed a Hilly-type muscle model and data-driven iterative learning control (DDILC) to learn and refine activation signals for repetitive tasks within a finite time frame. We validated the anti-interference capabilities of the musculoskeletal system through both simulations and experiments. The results show that the LTDM-Arm system can effectively achieve desired trajectory tracking tasks, even under load disturbances of 20 % in simulation and 15 % in experiments. This research lays the foundation for developing advanced robotic systems with human-like operational performance.
△ Less
Submitted 8 November, 2025;
originally announced November 2025.
-
Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning
Authors:
Jueye Zhang,
Chao Yang,
Youfang Lai,
Kai-Wen Li,
Wenting Yan,
Yunzhou Xia,
Haimei Zhang,
Jingjing Zhou,
Gen Yang,
Chen Lin,
Tian Li,
Yibao Zhang
Abstract:
Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters…
▽ More
Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL $85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.
△ Less
Submitted 4 November, 2025;
originally announced November 2025.
-
Stiff Circuit System Modeling via Transformer
Authors:
Weiman Yan,
Yi-Chia Chang,
Wanyu Zhao
Abstract:
Accurate and efficient circuit behavior modeling is a cornerstone of modern electronic design automation. Among different types of circuits, stiff circuits are challenging to model using previous frameworks. In this work, we propose a new approach using Crossformer, which is a current state-of-the-art Transformer model for time-series prediction tasks, combined with Kolmogorov-Arnold Networks (KAN…
▽ More
Accurate and efficient circuit behavior modeling is a cornerstone of modern electronic design automation. Among different types of circuits, stiff circuits are challenging to model using previous frameworks. In this work, we propose a new approach using Crossformer, which is a current state-of-the-art Transformer model for time-series prediction tasks, combined with Kolmogorov-Arnold Networks (KANs), to model stiff circuit transient behavior. By leveraging the Crossformer's temporal representation capabilities and the enhanced feature extraction of KANs, our method achieves improved fidelity in predicting circuit responses to a wide range of input conditions. Experimental evaluations on datasets generated through SPICE simulations of analog-to-digital converter (ADC) circuits demonstrate the effectiveness of our approach, with significant reductions in training time and error rates.
△ Less
Submitted 23 March, 2026; v1 submitted 5 October, 2025;
originally announced October 2025.
-
Problem-Parameter-Free Decentralized Bilevel Optimization
Authors:
Zhiwei Zhai,
Wenjing Yan,
Ying-Jun Angela Zhang
Abstract:
Decentralized bilevel optimization has garnered significant attention due to its critical role in solving large-scale machine learning problems. However, existing methods often rely on prior knowledge of problem parameters-such as smoothness, convexity, or communication network topologies-to determine appropriate stepsizes. In practice, these problem parameters are typically unavailable, leading t…
▽ More
Decentralized bilevel optimization has garnered significant attention due to its critical role in solving large-scale machine learning problems. However, existing methods often rely on prior knowledge of problem parameters-such as smoothness, convexity, or communication network topologies-to determine appropriate stepsizes. In practice, these problem parameters are typically unavailable, leading to substantial manual effort for hyperparameter tuning. In this paper, we propose AdaSDBO, a fully problem-parameter-free algorithm for decentralized bilevel optimization with a single-loop structure. AdaSDBO leverages adaptive stepsizes based on cumulative gradient norms to update all variables simultaneously, dynamically adjusting its progress and eliminating the need for problem-specific hyperparameter tuning. Through rigorous theoretical analysis, we establish that AdaSDBO achieves a convergence rate of $\widetilde{\mathcal{O}}\left(\frac{1}{T}\right)$, matching the performance of well-tuned state-of-the-art methods up to polylogarithmic factors. Extensive numerical experiments demonstrate that AdaSDBO delivers competitive performance compared to existing decentralized bilevel optimization methods while exhibiting remarkable robustness across diverse stepsize configurations.
△ Less
Submitted 28 October, 2025;
originally announced October 2025.