-
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
Authors:
Qin Zhou,
Guoyan Liang,
Qianyi Yang,
Jingyuan Chen,
Sai Wu,
Chang Yao,
Zhe Wang
Abstract:
Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement…
▽ More
Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Authors:
Daiwei Chen,
Zhoutong Fu,
Chengming Jiang,
Haichao Zhang,
Ran Zhou,
Tan Wang,
Chunnan Yao,
Guoyao Li,
Rui Cai,
Yihan Cao,
Ruijie Jiang,
Fedor Borisyuk,
Jianqiang Shen,
Jingwei Wu,
Ramya Korlakai Vinayak
Abstract:
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spec…
▽ More
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis
Authors:
Hongjun Liu,
Rujun Han,
Leyu Zhou,
Chao Yao
Abstract:
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing…
▽ More
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment. In this paper, we propose \textbf{SCAR}, a robust ECG--language pretraining framework for \textbf{S}emantic \textbf{C}ompensation via \textbf{A}dversarial \textbf{R}emoval. SCAR improves robustness by explicitly training the model to remain semantically aligned with semantically critical missingness and to recover diagnostic meaning from the remaining visible evidence. Specifically, we introduce a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn representations that remain semantically aligned with clinical text even when primary diagnostic evidence is missing. Under such adversarial corruption, we equip the ECG encoder with a semantically supervised adaptive selector that learns to reweight the remaining visible tokens and compensate with secondary yet diagnostically informative morphological cues. To evaluate robustness beyond classification accuracy, we further introduce Counterfactual Missingness Resolution Score (CMRS), which quantifies how well feature preserve diagnostic semantics under missingness. Experiments on $6$ datasets show that SCAR consistently improves semantic robustness under joint lead and temporal missingness, with particularly clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
TALENT: Target-aware Efficient Tuning for Referring Image Segmentation
Authors:
Shuo Jin,
Siyue Yu,
Bingfeng Zhang,
Chao Yao,
Meiqin Liu,
Jimin Xiao
Abstract:
Referring image segmentation aims to segment specific targets based on a natural text expression. Recently, parameter-efficient tuning (PET) has emerged as a promising paradigm. However, existing PET-based methods often suffer from the fact that visual features can't emphasize the text-referred target instance but activate co-category yet unrelated objects. We analyze and quantify this problem, te…
▽ More
Referring image segmentation aims to segment specific targets based on a natural text expression. Recently, parameter-efficient tuning (PET) has emerged as a promising paradigm. However, existing PET-based methods often suffer from the fact that visual features can't emphasize the text-referred target instance but activate co-category yet unrelated objects. We analyze and quantify this problem, terming it the `non-target activation' (NTA) issue. To address this, we propose a novel framework, TALENT, which utilizes target-aware efficient tuning for PET-based RIS. Specifically, we first propose a Rectified Cost Aggregator (RCA) to efficiently aggregate text-referred features. Then, to calibrate `NTA' into accurate target activation, we adopt a Target-aware Learning Mechanism (TLM), including contextual pairwise consistency learning and target-centric contrastive learning. The former uses the sentence-level text feature to achieve a holistic understanding of the referent and constructs a text-referred affinity map to optimize the semantic association of visual features. The latter further enhances target localization to discover the distinct instance while suppressing associations with other unrelated ones. The two objectives work in concert and address `NTA' effectively. Extensive evaluations show that TALENT outperforms existing methods across various metrics (e.g., 2.5\% mIoU gains on G-Ref val set). Our codes will be released at: https://github.com/Kimsure/TALENT.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Probing the Lack of Stable Internal Beliefs in LLMs
Authors:
Yifan Luo,
Kangping Xu,
Yanzhen Lu,
Yang Yuan,
Andrew Chi-Chih Yao
Abstract:
Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adheren…
▽ More
Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain "implicit consistency", defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users' guesses with "yes/no" answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit "goals" shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research
Authors:
Chenguang Pan,
Zhou Zhang,
Weixuan Xiao,
Chengyuan Yao
Abstract:
In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first inst…
▽ More
In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first instantiation of this framework, we focus on predictive modeling tasks. Within this scope, EDM-ARS orchestrates five specialized LLM-powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, and Writer) through a state-machine coordinator that supports revision loops, checkpoint-based recovery, and sandboxed code execution. Given a research prompt and a dataset, EDM-ARS produces a complete LaTeX manuscript with real Semantic Scholar citations, validated machine learning analyses, and automated methodological peer review. We also provide a detailed description of the system architecture, the three-tier data registry design that encodes educational domain expertise, the specification of each agent, the inter-agent communication protocol, and mechanisms for error-handling and self-correction. Finally, we discuss current limitations, including single-dataset scope and formulaic paper output, and outline a phased roadmap toward causal inference, transfer learning, psychometric, and multi-dataset generalization. EDM-ARS is released as an open-source project to support the educational research community.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM
Authors:
Guohua Zhang,
Jian Jin,
Meiqin Liu,
Chao Yao,
Weisi Lin,
Yao Zhao
Abstract:
With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to la…
▽ More
With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
A Deep-Learning-Boosted Framework for Quantum Sensing with Nitrogen-Vacancy Centers in Diamond
Authors:
Changyu Yao,
Haochen Shen,
Zhongyuan Liu,
Ruotian Gong,
Md Shakil Bin Kashem,
Stella Varnum,
Liangyu Li,
Hangyue Li,
Yue Yu,
Yizhou Wang,
Xiaoshui Lin,
Jonathan Brestoff,
Chenyang Lu,
Shankar Mukherji,
Chuanwei Zhang,
Chong Zu
Abstract:
Nitrogen-vacancy (NV) centers in diamond are a versatile quantum sensing platform for high sensitivity measurements of magnetic fields, temperature and strain with nanoscale spatial resolution. A common bottleneck is the analysis of optically detected magnetic resonance (ODMR) spectra, where target quantities are encoded in resonance features. Conventional nonlinear fitting is often computationall…
▽ More
Nitrogen-vacancy (NV) centers in diamond are a versatile quantum sensing platform for high sensitivity measurements of magnetic fields, temperature and strain with nanoscale spatial resolution. A common bottleneck is the analysis of optically detected magnetic resonance (ODMR) spectra, where target quantities are encoded in resonance features. Conventional nonlinear fitting is often computationally expensive, sensitive to initialization, and prone to failure at low signal-to-noise ratio (SNR). Here we introduce a robust, efficient machine learning (ML) framework for real-time ODMR analysis based on a one-dimensional convolutional neural network (1D-CNN). The model performs direct parameter inference without initial guesses or iterative optimization, and is naturally parallelizable on graphics processing units (GPU) for high-throughput processing. We validate the approach on both synthetic and experimental datasets, showing improved throughput, accuracy and robustness than standard nonlinear fitting, with the largest gains in the low-SNR regime. We further validate our methods in two representative sensing applications: diagnosing intracellular temperature changes using nanodiamond probes and widefield magnetic imaging of superconducting vortices in a high-temperature superconductor. This deep-learning inference framework enables fast and reliable extraction of physical parameters from complex ODMR data and provides a scalable route to real-time quantum sensing and imaging.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias
Authors:
Chengsi Yao,
Ge Wang,
Kai Kang,
Shenhao Yan,
Jiahao Yang,
Fan Feng,
Honghao Cai,
Xianxian Zeng,
Rongjun Chen,
Yiming Zhao,
Yatong Han,
Xi Li
Abstract:
Generative Control Policies (GCPs) show immense promise in robotic manipulation but struggle to simultaneously model stable global motions and high-frequency local corrections. While modern architectures extract multi-scale spatial features, their underlying Probability Flow ODEs apply a uniform temporal integration schedule. Compressed to a single step for real-time Receding Horizon Control (RHC)…
▽ More
Generative Control Policies (GCPs) show immense promise in robotic manipulation but struggle to simultaneously model stable global motions and high-frequency local corrections. While modern architectures extract multi-scale spatial features, their underlying Probability Flow ODEs apply a uniform temporal integration schedule. Compressed to a single step for real-time Receding Horizon Control (RHC), uniform ODE solvers mathematically smooth over sparse, high-frequency transients entangled within low-frequency steady states. To decouple these dynamics without accumulating pipelined errors, we introduce KoopmanFlow, a parameter-efficient generative policy guided by a Koopman-inspired structural inductive bias. Operating in a unified multimodal latent space with visual context, KoopmanFlow bifurcates generation at the terminal stage. Because visual conditioning occurs before spectral decomposition, both branches are visually guided yet temporally specialized. A macroscopic branch anchors slow-varying trajectories via single-step Consistency Training, while a transient branch uses Flow Matching to isolate high-frequency residuals stimulated by sudden visual cues (e.g., contacts or occlusions). Guided by an explicit spectral prior and optimized via a novel asymmetric consistency objective, KoopmanFlow establishes a fused co-training mechanism. This allows the variant branch to absorb localized dynamics without multi-stage error accumulation. Extensive experiments show KoopmanFlow significantly outperforms state-of-the-art baselines in contact-rich tasks requiring agile disturbance rejection. By trading a surplus latency buffer for a richer structural prior, KoopmanFlow achieves superior control fidelity and parameter efficiency within real-time deployment limits.
△ Less
Submitted 14 March, 2026;
originally announced March 2026.
-
SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning
Authors:
Yuyuan Yang,
Junkun Hong,
Hongrong Wang,
Honghao Cai,
Xunpeng Ren,
Ge Wang,
Mingcong Lei,
Shenhao Yan,
Jiahao Yang,
Chengsi Yao,
Xi Li,
Yiming Zhao,
Yatong Han,
Jinke Ren
Abstract:
Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Stag…
▽ More
Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
△ Less
Submitted 12 March, 2026;
originally announced March 2026.
-
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
Authors:
Guohua Zhang,
Jian Jin,
Meiqin Liu,
Chao Yao,
Weisi Lin
Abstract:
No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adapta…
▽ More
No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks.
△ Less
Submitted 16 March, 2026; v1 submitted 3 March, 2026;
originally announced March 2026.
-
LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning
Authors:
Chang Yao,
Jinghui Qin,
Kebing Jin,
Hankz Hankui Zhuo
Abstract:
Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral…
▽ More
Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma's Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.
△ Less
Submitted 7 March, 2026; v1 submitted 2 March, 2026;
originally announced March 2026.
-
BRepMAE: Self-Supervised Masked BRep Autoencoders for Machining Feature Recognition
Authors:
Can Yao,
Kang Wu,
Zuheng Zheng,
Siyuan Xing,
Xiao-Ming Fu
Abstract:
We propose a masked self-supervised learning framework, called BRepMAE, for automatically extracting a valuable representation of the input computer-aided design (CAD) model to recognize its machining features. Representation learning is conducted on a large-scale, unlabeled CAD model dataset using the geometric Attributed Adjacency Graph (gAAG) representation, derived from the boundary representa…
▽ More
We propose a masked self-supervised learning framework, called BRepMAE, for automatically extracting a valuable representation of the input computer-aided design (CAD) model to recognize its machining features. Representation learning is conducted on a large-scale, unlabeled CAD model dataset using the geometric Attributed Adjacency Graph (gAAG) representation, derived from the boundary representation (BRep). The self-supervised network is a masked graph autoencoder (MAE) that focuses on reconstructing geometries and attributes of BRep facets, rather than graph structures. After pre-training, we fine-tune a network that contains both the encoder and a task-specific classification network for machining feature recognition (MFR). In the experiments, our fine-tuned network achieves high recognition rates with only a small amount of data (e.g., 0.1% of the training data), significantly enhancing its practicality in real-world (or private) scenarios where only limited data is available. Compared with other MFR methods, our fine-tuned network achieves a significant improvement in recognition rate with the same amount of training data, especially when the number of training samples is limited.
△ Less
Submitted 26 February, 2026;
originally announced February 2026.
-
Human Video Generation from a Single Image with 3D Pose and View Control
Authors:
Tiantian Wang,
Chun-Han Yao,
Tao Hu,
Mallikarjun Byrasandra Ramalinga Reddy,
Ming-Hsuan Yang,
Varun Jampani
Abstract:
Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Vid…
▽ More
Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.
△ Less
Submitted 24 February, 2026;
originally announced February 2026.
-
G-LoG Bi-filtration for Medical Image Classification
Authors:
Qingsong Wang,
Jiaxing He,
Bingzhe Hou,
Tieru Wu,
Yang Cao,
Cailing Yao
Abstract:
Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parame…
▽ More
Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.
△ Less
Submitted 20 February, 2026;
originally announced February 2026.
-
Optimal error estimate of an isoparametric upwind discontinuous Galerkin method for radiation transport equation on curved domains
Authors:
Changhui Yao,
Yunpan Ma,
Lingxiao Li
Abstract:
This work investigates the isoparametric upwind discontinuous Galerkin method for solving the radiation transport equation defined on a bounded domain $D$ with a piecewise $C^{k+1}$ smooth curved boundary. An auxiliary mapping is constructed to approximate the original curved domain. The analysis delineates a high-order optimal convergence rate under the DG norm, which comprehensively balances the…
▽ More
This work investigates the isoparametric upwind discontinuous Galerkin method for solving the radiation transport equation defined on a bounded domain $D$ with a piecewise $C^{k+1}$ smooth curved boundary. An auxiliary mapping is constructed to approximate the original curved domain. The analysis delineates a high-order optimal convergence rate under the DG norm, which comprehensively balances the errors stemming from the numerical discretization and the geometric approximation. Two- and three-dimensional numerical experiments validate the theoretical results.
△ Less
Submitted 19 February, 2026;
originally announced February 2026.
-
CAFE: Channel-Autoregressive Factorized Encoding for Robust Biosignal Spatial Super-Resolution
Authors:
Hongjun Liu,
Leyu Zhou,
Zijianghao Yang,
Rujun Han,
Shitong Duan,
Kuanjian Tang,
Chao Yao
Abstract:
High-density biosignal recordings are critical for neural decoding and clinical monitoring, yet real-world deployments often rely on low-density (LD) montages due to hardware and operational constraints. This motivates spatial super-resolution from LD observations, but heterogeneous dependencies under sparse and noisy measurements often lead to artifact propagation and false non-local correlations…
▽ More
High-density biosignal recordings are critical for neural decoding and clinical monitoring, yet real-world deployments often rely on low-density (LD) montages due to hardware and operational constraints. This motivates spatial super-resolution from LD observations, but heterogeneous dependencies under sparse and noisy measurements often lead to artifact propagation and false non-local correlations. To address this, we propose CAFE, a plug-and-play rollout generation scheme that reconstructs the full montage in geometry-aligned stages. Starting from the LD channels, CAFE first recovers nearby channels and then progressively expands to more distal regions, exploiting reliable local structure before introducing non-local interactions. During training, step-wise supervision is applied over channel groups and teacher forcing with epoch-level scheduled sampling along the group dimension is utilized to reduce exposure bias, enabling parallel computation across steps. At test time, CAFE performs an autoregressive rollout across groups, while remaining plug-and-play by reusing any temporal backbone as the shared predictor. Evaluated on $4$ modalities and $6$ datasets, CAFE demonstrates plug-and-play generality across $3$ backbones (MLP, Conv, Transformer) and achieves consistently better reconstruction than $5$ representative baselines.
△ Less
Submitted 18 February, 2026;
originally announced February 2026.
-
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling
Authors:
Xiaochen Zhao,
Kaikai Wang,
Xiaowen Zhang,
Chen Yao,
Aili Wang
Abstract:
Large language model (LLM) agents demonstrate strong performance in short-text contexts but often underperform in extended dialogues due to inefficient memory management. Existing approaches face a fundamental trade-off between efficiency and effectiveness: memory compression risks losing critical details required for complex reasoning, while retaining raw text introduces unnecessary computational…
▽ More
Large language model (LLM) agents demonstrate strong performance in short-text contexts but often underperform in extended dialogues due to inefficient memory management. Existing approaches face a fundamental trade-off between efficiency and effectiveness: memory compression risks losing critical details required for complex reasoning, while retaining raw text introduces unnecessary computational overhead for simple queries. The crux lies in the limitations of monolithic memory representations and static retrieval mechanisms, which fail to emulate the flexible and proactive memory scheduling capabilities observed in humans, thus struggling to adapt to diverse problem scenarios. Inspired by the principle of cognitive economy, we propose HyMem, a hybrid memory architecture that enables dynamic on-demand scheduling through multi-granular memory representations. HyMem adopts a dual-granular storage scheme paired with a dynamic two-tier retrieval system: a lightweight module constructs summary-level context for efficient response generation, while an LLM-based deep module is selectively activated only for complex queries, augmented by a reflection mechanism for iterative reasoning refinement. Experiments show that HyMem achieves strong performance on both the LOCOMO and LongMemEval benchmarks, outperforming full-context while reducing computational cost by 92.6\%, establishing a state-of-the-art balance between efficiency and performance in long-term memory management.
△ Less
Submitted 14 February, 2026;
originally announced February 2026.
-
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
Authors:
Xiangfeng Wang,
Hangyu Guo,
Yanlin Lai,
Mitt Huang,
Liang Zhao,
Chengyuan Yao,
Yinmin Zhang,
Qi Han,
Xiaoxiao Ren,
Chun Yuan,
Tong Xu,
Zheng Ge,
Xiangyu Zhang,
Daxin Jiang
Abstract:
While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To b…
▽ More
While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.
△ Less
Submitted 11 February, 2026;
originally announced February 2026.
-
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Authors:
Ailin Huang,
Ang Li,
Aobo Kong,
Bin Wang,
Binxing Jiao,
Bo Dong,
Bojun Wang,
Boyu Chen,
Brian Li,
Buyun Ma,
Chang Su,
Changxin Miao,
Changyi Wan,
Chao Lou,
Chen Hu,
Chen Xu,
Chenfeng Yu,
Chengting Feng,
Chengyuan Yao,
Chunrui Han,
Dan Ma,
Dapeng Shi,
Daxin Jiang,
Dehua Ma,
Deshan Sun
, et al. (191 additional authors not shown)
Abstract:
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/f…
▽ More
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
△ Less
Submitted 23 February, 2026; v1 submitted 11 February, 2026;
originally announced February 2026.
-
R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging
Authors:
Yanlin Lai,
Mitt Huang,
Hangyu Guo,
Xiangfeng Wang,
Haodong Li,
Shaoxiong Zhan,
Liang Zhao,
Chengyuan Yao,
Yinmin Zhang,
Qi Han,
Chun Yuan,
Zheng Ge,
Xiangyu Zhang,
Daxin Jiang
Abstract:
Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that…
▽ More
Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM's preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
△ Less
Submitted 6 February, 2026;
originally announced February 2026.
-
Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion
Authors:
Da Li,
Chen Yao,
Tong Mao,
Jiacheng Bao,
Houjun Sun
Abstract:
Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To…
▽ More
Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.
△ Less
Submitted 29 January, 2026;
originally announced January 2026.
-
Electrostatic Screening Modulation of Graphene's Electronic Structure and the Helical Wavefunction Dominated Topological Properties
Authors:
Yaorui Tan,
Xiang Chen,
Yunhu Zhu,
Xiaowu Yang,
Zhongkai Huang,
Chuang Yao,
Maolin Bo
Abstract:
This study examines electrostatic screening effects in graphene using tight binding calculations based on the Binding energy and Bond Charge model and a modified version of it. The results indicate that the modified BBC potential decays in an exponential manner with distance, which suppresses electron electron interactions. The hopping integrals exhibit a pronounced decrease over distance and shif…
▽ More
This study examines electrostatic screening effects in graphene using tight binding calculations based on the Binding energy and Bond Charge model and a modified version of it. The results indicate that the modified BBC potential decays in an exponential manner with distance, which suppresses electron electron interactions. The hopping integrals exhibit a pronounced decrease over distance and shift with parameter variation. A band gap opens once the parameter exceeds a certain threshold. The density of states shows a prominent peak near the Fermi level, whereas the low-energy region remains largely unchanged. The low energy helical wave functions in graphene display topological characteristics, including pseudospin momentum locking and a π Berry phase, resulting in distinctive transport properties. By avoiding the Coulomb singularity, the model offers valuable insights for the engineering of screening in two-dimensional systems and the design of topological devices.
△ Less
Submitted 11 February, 2026; v1 submitted 26 January, 2026;
originally announced January 2026.
-
closed $\mathrm{G}_2$-structures with $\mathbb{T}^3$-symmetry and hypersymplectic structures
Authors:
Chengjian Yao,
Ziyi Zhou
Abstract:
Closed $\mathrm{G}_2$-structures $\varphi$ with an effective $\mathbb{T}^3$-symmetry on connected manifolds are roughly classified into three types according to the evaluation of $\varphi$ on the principal orbits. Type 1: if there is neither associative nor isotropic orbit, then the action is free and $\varphi$ reduces to a hypersymplectic structure on the quotient manifold admitting three linearl…
▽ More
Closed $\mathrm{G}_2$-structures $\varphi$ with an effective $\mathbb{T}^3$-symmetry on connected manifolds are roughly classified into three types according to the evaluation of $\varphi$ on the principal orbits. Type 1: if there is neither associative nor isotropic orbit, then the action is free and $\varphi$ reduces to a hypersymplectic structure on the quotient manifold admitting three linearly independent closed 1-forms; in particular, it is diffeomorphic to $\mathbb{T}^4$ if the manifold is compact. Type 2: if some orbit is associative, then the action is almost-free and $\varphi$ reduces to a good hypersymplectic orbifold with cyclic isotropic groups. Type 3: if some orbit is isotropic, then the action is locally multi-Hamiltonian for $\varphi$. Moreover, the open and dense subset of principal orbits is foliated by $\mathbb{T}^3$-invariant hypersymplectic manifolds. If $\varphi$ is torsion-free and complete, then the hypersymplectic manifold is flat and $\varphi$ is flat for Type 1; the good hypersymplectic orbifold is good hyperkähler orbifold for Type 2; $\varphi$ is locally toric for Type 3. As shown, hypersymplectic structures have intimate link with closed $\mathrm{G}_2$-structure with effective $\mathbb{T}^3$-symmetry.
△ Less
Submitted 20 January, 2026;
originally announced January 2026.
-
STEP3-VL-10B Technical Report
Authors:
Ailin Huang,
Chengyuan Yao,
Chunrui Han,
Fanqi Wan,
Hangyu Guo,
Haoran Lv,
Hongyu Zhou,
Jia Wang,
Jian Zhou,
Jianjian Sun,
Jingcheng Hu,
Kangheng Lin,
Liang Zhao,
Mitt Huang,
Song Yuan,
Wenwen Qu,
Xiangfeng Wang,
Yanlin Lai,
Yingxiu Zhao,
Yinmin Zhang,
Yukang Shi,
Yuyang Chen,
Zejia Weng,
Ziyang Meng,
Ang Li
, et al. (68 additional authors not shown)
Abstract:
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish…
▽ More
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
△ Less
Submitted 15 January, 2026; v1 submitted 14 January, 2026;
originally announced January 2026.
-
Learner-Tailored Program Repair: A Solution Generator with Iterative Edit-Driven Retrieval Enhancement
Authors:
Zhenlong Dai,
Zhuoluo Zhao,
Hengning Wang,
Xiu Tang,
Sai Wu,
Chang Yao,
Zhipeng Gao,
Jingyuan Chen
Abstract:
With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. However, most research focuses on repairing the buggy code of programming learners without providing the underlying causes of the bugs. To address this gap, we introduce a novel task, namely LRP (Learner-Tailored Program Repair). We then propo…
▽ More
With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. However, most research focuses on repairing the buggy code of programming learners without providing the underlying causes of the bugs. To address this gap, we introduce a novel task, namely LRP (Learner-Tailored Program Repair). We then propose a novel and effective framework, LSGEN (Learner-Tailored Solution Generator), to enhance program repair while offering the bug descriptions for the buggy code. In the first stage, we utilize a repair solution retrieval framework to construct a solution retrieval database and then employ an edit-driven code retrieval approach to retrieve valuable solutions, guiding LLMs in identifying and fixing the bugs in buggy code. In the second stage, we propose a solution-guided program repair method, which fixes the code and provides explanations under the guidance of retrieval solutions. Moreover, we propose an Iterative Retrieval Enhancement method that utilizes evaluation results of the generated code to iteratively optimize the retrieval direction and explore more suitable repair strategies, improving performance in practical programming coaching scenarios. The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our framework for the newly proposed LPR task.
△ Less
Submitted 18 January, 2026; v1 submitted 13 January, 2026;
originally announced January 2026.
-
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
Authors:
Jingcheng Hu,
Yinmin Zhang,
Shijie Shang,
Xiaobo Yang,
Yue Peng,
Zhewei Huang,
Hebin Zhou,
Xin Wu,
Jie Cheng,
Fanqi Wan,
Xiangwen Kong,
Chengyuan Yao,
Kaiwen Yan,
Ailin Huang,
Hongyu Zhou,
Qi Han,
Zheng Ge,
Daxin Jiang,
Xiangyu Zhang,
Heung-Yeung Shum
Abstract:
We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a…
▽ More
We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
△ Less
Submitted 9 January, 2026;
originally announced January 2026.
-
Monadic Context Engineering
Authors:
Yifan Zhang,
Yang Yuan,
Mengdi Wang,
Andrew Chi-Chih Yao
Abstract:
The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering…
▽ More
The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.
△ Less
Submitted 21 January, 2026; v1 submitted 26 December, 2025;
originally announced December 2025.
-
Spectral entropy prior-guided deep feature fusion architecture for magnetic core loss
Authors:
Cong Yao,
Chunye Gong,
Jin Zhang
Abstract:
Accurate core loss modeling is critical for the design of high-efficiency power electronic systems. Traditional core loss modeling methods have limitations in prediction accuracy. To advance this field, the IEEE Power Electronics Society launched the MagNet Challenge in 2023, the first international competition focused on data-driven power electronics design methods, aiming to uncover complex loss…
▽ More
Accurate core loss modeling is critical for the design of high-efficiency power electronic systems. Traditional core loss modeling methods have limitations in prediction accuracy. To advance this field, the IEEE Power Electronics Society launched the MagNet Challenge in 2023, the first international competition focused on data-driven power electronics design methods, aiming to uncover complex loss patterns in magnetic components through a data-driven paradigm. Although purely data-driven models demonstrate strong fitting performance, their interpretability and cross-distribution generalization capabilities remain limited. To address these issues, this paper proposes a hybrid model, SEPI-TFPNet, which integrates empirical models with deep learning. The physical-prior submodule employs a spectral entropy discrimination mechanism to select the most suitable empirical model under different excitation waveforms. The data-driven submodule incorporates convolutional neural networks, multi-head attention mechanisms, and bidirectional long short-term memory networks to extract flux-density time-series features. An adaptive feature fusion module is introduced to improve multimodal feature interaction and integration. Using the MagNet dataset containing various magnetic materials, this paper evaluates the proposed method and compares it with 21 representative models from the 2023 challenge and three advanced methods from 2024-2025. The results show that the proposed method achieves improved modeling accuracy and robustness.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation
Authors:
Liuyi Wang,
Zongtao He,
Jinlong Li,
Ruihao Xia,
Mengxian Hu,
Chenpeng Yao,
Chengju Liu,
Yang Tang,
Qijun Chen
Abstract:
Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that…
▽ More
Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.
△ Less
Submitted 23 January, 2026; v1 submitted 11 December, 2025;
originally announced December 2025.
-
Group Representational Position Encoding
Authors:
Yifan Zhang,
Zixiang Chen,
Yifeng Liu,
Zhen Qin,
Huizhuo Yuan,
Kangping Xu,
Yang Yuan,
Quanquan Gu,
Andrew Chi-Chih Yao
Abstract:
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a po…
▽ More
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n \, ω\, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.
△ Less
Submitted 1 April, 2026; v1 submitted 8 December, 2025;
originally announced December 2025.
-
Aging-driven in situ polymerization of FEC additive boosts the calendar-life of silicon anodes via surface passivation enhancement
Authors:
Sattajit Barua,
Rownak J. Mou,
Koffi P. C. Yao
Abstract:
The role of additives such as FEC in extending the calendar life of silicon anodes beyond the cycling benefits is still not fully understood. Herein, the calendar life of high-loading Si (80 wt%) using baseline 1.2 M LiPF6 in EC-EMC electrolyte versus adding 10 wt% FEC is investigated over months. Over 8 days of aging, FEC leads to a 13-fold reduction in irreversible capacity loss in Si-LiFePO4 fu…
▽ More
The role of additives such as FEC in extending the calendar life of silicon anodes beyond the cycling benefits is still not fully understood. Herein, the calendar life of high-loading Si (80 wt%) using baseline 1.2 M LiPF6 in EC-EMC electrolyte versus adding 10 wt% FEC is investigated over months. Over 8 days of aging, FEC leads to a 13-fold reduction in irreversible capacity loss in Si-LiFePO4 full cells. Cells without FEC are projected to fall below 80% of their initial capacity within approx. 22 days versus approx. 279 days with FEC. Symmetric Si-Si cells from harvested electrodes show greater increase in interphase resistance without FEC, whereby an increase of 10.81 Ohms is measured for 0 wt% FEC vs. only 3.37 Ohms for 10 wt% FEC over 2 months. Power law modeling of this long-term interphase resistance finds mixed transport-reaction growth behavior in FEC-free cells, suggesting significant dissolution, whereas cells with 10 wt% FEC added display a diffusion-controlled impedance growth behavior, suggesting a robust surface passivation film. Post-mortem FTIR and XPS confirm polycarbonate enrichment of the SEI, which was discovered to predominantly emerge from FEC self-polymerization during the idle aging. When the Si electrodes aged with and without FEC are harvested and reassembled into full cells with the same electrolytes used at aging, the first-cycle coulombic efficiency is 71% for 0 wt% FEC versus 97% for 10 wt% FEC. Subsequent cycling maintains over 99.7% CE with 10 wt% FEC, surpassing the pre-aging CE of 98.8%. This elevated CE indicates better passivation provided by the polymer fragments formed during aging compared to electrochemically formed SEI where no strong polymer FTIR signal is found. The self-polymerization during idle aging with additives such as FEC is therefore an opportune in situ mechanism to further engineer in extending the life of Si-based batteries.
△ Less
Submitted 30 November, 2025;
originally announced December 2025.
-
Think How Your Teammates Think: Active Inference Can Benefit Decentralized Execution
Authors:
Hao Wu,
Shoucheng Song,
Chang Yao,
Sheng Han,
Huaiyu Wan,
Youfang Lin,
Kai Lv
Abstract:
In multi-agent systems, explicit cognition of teammates' decision logic serves as a critical factor in facilitating coordination. Communication (i.e., ``\textit{Tell}'') can assist in the cognitive development process by information dissemination, yet it is inevitably subject to real-world constraints such as noise, latency, and attacks. Therefore, building the understanding of teammates' decision…
▽ More
In multi-agent systems, explicit cognition of teammates' decision logic serves as a critical factor in facilitating coordination. Communication (i.e., ``\textit{Tell}'') can assist in the cognitive development process by information dissemination, yet it is inevitably subject to real-world constraints such as noise, latency, and attacks. Therefore, building the understanding of teammates' decisions without communication remains challenging. To address this, we propose a novel non-communication MARL framework that realizes the construction of cognition through local observation-based modeling (i.e., \textit{``Think''}). Our framework enables agents to model teammates' \textbf{active inference} process. At first, the proposed method produces three teammate portraits: perception-belief-action. Specifically, we model the teammate's decision process as follows: 1) Perception: observing environments; 2) Belief: forming beliefs; 3) Action: making decisions. Then, we selectively integrate the belief portrait into the decision process based on the accuracy and relevance of the perception portrait. This enables the selection of cooperative teammates and facilitates effective collaboration. Extensive experiments on the SMAC, SMACv2, MPE, and GRF benchmarks demonstrate the superior performance of our method.
△ Less
Submitted 23 November, 2025;
originally announced November 2025.
-
Step-Audio-R1 Technical Report
Authors:
Fei Tian,
Xiangyu Tony Zhang,
Yuxin Zhang,
Haoyang Zhang,
Yuxin Li,
Daijiao Liu,
Yayue Deng,
Donghang Wu,
Jun Chen,
Liang Zhao,
Chengyuan Yao,
Hexin Liu,
Eng Siong Chng,
Xuerui Yang,
Xiangyu Zhang,
Daxin Jiang,
Gang Yu
Abstract:
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R…
▽ More
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
△ Less
Submitted 26 November, 2025; v1 submitted 19 November, 2025;
originally announced November 2025.
-
Heterogeneous Attributed Graph Learning via Neighborhood-Aware Star Kernels
Authors:
Hong Huang,
Chengyu Yao,
Haiming Chen,
Hang Gao
Abstract:
Attributed graphs, typically characterized by irregular topologies and a mix of numerical and categorical attributes, are ubiquitous in diverse domains such as social networks, bioinformatics, and cheminformatics. While graph kernels provide a principled framework for measuring graph similarity, existing kernel methods often struggle to simultaneously capture heterogeneous attribute semantics and…
▽ More
Attributed graphs, typically characterized by irregular topologies and a mix of numerical and categorical attributes, are ubiquitous in diverse domains such as social networks, bioinformatics, and cheminformatics. While graph kernels provide a principled framework for measuring graph similarity, existing kernel methods often struggle to simultaneously capture heterogeneous attribute semantics and neighborhood information in attributed graphs. In this work, we propose the Neighborhood-Aware Star Kernel (NASK), a novel graph kernel designed for attributed graph learning. NASK leverages an exponential transformation of the Gower similarity coefficient to jointly model numerical and categorical features efficiently, and employs star substructures enhanced by Weisfeiler-Lehman iterations to integrate multi-scale neighborhood structural information. We theoretically prove that NASK is positive definite, ensuring compatibility with kernel-based learning frameworks such as SVMs. Extensive experiments are conducted on eleven attributed and four large-scale real-world graph benchmarks. The results demonstrate that NASK consistently achieves superior performance over sixteen state-of-the-art baselines, including nine graph kernels and seven Graph Neural Networks.
△ Less
Submitted 14 November, 2025;
originally announced November 2025.
-
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Authors:
Qixiu Li,
Yu Deng,
Yaobo Liang,
Lin Luo,
Lei Zhou,
Chengtang Yao,
Lingqi Zeng,
Zhiyuan Feng,
Huizhi Liang,
Sicheng Xu,
Yizhong Zhang,
Xi Chen,
Hao Chen,
Lily Sun,
Dong Chen,
Jiaolong Yang,
Baining Guo
Abstract:
This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V…
▽ More
This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.
△ Less
Submitted 24 October, 2025;
originally announced October 2025.
-
Diffeomorphic solutions of Ahlfors-Hopf equations
Authors:
Gaven Martin,
Cong Yao
Abstract:
Here we advance the study of boundary the value problem for extremal functions of mean distortion and the associated TeichmĂ¼ller spaces interpolating between the classical examples of extremal quasiconformal mappings, and the more recent approach through harmonic mappings (of extreme Dirichlet energy). In this paper we focus on the Alhfors-Hopf differential \[ Φ=\mathcal{A}(\mathbb{K}(w,h))h_w\,\o…
▽ More
Here we advance the study of boundary the value problem for extremal functions of mean distortion and the associated TeichmĂ¼ller spaces interpolating between the classical examples of extremal quasiconformal mappings, and the more recent approach through harmonic mappings (of extreme Dirichlet energy). In this paper we focus on the Alhfors-Hopf differential \[ Φ=\mathcal{A}(\mathbb{K}(w,h))h_w\,\overline{h_{\overline{w}}}\, η(h), \] where $h=f^{-1}$ is the pseudo-inverse of an extremal mapping $f$ for the problem \[ \inf_{f:\mathbb{D}\to\mathbb{D}}\int_\mathbb{D} \mathcal{A}(\mathbb{K}(z,f)) \; dz, \quad\quad \mathbb{K}(z,f) = \frac{|f_z|^2+|f_{\overline{z}}|^2}{|f_z|^2-|f_{\overline{z}}|^2}. \] where the infimum is taken over those homeomorphisms of finite distortion $f:\overline{\mathbb{D}}\to\overline{\mathbb{D}}$ with $f|\mathbb{S}=f_0$, typically a quasisymmetric barrier function. The inner-variational equations, an analogue of the Euler-Lagrange equations, show $Φ$ is holomorphic at an extremal. Exploiting this Ahlfors-Hopf differential, we prove that an extreme point $f$ is a local diffeomorphism in $\mathbb{D}$, resolving some conjectures in [16].
△ Less
Submitted 8 January, 2026; v1 submitted 22 October, 2025;
originally announced October 2025.
-
Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution
Authors:
Hongjun Liu,
Leyu Zhou,
Zijianghao Yang,
Chao Yao
Abstract:
For real-world BCI applications, lightweight Electroencephalography (EEG) systems offer the best cost-deployment balance. However, such spatial sparsity of EEG limits spatial fidelity, hurting learning and introducing bias. EEG spatial super-resolution methods aim to recover high-density EEG signals from sparse measurements, yet is often hindered by distribution shift and signal distortion and thu…
▽ More
For real-world BCI applications, lightweight Electroencephalography (EEG) systems offer the best cost-deployment balance. However, such spatial sparsity of EEG limits spatial fidelity, hurting learning and introducing bias. EEG spatial super-resolution methods aim to recover high-density EEG signals from sparse measurements, yet is often hindered by distribution shift and signal distortion and thus reducing fidelity and usability for EEG analysis and visualization. To overcome these challenges, we introduce SRGDiff, a step-aware residual-guided diffusion model that formulates EEG spatial super-resolution as dynamic conditional generation. Our key idea is to learn a dynamic residual condition from the low-density input that predicts the step-wise temporal and spatial details to add and uses the evolving cue to steer the denoising process toward high density reconstructions. At each denoising step, the proposed residual condition is additively fused with the previous denoiser feature maps, then a step-dependent affine modulation scales and shifts the activation to produce the current features. This iterative procedure dynamically extracts step-wise temporal rhythms and spatial-topographic cues to steer high-density recovery and maintain a fidelity-consistency balance. We adopt a comprehensive evaluation protocol spanning signal-, feature-, and downstream-level metrics across SEED, SEED-IV, and Localize-MI and multiple upsampling scales. SRGDiff achieves consistent gains of up to 40% over strong baselines, proving its superiority in the task of EEG spatial super-resolution. Moreover, topographic visualizations comparison and substantial EEG-FID gains jointly indicate that our SR EEG mitigates the spatial-spectral shift between low- and high-density recordings. Our code is available at https://github.com/DhrLhj/ICLR2026SRGDiff.
△ Less
Submitted 22 February, 2026; v1 submitted 21 October, 2025;
originally announced October 2025.
-
Symmetric Entropy-Constrained Video Coding for Machines
Authors:
Yuxiao Sun,
Meiqin Liu,
Chao Yao,
Qi Tang,
Jian Jin,
Weisi Lin,
Frederic Dufaux,
Yao Zhao
Abstract:
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual b…
▽ More
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art~(SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), and multiple object tracking (44.9%). We will release our code soon.
△ Less
Submitted 31 October, 2025; v1 submitted 17 October, 2025;
originally announced October 2025.
-
OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment
Authors:
Rongjun Chen,
Chengsi Yao,
Jinchang Ren,
Xianxian Zeng,
Peixian Wang,
Jun Yuan,
Jiawen Li,
Huimin Zhao,
Xu Lu
Abstract:
Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrie…
▽ More
Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8\% (text-to-image) and 40.1\% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.
△ Less
Submitted 15 October, 2025;
originally announced October 2025.
-
SViM3D: Stable Video Material Diffusion for Single Image 3D Generation
Authors:
Andreas Engelhardt,
Mark Boss,
Vikram Voleti,
Chun-Han Yao,
Hendrik P. A. Lensch,
Varun Jampani
Abstract:
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable…
▽ More
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
△ Less
Submitted 1 November, 2025; v1 submitted 9 October, 2025;
originally announced October 2025.
-
Lattice Translation Modulated Symmetries and TFTs
Authors:
Ching-Yu Yao
Abstract:
Modulated symmetries are internal symmetries that are not invariant under spacetime symmetry actions. We propose a general way to describe the lattice translation modulated symmetries in 1+1D, including the non-invertible ones, via the tensor network language. We demonstrate that the modulations can be described by some autoequivalences of the categories. Although the topological behaviors are bro…
▽ More
Modulated symmetries are internal symmetries that are not invariant under spacetime symmetry actions. We propose a general way to describe the lattice translation modulated symmetries in 1+1D, including the non-invertible ones, via the tensor network language. We demonstrate that the modulations can be described by some autoequivalences of the categories. Although the topological behaviors are broken because of the presence of modulations, we can still construct the modulated version of the symmetry TFT bulks by inserting a series of domain walls described by invertible bimodule categories. This structure not only recovers some known results on invertible modulated symmetries but also provides a general framework to tackle modulated symmetries in a more general setting.
△ Less
Submitted 8 December, 2025; v1 submitted 4 October, 2025;
originally announced October 2025.
-
U-SWIFT: A Unified Surface Wave Inversion Framework with Transformer via Normalization of Dispersion Curves
Authors:
Tianjian Cheng,
Hongrui Xu,
Jiayu Feng,
Xiongyu Hu,
Chaofan Yao
Abstract:
Deep learning is an increasingly popular approach for inverting surface wave dispersion curves to obtain Vs profiles. However, its generalizability is constrained by the depth and velocity scales of training data. We propose a unified deep learning framework that overcomes this limitation via normalization of dispersion curves. By leveraging the scaling properties of dispersion curves, our approac…
▽ More
Deep learning is an increasingly popular approach for inverting surface wave dispersion curves to obtain Vs profiles. However, its generalizability is constrained by the depth and velocity scales of training data. We propose a unified deep learning framework that overcomes this limitation via normalization of dispersion curves. By leveraging the scaling properties of dispersion curves, our approach enables a single, pre-trained model to predict Vs profiles across diverse scales, from shallow subsurface (e.g., < 10 m depth) to crustal levels. The framework incorporates a novel transformer-based model to handle variable-length dispersion curves and removes tedious manual parameterization. Results from synthetic and field data demonstrate that it delivers rapid and robust inversions with uncertainty estimates. This work provides an efficient inversion approach applicable to a wide spectrum of applications, from near-surface engineering to crustal imaging. The framework establishes a paradigm for developing scale-invariant deep learning models in geophysical inversion.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Authors:
Chaorui Yao,
Yanxi Chen,
Yuchang Sun,
Yushuo Chen,
Wenhao Zhang,
Xuchen Pan,
Yaliang Li,
Bolin Ding
Abstract:
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algori…
▽ More
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE -- a REINFORCE variant that uses the within-group mean reward as the baseline for advantage calculation -- without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to truly off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent and Asymmetric REINFORCE -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k.
△ Less
Submitted 1 March, 2026; v1 submitted 28 September, 2025;
originally announced September 2025.
-
Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation
Authors:
Qingsong Wang,
Tao Wu,
Wang Lin,
Yueying Feng,
Gongsheng Yuan,
Chang Yao,
Jingyuan Chen
Abstract:
Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentat…
▽ More
Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom's taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning
Authors:
Changwei Yao,
Xinzi Liu,
Chen Li,
Marios Savvides
Abstract:
Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, w…
▽ More
Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.
△ Less
Submitted 24 March, 2026; v1 submitted 19 September, 2025;
originally announced September 2025.
-
Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation
Authors:
Hao Zhang,
Chun-Han Yao,
Simon Donné,
Narendra Ahuja,
Varun Jampani
Abstract:
We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model th…
▽ More
We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.
△ Less
Submitted 4 November, 2025; v1 submitted 12 September, 2025;
originally announced September 2025.
-
Powering Job Search at Scale: LLM-Enhanced Query Understanding in Job Matching Systems
Authors:
Ping Liu,
Jianqiang Shen,
Qianqi Shen,
Chunnan Yao,
Kevin Kao,
Dan Xu,
Rajat Arora,
Baofen Zheng,
Caleb Johnson,
Liangjie Hong,
Jingwei Wu,
Wenjing Zhang
Abstract:
Query understanding is essential in modern relevance systems, where user queries are often short, ambiguous, and highly context-dependent. Traditional approaches often rely on multiple task-specific Named Entity Recognition models to extract structured facets as seen in job search applications. However, this fragmented architecture is brittle, expensive to maintain, and slow to adapt to evolving t…
▽ More
Query understanding is essential in modern relevance systems, where user queries are often short, ambiguous, and highly context-dependent. Traditional approaches often rely on multiple task-specific Named Entity Recognition models to extract structured facets as seen in job search applications. However, this fragmented architecture is brittle, expensive to maintain, and slow to adapt to evolving taxonomies and language patterns. In this paper, we introduce a unified query understanding framework powered by a Large Language Model (LLM), designed to address these limitations. Our approach jointly models the user query and contextual signals such as profile attributes to generate structured interpretations that drive more accurate and personalized recommendations. The framework improves relevance quality in online A/B testing while significantly reducing system complexity and operational overhead. The results demonstrate that our solution provides a scalable and adaptable foundation for query understanding in dynamic web applications.
△ Less
Submitted 19 August, 2025;
originally announced September 2025.
-
Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users
Authors:
Haowei Yang,
Yushang Zhao,
Sitao Min,
Bo Su,
Chao Yao,
Wei Xu
Abstract:
The cold-start user issue further compromises the effectiveness of recommender systems in limiting access to the historical behavioral information. It is an effective pipeline to optimize instructional prompts on a few-shot large language model (LLM) used in recommender tasks. We introduce a context-conditioned prompt formulation method P(u,\ Ds)\ \rightarrow\ R\widehat, where u is a cold-start us…
▽ More
The cold-start user issue further compromises the effectiveness of recommender systems in limiting access to the historical behavioral information. It is an effective pipeline to optimize instructional prompts on a few-shot large language model (LLM) used in recommender tasks. We introduce a context-conditioned prompt formulation method P(u,\ Ds)\ \rightarrow\ R\widehat, where u is a cold-start user profile, Ds is a curated support set, and R\widehat is the predicted ranked list of items. Based on systematic experimentation with transformer-based autoregressive LLMs (BioGPT, LLaMA-2, GPT-4), we provide empirical evidence that optimal exemplar injection and instruction structuring can significantly improve the precision@k and NDCG scores of such models in low-data settings. The pipeline uses token-level alignments and embedding space regularization with a greater semantic fidelity. Our findings not only show that timely composition is not merely syntactic but also functional as it is in direct control of attention scales and decoder conduct through inference. This paper shows that prompt-based adaptation may be considered one of the ways to address cold-start recommendation issues in LLM-based pipelines.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Silicon-Compatible Ionic Control over Multi-State Magnetoelectric Phase Transformations in Correlated Oxide System
Authors:
Xuanchi Zhou,
Jiahui Ji,
Wentian Lu,
Huihui Ji,
Chunwei Yao,
Xiaohui Yao,
Xiaomei Qiao,
Guowei Zhou,
Xiaohong Xu
Abstract:
Realizing room-temperature ferromagnetic insulators, critical enablers for low-power spintronics, is fundamentally challenged by the long-standing trade-off between ferromagnetic ordering and indirect exchange interactions in insulators. Ionic evolution offers tempting opportunities for accessing exotic magnetoelectric states and physical functionality beyond conventional doping paradigm via tailo…
▽ More
Realizing room-temperature ferromagnetic insulators, critical enablers for low-power spintronics, is fundamentally challenged by the long-standing trade-off between ferromagnetic ordering and indirect exchange interactions in insulators. Ionic evolution offers tempting opportunities for accessing exotic magnetoelectric states and physical functionality beyond conventional doping paradigm via tailoring the charge-lattice-orbital-spin interactions. Here, we showcase the precise magneto-ionic control over magnetoelectric states in LSMO system, delicately delivering silicon-compatible weakly ferromagnetic insulator state above room temperature. Of particular note is the decoupling of ion-charge-spin interplay in correlated LSMO system, a primary obstacle in clarifying underlying physical origin, with this process concurrently giving rise to an emergent intermediate state characterized by a weakly ferromagnetic half-metallic state. Benefiting from the SrTiO3 buffer layer as epitaxial template to promote interfacial heterogeneous nucleation, hydrogenation enables diverse magnetoelectric states in LSMO integrated on silicon, fully compatible with traditional semiconductor processing. Assisted by theoretical calculations and spectroscopic techniques, hydrogen-induced magnetoelectric transitions in LSMO are driven by band-filling control and suppression in double exchange interaction. Our work not only defines a novel design paradigm for exploring exotic quantum states in correlated system, with transformative potential for spintronics, but also fundamentally unveils the physical origin behind ionic evolution via disentangling the ion-charge-spin coupling.
△ Less
Submitted 8 September, 2025;
originally announced September 2025.