-
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
Authors:
Xuemei Jia,
Jiawei Du,
Hui Wei,
Jun Chen,
Joey Tianyi Zhou,
Zheng Wang
Abstract:
High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative m…
▽ More
High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Authors:
Xiangru Jian,
Hao Xu,
Wei Pang,
Xinjian Zhao,
Chengyu Tao,
Qixin Zhang,
Xikun Zhang,
Chao Zhang,
Guanzhi Deng,
Alex Xue,
Juan Du,
Tianshu Yu,
Garth Tarr,
Linqi Song,
Qiuzhuang Sun,
Dacheng Tao
Abstract:
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE.…
▽ More
The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design
Authors:
Juan Du,
Yueteng Wu,
Pan Zhao,
Yuze Liu,
Min Zhang,
Xiaobin Xu,
Xinglong Zhang
Abstract:
The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design challenging. To address this issue, this s…
▽ More
The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design challenging. To address this issue, this study proposes TurboAgent, a large language model (LLM)-driven autonomous multi-agent framework for turbomachinery aerodynamic design and optimization. The LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi-objective optimization, and physics-based validation. The framework transforms traditional trial-and-error design into a data-driven collaborative workflow, with high-fidelity simulations retained for final verification. A transonic single-rotor compressor is used for validation. The results show strong agreement between target performance, generated designs, and CFD simulations. The coefficients of determination for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. The optimization agent further improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. The complete workflow can be executed within approximately 30 minutes under parallel computing. These results demonstrate that TurboAgent enables an autonomous closed-loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design.
△ Less
Submitted 8 April, 2026; v1 submitted 8 April, 2026;
originally announced April 2026.
-
SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection
Authors:
Letian Bai,
Chengyu Tao,
Juan Du
Abstract:
Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimo…
▽ More
Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
Authors:
Hao Liu,
Ye Huang,
Chenghuan Huang,
Zhenyi Zheng,
Jiangsu Du,
Ziyang Ma,
Jing Lyu,
Yutong Lu
Abstract:
Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests…
▽ More
Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation
Authors:
Jing Du,
Zesheng Ye,
Congbo Ma,
Feng Liu,
Flora. D. Salim
Abstract:
Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-…
▽ More
Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu-cs/GTC.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
Authors:
Jiawei Chen,
Simin Huang,
Jiawei Du,
Shuaihang Chen,
Yu Tian,
Mingjie Wei,
Chao Yu,
Zhaoxia Yin
Abstract:
Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adver…
▽ More
Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.
△ Less
Submitted 2 April, 2026;
originally announced April 2026.
-
A deterministic multiple-shift lattice algorithm for function approximation in Korobov and half-period Cosine spaces
Authors:
Jiarui Du,
Josef Dick
Abstract:
Approximating multivariate periodic functions in weighted Korobov spaces via rank-1 lattices is fundamentally limited by frequency aliasing. Existing optimal-rate methods rely on randomized constructions or large pre-computations. We propose a fully deterministic multiple-shift lattice algorithm without pre-computation. First, we develop a simplified multiple shift framework for aliased frequency…
▽ More
Approximating multivariate periodic functions in weighted Korobov spaces via rank-1 lattices is fundamentally limited by frequency aliasing. Existing optimal-rate methods rely on randomized constructions or large pre-computations. We propose a fully deterministic multiple-shift lattice algorithm without pre-computation. First, we develop a simplified multiple shift framework for aliased frequency fibers that reduces sampling costs. Second, leveraging the Chinese Remainder Theorem and the Weil bound, we introduce an adaptive hybrid construction that algebraically guarantees the full rank and bounded condition number of the reconstruction matrix. We rigorously prove that this deterministic method maintains the optimal convergence rate in the worst-case setting.
Furthermore, we extend this framework to non-periodic, half-period cosine spaces via the tent transformation. By establishing a strict projection equivalence, we prove that the algorithm attains optimal $L_2$ and $L_\infty$ approximation orders in the half-period cosine space, successfully resolving an open theoretical problem posed by Suryanarayana et al. (2016). This mathematically also validates the proposed algorithm as a generic meshless spectral solver for high-dimensional boundary value problems, such as the Poisson equation with Neumann conditions. Numerical experiments corroborate the theoretical bounds, demonstrating an order-of-magnitude reduction in sampling complexity over probabilistic baselines while ensuring absolute deterministic stability.
△ Less
Submitted 3 April, 2026; v1 submitted 1 April, 2026;
originally announced April 2026.
-
SuperGrasp: Single-View Object Grasping via Superquadric Similarity Matching, Evaluation, and Refinement
Authors:
Lijingze Xiao,
Jinhong Du,
Yang Cong,
Supeng Diao,
Yu Ren
Abstract:
Robotic grasping from single-view observations remains a critical challenge in manipulation. Existing methods still struggle to generate stable and valid grasp poses when confronted with incomplete geometric information. To address these limitations, we propose SuperGrasp, a novel two-stage framework for single-view grasping with parallel-jaw grippers that decomposes the grasping process into init…
▽ More
Robotic grasping from single-view observations remains a critical challenge in manipulation. Existing methods still struggle to generate stable and valid grasp poses when confronted with incomplete geometric information. To address these limitations, we propose SuperGrasp, a novel two-stage framework for single-view grasping with parallel-jaw grippers that decomposes the grasping process into initial grasp pose generation and subsequent grasp evaluation and refinement. In the first stage, we introduce a Similarity Matching Module that efficiently retrieves grasp candidates by matching the input single-view point cloud with a pre-computed primitive dataset based on superquadric coefficients. In the second stage, we propose E-RNet, an end-to-end network that expands the graspaware region and takes the initial grasp closure region as a local anchor region, enabling more accurate and reliable evaluation and refinement of grasp candidates. To enhance generalization, we construct a primitive dataset containing 1.5k primitives for similarity matching and collect a large-scale point cloud dataset with 100k stable grasp labels from 124 objects for network training. Extensive experiments in both simulation and realworld environments demonstrate that our method achieves stable grasping performance and strong generalization across varying scenes and novel objects.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
KAT-Coder-V2 Technical Report
Authors:
Fengxiang Li,
Han Zhang,
Haoyang Huang,
Jinghui Wang,
Jinhua Hao,
Kun Yuan,
Mengtong Li,
Minglei Zhang,
Pengcheng Xu,
Wenhao Zhuang,
Yizhen Shao,
Zongxian Feng,
Can Tang,
Chao Wang,
Chengxiao Tong,
Fan Yang,
Gang Xiong,
Haixuan Gao,
Han Gao,
Hao Wang,
Haochen Liu,
Hongliang Sun,
Jiabao Li,
Jingwen Chang,
Jun Du
, et al. (21 additional authors not shown)
Abstract:
We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy disti…
▽ More
We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
Giant Magnetostriction by Design: A First-Principles Screening of Co-based Heusler Alloys
Authors:
Pengju Wu,
Jie Du,
Liang Yao,
Hang Li,
Xiaodong Zhou,
Tao Zhu,
Wenhong Wang
Abstract:
The pursuit of high-performance, rare-earth-free magnetostrictive materials is crucial for advancing technologies in sensing, actuation, and microelectromechanical systems. Heusler alloys represent a promising, yet underexplored, class of materials for this purpose. In this work, we perform a systematic first-principles investigation of the magnetostrictive properties of 25 Co-based full Heusler a…
▽ More
The pursuit of high-performance, rare-earth-free magnetostrictive materials is crucial for advancing technologies in sensing, actuation, and microelectromechanical systems. Heusler alloys represent a promising, yet underexplored, class of materials for this purpose. In this work, we perform a systematic first-principles investigation of the magnetostrictive properties of 25 Co-based full Heusler alloys, Co$_2$YZ (Y = V, Cr, Mn, Fe, Co; Z = Al, Ga, Si, Ge, Sn). Our screening identifies 10 compounds with large predicted magnetostriction ($|λ_{001}| > 100$~ppm), highlighted by Co$_3$Si with a giant value of -966~ppm. Furthermore, we demonstrate two effective strategies for engineering magnetostriction: (i) tuning the Fermi level, which enhances the magnetostriction of Co$_3$Sn to -905~ppm via Sb doping, and (ii) amplifying the spin-orbit coupling, which boosts the magnetostriction of Co$_2$CrGa to a colossal -1008~ppm through Re substitution. Our analysis reveals a general predictive rule, uncovering a linear relationship between the magnetostriction and the choice of the Y-site transition metal. This work not only identifies novel candidates for magnetostrictive applications but also establishes clear, physically-grounded design principles to accelerate the discovery of new functional magnetic materials.
△ Less
Submitted 27 March, 2026;
originally announced March 2026.
-
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
Authors:
Fanheng Kong,
Jingyuan Zhang,
Yang Yue,
Chenxi Sun,
Yang Tian,
Shi Feng,
Xiaocui Yang,
Daling Wang,
Yu Tian,
Jun Du,
Wenchong Zeng,
Han Li,
Kun Gai
Abstract:
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably im…
▽ More
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
Tentative Detection of the Glycine Isomer Glycolamide in Hot Molecular Core
Authors:
Chunguo Duan,
Fengwei Xu,
Qian Gou,
Xuefang Xu,
Donghui Quan,
Laurent Pagani,
Xi Chen,
Jun Kang,
Jiaxin Du
Abstract:
Understanding whether prebiotic molecules can endure and reform through the energetic stages of star formation is essential for tracing the continuity of interstellar chemistry toward life. Glycolamide, an isomer of glycine, was recently detected in the molecular cloud G+0.693-0.027. However, establishing its presence in warm, high-density environments is crucial to evaluate the chemical continuit…
▽ More
Understanding whether prebiotic molecules can endure and reform through the energetic stages of star formation is essential for tracing the continuity of interstellar chemistry toward life. Glycolamide, an isomer of glycine, was recently detected in the molecular cloud G+0.693-0.027. However, establishing its presence in warm, high-density environments is crucial to evaluate the chemical continuity of amides. Here we report the tentative detection of glycolamide in a hot molecular core, G358.93-0.03 MM1, using ALMA 1 mm observations. Seven unblended or only mildly blended emission lines were identified, yielding an abundance of (1.7$\pm$0.2)$\times 10^{-10}$ relative to H$_{2}$. The comparable formamide/glycolamide and acetamide/glycolamide abundance ratios in both sources suggest a chemically connected amide network across different environments. These results demonstrate that amides can persist and chemically evolve during massive star formation, tracing the chemical continuity from interstellar to protostellar environments.
△ Less
Submitted 6 April, 2026; v1 submitted 24 March, 2026;
originally announced March 2026.
-
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
Authors:
Chunxia Qin,
Chenyu Liu,
Pengcheng Xia,
Jun Du,
Baocai Yin,
Bing Yin,
Cong Liu
Abstract:
Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table De…
▽ More
Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse'' strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.
△ Less
Submitted 24 March, 2026;
originally announced March 2026.
-
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Authors:
Junhao Du,
Jialong Xue,
Anqi Li,
Jincheng Dai,
Guo Lu
Abstract:
Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocat…
▽ More
Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.
△ Less
Submitted 23 March, 2026;
originally announced March 2026.
-
GAPG: Geometry Aware Push-Grasping Synergy for Goal-Oriented Manipulation in Clutter
Authors:
Lijingze Xiao,
Jinhong Du,
Yang Cong,
Supeng Diao,
Yu Ren
Abstract:
Grasping target objects is a fundamental skill for robotic manipulation, but in cluttered environments with stacked or occluded objects, a single-step grasp is often insufficient. To address this, previous work has introduced pushing as an auxiliary action to create graspable space. However, these methods often struggle with both stability and efficiency because they neglect the scene's geometric…
▽ More
Grasping target objects is a fundamental skill for robotic manipulation, but in cluttered environments with stacked or occluded objects, a single-step grasp is often insufficient. To address this, previous work has introduced pushing as an auxiliary action to create graspable space. However, these methods often struggle with both stability and efficiency because they neglect the scene's geometric information, which is essential for evaluating grasp robustness and ensuring that pushing actions are safe and effective. To this end, we propose a geometry-aware push-grasp synergy framework that leverages point cloud data to integrate grasp and push evaluation. Specifically, the grasp evaluation module analyzes the geometric relationship between the gripper's point cloud and the points enclosed within its closing region to determine grasp feasibility and stability. Guided by this, the push evaluation module predicts how pushing actions influence future graspable space, enabling the robot to select actions that reliably transform non-graspable states into graspable ones. By jointly reasoning about geometry in both grasping and pushing, our framework achieves safer, more efficient, and more reliable manipulation in cluttered settings. Our method is extensively tested in simulation and real-world environments in various scenarios. Experimental results demonstrate that our model generalizes well to real-world scenes and unseen objects.
△ Less
Submitted 22 March, 2026;
originally announced March 2026.
-
VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation
Authors:
Jun Du
Abstract:
Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address…
▽ More
Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.
△ Less
Submitted 21 March, 2026;
originally announced March 2026.
-
EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
Authors:
Yuzhe Weng,
Haotian Wang,
Yuanhong Yu,
Jun Du,
Shan He,
Xiaoyan Wu,
Haoran Xu
Abstract:
Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To addres…
▽ More
Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
Joint Trajectory, RIS, and Computation Offloading Optimization via Decentralized Model-Based PPO in Urban Multi-UAV Mobile Edge Computing
Authors:
Liangshun Wu,
Jianbo Du,
Junsuo Qu
Abstract:
Efficient computation offloading in multi-UAV edge networks becomes particularly challenging in dense urban areas, where line-of-sight (LoS) links are frequently blocked and user demand varies rapidly. Reconfigurable intelligent surfaces (RISs) can mitigate blockage by creating controllable reflected links, but realizing their potential requires tightly coupled decisions on UAV trajectories, offlo…
▽ More
Efficient computation offloading in multi-UAV edge networks becomes particularly challenging in dense urban areas, where line-of-sight (LoS) links are frequently blocked and user demand varies rapidly. Reconfigurable intelligent surfaces (RISs) can mitigate blockage by creating controllable reflected links, but realizing their potential requires tightly coupled decisions on UAV trajectories, offloading schedules, and RIS phase configurations. This joint optimization is hard to solve in practice because multiple UAVs must coordinate under limited information exchange, and purely model-free multi-agent reinforcement learning (MARL) often learns too slowly in highly dynamic environments. To address these challenges, we propose a decentralized model-based MARL framework. Each UAV optimizes mobility and offloading using observations from several hop neighbors, and submits an RIS phase proposal that is aggregated by a lightweight RIS controller. To boost sample efficiency and stability, agents learn local dynamics models and perform short horizon branched rollouts for proximal policy optimization (PPO) updates. Simulations show near centralized performance with improved throughput and energy efficiency at scale.
△ Less
Submitted 9 March, 2026;
originally announced March 2026.
-
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
Authors:
An Luo,
Jin Du,
Xun Xian,
Robert Specht,
Fangqiao Tian,
Ganghua Wang,
Xuan Bi,
Charles Fleming,
Ashish Kundu,
Jayanth Srinivasa,
Mingyi Hong,
Rui Zhang,
Tianxi Li,
Galin Jones,
Jie Ding
Abstract:
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in…
▽ More
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
The properties of plasma sheath containing the primary electrons with a Cairns-distribution
Authors:
Yida Zhang,
Jiulin Du
Abstract:
We study the properties of plasma sheath containing the cold positive ions, the secondary electrons, and the primary electrons with a Cairns-distribution (a non-thermal velocity-distribution). We derive the generalized Bohm criterion and Bohm speed, the new floating potential at the wall, and the new critical secondary electron emission coefficient. We show that these properties of the plasma shea…
▽ More
We study the properties of plasma sheath containing the cold positive ions, the secondary electrons, and the primary electrons with a Cairns-distribution (a non-thermal velocity-distribution). We derive the generalized Bohm criterion and Bohm speed, the new floating potential at the wall, and the new critical secondary electron emission coefficient. We show that these properties of the plasma sheath depend significantly on the a-parameter in the non-thermal a-distribution, and so they are generally different from those of the plasma sheath if the primary electrons were assumed to be a Maxwellian distribution.
△ Less
Submitted 18 March, 2026; v1 submitted 16 March, 2026;
originally announced March 2026.
-
RieMind: Geometry-Grounded Spatial Agent for Scene Understanding
Authors:
Fernando Ropero,
Erkin Turkoz,
Daniel Matos,
Junqing Du,
Antonio Ruiz,
Yanfeng Zhang,
Lu Liu,
Mingwei Sun,
Yongliang Wang
Abstract:
Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning le…
▽ More
Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
Authors:
Yang Li,
Zhaxizhuoma,
Hongru Jiang,
Junjie Xia,
Hongquan Zhang,
Jinda Du,
Yunsong Zhou,
Jia Zeng,
Ce Hao,
Jieji Ren,
Qiaojun Yu,
Cewu Lu,
Yu Qiao,
Jiangmiao Pang
Abstract:
Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awarenes…
▽ More
Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force-position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming pi0 and pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs. The project page is available at https://sites.google.com/view/force-vla2/home.
△ Less
Submitted 16 March, 2026;
originally announced March 2026.
-
The Chandrasekhar's Conditions as Equilibrium and Stability of Stars in a Universal Three-Parameter Non-Maxwell Distribution
Authors:
Wei Hu,
Jiulin Du
Abstract:
The idea of the Chandrasekhar's conditions as equilibrium and stability of stars is revisited with a new universal three-parameter non-Maxwell distribution. We derive the maximum radiation pressures in the non-Maxwell distribution for a gas star and a centrally-condensed star, respectively, and thus we generalize the Chandrasekhar's conditions in a Maxwellian sense. By numerical analyses, we find…
▽ More
The idea of the Chandrasekhar's conditions as equilibrium and stability of stars is revisited with a new universal three-parameter non-Maxwell distribution. We derive the maximum radiation pressures in the non-Maxwell distribution for a gas star and a centrally-condensed star, respectively, and thus we generalize the Chandrasekhar's conditions in a Maxwellian sense. By numerical analyses, we find that the non-Maxwellian distribution usually reduces the maximum radiation pressures in both a gas star and a central condensed star as compared with that cases if the gas is assumed to be a Maxwellian distribution.
△ Less
Submitted 16 March, 2026; v1 submitted 16 March, 2026;
originally announced March 2026.
-
Deconfounded Lifelong Learning for Autonomous Driving via Dynamic Knowledge Spaces
Authors:
Jiayuan Du,
Yuebing Song,
Yiming Zhao,
Xianghui Pan,
Jiawei Lian,
Yuchu Lu,
Liuyi Wang,
Chengju Liu,
Qijun Chen
Abstract:
End-to-End autonomous driving (E2E-AD) systems face challenges in lifelong learning, including catastrophic forgetting, difficulty in knowledge transfer across diverse scenarios, and spurious correlations between unobservable confounders and true driving intents. To address these issues, we propose DeLL, a Deconfounded Lifelong Learning framework that integrates a Dirichlet process mixture model (…
▽ More
End-to-End autonomous driving (E2E-AD) systems face challenges in lifelong learning, including catastrophic forgetting, difficulty in knowledge transfer across diverse scenarios, and spurious correlations between unobservable confounders and true driving intents. To address these issues, we propose DeLL, a Deconfounded Lifelong Learning framework that integrates a Dirichlet process mixture model (DPMM) with the front-door adjustment mechanism from causal inference. The DPMM is employed to construct two dynamic knowledge spaces: a trajectory knowledge space for clustering explicit driving behaviors and an implicit feature knowledge space for discovering latent driving abilities. Leveraging the non-parametric Bayesian nature of DPMM, our framework enables adaptive expansion and incremental updating of knowledge without predefining the number of clusters, thereby mitigating catastrophic forgetting. Meanwhile, the front-door adjustment mechanism utilizes the DPMM-derived knowledge as valid mediators to deconfound spurious correlations, such as those induced by sensor noise or environmental changes, and enhances the causal expressiveness of the learned representations. Additionally, we introduce an evolutionary trajectory decoder that enables non-autoregressive planning. To evaluate the lifelong learning performance of E2E-AD, we propose new evaluation protocols and metrics based on Bench2Drive. Extensive evaluations in the closed-loop CARLA simulator demonstrate that our framework significantly improves adaptability to new driving scenarios and overall driving performance, while effectively retaining previous acquired knowledge.
△ Less
Submitted 30 March, 2026; v1 submitted 15 March, 2026;
originally announced March 2026.
-
AI/ML for mobile networks: Current status in Rel. 19 and challenges ahead
Authors:
Yuan Gao,
Xinyi Wu,
Jun Jiang,
Bintao Hu,
Jianbo Du,
Qiang Ye,
Shunqing Zhang,
F. Richard Yu,
Shugong Xu
Abstract:
The transformative power of artificial intelligence (AI) and machine learning (ML) is recognized as a key enabler for sixth generation (6G) mobile networks by both academia and industry. Research on AI/ML in mobile networks has been ongoing for years, and the 3rd generation partnership project (3GPP) launched standardization efforts to integrate AI into mobile networks. However, a comprehensive re…
▽ More
The transformative power of artificial intelligence (AI) and machine learning (ML) is recognized as a key enabler for sixth generation (6G) mobile networks by both academia and industry. Research on AI/ML in mobile networks has been ongoing for years, and the 3rd generation partnership project (3GPP) launched standardization efforts to integrate AI into mobile networks. However, a comprehensive review of the current status and challenges of the standardization of AI/ML for mobile networks is still missing. To this end, we provided a comprehensive review of the standardization efforts by 3GPP on AI/ML for mobile networks. This includes an overview of the general AI/ML framework, representative use cases (i.e., CSI feedback, beam management and positioning), and corresponding evaluation matrices. We emphasized the key research challenges on dataset preparation, generalization evaluation and baseline AI/ML models selection. Using CSI feedback as a case study, given the test dataset 2, we demonstrated that the pre-training-fine-tuning paradigm (i.e., pre-training using dataset 1 and fine-tuning using dataset 2) outperforms training on dataset 2. Moreover, we observed the highest performance enhancements in Transformer-based models through fine-tuning, showing its great generalization potential at large floating-point operations (FLOPs). Finally, we outlined future research directions for the application of AI/ML in mobile networks.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering
Authors:
Shaowei Guan,
Yu Zhai,
Hin Chi Kwok,
Jiawei Du,
Xinyu Feng,
Jing Li,
Harry Qin,
Vivian Hui
Abstract:
Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare…
▽ More
Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.
△ Less
Submitted 15 March, 2026;
originally announced March 2026.
-
NetSpatial: Spatially Conditional Traffic Generation for Cellular Planning and Operations
Authors:
Shiyuan Zhang,
Jiale Du,
Yuanwei Liu,
Kaibin Huang,
Hongyang Du
Abstract:
Base station (BS) deployment and operation are fundamental to network performance, yet they require accurate demand understanding, which remains difficult for operators. Cellular traffic in dense urban regions is well measured but highly dynamic, which undermines prediction-based management, whereas the scarcity of traffic measurements in emerging regions limits informed deployment decisions. Exis…
▽ More
Base station (BS) deployment and operation are fundamental to network performance, yet they require accurate demand understanding, which remains difficult for operators. Cellular traffic in dense urban regions is well measured but highly dynamic, which undermines prediction-based management, whereas the scarcity of traffic measurements in emerging regions limits informed deployment decisions. Existing approaches therefore either depend on manual planning heuristics or use autoregressive predictors that fail to capture stochastic traffic variation. We present NetSpatial, a unified system for cellular planning and operation through spatially conditional traffic generation. NetSpatial exploits multimodal urban context, including satellite imagery and point of interest (POI) distributions, to learn how physical environment and functional semantics shape BS demand. It uses a multi-level flow-matching architecture that separates periodic structure from residual dynamics, enabling direct generation of long-horizon traffic sequences. NetSpatial supports two complementary decision scenarios, i.e., what-if analysis for deployment planning, which ranks candidate sites using generated traffic profiles, and what-to-do support for network operation, which uses generated traffic forecasts to guide BS sleep scheduling and load balancing. Experiments on real-world cellular traffic data show that NetSpatial reduces Jensen-Shannon Divergence (JSD) by 29.44% over the strongest baseline, generalizes across cities in zero-shot experiments, and enables up to 16.8% energy savings while maintaining over 80% quality of experience.
△ Less
Submitted 14 March, 2026;
originally announced March 2026.
-
A Stable, High-Order Time-Stepping Scheme for the Drift-Diffusion Model in Modern Solar Cell Simulation
Authors:
Jun Du,
Jun Yan
Abstract:
This paper presents a one-dimensional transient drift--diffusion simulator for advanced solar cells, integrating a structure-preserving finite-volume spatial discretization with Scharfetter--Gummel--type fluxes and a high-order, L-stable implicit Runge--Kutta (Radau IIA) temporal integrator. The scheme ensures local charge conservation, handles sharp material interfaces, and achieves second-order…
▽ More
This paper presents a one-dimensional transient drift--diffusion simulator for advanced solar cells, integrating a structure-preserving finite-volume spatial discretization with Scharfetter--Gummel--type fluxes and a high-order, L-stable implicit Runge--Kutta (Radau IIA) temporal integrator. The scheme ensures local charge conservation, handles sharp material interfaces, and achieves second-order spatial and fifth-order temporal convergence. Its accuracy is verified against the classical depletion approximation in $p$--$n$ junction and validated through excellent agreement with the established simulator for an organic photovoltaic device. The framework's extensibility is demonstrated by incorporating exciton kinetics in organic solar cells, capturing multi-timescale dynamics, and by modeling mobile ions in perovskite solar cells, reproducing characteristic $\tmem{J}$--$\tmem{V}$ hysteresis without empirical parameters. This work provides a robust, high-order numerical foundation for simulating coupled charge, exciton, and ion transport in next-generation photovoltaic devices.
△ Less
Submitted 9 March, 2026;
originally announced March 2026.
-
The orthogonal connectedness of polyhedral surfaces
Authors:
Julia Q. Du,
Xuemei He,
Xiaotian Song,
Daniela Stiller,
Liping Yuan,
Tudor Zamfirescu
Abstract:
Using the orthogonal connectedness, we introduce the notion of orthogonal decomposability of convex polytopes and study it in the case of Platonic and Archimedean solids. While doing so, we also encounter polytopes which are not orthogonally decomposable.
Using the orthogonal connectedness, we introduce the notion of orthogonal decomposability of convex polytopes and study it in the case of Platonic and Archimedean solids. While doing so, we also encounter polytopes which are not orthogonally decomposable.
△ Less
Submitted 8 March, 2026;
originally announced March 2026.
-
Extending gPET for Multi-Layer PET Simulation
Authors:
Satzhan Sitmukhambetov,
Junwei Du,
Mingwu Jin,
Yujie Chi
Abstract:
Depth-of-interaction (DOI) encoding is an effective strategy for reducing parallax error and preserving spatial resolution in positron emission tomography (PET), particularly in compact small-animal scanners. To enable efficient simulation-driven design of DOI-capable systems, we extend the GPU-accelerated Monte Carlo toolkit gPET to support flexible multi-layer detector geometries. The original t…
▽ More
Depth-of-interaction (DOI) encoding is an effective strategy for reducing parallax error and preserving spatial resolution in positron emission tomography (PET), particularly in compact small-animal scanners. To enable efficient simulation-driven design of DOI-capable systems, we extend the GPU-accelerated Monte Carlo toolkit gPET to support flexible multi-layer detector geometries. The original three-level hierarchical detector model in gPET (panel-module-crystal) was expanded by introducing an intermediate "layer" level, enabling parameterized modeling of stacked scintillator architectures. The photon transport algorithm was correspondingly updated to sample interactions across multiple layers and detector panels while preserving GPU-efficient memory usage. The framework was validated using three scanner configurations: a conventional single-layer ring (H2RSPET-1CL), an aligned split-layer design (H2RSPET-1CL-split), and an offset dual-layer design (H2RSPET-2CL). System performance was evaluated following NEMA NU4-2008 protocols using sensitivity, spatial resolution, and Derenzo phantom simulations with CASToR-based maximum likelihood expectation maximization reconstruction. The H2RSPET-1CL and H2RSPET-1CL-split configurations produced statistically identical hit distributions, while H2RSPET-2CL exhibited the expected offset interaction patterns. Sensitivity of H2RSPET-2CL remained comparable to H2RSPET-1CL, generally within about 2-5 percent, while radial spatial resolution improved substantially (0.8-1.6 mm vs. 1.0-4.2 mm from the center to a 50 mm radial offset). Runtime performance remained essentially unchanged between configurations. The extended gPET framework therefore enables fast and flexible simulation of multi-layer PET detectors and supports efficient optimization of DOI-enabled PET system designs.
△ Less
Submitted 7 March, 2026;
originally announced March 2026.
-
Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Authors:
Shuai Lu,
Meng Wang,
Jia Guo,
Jiawei Du,
Bo Liu,
Shengzhu Yang,
Weihang Zhang,
Huazhu Fu,
Huiqi Li
Abstract:
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., mic…
▽ More
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.
△ Less
Submitted 19 March, 2026; v1 submitted 7 March, 2026;
originally announced March 2026.
-
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
Authors:
Xisen Jin,
Michael Duan,
Qin Lin,
Aaron Chan,
Zhenglun Chen,
Junyi Du,
Xiang Ren
Abstract:
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To gener…
▽ More
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard
△ Less
Submitted 5 March, 2026;
originally announced March 2026.
-
TEGA: A Tactile-Enhanced Grasping Assistant for Assistive Robotics via Sensor Fusion and Closed-Loop Haptic Feedback
Authors:
Hengxu You,
Tianyu Zhou,
Fang Xu,
Kaleb Smith,
Eric Jing Du
Abstract:
Recent advances in teleoperation have enabled sophisticated manipulation of dexterous robotic hands, with most systems concentrating on guiding finger positions to achieve desired grasp configurations. However, while accurate finger positioning is essential, it often overlooks the equally critical task of grasp force modulation, vital for handling objects of diverse hardness, texture, and shape. T…
▽ More
Recent advances in teleoperation have enabled sophisticated manipulation of dexterous robotic hands, with most systems concentrating on guiding finger positions to achieve desired grasp configurations. However, while accurate finger positioning is essential, it often overlooks the equally critical task of grasp force modulation, vital for handling objects of diverse hardness, texture, and shape. This limitation poses a significant challenge for users, especially individuals with upper limb disabilities who lack natural tactile feedback and rely on indirect cues to infer appropriate force levels. To address this gap, We present the tactile enhanced grasping assistant (TEGA), a closed loop assistive teleoperation framework that fuses EMG based intent2force inference with visuotactile sensing mapped into real time vibrotactile feedback via a wearable haptic vest, enabling intuitive, proportional force adjustment during manipulation. A wearable haptic vest delivers real time tactile feedback, allowing users to dynamically refine grasp force during manipulation. User studies confirm that the system substantially improves grasp stability and task success, underscoring its potential for assistive robotic applications.
△ Less
Submitted 4 March, 2026;
originally announced March 2026.
-
The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge
Authors:
Ya Jiang,
Ruoyu Wang,
Jingxuan Zhang,
Jun Du,
Yi Han,
Zihao Quan,
Hang Chen,
Yeran Yang,
Kongzhi Zheng,
Zhuo Chen,
Yanhui Tu,
Shutong Niu,
Changfeng Xi,
Mengzhi Wang,
Zhongbin Wu,
Jieru Chen,
Henghui Zhi,
Weiyi Shi,
Shuhang Wu,
Genshun Wan,
Jia Pan,
Jianqing Gao
Abstract:
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To ta…
▽ More
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription accuracy. Our best single cascaded system achieves a Speaker Word Error Rate (WER) of 32.44% on the development set. By further applying ROVER to fuse outputs from diverse front-end and back-end variants, we reduce Speaker WER to 31.40%. Notably, our LLM-based zero-shot conversational clustering achieves a speaker clustering F1 score of 1.0, yielding a final Joint ASR-Clustering Error Rate (JACER) of 15.70%.
△ Less
Submitted 1 March, 2026;
originally announced March 2026.
-
AIoT-based Continuous, Contextualized, and Explainable Driving Assessment for Older Adults
Authors:
Yimeng Liu,
Fangwei Zhang,
Maolin Gan,
Jialuo Du,
Jingkai Lin,
Yawen Wang,
Fei Sun,
Honglei Chen,
Linda Hill,
Ruofeng Liu,
Tianxing Li,
Zhichao Cao
Abstract:
The world is undergoing a major demographic shift as older adults become a rapidly growing share of the population, creating new challenges for driving safety. In car-dependent regions such as the United States, driving remains essential for independence, access to services, and social participation. At the same time, aging can introduce gradual changes in vision, attention, reaction time, and dri…
▽ More
The world is undergoing a major demographic shift as older adults become a rapidly growing share of the population, creating new challenges for driving safety. In car-dependent regions such as the United States, driving remains essential for independence, access to services, and social participation. At the same time, aging can introduce gradual changes in vision, attention, reaction time, and driving control that quietly reduce safety. Today's assessment methods rely largely on infrequent clinic visits or simple screening tools, offering only a brief snapshot and failing to reflect how an older adult actually drives on the road. Our work starts from the observation that everyday driving provides a continuous record of functional ability and captures how a driver responds to traffic, navigates complex roads, and manages routine behavior. Leveraging this insight, we propose AURA, an Artificial Intelligence of Things (AIoT) framework for continuous, real-world assessment of driving safety among older adults. AURA integrates richer in-vehicle sensing, multi-scale behavioral modeling, and context-aware analysis to extract detailed indicators of driving performance from routine trips. It organizes fine-grained actions into longer behavioral trajectories and separates age-related performance changes from situational factors such as traffic, road design, or weather. By integrating sensing, modeling, and interpretation within a privacy-preserving edge architecture, AURA provides a foundation for proactive, individualized support that helps older adults drive safely. This paper outlines the design principles, challenges, and research opportunities needed to build reliable, real-world monitoring systems that promote safer aging behind the wheel.
△ Less
Submitted 28 February, 2026;
originally announced March 2026.
-
TEFL: Prediction-Residual-Guided Rolling Forecasting for Multi-Horizon Time Series
Authors:
Xiannan Huang,
Shen Fang,
Shuhan Qiu,
Chengcheng Yu,
Jiayuan Du,
Chao Yang
Abstract:
Time series forecasting plays a critical role in domains such as transportation, energy, and meteorology. Despite their success, modern deep forecasting models are typically trained to minimize point-wise prediction loss without leveraging the rich information contained in past prediction residuals from rolling forecasts - residuals that reflect persistent biases, unmodeled patterns, or evolving d…
▽ More
Time series forecasting plays a critical role in domains such as transportation, energy, and meteorology. Despite their success, modern deep forecasting models are typically trained to minimize point-wise prediction loss without leveraging the rich information contained in past prediction residuals from rolling forecasts - residuals that reflect persistent biases, unmodeled patterns, or evolving dynamics. We propose TEFL (Temporal Error Feedback Learning), a unified learning framework that explicitly incorporates these historical residuals into the forecasting pipeline during both training and evaluation. To make this practical in deep multi-step settings, we address three key challenges: (1) selecting observable multi-step residuals under the partial observability of rolling forecasts, (2) integrating them through a lightweight low-rank adapter to preserve efficiency and prevent overfitting, and (3) designing a two-stage training procedure that jointly optimizes the base forecaster and error module. Extensive experiments across 10 real-world datasets and 5 backbone architectures show that TEFL consistently improves accuracy, reducing MAE by 5-10% on average. Moreover, it demonstrates strong robustness under abrupt changes and distribution shifts, with error reductions exceeding 10% (up to 19.5%) in challenging scenarios. By embedding residual-based feedback directly into the learning process, TEFL offers a simple, general, and effective enhancement to modern deep forecasting systems.
△ Less
Submitted 25 February, 2026;
originally announced February 2026.
-
Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map
Authors:
Lei Su,
Zhijie Peng,
Renyuan Ren,
Shengping Mao,
Juan Du,
Kaifeng Zhang,
Xuezhou Zhu
Abstract:
Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement…
▽ More
Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement learning. In this work, we present Tacmap, a high-fidelity, computationally efficient tactile simulation framework anchored in volumetric penetration depth. Our key insight is to bridge the tactile sim-to-real gap by unifying both domains through a shared deform map representation. Specifically, we compute 3D intersection volumes as depth maps in simulation, while in the real world, we employ an automated data-collection rig to learn a robust mapping from raw tactile images to ground-truth depth maps. By aligning simulation and real-world in this unified geometric space, Tacmap minimizes domain shift while maintaining physical consistency. Quantitative evaluations across diverse contact scenarios demonstrate that Tacmap's deform maps closely mirror real-world measurements. Moreover, we validate the utility of Tacmap through an in-hand rotation task, where a policy trained exclusively in simulation achieves zero-shot transfer to a physical robot.
△ Less
Submitted 25 February, 2026;
originally announced February 2026.
-
Spiking Graph Predictive Coding for Reliable OOD Generalization
Authors:
Jing Ren,
Jiapeng Du,
Bowen Li,
Ziqi Xu,
Xin Zheng,
Hong Jia,
Suyu Ma,
Xiwei Xu,
Feng Xia
Abstract:
Graphs provide a powerful basis for modeling Web-based relational data, with expressive GNNs to support the effective learning in dynamic web environments. However, real-world deployment is hindered by pervasive out-of-distribution (OOD) shifts, where evolving user activity and changing content semantics alter feature distributions and labeling criteria. These shifts often lead to unstable or over…
▽ More
Graphs provide a powerful basis for modeling Web-based relational data, with expressive GNNs to support the effective learning in dynamic web environments. However, real-world deployment is hindered by pervasive out-of-distribution (OOD) shifts, where evolving user activity and changing content semantics alter feature distributions and labeling criteria. These shifts often lead to unstable or overconfident predictions, undermining the trustworthiness required for Web4Good applications. Achieving reliable OOD generalization demands principled and interpretable uncertainty estimation; however, existing methods are largely post-hoc, insensitive to distribution shifts, and unable to explain where uncertainty arises especially in high-stakes settings. To address these limitations, we introduce SpIking GrapH predicTive coding (SIGHT), an uncertainty-aware plug-in graph learning module for reliable OOD Generalization. SIGHT performs iterative, error-driven correction over spiking graph states, enabling models to expose internal mismatch signals that reveal where predictions become unreliable. Across multiple graph benchmarks and diverse OOD scenarios, SIGHT consistently enhances predictive accuracy, uncertainty estimation, and interpretability when integrated with GNNs.
△ Less
Submitted 22 February, 2026;
originally announced February 2026.
-
AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing
Authors:
Jianda Du,
Youran Sun,
Haizhao Yang
Abstract:
PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \texttt{AutoNumerics}, a multi-agent framework that autonomously designs,…
▽ More
PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \texttt{AutoNumerics}, a multi-agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for general PDEs directly from natural language descriptions. Unlike black-box neural solvers, our framework generates transparent solvers grounded in classical numerical analysis. We introduce a coarse-to-fine execution strategy and a residual-based self-verification mechanism. Experiments on 24 canonical and real-world PDE problems demonstrate that \texttt{AutoNumerics} achieves competitive or superior accuracy compared to existing neural and LLM-based baselines, and correctly selects numerical schemes based on PDE structural properties, suggesting its viability as an accessible paradigm for automated PDE solving.
△ Less
Submitted 19 February, 2026;
originally announced February 2026.
-
Dislocation-ledge coupling drives non-conservative migration of semicoherent precipitate interfaces
Authors:
Jin-Yu Zhang,
Juan Du,
Lin Yang,
Frédéric Mompiou,
Shigenobu Ogata,
Wen-Zheng Zhang
Abstract:
Precipitate shape and size control the strength and stability of many structural alloys, yet the microscopic mechanism by which semicoherent precipitate interfaces migrate remains unclear. In particular, how dense interfacial dislocation networks move while accommodating transformation strain has resisted direct, time-resolved characterization. Here, we show that non-conservative motion of interfa…
▽ More
Precipitate shape and size control the strength and stability of many structural alloys, yet the microscopic mechanism by which semicoherent precipitate interfaces migrate remains unclear. In particular, how dense interfacial dislocation networks move while accommodating transformation strain has resisted direct, time-resolved characterization. Here, we show that non-conservative motion of interfacial dislocations is intrinsically coupled to the nucleation and lateral propagation of nanoscale growth ledges, providing a defect-based kinetic description of lath growth. Phase-field-crystal simulations of a prototypical face-centered cubic/body-centered cubic (FCC/BCC) transformation resolve strongly anisotropic interface kinetics: the end face advances continuously along the lath long axis, whereas facets thicken by discrete ledge sweeps accompanied by mixed glide-climb reactions in a closed dislocation network. Crystallographic analyses predict the dislocation arrangements, rationalize the anisotropy via the geometry of misfit localization, and show how dislocation motion accommodates the transformation strain. In situ transmission electron microscopy of austenite precipitates in duplex stainless steel captures rapid ledge propagation on habit planes, consistent with the predicted migration mode. Our results bridge point-defect transport, dislocation reactions, and interface mobility, enabling quantitative, transferable predictions of precipitate morphology evolution.
△ Less
Submitted 17 February, 2026;
originally announced February 2026.
-
GLM-5: from Vibe Coding to Agentic Engineering
Authors:
GLM-5-Team,
:,
Aohan Zeng,
Xin Lv,
Zhenyu Hou,
Zhengxiao Du,
Qinkai Zheng,
Bin Chen,
Da Yin,
Chendi Ge,
Chenghua Huang,
Chengxing Xie,
Chenzheng Zhu,
Congfeng Yin,
Cunxiang Wang,
Gengzheng Pan,
Hao Zeng,
Haoke Zhang,
Haoran Wang,
Huilong Chen,
Jiajie Zhang,
Jian Jiao,
Jiaqi Guo,
Jingsen Wang,
Jingzhao Du
, et al. (162 additional authors not shown)
Abstract:
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous…
▽ More
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.
△ Less
Submitted 24 February, 2026; v1 submitted 17 February, 2026;
originally announced February 2026.
-
Sub-1-Angstrom-Resolution Imaging Reveals Phase Contrast Transition in Ice Ih Caused by Basal Stacking Faults
Authors:
Jingshan S. Du,
Suvo Banik,
Lehan Yao,
Shuai Zhang,
Subramanian K. R. S. Sankaranarayanan,
James J. De Yoreo
Abstract:
Phase-contrast transmission electron microscopy (TEM) of hexagonal ice (Ih) along [0001] sometimes shows a honeycomb-like pattern, often interpreted as individual oxygen columns in single crystals. Here, we show that this pattern commonly arises from intrinsic basal stacking faults instead. A translational boundary separating domains of comparable thickness, with an in-plane offset of…
▽ More
Phase-contrast transmission electron microscopy (TEM) of hexagonal ice (Ih) along [0001] sometimes shows a honeycomb-like pattern, often interpreted as individual oxygen columns in single crystals. Here, we show that this pattern commonly arises from intrinsic basal stacking faults instead. A translational boundary separating domains of comparable thickness, with an in-plane offset of $(\frac{2}{3} a_{1} + \frac{1}{3} a_{2})$, produces this honeycomb-like contrast. Stacking domains translated in nonequivalent directions yields patterns resembling cubic ice (Ic) along [111] but with a 3-fold symmetry. We imaged this structure at a record-breaking line resolution of 89 picometers, finer than the O-H covalent bond length. These findings highlight the defect tolerance of ice's molecular packing and clarify the structural relationships among hexagonal, stacking-disordered, and cubic ice phases. This resolution milestone opens new avenues for characterizing subtle structural perturbations of water in the solid state.
△ Less
Submitted 23 February, 2026; v1 submitted 16 February, 2026;
originally announced February 2026.
-
Modular Nahm sums for symmetrizable matrices of indices $({2,\ldots, 2},1)$ and $({1,\ldots, 1},2)$
Authors:
Julia Q. D. Du,
Kathy Q. Ji,
Erin Y. Y. Shen,
Clara X. Y. Xu
Abstract:
In this paper, we present three families of modular Nahm sums for symmetrizable matrices with arbitrary rank $r\geq 2$ of indices $({2,\ldots, 2},1)$ and $({1,\ldots, 1},2)$. Specifically, the cases corresponding to $r = 2$ and $r = 3$ of these families have been previously demonstrated by Mizuno, Warnaar, and B. Wang-L. Wang. Building upon these three families, we construct two vector-valued auto…
▽ More
In this paper, we present three families of modular Nahm sums for symmetrizable matrices with arbitrary rank $r\geq 2$ of indices $({2,\ldots, 2},1)$ and $({1,\ldots, 1},2)$. Specifically, the cases corresponding to $r = 2$ and $r = 3$ of these families have been previously demonstrated by Mizuno, Warnaar, and B. Wang-L. Wang. Building upon these three families, we construct two vector-valued automorphic forms, one of which is a vector-valued modular function when $r$ is odd.
△ Less
Submitted 6 March, 2026; v1 submitted 16 February, 2026;
originally announced February 2026.
-
Stacking theory for bilayer two-dimensional magnets
Authors:
Jun-Xi Du,
Sike Zeng,
Yu-Jun Zhao
Abstract:
Two-dimensional unconventional magnetism has recently attracted growing interest due to its intriguing physical properties and promising applications in spintronics. However, existing studies on stacking-induced unconventional magnetism mainly focus on specific materials and stacking configurations. Here, we develop a general symmetry-based stacking theory for two-dimensional magnets. We first int…
▽ More
Two-dimensional unconventional magnetism has recently attracted growing interest due to its intriguing physical properties and promising applications in spintronics. However, existing studies on stacking-induced unconventional magnetism mainly focus on specific materials and stacking configurations. Here, we develop a general symmetry-based stacking theory for two-dimensional magnets. We first introduce spin layer groups as the fundamental symmetry framework, providing the essential magnetic symmetry information for the stacking theory. Based on this framework, we construct the complete set of 448 collinear spin layer groups for describing two-dimensional collinear magnets. Subsequently, we develop a general magnetic stacking theory applicable to arbitrary magnetic systems and derive its general solutions. Using CrF$_3$ as an illustrative example, we show how this theory enables designs of two-dimensional unconventional magnetism, as validated by first-principles calculations. We realize two-dimensional fully compensated ferrimagnetism through our stacking theory. Our work provides a general symmetry-guided platform for discovering and designing stacking-induced unconventional magnetism.
△ Less
Submitted 12 February, 2026;
originally announced February 2026.
-
Eliminating Delocalization Error through Localized Orbital Scaling Correction with Orbital Relaxation from Linear Response
Authors:
Yichen Fan,
Jincheng Yu,
Jiayi Du,
Weitao Yang
Abstract:
Despite the great success Kohn-Sham density functional theory (KS-DFT) has achieved, the delocalization error remains a major challenge for commonly used density functional approximations (DFAs), resulting in systematic errors in ionization energies, electron affinities, band structures, and charge distributions. A recently developed localized orbital scaling correction (LOSC) method, namely linea…
▽ More
Despite the great success Kohn-Sham density functional theory (KS-DFT) has achieved, the delocalization error remains a major challenge for commonly used density functional approximations (DFAs), resulting in systematic errors in ionization energies, electron affinities, band structures, and charge distributions. A recently developed localized orbital scaling correction (LOSC) method, namely linear response LOSC (lrLOSC), addresses these challenges by incorporating a functional correction that includes the screening effect and orbital localization within the LOSC framework. The method has been shown to provide accurate descriptions of bulk systems and core-level binding energies in small molecular systems. In this work, we extend the applicability of lrLOSC to a broader range of molecular systems, spanning various sizes, with a focus on the corrections to valence orbital energies and total energies. To enable the calculation of large chemical systems, we developed an efficient implementation of lrLOSC with computational costs comparable to standard KS-DFT calculations. Numerical results show that, while screening provides modest improvements for small molecules, it becomes critical for achieving high accuracy in larger molecules, from linear to three-dimensional systems. With the screening effect well captured in a unified way, lrLOSC provides accurate descriptions for a wide range of chemical systems, including organic molecular systems of varying sizes and transition-metal oxide complexes, establishing it as a powerful tool for enhancing the reliability of computational simulations of chemical systems.
△ Less
Submitted 11 February, 2026;
originally announced February 2026.
-
Generalizable and Robust Beam Prediction for 6G Networks: An Deep-Learning Framework with Positioning Feature Fusion
Authors:
Yanliang Jin,
Yunfan Li,
Jiang Jun,
Yuan Gao,
Shengli Liu,
Jianbo Du,
Zhaohui Yang,
Shugong Xu
Abstract:
Beamforming (BF) is essential for enhancing system capacity in fifth generation (5G) and beyond wireless networks, yet exhaustive beam training in ultra-massive multiple-input multiple-output (MIMO) systems incurs substantial overhead. To address this challenge, we propose a deep learning based framework that leverages position-aware features to improve beam prediction accuracy while reducing trai…
▽ More
Beamforming (BF) is essential for enhancing system capacity in fifth generation (5G) and beyond wireless networks, yet exhaustive beam training in ultra-massive multiple-input multiple-output (MIMO) systems incurs substantial overhead. To address this challenge, we propose a deep learning based framework that leverages position-aware features to improve beam prediction accuracy while reducing training costs. The proposed approach uses spatial coordinate labels to supervise a position extraction branch and integrates the resulting representations with beam-domain features through a feature fusion module. A dual-branch RegNet architecture is adopted to jointly learn location related and communication features for beam prediction. Two fusion strategies, namely adaptive fusion and adversarial fusion, are introduced to enable efficient feature integration. The proposed framework is evaluated on datasets generated by the DeepMIMO simulator across four urban scenarios at 3.5 GHz following 3GPP specifications, where both reference signal received power and user equipment location information are available. Simulation results under both in-distribution and out-of-distribution settings demonstrate that the proposed approach consistently outperforms traditional baselines and achieves more accurate and robust beam prediction by effectively incorporating positioning information.
△ Less
Submitted 10 February, 2026;
originally announced February 2026.
-
Dataset Distillation via Relative Distribution Matching and Cognitive Heritage
Authors:
Qianxin Xia,
Jiawei Du,
Yuhan Zhang,
Jielei Wang,
Guoming Lu
Abstract:
Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. Howeve…
▽ More
Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
△ Less
Submitted 5 February, 2026;
originally announced February 2026.
-
ERNIE 5.0 Technical Report
Authors:
Haifeng Wang,
Hua Wu,
Tian Wu,
Yu Sun,
Jing Liu,
Dianhai Yu,
Yanjun Ma,
Jingzhou He,
Zhongjun He,
Dou Hong,
Qiwen Liu,
Shuohuan Wang,
Junyuan Shang,
Zhenyu Zhang,
Yuchen Ding,
Jinle Zeng,
Jiabin Yang,
Liang Shen,
Ruibiao Chen,
Weichong Yin,
Siyu Ding,
Dai Dai,
Shikun Feng,
Siqi Bao,
Bolei He
, et al. (413 additional authors not shown)
Abstract:
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practi…
▽ More
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
△ Less
Submitted 4 February, 2026;
originally announced February 2026.
-
TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
Authors:
Shichao Ma,
Zhiyuan Ma,
Ming Yang,
Xiaofan Li,
Xing Wu,
Jintao Du,
Yu Cheng,
Weiqiang Wang,
Qiliang Liu,
Zhengyang Zhou,
Yang Wang
Abstract:
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning,…
▽ More
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively. Code is available at https://github.com/Flipped-May/TSPO.
△ Less
Submitted 6 April, 2026; v1 submitted 30 January, 2026;
originally announced January 2026.