-
SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates
Authors:
Chi Zhang,
Braedon Gunn,
Andrew M. Read-Fuller
Abstract:
Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We develop…
▽ More
Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
An Agentic Framework for Autonomous Materials Computation
Authors:
Zeyu Xia,
Jinzhe Ma,
Congjie Zheng,
Shufei Zhang,
Yuqiang Li,
Hang Su,
P. Hu,
Changshui Zhang,
Xingao Gong,
Wanli Ouyang,
Lei Bai,
Dongzhan Zhou,
Mao Su
Abstract:
Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery, yet their static knowledge and hallucination issues hinder autonomous research applications. Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific workflows. Here, we present a domain-specialized agent designed for reliable automati…
▽ More
Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery, yet their static knowledge and hallucination issues hinder autonomous research applications. Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific workflows. Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations. By embedding domain expertise, the agent ensures physically coherent multi-step workflows and consistently selects convergent, well-posed parameters, thereby enabling reliable end-to-end computational execution. A new benchmark of diverse computational tasks demonstrates that our system significantly outperforms standalone LLMs in both accuracy and robustness. This work establishes a verifiable foundation for autonomous computational experimentation and represents a key step toward fully automated scientific discovery.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
LOCO: A Low-Cost SNU-Self-Resilient Latch Using an Output-Split C-Element
Authors:
Ruijun Ma,
Xin Chen,
Xiaoqing Wen,
Hui Xu,
Shengnan Ye,
Chuanjian Zhang,
Senling Wang
Abstract:
As the CMOS technology enters nanometer scales, integrated circuits (ICs) become increasingly sensitive to radiation-induced soft errors, which can corrupt the state of storage elements and cause severe reliability issues. Many hardened designs have been proposed to mitigate soft errors by using filtering elements. However, existing filtering elements only protect their inputs against soft errors…
▽ More
As the CMOS technology enters nanometer scales, integrated circuits (ICs) become increasingly sensitive to radiation-induced soft errors, which can corrupt the state of storage elements and cause severe reliability issues. Many hardened designs have been proposed to mitigate soft errors by using filtering elements. However, existing filtering elements only protect their inputs against soft errors and leave their outputs unprotected. Therefore, additional filtering elements must be added to protect outputs, resulting in extra overhead. In this paper, we first propose a novel Output-Split C-element (OSC) to protect both its input and output nodes, and then a novel LOw-COst single-node-upset (SNU) self-resilient latch (LOCO) to use OSCs to achieve both soft error resilience and low overhead. The usage of OSCs effectively reduce the short-circuit current of the LOCO latch during switching activities. Furthermore, the usage of clock gating and high-speed path reduces power consumption and delay, respectively. Compared with state-of-the-art SNU-resilient hardened designs, the LOCO latch achieves 19% fewer transistors, 63.58% lower power, 74% less delay, and 92% lower power-delay-product (PDP) on average. In addition, the LOCO latch exhibits better stability under variations in PVT (Process, Voltage, and Temperature).
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
Translating Flow to Policy via Hindsight Online Imitation
Authors:
Yitian Zheng,
Zhangchen Ye,
Weijun Dong,
Shengjie Wang,
Yuyang Liu,
Chongjie Zhang,
Chuan Wen,
Yang Gao
Abstract:
Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially wit…
▽ More
Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data. To this end, we propose to improve the low-level policy through online interactions. Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy. Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner. Across diverse manipulation tasks in both simulation and physical world, our method achieves more than $2\times$ performance improvement over the base policy, significantly outperforming the existing methods. Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
Understanding Chain-of-Thought in Large Language Models via Topological Data Analysis
Authors:
Chenghao Li,
Chaoning Zhang,
Yi Lu,
Shuxu Chen,
Xudong Wang,
Jiaquan Zhang,
Zhicheng Wang,
Zhengxun Jin,
Kuien Liu,
Sung-Ho Bae,
Guoqing Wang,
Yang Yang,
Hen Tao Shen
Abstract:
With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasonin…
▽ More
With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasoning chains play a key role? Existing studies mainly focus on evaluating reasoning chains from a functional perspective, with little attention paid to their structural mechanisms. To address this gap, this work is the first to analyze and evaluate the quality of the reasoning chain from a structural perspective. We apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, and analyze structural changes. These changes reveal semantic coherence, logical redundancy, and identify logical breaks and gaps. By calculating homology groups, we assess connectivity and redundancy at various scales, using barcode and persistence diagrams to quantify stability and consistency. Our results show that the topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies, reducing redundancy and cycles, enhancing efficiency and interpretability. This work provides a new perspective on reasoning chain quality assessment and offers guidance for future optimization.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
ShadowBlock: Efficient Dynamic Anonymous Blocklisting and Its Cross-chain Application
Authors:
Haotian Deng,
Mengxuan Liu,
Chuan Zhang,
Wei Huang,
Licheng Wang,
Liehuang Zhu
Abstract:
Online harassment, incitement to violence, racist behavior, and other harmful content on social media can damage social harmony and even break the law. Traditional blocklisting technologies can block malicious users, but this comes at the expense of identity privacy. The anonymous blocklisting has emerged as an effective mechanism to restrict the abuse of freedom of speech while protecting user id…
▽ More
Online harassment, incitement to violence, racist behavior, and other harmful content on social media can damage social harmony and even break the law. Traditional blocklisting technologies can block malicious users, but this comes at the expense of identity privacy. The anonymous blocklisting has emerged as an effective mechanism to restrict the abuse of freedom of speech while protecting user identity privacy. However, the state-of-the-art anonymous blocklisting schemes suffer from either poor dynamism or low efficiency. In this paper, we propose $\mathsf{ShadowBlock}$, an efficient dynamic anonymous blocklisting scheme. Specifically, we utilize the pseudorandom function and cryptographic accumulator to construct the public blocklisting, enabling users to prove they are not on the blocklisting in an anonymous manner. To improve verification efficiency, we design an aggregation zero-knowledge proof mechanism that converts multiple verification operations into a single one. In addition, we leverage the accumulator's property to achieve efficient updates of the blocklisting, i.e., the original proof can be reused with minimal updates rather than regenerating the entire proof. Experiments show that $\mathsf{ShadowBlock}$ has better dynamics and efficiency than the existing schemes. Finally, the discussion on applications indicates that $\mathsf{ShadowBlock}$ also holds significant value and has broad prospects in emerging fields such as cross-chain identity management.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding
Authors:
Ruiqi Ma,
Yu Yan,
Chunhong Zhang,
Minghao Yin,
XinChao Liu,
Zhihong Jin,
Zheng Hu
Abstract:
Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not co…
▽ More
Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
Improving Pattern Recognition of Scheduling Anomalies through Structure-Aware and Semantically-Enhanced Graphs
Authors:
Ning Lyu,
Junjie Jiang,
Lu Chang,
Chihui Shao,
Feng Chen,
Chong Zhang
Abstract:
This paper proposes a structure-aware driven scheduling graph modeling method to improve the accuracy and representation capability of anomaly identification in scheduling behaviors of complex systems. The method first designs a structure-guided scheduling graph construction mechanism that integrates task execution stages, resource node states, and scheduling path information to build dynamically…
▽ More
This paper proposes a structure-aware driven scheduling graph modeling method to improve the accuracy and representation capability of anomaly identification in scheduling behaviors of complex systems. The method first designs a structure-guided scheduling graph construction mechanism that integrates task execution stages, resource node states, and scheduling path information to build dynamically evolving scheduling behavior graphs, enhancing the model's ability to capture global scheduling relationships. On this basis, a multi-scale graph semantic aggregation module is introduced to achieve semantic consistency modeling of scheduling features through local adjacency semantic integration and global topology alignment, thereby strengthening the model's capability to capture abnormal features in complex scenarios such as multi-task concurrency, resource competition, and stage transitions. Experiments are conducted on a real scheduling dataset with multiple scheduling disturbance paths set to simulate different types of anomalies, including structural shifts, resource changes, and task delays. The proposed model demonstrates significant performance advantages across multiple metrics, showing a sensitive response to structural disturbances and semantic shifts. Further visualization analysis reveals that, under the combined effect of structure guidance and semantic aggregation, the scheduling behavior graph exhibits stronger anomaly separability and pattern representation, validating the effectiveness and adaptability of the method in scheduling anomaly detection tasks.
△ Less
Submitted 21 December, 2025;
originally announced December 2025.
-
OpenView: Empowering MLLMs with Out-of-view VQA
Authors:
Qixiang Chen,
Cheng Zhang,
Chi-Wing Fu,
Jingwen Ye,
Jianfei Cai
Abstract:
Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are th…
▽ More
Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.
△ Less
Submitted 20 December, 2025;
originally announced December 2025.
-
FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
Authors:
Cheng Peng,
Zhuo Su,
Liao Wang,
Chen Guo,
Zhaohu Li,
Chengjiang Long,
Zheng Lv,
Jingxiang Sun,
Chenyangguang Zhang,
Yebin Liu
Abstract:
We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-fre…
▽ More
We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.
△ Less
Submitted 19 December, 2025;
originally announced December 2025.
-
Kinematics-Aware Diffusion Policy with Consistent 3D Observation and Action Space for Whole-Arm Robotic Manipulation
Authors:
Kangchen Lv,
Mingrui Yu,
Yongyi Jia,
Chenyu Zhang,
Xiang Li
Abstract:
Whole-body control of robotic manipulators with awareness of full-arm kinematics is crucial for many manipulation scenarios involving body collision avoidance or body-object interactions, which makes it insufficient to consider only the end-effector poses in policy learning. The typical approach for whole-arm manipulation is to learn actions in the robot's joint space. However, the unalignment bet…
▽ More
Whole-body control of robotic manipulators with awareness of full-arm kinematics is crucial for many manipulation scenarios involving body collision avoidance or body-object interactions, which makes it insufficient to consider only the end-effector poses in policy learning. The typical approach for whole-arm manipulation is to learn actions in the robot's joint space. However, the unalignment between the joint space and actual task space (i.e., 3D space) increases the complexity of policy learning, as generalization in task space requires the policy to intrinsically understand the non-linear arm kinematics, which is difficult to learn from limited demonstrations. To address this issue, this letter proposes a kinematics-aware imitation learning framework with consistent task, observation, and action spaces, all represented in the same 3D space. Specifically, we represent both robot states and actions using a set of 3D points on the arm body, naturally aligned with the 3D point cloud observations. This spatially consistent representation improves the policy's sample efficiency and spatial generalizability while enabling full-body control. Built upon the diffusion policy, we further incorporate kinematics priors into the diffusion processes to guarantee the kinematic feasibility of output actions. The joint angle commands are finally calculated through an optimization-based whole-body inverse kinematics solver for execution. Simulation and real-world experimental results demonstrate higher success rates and stronger spatial generalizability of our approach compared to existing methods in body-aware manipulation policy learning.
△ Less
Submitted 19 December, 2025;
originally announced December 2025.
-
Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
Authors:
Henghui Du,
Chang Zhou,
Chunjie Zhang,
Xi Chen,
Di Hu
Abstract:
Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable comp…
▽ More
Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Authors:
Xiaoyan Cong,
Haotian Yang,
Angtian Wang,
Yizhi Wang,
Yiding Yang,
Canyu Zhang,
Chongyang Ma
Abstract:
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generali…
▽ More
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
InfoDCL: Informative Noise Enhanced Diffusion Based Contrastive Learning
Authors:
Xufeng Liang,
Zhida Qin,
Chong Zhang,
Tianyu Huang,
Gangyi Ding
Abstract:
Contrastive learning has demonstrated promising potential in recommender systems. Existing methods typically construct sparser views by randomly perturbing the original interaction graph, as they have no idea about the authentic user preferences. Owing to the sparse nature of recommendation data, this paradigm can only capture insufficient semantic information. To address the issue, we propose Inf…
▽ More
Contrastive learning has demonstrated promising potential in recommender systems. Existing methods typically construct sparser views by randomly perturbing the original interaction graph, as they have no idea about the authentic user preferences. Owing to the sparse nature of recommendation data, this paradigm can only capture insufficient semantic information. To address the issue, we propose InfoDCL, a novel diffusion-based contrastive learning framework for recommendation. Rather than injecting randomly sampled Gaussian noise, we employ a single-step diffusion process that integrates noise with auxiliary semantic information to generate signals and feed them to the standard diffusion process to generate authentic user preferences as contrastive views. Besides, based on a comprehensive analysis of the mutual influence between generation and preference learning in InfoDCL, we build a collaborative training objective strategy to transform the interference between them into mutual collaboration. Additionally, we employ multiple GCN layers only during inference stage to incorporate higher-order co-occurrence information while maintaining training efficiency. Extensive experiments on five real-world datasets demonstrate that InfoDCL significantly outperforms state-of-the-art methods. Our InfoDCL offers an effective solution for enhancing recommendation performance and suggests a novel paradigm for applying diffusion method in contrastive learning frameworks.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
Adaptation of Agentic AI
Authors:
Pengcheng Jiang,
Jiacheng Lin,
Zhiyi Shi,
Zifeng Wang,
Luxi He,
Yichen Wu,
Ming Zhong,
Peiyang Song,
Qizheng Zhang,
Heng Wang,
Xueqiang Xu,
Hanwen Xu,
Pengrui Han,
Dylan Zhang,
Jiashuo Sun,
Chaoqi Yang,
Kun Qian,
Tian Wang,
Changran Hu,
Manling Li,
Quanzheng Li,
Hao Peng,
Sheng Wang,
Jingbo Shang,
Chao Zhang
, et al. (9 additional authors not shown)
Abstract:
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape i…
▽ More
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
△ Less
Submitted 22 December, 2025; v1 submitted 18 December, 2025;
originally announced December 2025.
-
Physics-Informed Neural Networks for Modeling the Martian Induced Magnetosphere
Authors:
Jiawei Gao,
Chuanfei Dong,
Chi Zhang,
Yilan Qin,
Simin Shekarpaz,
Xinmin Li,
Liang Wang,
Hongyang Zhou,
Abigail Tadlock
Abstract:
Understanding the magnetic field environment around Mars and its response to upstream solar wind conditions provide key insights into the processes driving atmospheric ion escape. To date, global models of Martian induced magnetosphere have been exclusively physics-based, relying on computationally intensive simulations. For the first time, we develop a data-driven model of the Martian induced mag…
▽ More
Understanding the magnetic field environment around Mars and its response to upstream solar wind conditions provide key insights into the processes driving atmospheric ion escape. To date, global models of Martian induced magnetosphere have been exclusively physics-based, relying on computationally intensive simulations. For the first time, we develop a data-driven model of the Martian induced magnetospheric magnetic field using Physics-Informed Neural Network (PINN) combined with MAVEN observations and physical laws. Trained under varying solar wind conditions, including B_IMF, P_SW, and Īø_cone, the data-driven model accurately reconstructs the three-dimensional magnetic field configuration and its variability in response to upstream solar wind drivers. Based on the PINN results, we identify key dependencies of magnetic field configuration on solar wind parameters, including the hemispheric asymmetries of the draped field line strength in the Mars-Solar-Electric coordinates. These findings demonstrate the capability of PINNs to reconstruct complex magnetic field structures in the Martian induced magnetosphere, thereby offering a promising tool for advancing studies of solar wind-Mars interactions.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Large Video Planner Enables Generalizable Robot Control
Authors:
Boyuan Chen,
Tianyuan Zhang,
Haoran Geng,
Kiwhan Song,
Caiyi Zhang,
Peihao Li,
William T. Freeman,
Jitendra Malik,
Pieter Abbeel,
Russ Tedrake,
Vincent Sitzmann,
Yilun Du
Abstract:
General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transfe…
▽ More
General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Beyond Training: Enabling Self-Evolution of Agents with MOBIMEM
Authors:
Zibin Liu,
Cheng Zhang,
Xi Zhao,
Yunfei Feng,
Bingyu Bai,
Dahu Feng,
Erhu Feng,
Yubin Xia,
Haibo Chen
Abstract:
Large Language Model (LLM) agents are increasingly deployed to automate complex workflows in mobile and desktop environments. However, current model-centric agent architectures struggle to self-evolve post-deployment: improving personalization, capability, and efficiency typically requires continuous model retraining/fine-tuning, which incurs prohibitive computational overheads and suffers from an…
▽ More
Large Language Model (LLM) agents are increasingly deployed to automate complex workflows in mobile and desktop environments. However, current model-centric agent architectures struggle to self-evolve post-deployment: improving personalization, capability, and efficiency typically requires continuous model retraining/fine-tuning, which incurs prohibitive computational overheads and suffers from an inherent trade-off between model accuracy and inference efficiency.
To enable iterative self-evolution without model retraining, we propose MOBIMEM, a memory-centric agent system. MOBIMEM first introduces three specialized memory primitives to decouple agent evolution from model weights: (1) Profile Memory uses a lightweight distance-graph (DisGraph) structure to align with user preferences, resolving the accuracy-latency trade-off in user profile retrieval; (2) Experience Memory employs multi-level templates to instantiate execution logic for new tasks, ensuring capability generalization; and (3) Action Memory records fine-grained interaction sequences, reducing the reliance on expensive model inference. Building upon this memory architecture, MOBIMEM further integrates a suite of OS-inspired services to orchestrate execution: a scheduler that coordinates parallel sub-task execution and memory operations; an agent record-and-replay (AgentRR) mechanism that enables safe and efficient action reuse; and a context-aware exception handling that ensures graceful recovery from user interruptions and runtime errors.
Evaluation on AndroidWorld and top-50 apps shows that MOBIMEM achieves 83.1% profile alignment with 23.83 ms retrieval time (280x faster than GraphRAG baselines), improves task success rates by up to 50.3%, and reduces end-to-end latency by up to 9x on mobile devices.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
TENG++: Time-Evolving Natural Gradient for Solving PDEs With Deep Neural Nets under General Boundary Conditions
Authors:
Xinjie He,
Chenggong Zhang
Abstract:
Partial Differential Equations (PDEs) are central to modeling complex systems across physical, biological, and engineering domains, yet traditional numerical methods often struggle with high-dimensional or complex problems. Physics-Informed Neural Networks (PINNs) have emerged as an efficient alternative by embedding physics-based constraints into deep learning frameworks, but they face challenges…
▽ More
Partial Differential Equations (PDEs) are central to modeling complex systems across physical, biological, and engineering domains, yet traditional numerical methods often struggle with high-dimensional or complex problems. Physics-Informed Neural Networks (PINNs) have emerged as an efficient alternative by embedding physics-based constraints into deep learning frameworks, but they face challenges in achieving high accuracy and handling complex boundary conditions. In this work, we extend the Time-Evolving Natural Gradient (TENG) framework to address Dirichlet boundary conditions, integrating natural gradient optimization with numerical time-stepping schemes, including Euler and Heun methods, to ensure both stability and accuracy. By incorporating boundary condition penalty terms into the loss function, the proposed approach enables precise enforcement of Dirichlet constraints. Experiments on the heat equation demonstrate the superior accuracy of the Heun method due to its second-order corrections and the computational efficiency of the Euler method for simpler scenarios. This work establishes a foundation for extending the framework to Neumann and mixed boundary conditions, as well as broader classes of PDEs, advancing the applicability of neural network-based solvers for real-world problems.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
Evaluating Large Language Models in Scientific Discovery
Authors:
Zhangde Song,
Jieyu Lu,
Yuanqi Du,
Botao Yu,
Thomas M. Pruyn,
Yue Huang,
Kehan Guo,
Xiuzhe Luo,
Yuanhao Qu,
Yi Qu,
Yinkai Wang,
Haorui Wang,
Jeff Guo,
Jingru Gan,
Parshin Shojaee,
Di Luo,
Andres M Bran,
Gen Li,
Qiyuan Zhao,
Shao-Xiong Lennon Luo,
Yuxuan Zhang,
Xiang Zou,
Wanru Zhao,
Yifan F. Zhang,
Wucheng Zhang
, et al. (31 additional authors not shown)
Abstract:
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain exp…
▽ More
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Bits for Privacy: Evaluating Post-Training Quantization via Membership Inference
Authors:
Chenxiang Zhang,
Tongxi Qu,
Zhong Li,
Tian Zhang,
Jun Pang,
Sjouke Mauw
Abstract:
Deep neural networks are widely deployed with quantization techniques to reduce memory and computational costs by lowering the numerical precision of their parameters. While quantization alters model parameters and their outputs, existing privacy analyses primarily focus on full-precision models, leaving a gap in understanding how bit-width reduction can affect privacy leakage. We present the firs…
▽ More
Deep neural networks are widely deployed with quantization techniques to reduce memory and computational costs by lowering the numerical precision of their parameters. While quantization alters model parameters and their outputs, existing privacy analyses primarily focus on full-precision models, leaving a gap in understanding how bit-width reduction can affect privacy leakage. We present the first systematic study of the privacy-utility relationship in post-training quantization (PTQ), a versatile family of methods that can be applied to pretrained models without further training. Using membership inference attacks as our evaluation framework, we analyze three popular PTQ algorithms-AdaRound, BRECQ, and OBC-across multiple precision levels (4-bit, 2-bit, and 1.58-bit) on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Our findings consistently show that low-precision PTQs can reduce privacy leakage. In particular, lower-precision models demonstrate up to an order of magnitude reduction in membership inference vulnerability compared to their full-precision counterparts, albeit at the cost of decreased utility. Additional ablation studies on the 1.58-bit quantization level show that quantizing only the last layer at higher precision enables fine-grained control over the privacy-utility trade-off. These results offer actionable insights for practitioners to balance efficiency, utility, and privacy protection in real-world deployments.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA
Authors:
Chao Zhang,
Minghan Li,
Tianrui Lv,
Guodong Zhou
Abstract:
Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it…
▽ More
Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct "brother" relations, 2-hop for indirect "father-son" chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a "question-paths-answer" format to enhance LLMs' ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Update Strategy for Channel Knowledge Map in Complex Environments
Authors:
Ting Wang,
Chiya Zhang,
Chang Liu,
Zhuoyuan Hao,
Rubing Han,
Weizheng Zhang,
Chunlong He
Abstract:
The Channel Knowledge Map (CKM) maps position information to channel state information, leveraging environmental knowledge to reduce signaling overhead in sixth-generation networks. However, constructing a reliable CKM demands substantial data and computation, and in dynamic environments, a pre-built CKM becomes outdated, degrading performance. Frequent retraining restores accuracy but incurs sign…
▽ More
The Channel Knowledge Map (CKM) maps position information to channel state information, leveraging environmental knowledge to reduce signaling overhead in sixth-generation networks. However, constructing a reliable CKM demands substantial data and computation, and in dynamic environments, a pre-built CKM becomes outdated, degrading performance. Frequent retraining restores accuracy but incurs significant waste, creating a fundamental trade-off between CKM efficacy and update overhead. To address this, we introduce a Map Efficacy Function (MEF) capturing both gradual aging and abrupt environmental transitions, and formulate the update scheduling problem as fractional programming. We develop two Dinkelbach-based algorithms: Delta-P guarantees global optimality, while Delta-L achieves near-optimal performance with near-linear complexity. For unpredictable environments, we derive a threshold-based policy: immediate updates are optimal when the environmental degradation rate exceeds the resource consumption acceleration; otherwise, delay is preferable. For predictable environments, long-term strategies strategically relax these myopic rules to maximize global performance. Across this regime, the policy reveals that stronger entry loss and faster decay favor immediate updates, while weaker entry loss and slower decay favor delayed updates.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank
Authors:
Chenxiao Zhang,
Runshi Zhang,
Junchen Wang
Abstract:
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase bounda…
▽ More
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
ISS Policy : Scalable Diffusion Policy with Implicit Scene Supervision
Authors:
Wenlong Xia,
Jinhao Zhang,
Ce Zhang,
Yaojia Wang,
Youmin Gong,
Jie Mei
Abstract:
Vision-based imitation learning has enabled impressive robotic manipulation skills, but its reliance on object appearance while ignoring the underlying 3D scene structure leads to low training efficiency and poor generalization. To address these challenges, we introduce \emph{Implicit Scene Supervision (ISS) Policy}, a 3D visuomotor DiT-based diffusion policy that predicts sequences of continuous…
▽ More
Vision-based imitation learning has enabled impressive robotic manipulation skills, but its reliance on object appearance while ignoring the underlying 3D scene structure leads to low training efficiency and poor generalization. To address these challenges, we introduce \emph{Implicit Scene Supervision (ISS) Policy}, a 3D visuomotor DiT-based diffusion policy that predicts sequences of continuous actions from point cloud observations. We extend DiT with a novel implicit scene supervision module that encourages the model to produce outputs consistent with the scene's geometric evolution, thereby improving the performance and robustness of the policy. Notably, ISS Policy achieves state-of-the-art performance on both single-arm manipulation tasks (MetaWorld) and dexterous hand manipulation (Adroit). In real-world experiments, it also demonstrates strong generalization and robustness. Additional ablation studies show that our method scales effectively with both data and parameters. Code and videos will be released.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Revisiting the Reliability of Language Models in Instruction-Following
Authors:
Jianshuo Dong,
Yutong Zhang,
Yan Liu,
Zhenyu Zhong,
Tao Wei,
Chao Zhang,
Han Qiu
Abstract:
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin pr…
▽ More
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
ART: Articulated Reconstruction Transformer
Authors:
Zizhang Li,
Cheng Zhang,
Zhengqin Li,
Henry Howard-Jenkins,
Zhaoyang Lv,
Chen Geng,
Jiajun Wu,
Richard Newcombe,
Jakob Engel,
Zhao Dong
Abstract:
We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast,…
▽ More
We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents
Authors:
Hongqiu Ni,
Jiabao Zhang,
Guopeng Li,
Zilong Wang,
Ruiqi Wu,
Chi Zhang,
Haisheng Tan
Abstract:
Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which p…
▽ More
Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5\% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
ASAP-Textured Gaussians: Enhancing Textured Gaussians with Adaptive Sampling and Anisotropic Parameterization
Authors:
Meng Wei,
Cheng Zhang,
Jianmin Zheng,
Hamid Rezatofighi,
Jianfei Cai
Abstract:
Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing…
▽ More
Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing textured Gaussian methods and identify two key limitations in common: (1) Textures are typically defined in canonical space, leading to inefficient sampling that wastes textures' capacity on low-contribution regions; and (2) texture parameterization is uniformly assigned across all Gaussians, regardless of their visual complexity, resulting in over-parameterization. In this work, we address these issues through two simple yet effective strategies: adaptive sampling based on the Gaussian density distribution and error-driven anisotropic parameterization that allocates texture resources according to rendering error. Our proposed ASAP Textured Gaussians, short for Adaptive Sampling and Anisotropic Parameterization, significantly improve the quality efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
A Single Architecture for Representing Invariance Under Any Space Group
Authors:
Cindy Y. Zhang,
Elif Ertekin,
Peter Orbanz,
Ryan P. Adams
Abstract:
Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups…
▽ More
Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups, symmetries critical to modeling crystalline solids in materials science and condensed matter physics, this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
Authors:
Aofeng Shen,
Chi Zhang,
Yakup Budanaz,
Alexandru Calotoiu,
Torsten Hoefler,
Luca Benini
Abstract:
Tile-based many-Processing Element (PE) accelerators can achieve competitive performance on General Matrix Multiplication (GEMM), but they are extremely hard to program, as their optimal software mapping is deeply coupled with hardware design which is unwieldy to manual deployment. We propose "Design in Tiles (DiT)", an automated framework connecting a deployment toolchain with a configurable exec…
▽ More
Tile-based many-Processing Element (PE) accelerators can achieve competitive performance on General Matrix Multiplication (GEMM), but they are extremely hard to program, as their optimal software mapping is deeply coupled with hardware design which is unwieldy to manual deployment. We propose "Design in Tiles (DiT)", an automated framework connecting a deployment toolchain with a configurable executable model for these accelerators. For evaluation, we apply our framework to GEMM targeting a large acceleration configuration (e.g., 32x32 tiles, 1979 TFLOPS@FP8, 4 TB/s Bandwidth) comparable to an NVIDIA GH200. We achieve higher PE utilization than GH200 with its expert-tuned GEMM libraries, achieving 1.2-2.0x speedup across diverse matrix shapes.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
3D Human-Human Interaction Anomaly Detection
Authors:
Shun Maeda,
Chunzhi Gu,
Koichiro Kamide,
Katsuya Hotta,
Shangce Gao,
Chao Zhang
Abstract:
Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture th…
▽ More
Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Authors:
Heyi Chen,
Siyan Chen,
Xin Chen,
Yanfei Chen,
Ying Chen,
Zhuo Chen,
Feng Cheng,
Tianheng Cheng,
Xinqi Cheng,
Xuyan Chi,
Jian Cong,
Jing Cui,
Qinpeng Cui,
Qide Dong,
Junliang Fan,
Jing Fang,
Zetao Fang,
Chengjian Feng,
Han Feng,
Mingyuan Gao,
Yu Gao,
Dong Guo,
Qiushan Guo,
Boyang Hao,
Qingkai Hao
, et al. (171 additional authors not shown)
Abstract:
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional au…
▽ More
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
△ Less
Submitted 16 December, 2025; v1 submitted 15 December, 2025;
originally announced December 2025.
-
CoRA: A Collaborative Robust Architecture with Hybrid Fusion for Efficient Perception
Authors:
Gong Chen,
Chaokun Zhang,
Pengcheng Lv,
Xiaohui Xie
Abstract:
Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalig…
▽ More
Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalignment induced by data transmission, which severely hampers their practical deployment. To bridge this gap, we re-examine different fusion paradigms, and recover that the strengths of intermediate and late fusion are not a trade-off, but a complementary pairing. Based on this key insight, we propose CoRA, a novel collaborative robust architecture with a hybrid approach to decouple performance from robustness with low communication. It is composed of two components: a feature-level fusion branch and an object-level correction branch. Its first branch selects critical features and fuses them efficiently to ensure both performance and scalability. The second branch leverages semantic relevance to correct spatial displacements, guaranteeing resilience against pose errors. Experiments demonstrate the superiority of CoRA. Under extreme scenarios, CoRA improves upon its baseline performance by approximately 19% in AP@0.7 with more than 5x less communication volume, which makes it a promising solution for robust collaborative perception.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Authors:
Tong Wei,
Yijun Yang,
Changhao Zhang,
Junliang Xing,
Yuanchun Shi,
Zongqing Lu,
Deheng Ye
Abstract:
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting pr…
▽ More
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection
Authors:
Ziyu Huang,
Yangjie Zhou,
Zihan Liu,
Xinhao Luo,
Yijia Diao,
Minyi Guo,
Jidong Zhai,
Yu Feng,
Chen Zhang,
Anbang Wu,
Jingwen Leng
Abstract:
The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing compilers and frameworks are limited to using local scratchpad memory. When the intermediate results exceed the limited capacity (such as FFN), the fusion fail…
▽ More
The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing compilers and frameworks are limited to using local scratchpad memory. When the intermediate results exceed the limited capacity (such as FFN), the fusion fails. Although modern GPUs (like the NVIDIA H100) now incorporate an inter-core connection mechanism known as Distributed Shared Memory(DSM)--providing a larger, high-bandwidth, and low-latency on-chip memory pool--this hardware potential has yet to be exploited by software frameworks. To bridge this gap, we present FlashFuser, the first compiler framework to utilize inter-core connection for kernel fusion on modern GPUs. FlashFuser extends established fusion techniques to the DSM domain through three core contributions. First, we propose a powerful DSM-based communication abstraction that formalizes complex cluster-based data exchange patterns, such as reduce, shuffle and multiply. Second, we introduce a dataflow analyzer that generalizes loop scheduling, resource mapping, and tile selection to the distributed memory hierarchy; it determines the optimal execution order and tile sizes by quantifying data movement across memory levels. Finally, FlashFuser integrates these components into a unified search engine that employs analytical cost modeling and DSM-aware pruning strategies to efficiently discover the optimal execution plan. Our evaluation on an NVIDIA H100 GPU shows that FlashFuser reduces memory access by 58% and delivers kernel speedups of 3.3x against highly-tuned libraries and 4.1x against state-of-the-art compilers, resulting in a 1.24x end-to-end speedup.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Authors:
Jingzhe Ding,
Shengda Long,
Changxin Pu,
Huan Zhou,
Hongwan Gao,
Xiang Gao,
Chao He,
Yue Hou,
Fei Hu,
Zhaojian Li,
Weiran Shi,
Zaiyuan Wang,
Daoguang Zan,
Chenchen Zhang,
Xiaoxu Zhang,
Qizhi Chen,
Xianfu Cheng,
Bo Deng,
Qingshui Gu,
Kai Hua,
Juntao Lin,
Pai Liu,
Mingchen Li,
Xuanguang Pan,
Zifan Peng
, et al. (23 additional authors not shown)
Abstract:
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent re…
▽ More
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
Authors:
Minghui Liu,
Aadi Palnitkar,
Tahseen Rabbani,
Hyunwoo Jae,
Kyle Rui Sang,
Dixi Yao,
Shayan Shabihi,
Fuheng Zhao,
Tian Li,
Ce Zhang,
Furong Huang,
Kunpeng Zhang
Abstract:
Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popula…
▽ More
Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
CogniSNN: Enabling Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability with Random Graph Architectures in Spiking Neural Networks
Authors:
Yongsheng Huang,
Peibo Duan,
Yujie Wu,
Kai Sun,
Zhipeng Liu,
Changsheng Zhang,
Bin Zhang,
Mingkun Xu
Abstract:
Spiking neural networks (SNNs), regarded as the third generation of artificial neural networks, are expected to bridge the gap between artificial intelligence and computational neuroscience. However, most mainstream SNN research directly adopts the rigid, chain-like hierarchical architecture of traditional artificial neural networks (ANNs), ignoring key structural characteristics of the brain. Bio…
▽ More
Spiking neural networks (SNNs), regarded as the third generation of artificial neural networks, are expected to bridge the gap between artificial intelligence and computational neuroscience. However, most mainstream SNN research directly adopts the rigid, chain-like hierarchical architecture of traditional artificial neural networks (ANNs), ignoring key structural characteristics of the brain. Biological neurons are stochastically interconnected, forming complex neural pathways that exhibit Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability. In this paper, we introduce a new SNN paradigm, named Cognition-aware SNN (CogniSNN), by incorporating Random Graph Architecture (RGA). Furthermore, we address the issues of network degradation and dimensional mismatch in deep pathways by introducing an improved pure spiking residual mechanism alongside an adaptive pooling strategy. Then, we design a Key Pathway-based Learning without Forgetting (KP-LwF) approach, which selectively reuses critical neural pathways while retaining historical knowledge, enabling efficient multi-task transfer. Finally, we propose a Dynamic Growth Learning (DGL) algorithm that allows neurons and synapses to grow dynamically along the internal temporal dimension. Extensive experiments demonstrate that CogniSNN achieves performance comparable to, or even surpassing, current state-of-the-art SNNs on neuromorphic datasets and Tiny-ImageNet. The Pathway-Reusability enhances the network's continuous learning capability across different scenarios, while the dynamic growth algorithm improves robustness against interference and mitigates the fixed-timestep constraints during neuromorphic chip deployment. This work demonstrates the potential of SNNs with random graph structures in advancing brain-inspired intelligence and lays the foundation for their practical application on neuromorphic hardware.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
Metaphor-based Jailbreaking Attacks on Text-to-Image Models
Authors:
Chenyu Zhang,
Yiwen Ma,
Lanjun Wang,
Wenhui Li,
Yi Tu,
An-An Liu
Abstract:
Text-to-image~(T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker kno…
▽ More
Text-to-image~(T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based \textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming to effectively and efficiently attack diverse defense mechanisms without prior knowledge of their type by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA outperforms six baseline methods, achieving stronger attack performance while using fewer queries. Code is available in https://github.com/datar001/metaphor-based-jailbreaking-attack.
△ Less
Submitted 6 December, 2025;
originally announced December 2025.
-
ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp
Authors:
Xinhang Chen,
Chao Zhang,
Jiahuan He,
Wei Liu,
Jianming Zhang,
Wenlong Zhou,
Xiao Li,
Pai Zeng,
Shiyong Li,
Yuanpan Qian,
Dong Li,
Zhaogeng Li
Abstract:
DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity…
▽ More
DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput.
To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings.
Our high-fidelity simulations show that ESS delivers 69.4\% throughput improvement at 32K context length and up to 123\% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
RoboNeuron: A Modular Framework Linking Foundation Models and ROS for Embodied AI
Authors:
Weifan Guan,
Huasen Xi,
Chenxiao Zhang,
Aosheng Li,
Qinghao Hu,
Jian Cheng
Abstract:
Current embodied AI systems face severe engineering impediments, primarily characterized by poor cross-scenario adaptability, rigid inter-module coupling, and fragmented inference acceleration. To overcome these limitations, we propose RoboNeuron, a universal deployment framework for embodied intelligence. RoboNeuron is the first framework to deeply integrate the cognitive capabilities of Large La…
▽ More
Current embodied AI systems face severe engineering impediments, primarily characterized by poor cross-scenario adaptability, rigid inter-module coupling, and fragmented inference acceleration. To overcome these limitations, we propose RoboNeuron, a universal deployment framework for embodied intelligence. RoboNeuron is the first framework to deeply integrate the cognitive capabilities of Large Language Models (LLMs) and Vision-Language-Action (VLA) models with the real-time execution backbone of the Robot Operating System (ROS). We utilize the Model Context Protocol (MCP) as a semantic bridge, enabling the LLM to dynamically orchestrate underlying robotic tools. The framework establishes a highly modular architecture that strictly decouples sensing, reasoning, and control by leveraging ROS's unified communication interfaces. Crucially, we introduce an automated tool to translate ROS messages into callable MCP functions, significantly streamlining development. RoboNeuron significantly enhances cross-scenario adaptability and component flexibility, while establishing a systematic platform for horizontal performance benchmarking, laying a robust foundation for scalable real-world embodied applications.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance
Authors:
Kang Yin,
Chunyu Qiang,
Sirui Zhao,
Xiaopeng Wang,
Yuzhe Liang,
Pengfei Cai,
Tong Xu,
Chen Zhang,
Enhong Chen
Abstract:
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference aud…
▽ More
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.
△ Less
Submitted 10 December, 2025;
originally announced December 2025.
-
Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook
Authors:
Yuan Ma,
Junlin Hou,
Chao Zhang,
Yukun Zhou,
Zongyuan Ge,
Haoran Xie,
Lie Ju
Abstract:
Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introd…
▽ More
Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Authors:
Chuhan Zhang,
Guillaume Le Moing,
Skanda Koppula,
Ignacio Rocco,
Liliane Momeni,
Junyu Xie,
Shuyang Sun,
Rahul Sukthankar,
Joƫlle K. Barral,
Raia Hadsell,
Zoubin Ghahramani,
Andrew Zisserman,
Junlin Zhang,
Mehdi S. M. Sajjadi
Abstract:
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single…
▽ More
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
△ Less
Submitted 10 December, 2025; v1 submitted 9 December, 2025;
originally announced December 2025.
-
Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology
Authors:
Rongzhao Zhang,
Junqiao Wang,
Shuyun Yang,
Mouxiao Bian,
Chao Ding,
Yuwei Bai,
Chihao Zhang,
Yuguang Shen,
Lei Wang,
Lei Zheng,
Qiujuan Yan,
Yun Zhong,
Meiling Liu,
Jiwei Yu,
Zheng Wang,
Jie Xu,
Meng Luo
Abstract:
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous me…
▽ More
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models
Authors:
Ryan Feng Lin,
Keyu Tian,
Hanming Zheng,
Congjing Zhang,
Li Zeng,
Shuai Huang
Abstract:
The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found th…
▽ More
The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.
△ Less
Submitted 2 December, 2025;
originally announced December 2025.
-
Advancing physiological time series reconstruction and imputation via mixture of receptive fields and experts fusion
Authors:
Ci Zhang,
Huayu Li,
Changdi Yang,
Jiangnan Xia,
Yanzhi Wang,
Xiaolong Ma,
Jin Lu,
Ao Li,
Geng Yuan
Abstract:
Recent studies show that using diffusion models for time series signal reconstruction holds great promise. However, such approaches remain largely unexplored in the domain of medical time series. The unique characteristics of the physiological time series signals, such as multivariate, high temporal variability, highly noisy, and artifact-prone, make deep learning-based approaches still challengin…
▽ More
Recent studies show that using diffusion models for time series signal reconstruction holds great promise. However, such approaches remain largely unexplored in the domain of medical time series. The unique characteristics of the physiological time series signals, such as multivariate, high temporal variability, highly noisy, and artifact-prone, make deep learning-based approaches still challenging for tasks such as imputation. Hence, we propose a novel Mixture of Experts (MoE)-based noise estimator within a score-based diffusion framework. Specifically, the Receptive Field Adaptive MoE (RFAMoE) module is designed to enable each channel to adaptively select desired receptive fields throughout the diffusion process. Moreover, recent literature has found that when generating a physiological signal, performing multiple inferences and averaging the reconstructed signals can effectively reduce reconstruction errors, but at the cost of significant computational and latency overhead. We design a Fusion MoE module and innovatively leverage the nature of MoE module to generate K noise signals in parallel, fuse them using a routing mechanism, and complete signal reconstruction in a single inference step. This design not only improves performance over previous methods but also eliminates the substantial computational cost and latency associated with multiple inference processes. Extensive results demonstrate that our proposed framework consistently outperforms diffusion-based SOTA works on different tasks and datasets.
△ Less
Submitted 12 December, 2025; v1 submitted 26 November, 2025;
originally announced December 2025.
-
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Authors:
Zhaochong An,
Menglin Jia,
Haonan Qiu,
Zijian Zhou,
Xiaoke Huang,
Zhiheng Liu,
Weiming Ren,
Kumara Kahatapitiya,
Ding Liu,
Sen He,
Chenyang Zhang,
Tao Xiang,
Fanny Yang,
Serge Belongie,
Tian Xie
Abstract:
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under com…
▽ More
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
△ Less
Submitted 8 December, 2025;
originally announced December 2025.
-
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Authors:
Charlie Zhang,
Graham Neubig,
Xiang Yue
Abstract:
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined,…
▽ More
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
△ Less
Submitted 8 December, 2025;
originally announced December 2025.