-
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
Authors:
Yi Xin,
Siqi Luo,
Qi Qin,
Haoxing Chen,
Kaiwen Zhu,
Zhiwei Zhang,
Yangfan He,
Rongchao Zhang,
Jinbin Bai,
Shuo Cao,
Bin Fu,
Junjun He,
Yihao Liu,
Yuewen Cao,
Xiaohong Liu
Abstract:
Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes:…
▽ More
Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows
Authors:
Runze Mao,
Rui Zhang,
Xuan Bai,
Tianhao Wu,
Teng Zhang,
Zhenyi Chen,
Minqi Lin,
Bocheng Zeng,
Yangchen Xu,
Yingxuan Xiang,
Haoze Zhang,
Shubham Goswami,
Pierre A. Dawe,
Yifan Xu,
Zhenhua An,
Mengtao Yan,
Xiaoyi Lu,
Yi Wang,
Rongbo Bai,
Haobu Gao,
Xiaohang Fang,
Han Li,
Hao Sun,
Zhi X. Chen
Abstract:
Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail t…
▽ More
Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models' inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.
△ Less
Submitted 21 December, 2025;
originally announced December 2025.
-
Wireless Copilot: An AI-Powered Partner for Navigating Next-Generation Wireless Complexity
Authors:
Haoxiang Luo,
Ruichen Zhang,
Yinqiu Liu,
Gang Sun,
Hongfang Yu,
Dusit Niyato,
Shiwen Mao,
Dong In Kim
Abstract:
The sixth-generation (6G) of wireless networks introduces a level of operational complexity that exceeds the limits of traditional automation and manual oversight. This paper introduces the "Wireless Copilot", an AI-powered technical assistant designed to function as a collaborative partner for human network designers, engineers, and operators. We posit that by integrating Large Language Models (L…
▽ More
The sixth-generation (6G) of wireless networks introduces a level of operational complexity that exceeds the limits of traditional automation and manual oversight. This paper introduces the "Wireless Copilot", an AI-powered technical assistant designed to function as a collaborative partner for human network designers, engineers, and operators. We posit that by integrating Large Language Models (LLMs) with a robust cognitive framework. It will surpass the existing AI tools and interact with wireless devices, transmitting the user's intentions into the actual network execution process. Then, Wireless Copilot can translate high-level human intent into precise, optimized, and verifiable network actions. This framework bridges the gap between human expertise and machine-scale complexity, enabling more efficient, intelligent, and trustworthy management of 6G systems. Wireless Copilot will be a novel layer between the wireless infrastructure and the network operators. Moreover, we explore Wireless Copilot's methodology and analyze its application in Low-Altitude Wireless Networks (LAWNets) assisting 6G networking, including network design, configuration, evaluation, and optimization. Additionally, we present a case study on intent-based LAWNets resource allocation, demonstrating its superior adaptability compared to others. Finally, we outline future research directions toward creating a comprehensive human-AI collaborative ecosystem for the 6G era.
△ Less
Submitted 20 December, 2025;
originally announced December 2025.
-
NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework
Authors:
Zihao Deng,
Yijia Li,
Renrui Zhang,
Peijun Ye
Abstract:
Cognitive computing models offer a formal and interpretable way to characterize human's deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human…
▽ More
Cognitive computing models offer a formal and interpretable way to characterize human's deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine-tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real-world behavioral data. Our method is validated in two domains: (1) NL-to-LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large-scale benchmarks without human-in-the-loop feed-backs, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.
△ Less
Submitted 19 December, 2025;
originally announced December 2025.
-
Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection
Authors:
Jie Yang,
Rui Zhang,
Ziyang Cheng,
Dawei Cheng,
Guang Yang,
Bo Wang
Abstract:
Nowadays, Graph Fraud Detection (GFD) in financial scenarios has become an urgent research topic to protect online payment security. However, as organized crime groups are becoming more professional in real-world scenarios, fraudsters are employing more sophisticated camouflage strategies. Specifically, fraudsters disguise themselves by mimicking the behavioral data collected by platforms, ensurin…
▽ More
Nowadays, Graph Fraud Detection (GFD) in financial scenarios has become an urgent research topic to protect online payment security. However, as organized crime groups are becoming more professional in real-world scenarios, fraudsters are employing more sophisticated camouflage strategies. Specifically, fraudsters disguise themselves by mimicking the behavioral data collected by platforms, ensuring that their key characteristics are consistent with those of benign users to a high degree, which we call Adaptive Camouflage. Consequently, this narrows the differences in behavioral traits between them and benign users within the platform's database, thereby making current GFD models lose efficiency. To address this problem, we propose a relation diffusion-based graph augmentation model Grad. In detail, Grad leverages a supervised graph contrastive learning module to enhance the fraud-benign difference and employs a guided relation diffusion generator to generate auxiliary homophilic relations from scratch. Based on these, weak fraudulent signals would be enhanced during the aggregation process, thus being obvious enough to be captured. Extensive experiments have been conducted on two real-world datasets provided by WeChat Pay, one of the largest online payment platforms with billions of users, and three public datasets. The results show that our proposed model Grad outperforms SOTA methods in both various scenarios, achieving at most 11.10% and 43.95% increases in AUC and AP, respectively. Our code is released at https://github.com/AI4Risk/antifraud and https://github.com/Muyiiiii/WWW25-Grad.
△ Less
Submitted 19 December, 2025;
originally announced December 2025.
-
Transformer-Based Modeling of User Interaction Sequences for Dwell Time Prediction in Human-Computer Interfaces
Authors:
Rui Liu,
Runsheng Zhang,
Shixiao Wang
Abstract:
This study investigates the task of dwell time prediction and proposes a Transformer framework based on interaction behavior modeling. The method first represents user interaction sequences on the interface by integrating dwell duration, click frequency, scrolling behavior, and contextual features, which are mapped into a unified latent space through embedding and positional encoding. On this basi…
▽ More
This study investigates the task of dwell time prediction and proposes a Transformer framework based on interaction behavior modeling. The method first represents user interaction sequences on the interface by integrating dwell duration, click frequency, scrolling behavior, and contextual features, which are mapped into a unified latent space through embedding and positional encoding. On this basis, a multi-head self-attention mechanism is employed to capture long-range dependencies, while a feed-forward network performs deep nonlinear transformations to model the dynamic patterns of dwell time. Multiple comparative experiments are conducted with BILSTM, DRFormer, FedFormer, and iTransformer as baselines under the same conditions. The results show that the proposed method achieves the best performance in terms of MSE, RMSE, MAPE, and RMAE, and more accurately captures the complex patterns in interaction behavior. In addition, sensitivity experiments are carried out on hyperparameters and environments to examine the impact of the number of attention heads, sequence window length, and device environment on prediction performance, which further demonstrates the robustness and adaptability of the method. Overall, this study provides a new solution for dwell time prediction from both theoretical and methodological perspectives and verifies its effectiveness in multiple aspects.
△ Less
Submitted 18 December, 2025;
originally announced December 2025.
-
Nonparametric Stochastic Subspaces via the Bootstrap for Characterizing Model Error
Authors:
Akash Yadav,
Ruda Zhang
Abstract:
Reliable forward uncertainty quantification in engineering requires methods that account for aleatory and epistemic uncertainties. In many applications, epistemic effects arising from uncertain parameters and model form dominate prediction error and strongly influence engineering decisions. Because distinguishing and representing each source separately is often infeasible, their combined effect is…
▽ More
Reliable forward uncertainty quantification in engineering requires methods that account for aleatory and epistemic uncertainties. In many applications, epistemic effects arising from uncertain parameters and model form dominate prediction error and strongly influence engineering decisions. Because distinguishing and representing each source separately is often infeasible, their combined effect is typically analyzed using a unified model-error framework. Model error directly affects model credibility and predictive reliability; yet its characterization remains challenging. To address this need, we introduce a bootstrap-based stochastic subspace model for characterizing model error in the stochastic reduced-order modeling framework. Given a snapshot matrix of state vectors, the method leverages the empirical data distribution to induce a sampling distribution over principal subspaces for reduced order modeling. The resulting stochastic model enables improved characterization of model error in computational mechanics compared with existing approaches. The method offers several advantages: (1) it is assumption-free and leverages the empirical data distribution; (2) it enforces linear constraints (such as boundary conditions) by construction; (3) it requires only one hyperparameter, significantly simplifying the training process; and (4) its algorithm is straightforward to implement. We evaluate the method's performance against existing approaches using numerical examples in computational mechanics and structural dynamics.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
PMMD: A pose-guided multi-view multi-modal diffusion for person generation
Authors:
Ziyu Shang,
Haoran Liu,
Rongchao Zhang,
Zhiqian Wei,
Tongtong Feng
Abstract:
Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on…
▽ More
Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank
Authors:
Chenxiao Zhang,
Runshi Zhang,
Junchen Wang
Abstract:
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase bounda…
▽ More
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Agentic AI for Integrated Sensing and Communication: Analysis, Framework, and Case Study
Authors:
Wenwen Xie,
Geng Sun,
Ruichen Zhang,
Xuejie Liu,
Yinqiu Liu,
Jiacheng Wang,
Dusit Niyato,
Ping Zhang
Abstract:
Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain…
▽ More
Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain efficiency and adaptability. Meanwhile, agentic artificial intelligence (AI) offers a feasible solution to address these challenges by enabling continuous perception-reasoning-action loops in dynamic environments to support intelligent, autonomous, and efficient operation for ISAC systems. As such, we delve into the application value and prospects of agentic AI in ISAC systems in this work. Firstly, we provide a comprehensive review of agentic AI and ISAC systems to demonstrate their key characteristics. Secondly, we show several common optimization approaches for ISAC systems and highlight the significant advantages of generative artificial intelligence (GenAI)-based agentic AI. Thirdly, we propose a novel agentic ISAC framework and prensent a case study to verify its superiority in optimizing ISAC performance. Finally, we clarify future research directions for agentic AI-based ISAC systems.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding
Authors:
Ruiyi Zhang,
Peijia Qin,
Qi Cao,
Pengtao Xie
Abstract:
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompt…
▽ More
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
Authors:
Shaoting Feng,
Yuhan Liu,
Hanchen Li,
Xiaokun Chen,
Samuel Shen,
Kuntai Du,
Zhuohan Gu,
Rui Zhang,
Yuyang Huang,
Yihua Cheng,
Jiayi Yao,
Qizheng Zhang,
Ganesh Ananthanarayanan,
Junchen Jiang
Abstract:
Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly…
▽ More
Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality.
We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS's profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Puzzle Curriculum GRPO for Vision-Centric Reasoning
Authors:
Ahmadreza Jeddi,
Hakki Can Karaimer,
Hue Nguyen,
Zhongling Wang,
Ke Zhao,
Javad Rajabi,
Ran Zhang,
Raghav Goyal,
Babak Taati,
Radek Grzeszczuk
Abstract:
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle C…
▽ More
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
A data-physics hybrid generative model for patient-specific post-stroke motor rehabilitation using wearable sensor data
Authors:
Yanning Dai,
Chenyu Tang,
Ruizhi Zhang,
Wenyu Yang,
Yilan Zhang,
Yuhui Wang,
Junliang Chen,
Xuhang Chen,
Ruimou Xie,
Yangyue Cao,
Qiaoying Li,
Jin Cao,
Tao Li,
Hubin Zhao,
Yu Pan,
Arokia Nathan,
Xin Gao,
Peter Smielewski,
Shuo Gao
Abstract:
Dynamic prediction of locomotor capacity after stroke is crucial for tailoring rehabilitation, yet current assessments provide only static impairment scores and do not indicate whether patients can safely perform specific tasks such as slope walking or stair climbing. Here, we develop a data-physics hybrid generative framework that reconstructs an individual stroke survivor's neuromuscular control…
▽ More
Dynamic prediction of locomotor capacity after stroke is crucial for tailoring rehabilitation, yet current assessments provide only static impairment scores and do not indicate whether patients can safely perform specific tasks such as slope walking or stair climbing. Here, we develop a data-physics hybrid generative framework that reconstructs an individual stroke survivor's neuromuscular control from a single 20 m level-ground walking trial and predicts task-conditioned locomotion across rehabilitation scenarios. The system combines wearable-sensor kinematics, a proportional-derivative physics controller, a population Healthy Motion Atlas, and goal-conditioned deep reinforcement learning with behaviour cloning and generative adversarial imitation learning to generate physically plausible, patient-specific gait simulations for slopes and stairs. In 11 stroke survivors, the personalized controllers preserved idiosyncratic gait patterns while improving joint-angle and endpoint fidelity by 4.73% and 12.10%, respectively, and reducing training time to 25.56% relative to a physics-only baseline. In a multicentre pilot involving 21 inpatients, clinicians who used our locomotion predictions to guide task selection and difficulty obtained larger gains in Fugl-Meyer lower-extremity scores over 28 days of standard rehabilitation than control clinicians (mean change 6.0 versus 3.7 points). These findings indicate that our generative, task-predictive framework can augment clinical decision-making in post-stroke gait rehabilitation and provide a template for dynamically personalized motor recovery strategies.
△ Less
Submitted 16 December, 2025;
originally announced December 2025.
-
Legitimizing, Developing, and Sustaining Feminist HCI in East Asia: Challenges and Opportunities
Authors:
Runhua Zhang,
Ruyuan Wan,
Jiaqi,
Li,
Daye Kang,
Yigang Qin,
Yijia Wang,
Ziqi Pan,
Tiffany Knearem,
Huamin Qu,
Xiaojuan Ma
Abstract:
Feminist HCI has been rapidly developing in East Asian contexts in recent years. The region's unique cultural and political backgrounds have contributed valuable, situated knowledge, revealing topics such as localized digital feminism practices, or women's complex navigation among social expectations. However, the very factors that ground these perspectives also create significant survival challen…
▽ More
Feminist HCI has been rapidly developing in East Asian contexts in recent years. The region's unique cultural and political backgrounds have contributed valuable, situated knowledge, revealing topics such as localized digital feminism practices, or women's complex navigation among social expectations. However, the very factors that ground these perspectives also create significant survival challenges for researchers in East Asia. These include a scarcity of dedicated funding, the stigma of being perceived as less valuable than productivity-oriented technologies, and the lack of senior researchers and established, resilient communities. Grounded in these challenges and our prior collective practices, we propose this meet-up with two focused goals: (1) to provide a legitimized channel for Feminist HCI researchers to connect and build community, and (2) to facilitate an action-oriented dialogue on how to legitimize, develop, and sustain Feminist HCI in the East Asian context. The website for this meet-up is: https://feminist-hci.github.io/
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
Decoding Human and AI Persuasion in National College Debate: Analyzing Prepared Arguments Through Aristotle's Rhetorical Principles
Authors:
Mengqian Wu,
Jiayi Zhang,
Raymond Z. Zhang
Abstract:
Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale…
▽ More
Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle's rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI's focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
Scalable Quantum Error Mitigation with Neighbor-Informed Learning
Authors:
Zhenyu Chen,
Bin Cheng,
Minbo Gao,
Xiaodie Lin,
Ruiqi Zhang,
Zhaohui Wei,
Zhengfeng Ji
Abstract:
Noise in quantum hardware is the primary obstacle to realizing the transformative potential of quantum computing. Quantum error mitigation (QEM) offers a promising pathway to enhance computational accuracy on near-term devices, yet existing methods face a difficult trade-off between performance, resource overhead, and theoretical guarantees. In this work, we introduce neighbor-informed learning (N…
▽ More
Noise in quantum hardware is the primary obstacle to realizing the transformative potential of quantum computing. Quantum error mitigation (QEM) offers a promising pathway to enhance computational accuracy on near-term devices, yet existing methods face a difficult trade-off between performance, resource overhead, and theoretical guarantees. In this work, we introduce neighbor-informed learning (NIL), a versatile and scalable QEM framework that unifies and strengthens existing methods such as zero-noise extrapolation (ZNE) and probabilistic error cancellation (PEC), while offering improved flexibility, accuracy, efficiency, and robustness.
NIL learns to predict the ideal output of a target quantum circuit from the noisy outputs of its structurally related ``neighbor'' circuits. A key innovation is our 2-design training method, which generates training data for our machine learning model. In contrast to conventional learning-based QEM protocols that create training circuits by replacing non-Clifford gates with uniformly random Clifford gates, our approach achieves higher accuracy and efficiency, as demonstrated by both theoretical analysis and numerical simulation. Furthermore, we prove that the required size of the training set scales only \emph{logarithmically} with the total number of neighbor circuits, enabling NIL to be applied to problems involving large-scale quantum circuits. Our work establishes a theoretically grounded and practically efficient framework for QEM, paving a viable path toward achieving quantum advantage on noisy hardware.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models
Authors:
Hoang Anh Just,
Yifei Fan,
Handong Zhao,
Jiuxiang Gu,
Ruiyi Zhang,
Simon Jenni,
Kushal Kafle,
Ruoxi Jia,
Jing Shi
Abstract:
Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervi…
▽ More
Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.
△ Less
Submitted 13 December, 2025;
originally announced December 2025.
-
Rethinking Label Consistency of In-Context Learning: An Implicit Transductive Label Propagation Perspective
Authors:
Haoyang Chen,
Richong Zhang,
Junfan Chen
Abstract:
Large language models (LLMs) perform in-context learning (ICL) with minimal supervised examples, which benefits various natural language processing (NLP) tasks. One of the critical research focus is the selection of prompt demonstrations. Current approaches typically employ retrieval models to select the top-K most semantically similar examples as demonstrations. However, we argue that existing me…
▽ More
Large language models (LLMs) perform in-context learning (ICL) with minimal supervised examples, which benefits various natural language processing (NLP) tasks. One of the critical research focus is the selection of prompt demonstrations. Current approaches typically employ retrieval models to select the top-K most semantically similar examples as demonstrations. However, we argue that existing methods are limited since the label consistency is not guaranteed during demonstration selection. Our cognition derives from the Bayesian view of ICL and our rethinking of ICL from the transductive label propagation perspective. We treat ICL as a transductive learning method and incorporate latent concepts from Bayesian view and deduce that similar demonstrations guide the concepts of query, with consistent labels serving as estimates. Based on this understanding, we establish a label propagation framework to link label consistency with propagation error bounds. To model label consistency, we propose a data synthesis method, leveraging both semantic and label information, and use TopK sampling with Synthetic Data (TopK-SD) to acquire demonstrations with consistent labels. TopK-SD outperforms original TopK sampling on multiple benchmarks. Our work provides a new perspective for understanding the working mechanisms within ICL.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
Authors:
Zhengyang Wang,
Ziyue Liu,
Ruijie Zhang,
Avinash Maurya,
Paul Hovland,
Bogdan Nicolae,
Franck Cappello,
Zheng Zhang
Abstract:
The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parall…
▽ More
The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
Stable spectral neural operator for learning stiff PDE systems from limited data
Authors:
Rui Zhang,
Han Wan,
Yang Liu,
Hao Sun
Abstract:
Accurate modeling of spatiotemporal dynamics is crucial to understanding complex phenomena across science and engineering. However, this task faces a fundamental challenge when the governing equations are unknown and observational data are sparse. System stiffness, the coupling of multiple time-scales, further exacerbates this problem and hinders long-term prediction. Existing methods fall short:…
▽ More
Accurate modeling of spatiotemporal dynamics is crucial to understanding complex phenomena across science and engineering. However, this task faces a fundamental challenge when the governing equations are unknown and observational data are sparse. System stiffness, the coupling of multiple time-scales, further exacerbates this problem and hinders long-term prediction. Existing methods fall short: purely data-driven methods demand massive datasets, whereas physics-aware approaches are constrained by their reliance on known equations and fine-grained time steps. To overcome these limitations, we introduce an equation-free learning framework, namely, the Stable Spectral Neural Operator (SSNO), for modeling stiff partial differential equation (PDE) systems based on limited data. Instead of encoding specific equation terms, SSNO embeds spectrally inspired structures in its architecture, yielding strong inductive biases for learning the underlying physics. It automatically learns local and global spatial interactions in the frequency domain, while handling system stiffness with a robust integrating factor time-stepping scheme. Demonstrated across multiple 2D and 3D benchmarks in Cartesian and spherical geometries, SSNO achieves prediction errors one to two orders of magnitude lower than leading models. Crucially, it shows remarkable data efficiency, requiring only very few (2--5) training trajectories for robust generalization to out-of-distribution conditions. This work offers a robust and generalizable approach to learning stiff spatiotemporal dynamics from limited data without explicit \textit{a priori} knowledge of PDE terms.
△ Less
Submitted 12 December, 2025;
originally announced December 2025.
-
Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
Authors:
Sicheng Mo,
Thao Nguyen,
Richard Zhang,
Nick Kolkin,
Siddharth Srinivasan Iyer,
Eli Shechtman,
Krishna Kumar Singh,
Yong Jae Lee,
Bolei Zhou,
Yuheng Li
Abstract:
In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at in…
▽ More
In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Authors:
Yiwen Tang,
Zoey Guo,
Kaixin Zhu,
Ray Zhang,
Qizhi Chen,
Dongzhi Jiang,
Junli Liu,
Bohan Zeng,
Haoming Song,
Delin Qu,
Tianyi Bai,
Dan Xu,
Wentao Zhang,
Bin Zhao
Abstract:
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation signific…
▽ More
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
What matters for Representation Alignment: Global Information or Spatial Structure?
Authors:
Jaskirat Singh,
Xingjian Leng,
Zongze Wu,
Liang Zheng,
Richard Zhang,
Eli Shechtman,
Saining Xie
Abstract:
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwi…
▽ More
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
Authors:
Yan Zhuang,
Jiawei Ren,
Xiaokang Ye,
Jianzhi Shen,
Ruixuan Zhang,
Tianai Yue,
Muhammad Faayez,
Xuhong He,
Ziqiao Ma,
Lianhui Qin,
Zhiting Hu,
Tianmin Shu
Abstract:
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Bui…
▽ More
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
△ Less
Submitted 10 December, 2025;
originally announced December 2025.
-
Open Polymer Challenge: Post-Competition Report
Authors:
Gang Liu,
Sobin Alosious,
Subhamoy Mahajan,
Eric Inae,
Yihan Zhu,
Yuhan Liu,
Renzheng Zhang,
Jiaxin Xu,
Addison Howard,
Ying Li,
Tengfei Luo,
Meng Jiang
Abstract:
Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal condu…
▽ More
Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data. We also release the data generation pipeline at https://github.com/sobinalosious/ADEPT, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
A Multi-Robot Platform for Robotic Triage Combining Onboard Sensing and Foundation Models
Authors:
Jason Hughes,
Marcel Hussing,
Edward Zhang,
Shenbagaraj Kannapiran,
Joshua Caswell,
Kenneth Chaney,
Ruichen Deng,
Michaela Feehery,
Agelos Kratimenos,
Yi Fan Li,
Britny Major,
Ethan Sanchez,
Sumukh Shrote,
Youkang Wang,
Jeremy Wang,
Daudi Zein,
Luying Zhang,
Ruijun Zhang,
Alex Zhou,
Tenzi Zhouga,
Jeremy Cannon,
Zaffir Qasim,
Jay Yelon,
Fernando Cladera,
Kostas Daniilidis
, et al. (2 additional authors not shown)
Abstract:
This report presents a heterogeneous robotic system designed for remote primary triage in mass-casualty incidents (MCIs). The system employs a coordinated air-ground team of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to locate victims, assess their injuries, and prioritize medical assistance without risking the lives of first responders. The UAV identify and provide overhe…
▽ More
This report presents a heterogeneous robotic system designed for remote primary triage in mass-casualty incidents (MCIs). The system employs a coordinated air-ground team of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to locate victims, assess their injuries, and prioritize medical assistance without risking the lives of first responders. The UAV identify and provide overhead views of casualties, while UGVs equipped with specialized sensors measure vital signs and detect and localize physical injuries. Unlike previous work that focused on exploration or limited medical evaluation, this system addresses the complete triage process: victim localization, vital sign measurement, injury severity classification, mental status assessment, and data consolidation for first responders. Developed as part of the DARPA Triage Challenge, this approach demonstrates how multi-robot systems can augment human capabilities in disaster response scenarios to maximize lives saved.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology
Authors:
Rongzhao Zhang,
Junqiao Wang,
Shuyun Yang,
Mouxiao Bian,
Chao Ding,
Yuwei Bai,
Chihao Zhang,
Yuguang Shen,
Lei Wang,
Lei Zheng,
Qiujuan Yan,
Yun Zhong,
Meiling Liu,
Jiwei Yu,
Zheng Wang,
Jie Xu,
Meng Luo
Abstract:
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous me…
▽ More
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
Modular Neural Image Signal Processing
Authors:
Mahmoud Afifi,
Zhongling Wang,
Ran Zhang,
Michael S. Brown
Abstract:
This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves…
▽ More
This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: https://youtu.be/ByhQjQSjxVM
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Authors:
Bin Li,
Ruichi Zhang,
Han Liang,
Jingyan Zhang,
Juze Zhang,
Xin Chen,
Lan Xu,
Jingyi Yu,
Jingya Wang
Abstract:
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At i…
▽ More
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
△ Less
Submitted 12 December, 2025; v1 submitted 8 December, 2025;
originally announced December 2025.
-
Unsupervised Single-Channel Audio Separation with Diffusion Source Priors
Authors:
Runwu Shi,
Chang Li,
Jiang Wang,
Rui Zhang,
Nabeela Khan,
Benjamin Yen,
Takeshi Ashizawa,
Kazuhiro Nakadai
Abstract:
Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, i…
▽ More
Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.
△ Less
Submitted 8 December, 2025;
originally announced December 2025.
-
MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
Authors:
Ruicheng Zhang,
Mingyang Zhang,
Jun Zhou,
Zhangrui Guo,
Xiaofan Liu,
Zunnan Xu,
Zhizhou Zhong,
Puxin Yan,
Haocheng Luo,
Xiu Li
Abstract:
Embodied imitation learning is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video generation models for this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a hierarchical framework designed to synthesize physically plausible and logically coherent videos…
▽ More
Embodied imitation learning is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video generation models for this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To align the generated videos with physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA world model to enforce physical plausibility by aligning the predicted and actual dynamic evolutions in the feature space. MIND-V demonstrates state-of-the-art performance in long-horizon robotic manipulation video generation, establishing a scalable and controllable paradigm for embodied data synthesis.
△ Less
Submitted 6 December, 2025;
originally announced December 2025.
-
One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow
Authors:
Pascal Jutras-Dube,
Jiaru Zhang,
Ziran Wang,
Ruqi Zhang
Abstract:
Sampling from unnormalized target distributions is a fundamental yet challenging task in machine learning and statistics. Existing sampling algorithms typically require many iterative steps to produce high-quality samples, leading to high computational costs. We introduce one-step diffusion samplers which learn a step-conditioned ODE so that one large step reproduces the trajectory of many small o…
▽ More
Sampling from unnormalized target distributions is a fundamental yet challenging task in machine learning and statistics. Existing sampling algorithms typically require many iterative steps to produce high-quality samples, leading to high computational costs. We introduce one-step diffusion samplers which learn a step-conditioned ODE so that one large step reproduces the trajectory of many small ones via a state-space consistency loss. We further show that standard ELBO estimates in diffusion samplers degrade in the few-step regime because common discrete integrators yield mismatched forward/backward transition kernels. Motivated by this analysis, we derive a deterministic-flow (DF) importance weight for ELBO estimation without a backward kernel. To calibrate DF, we introduce a volume-consistency regularization that aligns the accumulated volume change along the flow across step resolutions. Our proposed sampler therefore achieves both sampling and stable evidence estimate in only one or few steps. Across challenging synthetic and Bayesian benchmarks, it achieves competitive sample quality with orders-of-magnitude fewer network evaluations while maintaining robust ELBO estimates.
△ Less
Submitted 4 December, 2025;
originally announced December 2025.
-
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering
Authors:
Rongyang Zhang,
Yuqing Huang,
Chengqiang Lu,
Qimeng Wang,
Yan Gao,
Yi Wu,
Yao Hu,
Yin Xu,
Wei Wang,
Hao Wang,
Enhong Chen
Abstract:
In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challengin…
▽ More
In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench's training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at https://github.com/USTC-StarTeam/RAG-IGBench.
△ Less
Submitted 10 October, 2025;
originally announced December 2025.
-
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
Authors:
Dongzhi Jiang,
Renrui Zhang,
Haodong Li,
Zhuofan Zong,
Ziyu Guo,
Jun He,
Claire Guo,
Junyan Ye,
Rongyao Fang,
Weijia Li,
Rui Liu,
Hongsheng Li
Abstract:
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning p…
▽ More
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
△ Less
Submitted 4 December, 2025;
originally announced December 2025.
-
NAWOA-XGBoost: A Novel Model for Early Prediction of Academic Potential in Computer Science Students
Authors:
Junhao Wei,
Yanzhao Gu,
Ran Zhang,
Mingjing Huang,
Jinhong Song,
Yanxiao Li,
Wenxuan Zhu,
Yapeng Wang,
Zikun Li,
Zhiwen Wang,
Xu Yang,
Ngai Cheong
Abstract:
Whale Optimization Algorithm (WOA) suffers from limited global search ability, slow convergence, and tendency to fall into local optima, restricting its effectiveness in hyperparameter optimization for machine learning models. To address these issues, this study proposes a Nonlinear Adaptive Whale Optimization Algorithm (NAWOA), which integrates strategies such as Good Nodes Set initialization, Le…
▽ More
Whale Optimization Algorithm (WOA) suffers from limited global search ability, slow convergence, and tendency to fall into local optima, restricting its effectiveness in hyperparameter optimization for machine learning models. To address these issues, this study proposes a Nonlinear Adaptive Whale Optimization Algorithm (NAWOA), which integrates strategies such as Good Nodes Set initialization, Leader-Followers Foraging, Dynamic Encircling Prey, Triangular Hunting, and a nonlinear convergence factor to enhance exploration, exploitation, and convergence stability. Experiments on 23 benchmark functions demonstrate NAWOA's superior optimization capability and robustness. Based on this optimizer, an NAWOA-XGBoost model was developed to predict academic potential using data from 495 Computer Science undergraduates at Macao Polytechnic University (2009-2019). Results show that NAWOA-XGBoost outperforms traditional XGBoost and WOA-XGBoost across key metrics, including Accuracy (0.8148), Macro F1 (0.8101), AUC (0.8932), and G-Mean (0.8172), demonstrating strong adaptability on multi-class imbalanced datasets.
△ Less
Submitted 5 December, 2025; v1 submitted 4 December, 2025;
originally announced December 2025.
-
Rotatable Antenna-Enhanced Cell-Free Communication
Authors:
Kecheng Pan,
Beixiong Zheng,
Yanhua Tan,
Emil Björnson,
Robert Schober,
Rui Zhang
Abstract:
Rotatable antenna (RA) is a promising technology that can exploit new spatial degrees-of-freedom (DoFs) by flexibly adjusting the three-dimensional (3D) boresight direction of antennas. In this letter, we investigate an RA-enhanced cell-free system for downlink transmission, where multiple RA-equipped access points (APs) cooperatively serve multiple single-antenna users over the same time-frequenc…
▽ More
Rotatable antenna (RA) is a promising technology that can exploit new spatial degrees-of-freedom (DoFs) by flexibly adjusting the three-dimensional (3D) boresight direction of antennas. In this letter, we investigate an RA-enhanced cell-free system for downlink transmission, where multiple RA-equipped access points (APs) cooperatively serve multiple single-antenna users over the same time-frequency resource. Specifically, we aim to maximize the sum rate of all users by jointly optimizing the AP-user associations and the RA boresight directions. Accordingly, we propose a two-stage strategy to solve the AP-user association problem, and then employ fractional programming (FP) and successive convex approximation (SCA) techniques to optimize the RA boresight directions. Numerical results demonstrate that the proposed RA-enhanced cell-free system significantly outperforms various benchmark schemes.
△ Less
Submitted 12 December, 2025; v1 submitted 4 December, 2025;
originally announced December 2025.
-
Tutorial on Large Language Model-Enhanced Reinforcement Learning for Wireless Networks
Authors:
Lingyi Cai,
Wenjie Fu,
Yuxi Huang,
Ruichen Zhang,
Yinqiu Liu,
Jiawen Kang,
Zehui Xiong,
Tao Jiang,
Dusit Niyato,
Xianbin Wang,
Shiwen Mao,
Xuemin Shen
Abstract:
Reinforcement Learning (RL) has shown remarkable success in enabling adaptive and data-driven optimization for various applications in wireless networks. However, classical RL suffers from limitations in generalization, learning feedback, interpretability, and sample efficiency in dynamic wireless environments. Large Language Models (LLMs) have emerged as a transformative Artificial Intelligence (…
▽ More
Reinforcement Learning (RL) has shown remarkable success in enabling adaptive and data-driven optimization for various applications in wireless networks. However, classical RL suffers from limitations in generalization, learning feedback, interpretability, and sample efficiency in dynamic wireless environments. Large Language Models (LLMs) have emerged as a transformative Artificial Intelligence (AI) paradigm with exceptional capabilities in knowledge generalization, contextual reasoning, and interactive generation, which have demonstrated strong potential to enhance classical RL. This paper serves as a comprehensive tutorial on LLM-enhanced RL for wireless networks. We propose a taxonomy to categorize the roles of LLMs into four critical functions: state perceiver, reward designer, decision-maker, and generator. Then, we review existing studies exploring how each role of LLMs enhances different stages of the RL pipeline. Moreover, we provide a series of case studies to illustrate how to design and apply LLM-enhanced RL in low-altitude economy networking, vehicular networks, and space-air-ground integrated networks. Finally, we conclude with a discussion on potential future directions for LLM-enhanced RL and offer insights into its future development in wireless networks.
△ Less
Submitted 3 December, 2025;
originally announced December 2025.
-
Evaluating Hydro-Science and Engineering Knowledge of Large Language Models
Authors:
Shiruo Hu,
Wenbo Shan,
Yingjia Li,
Zhiqi Wan,
Xinpeng Yu,
Yunjia Qi,
Haotian Xia,
Yang Xiao,
Dingxiao Liu,
Jiaru Wang,
Chenxu Gong,
Ruixi Zhang,
Shuyue Wu,
Shibo Cui,
Chee Hui Lai,
Wei Luo,
Yubin He,
Bin Xu,
Jianshi Zhao
Abstract:
Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert col…
▽ More
Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.
△ Less
Submitted 3 December, 2025;
originally announced December 2025.
-
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
Authors:
Ruoxuan Zhang,
Bin Wen,
Hongxia Xie,
Yi Yao,
Songhan Zuo,
Jian-Yu Jiang-Lin,
Hong-Han Shuai,
Wen-Huang Cheng
Abstract:
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adju…
▽ More
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
△ Less
Submitted 5 December, 2025; v1 submitted 3 December, 2025;
originally announced December 2025.
-
Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation
Authors:
Ziniu Zhang,
Minxuan Duan,
Haris N. Koutsopoulos,
Hongyang R. Zhang
Abstract:
We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S.…
▽ More
We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, which is a $3.7\%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.
△ Less
Submitted 17 December, 2025; v1 submitted 2 December, 2025;
originally announced December 2025.
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Authors:
DeepSeek-AI,
Aixin Liu,
Aoxue Mei,
Bangcai Lin,
Bing Xue,
Bingxuan Wang,
Bingzheng Xu,
Bochao Wu,
Bowei Zhang,
Chaofan Lin,
Chen Dong,
Chengda Lu,
Chenggang Zhao,
Chengqi Deng,
Chenhao Xu,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Erhang Li,
Fangqi Zhou,
Fangyun Lin,
Fucong Dai,
Guangbo Hao
, et al. (239 additional authors not shown)
Abstract:
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2)…
▽ More
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
△ Less
Submitted 2 December, 2025;
originally announced December 2025.
-
scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing
Authors:
Ping Xu,
Zaitian Wang,
Zhirui Wang,
Pengjiang Li,
Jiajia Wang,
Ran Zhang,
Pengfei Wang,
Yuanchun Zhou
Abstract:
Cell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks for scRNA-seq clustering methods remain fragmented, often lacking standardized protocols and failing to incorporate recent advances in artificial intelligence. To fill these gaps, we present scCluBench, a comp…
▽ More
Cell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks for scRNA-seq clustering methods remain fragmented, often lacking standardized protocols and failing to incorporate recent advances in artificial intelligence. To fill these gaps, we present scCluBench, a comprehensive benchmark of clustering algorithms for scRNA-seq data. First, scCluBench provides 36 scRNA-seq datasets collected from diverse public sources, covering multiple tissues, which are uniformly processed and standardized to ensure consistency for systematic evaluation and downstream analyses. To evaluate performance, we collect and reproduce a range of scRNA-seq clustering methods, including traditional, deep learning-based, graph-based, and biological foundation models. We comprehensively evaluate each method both quantitatively and qualitatively, using core performance metrics as well as visualization analyses. Furthermore, we construct representative downstream biological tasks, such as marker gene identification and cell type annotation, to further assess the practical utility. scCluBench then investigates the performance differences and applicability boundaries of various clustering models across diverse analytical tasks, systematically assessing their robustness and scalability in real-world scenarios. Overall, scCluBench offers a standardized and user-friendly benchmark for scRNA-seq clustering, with curated datasets, unified evaluation protocols, and transparent analyses, facilitating informed method selection and providing valuable insights into model generalizability and application scope.
△ Less
Submitted 2 December, 2025;
originally announced December 2025.
-
Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games
Authors:
Ran Zhang,
Kun Ouyang,
Tiancheng Ma,
Yida Yang,
Dong Fang
Abstract:
Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-scale online experiments or parameter tuning over predefined statistical models, which are costly, time-consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopt…
▽ More
Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-scale online experiments or parameter tuning over predefined statistical models, which are costly, time-consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopted as alternatives, their limited fidelity prevents agents from accurately mimicking real player reasoning and reactions to interventions. To address these limitations, we propose a generative agent-based MMO simulation system empowered by Large Language Models (LLMs). By applying Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on large-scale real player behavioral data, we adapt LLMs from general priors to game-specific domains, enabling realistic and interpretable player decision-making. In parallel, a data-driven environment model trained on real gameplay logs reconstructs dynamic in-game systems. Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost-efficient framework for data-driven numerical design optimization.
△ Less
Submitted 1 December, 2025;
originally announced December 2025.
-
Quantum-Classical Separation in Bounded-Resource Tasks Arising from Measurement Contextuality
Authors:
Shashwat Kumar,
Eliott Rosenberg,
Alejandro Grajales Dau,
Rodrigo Cortinas,
Dmitri Maslov,
Richard Oliver,
Adam Zalcman,
Matthew Neeley,
Alice Pagano,
Aaron Szasz,
Ilya Drozdov,
Zlatko Minev,
Craig Gidney,
Noureldin Yosri,
Stijn J. de Graaf,
Aniket Maiti,
Dmitry Abanin,
Rajeev Acharya,
Laleh Aghababaie Beni,
Georg Aigeldinger,
Ross Alcaraz,
Sayra Alcaraz,
Trond I. Andersen,
Markus Ansmann,
Frank Arute
, et al. (258 additional authors not shown)
Abstract:
The prevailing view is that quantum phenomena can be harnessed to tackle certain problems beyond the reach of classical approaches. Quantifying this capability as a quantum-classical separation and demonstrating it on current quantum processors has remained elusive. Using a superconducting qubit processor, we show that quantum contextuality enables certain tasks to be performed with success probab…
▽ More
The prevailing view is that quantum phenomena can be harnessed to tackle certain problems beyond the reach of classical approaches. Quantifying this capability as a quantum-classical separation and demonstrating it on current quantum processors has remained elusive. Using a superconducting qubit processor, we show that quantum contextuality enables certain tasks to be performed with success probabilities beyond classical limits. With a few qubits, we illustrate quantum contextuality with the magic square game, as well as quantify it through a Kochen--Specker--Bell inequality violation. To examine many-body contextuality, we implement the N-player GHZ game and separately solve a 2D hidden linear function problem, exceeding classical success rate in both. Our work proposes novel ways to benchmark quantum processors using contextuality-based algorithms.
△ Less
Submitted 1 December, 2025;
originally announced December 2025.
-
ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Authors:
Chenyang Gu,
Jiaming Liu,
Hao Chen,
Runzhong Huang,
Qingpo Wuwu,
Zhuoyang Liu,
Xiaoqi Li,
Ying Li,
Renrui Zhang,
Peng Jia,
Pheng-Ann Heng,
Shanghang Zhang
Abstract:
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to…
▽ More
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.
△ Less
Submitted 1 December, 2025;
originally announced December 2025.
-
Efficiently Learning Branching Networks for Multitask Algorithmic Reasoning
Authors:
Dongyue Li,
Zhenshuo Zhang,
Minxuan Duan,
Edgar Dobriban,
Hongyang R. Zhang
Abstract:
Algorithmic reasoning -- the ability to perform step-by-step logical inference -- has become a core benchmark for evaluating reasoning in graph neural networks (GNNs) and large language models (LLMs). Ideally, one would like to design a single model capable of performing well on multiple algorithmic reasoning tasks simultaneously. However, this is challenging when the execution steps of algorithms…
▽ More
Algorithmic reasoning -- the ability to perform step-by-step logical inference -- has become a core benchmark for evaluating reasoning in graph neural networks (GNNs) and large language models (LLMs). Ideally, one would like to design a single model capable of performing well on multiple algorithmic reasoning tasks simultaneously. However, this is challenging when the execution steps of algorithms differ from one another, causing negative interference when they are trained together.
We propose branching neural networks, a principled architecture for multitask algorithmic reasoning. Searching for the optimal $k$-ary tree with $L$ layers over $n$ algorithmic tasks is combinatorial, requiring exploration of up to $k^{nL}$ possible structures. We develop AutoBRANE, an efficient algorithm that reduces this search to $O(nL)$ time by solving a convex relaxation at each layer to approximate an optimal task partition. The method clusters tasks using gradient-based affinity scores and can be used on top of any base model, including GNNs and LLMs.
We validate AutoBRANE on a broad suite of graph-algorithmic and text-based reasoning benchmarks. We show that gradient features estimate true task performance within 5% error across four GNNs and four LLMs (up to 34B parameters). On the CLRS benchmark, it outperforms the strongest single multitask GNN by 3.7% and the best baseline by 1.2%, while reducing runtime by 48% and memory usage by 26%. The learned branching structures reveal an intuitively reasonable hierarchical clustering of related algorithms. On three text-based graph reasoning benchmarks, AutoBRANE improves over the best non-branching multitask baseline by 3.2%. Finally, on a large graph dataset with 21M edges and 500 tasks, AutoBRANE achieves a 28% accuracy gain over existing multitask and branching architectures, along with a 4.5$\times$ reduction in runtime.
△ Less
Submitted 30 November, 2025;
originally announced December 2025.
-
SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds
Authors:
Jiawei Ren,
Yan Zhuang,
Xiaokang Ye,
Lingjun Mao,
Xuhong He,
Jianzhi Shen,
Mrinaal Dogra,
Yiming Liang,
Ruixuan Zhang,
Tianai Yue,
Yiqing Yang,
Eric Liu,
Ryan Wu,
Kevin Benavente,
Rajiv Mandya Nagaraju,
Muhammad Faayez,
Xiyan Zhang,
Dhruv Vivek Sharma,
Xianrui Zhong,
Ziqiao Ma,
Tianmin Shu,
Zhiting Hu,
Lianhui Qin
Abstract:
While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (for example, by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied sc…
▽ More
While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (for example, by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: (1) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) a rich interface for LLM/VLM agents, with multimodal world inputs and open-vocabulary actions at varying levels of abstraction; and (3) diverse and extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT-4o, Gemini-2.5-Flash, Claude-3.5, and DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines: https://simworld.org.
△ Less
Submitted 30 November, 2025;
originally announced December 2025.
-
"Why the face?": Exploring Robot Error Detection Using Instrumented Bystander Reactions
Authors:
Maria Teresa Parreira,
Ruidong Zhang,
Sukruth Gowdru Lingaraju,
Alexandra Bremers,
Xuanyu Fang,
Adolfo Ramirez-Aristizabal,
Manaswi Saha,
Michael Kuniavsky,
Cheng Zhang,
Wendy Ju
Abstract:
How do humans recognize and rectify social missteps? We achieve social competence by looking around at our peers, decoding subtle cues from bystanders - a raised eyebrow, a laugh - to evaluate the environment and our actions. Robots, however, struggle to perceive and make use of these nuanced reactions. By employing a novel neck-mounted device that records facial expressions from the chin region,…
▽ More
How do humans recognize and rectify social missteps? We achieve social competence by looking around at our peers, decoding subtle cues from bystanders - a raised eyebrow, a laugh - to evaluate the environment and our actions. Robots, however, struggle to perceive and make use of these nuanced reactions. By employing a novel neck-mounted device that records facial expressions from the chin region, we explore the potential of previously untapped data to capture and interpret human responses to robot error. First, we develop NeckNet-18, a 3D facial reconstruction model to map the reactions captured through the chin camera onto facial points and head motion. We then use these facial responses to develop a robot error detection model which outperforms standard methodologies such as using OpenFace or video data, generalizing well especially for within-participant data. Through this work, we argue for expanding human-in-the-loop robot sensing, fostering more seamless integration of robots into diverse human environments, pushing the boundaries of social cue detection and opening new avenues for adaptable robotics.
△ Less
Submitted 28 November, 2025;
originally announced December 2025.
-
InF-ATPG: Intelligent FFR-Driven ATPG with Advanced Circuit Representation Guided Reinforcement Learning
Authors:
Bin Sun,
Rengang Zhang,
Zhiteng Chao,
Zizhen Liu,
Jianan Mu,
Jing Ye,
Huawei Li
Abstract:
Automatic test pattern generation (ATPG) is a crucial process in integrated circuit (IC) design and testing, responsible for efficiently generating test patterns. As semiconductor technology progresses, traditional ATPG struggles with long execution times to achieve the expected fault coverage, which impacts the time-to-market of chips. Recent machine learning techniques, like reinforcement learni…
▽ More
Automatic test pattern generation (ATPG) is a crucial process in integrated circuit (IC) design and testing, responsible for efficiently generating test patterns. As semiconductor technology progresses, traditional ATPG struggles with long execution times to achieve the expected fault coverage, which impacts the time-to-market of chips. Recent machine learning techniques, like reinforcement learning (RL) and graph neural networks (GNNs), show promise but face issues such as reward delay in RL models and inadequate circuit representation in GNN-based methods. In this paper, we propose InF-ATPG, an intelligent FFR-driven ATPG framework that overcomes these challenges by using advanced circuit representation to guide RL. By partitioning circuits into fanout-free regions (FFRs) and incorporating ATPG-specific features into a novel QGNN architecture, InF-ATPG enhances test pattern generation efficiency. Experimental results show InF-ATPG reduces backtracks by 55.06\% on average compared to traditional methods and 38.31\% compared to the machine learning approach, while also improving fault coverage.
△ Less
Submitted 25 November, 2025;
originally announced December 2025.