-
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
Authors:
Evelyn Zhang,
Fufu Yu,
Aoqi Wu,
Zichen Wen,
Ke Yan,
Shouhong Ding,
Biqing Qi,
Linfeng Zhang
Abstract:
Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-b…
▽ More
Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods.
△ Less
Submitted 22 December, 2025;
originally announced December 2025.
-
KathDB: Explainable Multimodal Database Management System with Human-AI Collaboration
Authors:
Guorui Xiao,
Enhao Zhang,
Nicole Sullivan,
Will Hansen,
Magdalena Balazinska
Abstract:
Traditional DBMSs execute user- or application-provided SQL queries over relational data with strong semantic guarantees and advanced query optimization, but writing complex SQL is hard and focuses only on structured tables. Contemporary multimodal systems (which operate over relations but also text, images, and even videos) either expose low-level controls that force users to use (and possibly cr…
▽ More
Traditional DBMSs execute user- or application-provided SQL queries over relational data with strong semantic guarantees and advanced query optimization, but writing complex SQL is hard and focuses only on structured tables. Contemporary multimodal systems (which operate over relations but also text, images, and even videos) either expose low-level controls that force users to use (and possibly create) machine learning UDFs manually within SQL or offload execution entirely to black-box LLMs, sacrificing usability or explainability. We propose KathDB, a new system that combines relational semantics with the reasoning power of foundation models over multimodal data. Furthermore, KathDB includes human-AI interaction channels during query parsing, execution, and result explanation, such that users can iteratively obtain explainable answers across data modalities.
△ Less
Submitted 11 December, 2025;
originally announced December 2025.
-
MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment
Authors:
Mengxi Xiao,
Kailai Yang,
Pengde Zhao,
Enze Zhang,
Ziyan Kuang,
Zhiwei Liu,
Weiguang Han,
Shu Liao,
Lianting Huang,
Jinpeng Hu,
Min Peng,
Qianqian Xie,
Sophia Ananiadou
Abstract:
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotiona…
▽ More
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
△ Less
Submitted 16 December, 2025; v1 submitted 10 December, 2025;
originally announced December 2025.
-
A Multi-Robot Platform for Robotic Triage Combining Onboard Sensing and Foundation Models
Authors:
Jason Hughes,
Marcel Hussing,
Edward Zhang,
Shenbagaraj Kannapiran,
Joshua Caswell,
Kenneth Chaney,
Ruichen Deng,
Michaela Feehery,
Agelos Kratimenos,
Yi Fan Li,
Britny Major,
Ethan Sanchez,
Sumukh Shrote,
Youkang Wang,
Jeremy Wang,
Daudi Zein,
Luying Zhang,
Ruijun Zhang,
Alex Zhou,
Tenzi Zhouga,
Jeremy Cannon,
Zaffir Qasim,
Jay Yelon,
Fernando Cladera,
Kostas Daniilidis
, et al. (2 additional authors not shown)
Abstract:
This report presents a heterogeneous robotic system designed for remote primary triage in mass-casualty incidents (MCIs). The system employs a coordinated air-ground team of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to locate victims, assess their injuries, and prioritize medical assistance without risking the lives of first responders. The UAV identify and provide overhe…
▽ More
This report presents a heterogeneous robotic system designed for remote primary triage in mass-casualty incidents (MCIs). The system employs a coordinated air-ground team of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to locate victims, assess their injuries, and prioritize medical assistance without risking the lives of first responders. The UAV identify and provide overhead views of casualties, while UGVs equipped with specialized sensors measure vital signs and detect and localize physical injuries. Unlike previous work that focused on exploration or limited medical evaluation, this system addresses the complete triage process: victim localization, vital sign measurement, injury severity classification, mental status assessment, and data consolidation for first responders. Developed as part of the DARPA Triage Challenge, this approach demonstrates how multi-robot systems can augment human capabilities in disaster response scenarios to maximize lives saved.
△ Less
Submitted 9 December, 2025;
originally announced December 2025.
-
Estimation of Kinematic Motion from Dashcam Footage
Authors:
Evelyn Zhang,
Alex Richardson,
Jonathan Sprinkle
Abstract:
The goal of this paper is to explore the accuracy of dashcam footage to predict the actual kinematic motion of a car-like vehicle. Our approach uses ground truth information from the vehicle's on-board data stream, through the controller area network, and a time-synchronized dashboard camera, mounted to a consumer-grade vehicle, for 18 hours of footage and driving. The contributions of the paper i…
▽ More
The goal of this paper is to explore the accuracy of dashcam footage to predict the actual kinematic motion of a car-like vehicle. Our approach uses ground truth information from the vehicle's on-board data stream, through the controller area network, and a time-synchronized dashboard camera, mounted to a consumer-grade vehicle, for 18 hours of footage and driving. The contributions of the paper include neural network models that allow us to quantify the accuracy of predicting the vehicle speed and yaw, as well as the presence of a lead vehicle, and its relative distance and speed. In addition, the paper describes how other researchers can gather their own data to perform similar experiments, using open-source tools and off-the-shelf technology.
△ Less
Submitted 30 November, 2025;
originally announced December 2025.
-
Dynamic Multi-Species Bird Soundscape Generation with Acoustic Patterning and 3D Spatialization
Authors:
Ellie L. Zhang,
Duoduo Liao,
Callie C. Liao
Abstract:
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing ap…
▽ More
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing approaches, whether Digital Signal Processing (DSP)-based or data-driven, typically focus only on single species modeling, static call structures, or synthesis directly from recordings, and often suffer from noise, limited flexibility, or large data needs. To address these challenges, we present a novel, fully algorithm-driven framework that generates dynamic multi-species bird soundscapes using DSP-based chirp generation and 3D spatialization, without relying on recordings or training data. Our approach simulates multiple independently-moving birds per species along different moving 3D trajectories, supporting controllable chirp sequences, overlapping choruses, and realistic 3D motion in scalable soundscapes while preserving species-specific acoustic patterns. A visualization interface provides bird trajectories, spectrograms, activity timelines, and sound waves for analytical and creative purposes. Both visual and audio evaluations demonstrate the ability of the system to generate dense, immersive, and ecologically inspired soundscapes, highlighting its potential for computer music, interactive virtual environments, and computational bioacoustics research.
△ Less
Submitted 24 November, 2025;
originally announced November 2025.
-
MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core
Authors:
Callie C. Liao,
Duoduo Liao,
Ellie L. Zhang
Abstract:
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright i…
▽ More
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
△ Less
Submitted 21 November, 2025;
originally announced November 2025.
-
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
Authors:
Junyi Ma,
Wentao Bao,
Jingyi Xu,
Guanzhong Sun,
Yu Zheng,
Erhang Zhang,
Xieyuanli Chen,
Hesheng Wang
Abstract:
Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in down…
▽ More
Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.
△ Less
Submitted 4 December, 2025; v1 submitted 16 November, 2025;
originally announced November 2025.
-
Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support
Authors:
Eric Hua Qing Zhang,
Julia Ive
Abstract:
Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessar…
▽ More
Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2's capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning's effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.
△ Less
Submitted 14 November, 2025;
originally announced November 2025.
-
GPR: Towards a Generative Pre-trained One-Model Paradigm for Large-Scale Advertising Recommendation
Authors:
Jun Zhang,
Yi Li,
Yue Liu,
Changping Wang,
Yuan Wang,
Yuling Xiong,
Xun Liu,
Haiyang Wu,
Qian Li,
Enming Zhang,
Jiawei Sun,
Xin Xu,
Zishuai Zhang,
Ruoran Liu,
Suyuan Huang,
Zhaoxin Zhang,
Zhengkai Guo,
Shuojin Yang,
Meng-Hao Guo,
Huan Yu,
Jie Jiang,
Shi-Min Hu
Abstract:
As an intelligent infrastructure connecting users with commercial content, advertising recommendation systems play a central role in information flow and value creation within the digital economy. However, existing multi-stage advertising recommendation systems suffer from objective misalignment and error propagation, making it difficult to achieve global optimality, while unified generative recom…
▽ More
As an intelligent infrastructure connecting users with commercial content, advertising recommendation systems play a central role in information flow and value creation within the digital economy. However, existing multi-stage advertising recommendation systems suffer from objective misalignment and error propagation, making it difficult to achieve global optimality, while unified generative recommendation models still struggle to meet the demands of practical industrial applications. To address these issues, we propose GPR (Generative Pre-trained Recommender), the first one-model framework that redefines advertising recommendation as an end-to-end generative task, replacing the traditional cascading paradigm with a unified generative approach. To realize GPR, we introduce three key innovations spanning unified representation, network architecture, and training strategy. First, we design a unified input schema and tokenization method tailored to advertising scenarios, mapping both ads and organic content into a shared multi-level semantic ID space, thereby enhancing semantic alignment and modeling consistency across heterogeneous data. Second, we develop the Heterogeneous Hierarchical Decoder (HHD), a dual-decoder architecture that decouples user intent modeling from ad generation, achieving a balance between training efficiency and inference flexibility while maintaining strong modeling capacity. Finally, we propose a multi-stage joint training strategy that integrates Multi-Token Prediction (MTP), Value-Aware Fine-Tuning and the Hierarchy Enhanced Policy Optimization (HEPO) algorithm, forming a complete generative recommendation pipeline that unifies interest modeling, value alignment, and policy optimization. GPR has been fully deployed in the Tencent Weixin Channels advertising system, delivering significant improvements in key business metrics including GMV and CTCVR.
△ Less
Submitted 21 November, 2025; v1 submitted 13 November, 2025;
originally announced November 2025.
-
A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
Authors:
Shubhra Mishra,
Yuka Machino,
Gabriel Poesia,
Albert Jiang,
Joy Hsu,
Adrian Weller,
Challenger Mishra,
David Broman,
Joshua B. Tenenbaum,
Mateja Jamnik,
Cedegao E. Zhang,
Katherine M. Collins
Abstract:
The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced resear…
▽ More
The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people's choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people -- whether for advanced research or education -- it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.
△ Less
Submitted 11 November, 2025;
originally announced November 2025.
-
EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response
Authors:
Chenpei Huang,
Lingfeng Yao,
Kyu In Lee,
Lan Emily Zhang,
Xun Chen,
Miao Pan
Abstract:
Acoustic Environment Matching (AEM) is the task of transferring clean audio into a target acoustic environment, enabling engaging applications such as audio dubbing and auditory immersive virtual reality (VR). Recovering similar room impulse response (RIR) directly from reverberant speech offers more accessible and flexible AEM solution. However, this capability also introduces vulnerabilities of…
▽ More
Acoustic Environment Matching (AEM) is the task of transferring clean audio into a target acoustic environment, enabling engaging applications such as audio dubbing and auditory immersive virtual reality (VR). Recovering similar room impulse response (RIR) directly from reverberant speech offers more accessible and flexible AEM solution. However, this capability also introduces vulnerabilities of arbitrary ``relocation" if misused by malicious user, such as facilitating advanced voice spoofing attacks or undermining the authenticity of recorded evidence. To address this issue, we propose EchoMark, the first deep learning-based AEM framework that generates perceptually similar RIRs with embedded watermark. Our design tackle the challenges posed by variable RIR characteristics, such as different durations and energy decays, by operating in the latent domain. By jointly optimizing the model with a perceptual loss for RIR reconstruction and a loss for watermark detection, EchoMark achieves both high-quality environment transfer and reliable watermark recovery. Experiments on diverse datasets validate that EchoMark achieves room acoustic parameter matching performance comparable to FiNS, the state-of-the-art RIR estimator. Furthermore, a high Mean Opinion Score (MOS) of 4.22 out of 5, watermark detection accuracy exceeding 99\%, and bit error rates (BER) below 0.3\% collectively demonstrate the effectiveness of EchoMark in preserving perceptual quality while ensuring reliable watermark embedding.
△ Less
Submitted 9 November, 2025;
originally announced November 2025.
-
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
Authors:
Daniel Aarao Reis Arturi,
Eric Zhang,
Andrew Ansah,
Kevin Zhu,
Ashwinee Panda,
Aishwarya Balwani
Abstract:
Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate tha…
▽ More
Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
△ Less
Submitted 3 November, 2025;
originally announced November 2025.
-
Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs
Authors:
Qing Ding,
Eric Hua Qing Zhang,
Felix Jozsa,
Julia Ive
Abstract:
Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark…
▽ More
Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.
△ Less
Submitted 2 November, 2025;
originally announced November 2025.
-
VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus
Authors:
Chuyue Sun,
Yican Sun,
Daneshvar Amrollahi,
Ethan Zhang,
Shuvendu Lahiri,
Shan Lu,
David Dill,
Clark Barrett
Abstract:
We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner module to orchestrate the systematic generation of abstractions, type invariants, specifications, and proof code. To address the challenge that LLMs often misunderstand Verus' annotation syntax and verificatio…
▽ More
We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner module to orchestrate the systematic generation of abstractions, type invariants, specifications, and proof code. To address the challenge that LLMs often misunderstand Verus' annotation syntax and verification-specific semantics, VeriStruct embeds syntax guidance within prompts and includes a repair stage to automatically correct annotation errors. In an evaluation on eleven Rust data structure modules, VeriStruct succeeds on ten of the eleven, successfully verifying 128 out of 129 functions (99.2%) in total. These results represent an important step toward the goal of automatic AI-assisted formal verification.
△ Less
Submitted 16 November, 2025; v1 submitted 28 October, 2025;
originally announced October 2025.
-
Code-enabled language models can outperform reasoning models on diverse tasks
Authors:
Cedegao E. Zhang,
Cédric Colas,
Gabriel Poesia,
Joshua B. Tenenbaum,
Jacob Andreas
Abstract:
Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or ev…
▽ More
Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.
△ Less
Submitted 23 October, 2025;
originally announced October 2025.
-
RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies
Authors:
Adina Yakefu,
Bin Xie,
Chongyang Xu,
Enwen Zhang,
Erjin Zhou,
Fan Jia,
Haitao Yang,
Haoqiang Fan,
Haowei Zhang,
Hongyang Peng,
Jing Tan,
Junwen Huang,
Kai Liu,
Kaixin Liu,
Kefan Gu,
Qinglun Zhang,
Ruitao Zhang,
Saike Huang,
Shen Cheng,
Shuaicheng Liu,
Tiancai Wang,
Tiezhen Wang,
Wei Sun,
Wenbin Tang,
Yajun Wei
, et al. (12 additional authors not shown)
Abstract:
Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In t…
▽ More
Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In this report, we describe our methodology for constructing RoboChallenge, an online evaluation system to test robotic control algorithms, and our survey of recent state-of-the-art VLA models using our initial benchmark Table30.
△ Less
Submitted 20 October, 2025;
originally announced October 2025.
-
Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Authors:
Zheng Huang,
Enpei Zhang,
Yinghao Cai,
Weikang Qiu,
Carl Yang,
Elynn Chen,
Xiang Zhang,
Rex Ying,
Dawei Zhou,
Yujun Yan
Abstract:
Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The recons…
▽ More
Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.
△ Less
Submitted 17 October, 2025;
originally announced October 2025.
-
COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes
Authors:
Yunwen Li,
Shuangshuang Ying,
Xingwei Qu,
Xin Li,
Sheng Jin,
Minghao Liu,
Zhoufutu Wen,
Tianyu Zheng,
Xeron Du,
Qiguang Chen,
Jiajun Shi,
Wangchunshu Zhou,
Jiazhan Feng,
Wanjun Zhong,
Libo Qin,
Stephen Huang,
Wanxiang Che,
Chenghua Lin,
Eli Zhang
Abstract:
Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing data…
▽ More
Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.
△ Less
Submitted 16 October, 2025;
originally announced October 2025.
-
Lifting Manifolds to Mitigate Pseudo-Alignment in LLM4TS
Authors:
Liangwei Nathan Zheng,
Wenhao Liang,
Wei Emma Zhang,
Miao Xu,
Olaf Maennel,
Weitong Chen
Abstract:
Pseudo-Alignment is a pervasive challenge in many large language models for time series (LLM4TS) models, often causing them to underperform compared to linear models or randomly initialised backbones. However, there is limited discussion in the community for the reasons that pseudo-alignment occurs. In this work, we conduct a thorough investigation into the root causes of pseudo-alignment in LLM4T…
▽ More
Pseudo-Alignment is a pervasive challenge in many large language models for time series (LLM4TS) models, often causing them to underperform compared to linear models or randomly initialised backbones. However, there is limited discussion in the community for the reasons that pseudo-alignment occurs. In this work, we conduct a thorough investigation into the root causes of pseudo-alignment in LLM4TS and build a connection of pseudo-alignment to the cone effect in LLM. We demonstrate that pseudo-alignment arises from the interplay of cone effect within pretrained LLM components and the intrinsically low-dimensional manifold of time-series data. In addition, we also introduce \textit{\textbf{TimeSUP}}, a novel technique designed to mitigate this issue and improve forecast performance in existing LLM4TS approaches. TimeSUP addresses this by increasing the time series manifold to more closely match the intrinsic dimension of language embeddings, allowing the model to distinguish temporal signals clearly while still capturing shared structures across modalities. As a result, representations for time and language tokens remain distinct yet exhibit high cosine similarity, signifying that the model preserves each modality unique features while learning their commonalities in a unified embedding space. Empirically, TimeSUP consistently outperforms state-of-the-art LLM4TS methods and other lightweight baselines on long-term forecasting performance. Furthermore, it can be seamlessly integrated into four existing LLM4TS pipelines and delivers significant improvements in forecasting performance.
△ Less
Submitted 14 October, 2025;
originally announced October 2025.
-
People use fast, flat goal-directed simulation to reason about novel problems
Authors:
Katherine M. Collins,
Cedegao E. Zhang,
Lionel Wong,
Mauricio Barba da Costa,
Graham Todd,
Adrian Weller,
Samuel J. Cheyette,
Thomas L. Griffiths,
Joshua B. Tenenbaum
Abstract:
Games have long been a microcosm for studying planning and reasoning in both natural and artificial intelligence, especially with a focus on expert-level or even super-human play. But real life also pushes human intelligence along a different frontier, requiring people to flexibly navigate decision-making problems that they have never thought about before. Here, we use novice gameplay to study how…
▽ More
Games have long been a microcosm for studying planning and reasoning in both natural and artificial intelligence, especially with a focus on expert-level or even super-human play. But real life also pushes human intelligence along a different frontier, requiring people to flexibly navigate decision-making problems that they have never thought about before. Here, we use novice gameplay to study how people make decisions and form judgments in new problem settings. We show that people are systematic and adaptively rational in how they play a game for the first time, or evaluate a game (e.g., how fair or how fun it is likely to be) before they have played it even once. We explain these capacities via a computational cognitive model that we call the "Intuitive Gamer". The model is based on mechanisms of fast and flat (depth-limited) goal-directed probabilistic simulation--analogous to those used in Monte Carlo tree-search models of expert game-play, but scaled down to use very few stochastic samples, simple goal heuristics for evaluating actions, and no deep search. In a series of large-scale behavioral studies with over 1000 participants and 121 two-player strategic board games (almost all novel to our participants), our model quantitatively captures human judgments and decisions varying the amount and kind of experience people have with a game--from no experience at all ("just thinking"), to a single round of play, to indirect experience watching another person and predicting how they should play--and does so significantly better than much more compute-intensive expert-level models. More broadly, our work offers new insights into how people rapidly evaluate, act, and make suggestions when encountering novel problems, and could inform the design of more flexible and human-like AI systems that can determine not just how to solve new tasks, but whether a task is worth thinking about at all.
△ Less
Submitted 13 October, 2025;
originally announced October 2025.
-
Evaluating Language Models' Evaluations of Games
Authors:
Katherine M. Collins,
Cedegao E. Zhang,
Graham Todd,
Lance Ying,
Mauricio Barba da Costa,
Ryan Liu,
Prafull Sharma,
Adrian Weller,
Ionatan Kuperwajs,
Lionel Wong,
Joshua B. Tenenbaum,
Thomas L. Griffiths
Abstract:
Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism…
▽ More
Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
△ Less
Submitted 12 October, 2025;
originally announced October 2025.
-
DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation
Authors:
Enze Zhang,
Jiaying Wang,
Mengxi Xiao,
Jifei Liu,
Ziyan Kuang,
Rui Dong,
Eric Dong,
Sophia Ananiadou,
Min Peng,
Qianqian Xie
Abstract:
Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and c…
▽ More
Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.
△ Less
Submitted 13 October, 2025; v1 submitted 10 October, 2025;
originally announced October 2025.
-
Autonomy Matters: A Study on Personalization-Privacy Dilemma in LLM Agents
Authors:
Zhiping Zhang,
Yi Evie Zhang,
Freda Shi,
Tianshi Li
Abstract:
Large Language Model (LLM) agents require personal information for personalization in order to better act on users' behalf in daily tasks, but this raises privacy concerns and a personalization-privacy dilemma. Agent's autonomy introduces both risks and opportunities, yet its effects remain unclear. To better understand this, we conducted a 3$\times$3 between-subjects experiment ($N=450$) to study…
▽ More
Large Language Model (LLM) agents require personal information for personalization in order to better act on users' behalf in daily tasks, but this raises privacy concerns and a personalization-privacy dilemma. Agent's autonomy introduces both risks and opportunities, yet its effects remain unclear. To better understand this, we conducted a 3$\times$3 between-subjects experiment ($N=450$) to study how agent's autonomy level and personalization influence users' privacy concerns, trust and willingness to use, as well as the underlying psychological processes. We find that personalization without considering users' privacy preferences increases privacy concerns and decreases trust and willingness to use. Autonomy moderates these effects: Intermediate autonomy flattens the impact of personalization compared to No- and Full autonomy conditions. Our results suggest that rather than aiming for perfect model alignment in output generation, balancing autonomy of agent's action and user control offers a promising path to mitigate the personalization-privacy dilemma.
△ Less
Submitted 5 October, 2025;
originally announced October 2025.
-
Flow Autoencoders are Effective Protein Tokenizers
Authors:
Rohit Dilip,
Evan Zhang,
Ayush Varshney,
David Van Valen
Abstract:
Protein structure tokenizers enable the creation of multimodal models of protein structure, sequence, and function. Current approaches to protein structure tokenization rely on bespoke components that are invariant to spatial symmetries, but that are challenging to optimize and scale. We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures. Kanzi consists of…
▽ More
Protein structure tokenizers enable the creation of multimodal models of protein structure, sequence, and function. Current approaches to protein structure tokenization rely on bespoke components that are invariant to spatial symmetries, but that are challenging to optimize and scale. We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures. Kanzi consists of a diffusion autoencoder trained with a flow matching loss. We show that this approach simplifies several aspects of protein structure tokenizers: frame-based representations can be replaced with global coordinates, complex losses are replaced with a single flow matching loss, and SE(3)-invariant attention operations can be replaced with standard attention. We find that these changes stabilize the training of parameter-efficient models that outperform existing tokenizers on reconstruction metrics at a fraction of the model size and training cost. An autoregressive model trained with Kanzi outperforms similar generative models that operate over tokens, although it does not yet match the performance of state-of-the-art continuous diffusion models. Code is available here: https://github.com/rdilip/kanzi/.
△ Less
Submitted 30 September, 2025;
originally announced October 2025.
-
Data Melodification FM: Where Musical Rhetoric Meets Sonification
Authors:
Ke Er Amy Zhang,
David Grellscheid,
Laura Garrison
Abstract:
We propose a design space for data melodification, where standard visualization idioms and fundamental data characteristics map to rhetorical devices of music for a more affective experience of data. Traditional data sonification transforms data into sound by mapping it to different parameters such as pitch, volume, and duration. Often and regrettably, this mapping leaves behind melody, harmony, r…
▽ More
We propose a design space for data melodification, where standard visualization idioms and fundamental data characteristics map to rhetorical devices of music for a more affective experience of data. Traditional data sonification transforms data into sound by mapping it to different parameters such as pitch, volume, and duration. Often and regrettably, this mapping leaves behind melody, harmony, rhythm and other musical devices that compose the centuries-long persuasive and expressive power of music. What results is the occasional, unintentional sense of tinnitus and horror film-like impending doom caused by a disconnect between the semantics of data and sound. Through this work we ask, can the aestheticization of sonification through (classical) music theory make data simultaneously accessible, meaningful, and pleasing to one's ears?
△ Less
Submitted 30 September, 2025;
originally announced October 2025.
-
EVO-LRP: Evolutionary Optimization of LRP for Interpretable Model Explanations
Authors:
Emerald Zhang,
Julian Weaver,
Samantha R Santacruz,
Edward Castillo
Abstract:
Explainable AI (XAI) methods help identify which image regions influence a model's prediction, but often face a trade-off between detail and interpretability. Layer-wise Relevance Propagation (LRP) offers a model-aware alternative. However, LRP implementations commonly rely on heuristic rule sets that are not optimized for clarity or alignment with model behavior. We introduce EVO-LRP, a method th…
▽ More
Explainable AI (XAI) methods help identify which image regions influence a model's prediction, but often face a trade-off between detail and interpretability. Layer-wise Relevance Propagation (LRP) offers a model-aware alternative. However, LRP implementations commonly rely on heuristic rule sets that are not optimized for clarity or alignment with model behavior. We introduce EVO-LRP, a method that applies Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to tune LRP hyperparameters based on quantitative interpretability metrics, such as faithfulness or sparseness. EVO-LRP outperforms traditional XAI approaches in both interpretability metric performance and visual coherence, with strong sensitivity to class-specific features. These findings demonstrate that attribution quality can be systematically improved through principled, task-specific optimization.
△ Less
Submitted 30 September, 2025; v1 submitted 27 September, 2025;
originally announced September 2025.
-
Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling
Authors:
Haoran Li,
Zhiming Su,
Junyan Yao,
Enwei Zhang,
Yang Ji,
Yan Chen,
Kan Zhou,
Chao Feng,
Jiao Ran
Abstract:
Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short…
▽ More
Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.
△ Less
Submitted 4 December, 2025; v1 submitted 20 September, 2025;
originally announced September 2025.
-
Deploying UDM Series in Real-Life Stuttered Speech Applications: A Clinical Evaluation Framework
Authors:
Eric Zhang,
Li Wei,
Sarah Chen,
Michael Wang
Abstract:
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the Unconstrained Dysfluency Modeling (UDM) series-the current state-of-the-art framework developed by Berkeley that combin…
▽ More
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the Unconstrained Dysfluency Modeling (UDM) series-the current state-of-the-art framework developed by Berkeley that combines modular architecture, explicit phoneme alignment, and interpretable outputs for real-world clinical deployment. Through extensive experiments involving patients and certified speech-language pathologists (SLPs), we demonstrate that UDM achieves state-of-the-art performance (F1: 0.89+-0.04) while providing clinically meaningful interpretability scores (4.2/5.0). Our deployment study shows 87% clinician acceptance rate and 34% reduction in diagnostic time. The results provide strong evidence that UDM represents a practical pathway toward AI-assisted speech therapy in clinical environments.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Progressing Level-of-Detail Animation of Volumetric Elastodynamics
Authors:
Jiayi Eris Zhang,
Doug L. James,
Danny M. Kaufman
Abstract:
We extend the progressive dynamics model (Zhang et al., 2024) from cloth and shell simulation to volumetric finite elements, enabling an efficient level-of-detail (LOD) animation-design pipeline with predictive coarse-resolution previews facilitating rapid iterative design for a final, to-be-generated, high-resolution animation of volumetric elastodynamics. This extension to volumetric domains pos…
▽ More
We extend the progressive dynamics model (Zhang et al., 2024) from cloth and shell simulation to volumetric finite elements, enabling an efficient level-of-detail (LOD) animation-design pipeline with predictive coarse-resolution previews facilitating rapid iterative design for a final, to-be-generated, high-resolution animation of volumetric elastodynamics. This extension to volumetric domains poses significant new challenges, including the construction of suitable mesh hierarchies and the definition of effective prolongation operators for codimension-0 progressive dynamics. To address these challenges, we propose a practical method for defining multiresolution hierarchies and, more importantly, introduce a simple yet effective topology-aware algorithm for constructing prolongation operators between overlapping (but not necessarily conforming) volumetric meshes. Our key insight is a boundary binding strategy that enables the computation of barycentric coordinates, allowing several off-the-shelf interpolants -- such as standard barycentric coordinates, Biharmonic Coordinates (Wang et al., 2015), and Phong Deformation (James, 2020) -- to serve as "plug-and-play" components for prolongation with minimal modification. We show that our progressive volumetric simulation framework achieves high-fidelity matching LOD animation across resolutions including challenging dynamics with high speeds, large deformations, and frictional contact.
△ Less
Submitted 16 September, 2025;
originally announced September 2025.
-
FediLoRA: Heterogeneous LoRA for Federated Multimodal Fine-tuning under Missing Modalities
Authors:
Lishan Yang,
Wei Emma Zhang,
Nam Kha Nguygen,
Po Hu,
Yanjun Shu,
Weitong Chen,
Mong Yuan Sim
Abstract:
Foundation models have demonstrated remarkable performance across a wide range of tasks, yet their large parameter sizes pose challenges for practical deployment, especially in decentralized environments. Parameter-efficient fine-tuning (PEFT), such as Low-Rank Adaptation (LoRA), reduces local computing and memory overhead, making it attractive for federated learning. However, existing federated L…
▽ More
Foundation models have demonstrated remarkable performance across a wide range of tasks, yet their large parameter sizes pose challenges for practical deployment, especially in decentralized environments. Parameter-efficient fine-tuning (PEFT), such as Low-Rank Adaptation (LoRA), reduces local computing and memory overhead, making it attractive for federated learning. However, existing federated LoRA methods typically assume uniform rank configurations and unimodal inputs, overlooking two key real-world challenges: (1) heterogeneous client resources have different LoRA ranks, and (2) multimodal data settings with potentially missing modalities. In this work, we propose FediLoRA, a simple yet effective framework for federated multimodal fine-tuning under heterogeneous LoRA ranks and missing modalities. FediLoRA introduces a dimension-wise aggregation strategy that reweights LoRA updates without information dilution during aggregation. It also includes a lightweight layer-wise model editing method that selectively incorporates global parameters to repair local components which improves both client and global model performances. Experimental results on three multimodal benchmark datasets demonstrate that FediLoRA achieves superior performance over competitive baselines in both global and personalized settings, particularly in the presence of modality incompleteness.
△ Less
Submitted 23 September, 2025; v1 submitted 1 September, 2025;
originally announced September 2025.
-
On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts
Authors:
Linlu Qiu,
Cedegao E. Zhang,
Joshua B. Tenenbaum,
Yoon Kim,
Roger P. Levy
Abstract:
Language use is shaped by pragmatics -- i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a b…
▽ More
Language use is shaped by pragmatics -- i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs' pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.
△ Less
Submitted 26 September, 2025; v1 submitted 8 September, 2025;
originally announced September 2025.
-
Why Language Models Hallucinate
Authors:
Adam Tauman Kalai,
Ofir Nachum,
Santosh S. Vempala,
Edwin Zhang
Abstract:
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and…
▽ More
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
△ Less
Submitted 4 September, 2025;
originally announced September 2025.
-
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Authors:
Xingyue Huang,
Rishabh,
Gregor Franke,
Ziyi Yang,
Jiamu Bai,
Weijie Bai,
Jinhe Bi,
Zifeng Ding,
Yiqun Duan,
Chengyu Fan,
Wendong Fan,
Xin Gao,
Ruohao Guo,
Yuan He,
Zhuangzhuang He,
Xianglong Hu,
Neil Johnson,
Bowen Li,
Fangru Lin,
Siyu Lin,
Tong Liu,
Yunpu Ma,
Hao Shen,
Hao Sun,
Beibei Wang
, et al. (21 additional authors not shown)
Abstract:
Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due t…
▽ More
Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.
△ Less
Submitted 3 September, 2025;
originally announced September 2025.
-
Data Humanism Decoded: A Characterization of its Principles to Bridge Data Visualization Researchers and Practitioners
Authors:
Ibrahim Al-Hazwani,
Ke Er Zhang,
Laura Garrison,
Jürgen Bernard
Abstract:
Data Humanism is a human-centered design approach that emphasizes the personal, contextual, and imperfect nature of data. Despite its growing influence among practitioners, the 13 principles outlined in Giorgia Lupi's visual manifesto remain loosely defined in research contexts, creating a gap between design practice and systematic application. Through a mixed-methods approach, including a systema…
▽ More
Data Humanism is a human-centered design approach that emphasizes the personal, contextual, and imperfect nature of data. Despite its growing influence among practitioners, the 13 principles outlined in Giorgia Lupi's visual manifesto remain loosely defined in research contexts, creating a gap between design practice and systematic application. Through a mixed-methods approach, including a systematic literature review, multimedia analysis, and expert interviews, we present a characterization of Data Humanism principles for visualization researchers. Our characterization provides concrete definitions that maintain interpretive flexibility in operationalizing design choices. We validate our work through direct consultation with Lupi. Moreover, we leverage the characterization to decode a visualization work, mapping Data Humanism principles to specific visual design choices. Our work creates a common language for human-centered visualization, bridging the gap between practice and research for future applications and evaluations.
△ Less
Submitted 30 August, 2025;
originally announced September 2025.
-
A Comparative Study of Controllability, Explainability, and Performance in Dysfluency Detection Models
Authors:
Eric Zhang,
Li Wei,
Sarah Chen,
Michael Wang
Abstract:
Recent advances in dysfluency detection have introduced a variety of modeling paradigms, ranging from lightweight object-detection inspired networks (YOLOStutter) to modular interpretable frameworks (UDM). While performance on benchmark datasets continues to improve, clinical adoption requires more than accuracy: models must be controllable and explainable. In this paper, we present a systematic c…
▽ More
Recent advances in dysfluency detection have introduced a variety of modeling paradigms, ranging from lightweight object-detection inspired networks (YOLOStutter) to modular interpretable frameworks (UDM). While performance on benchmark datasets continues to improve, clinical adoption requires more than accuracy: models must be controllable and explainable. In this paper, we present a systematic comparative analysis of four representative approaches--YOLO-Stutter, FluentNet, UDM, and SSDM--along three dimensions: performance, controllability, and explainability. Through comprehensive evaluation on multiple datasets and expert clinician assessment, we find that YOLO-Stutter and FluentNet provide efficiency and simplicity, but with limited transparency; UDM achieves the best balance of accuracy and clinical interpretability; and SSDM, while promising, could not be fully reproduced in our experiments. Our analysis highlights the trade-offs among competing approaches and identifies future directions for clinically viable dysfluency modeling. We also provide detailed implementation insights and practical deployment considerations for each approach.
△ Less
Submitted 25 August, 2025;
originally announced September 2025.
-
A Taxonomy of Transcendence
Authors:
Natalie Abreu,
Edwin Zhang,
Eran Malach,
Naomi Saphra
Abstract:
Although language models are trained to mimic humans, the resulting systems display capabilities beyond the scope of any one person. To understand this phenomenon, we use a controlled setting to identify properties of the training data that lead a model to transcend the performance of its data sources. We build on previous work to outline three modes of transcendence, which we call skill denoising…
▽ More
Although language models are trained to mimic humans, the resulting systems display capabilities beyond the scope of any one person. To understand this phenomenon, we use a controlled setting to identify properties of the training data that lead a model to transcend the performance of its data sources. We build on previous work to outline three modes of transcendence, which we call skill denoising, skill selection, and skill generalization. We then introduce a knowledge graph-based setting in which simulated experts generate data based on their individual expertise. We highlight several aspects of data diversity that help to enable the model's transcendent capabilities. Additionally, our data generation setting offers a controlled testbed that we hope is valuable for future research in the area.
△ Less
Submitted 25 August, 2025;
originally announced August 2025.
-
Revisiting Rule-Based Stuttering Detection: A Comprehensive Analysis of Interpretable Models for Clinical Applications
Authors:
Eric Zhang
Abstract:
Stuttering affects approximately 1% of the global population, impacting communication and quality of life. While recent advances in deep learning have pushed the boundaries of automatic speech dysfluency detection, rule-based approaches remain crucial for clinical applications where interpretability and transparency are paramount. This paper presents a comprehensive analysis of rule-based stutteri…
▽ More
Stuttering affects approximately 1% of the global population, impacting communication and quality of life. While recent advances in deep learning have pushed the boundaries of automatic speech dysfluency detection, rule-based approaches remain crucial for clinical applications where interpretability and transparency are paramount. This paper presents a comprehensive analysis of rule-based stuttering detection systems, synthesizing insights from multiple corpora including UCLASS, FluencyBank, and SEP-28k. We propose an enhanced rule-based framework that incorporates speaking-rate normalization, multi-level acoustic feature analysis, and hierarchical decision structures. Our approach achieves competitive performance while maintaining complete interpretability-critical for clinical adoption. We demonstrate that rule-based systems excel particularly in prolongation detection (97-99% accuracy) and provide stable performance across varying speaking rates. Furthermore, we show how these interpretable models can be integrated with modern machine learning pipelines as proposal generators or constraint modules, bridging the gap between traditional speech pathology practices and contemporary AI systems. Our analysis reveals that while neural approaches may achieve marginally higher accuracy in unconstrained settings, rule-based methods offer unique advantages in clinical contexts where decision auditability, patient-specific tuning, and real-time feedback are essential.
△ Less
Submitted 21 August, 2025;
originally announced August 2025.
-
EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos
Authors:
Junyi Ma,
Erhang Zhang,
Yin-Dong Zheng,
Yuchen Xie,
Yixuan Zhou,
Hesheng Wang
Abstract:
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., ``how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e…
▽ More
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., ``how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., ``when to interact'') is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.
△ Less
Submitted 17 August, 2025;
originally announced August 2025.
-
gpt-oss-120b & gpt-oss-20b Model Card
Authors:
OpenAI,
:,
Sandhini Agarwal,
Lama Ahmad,
Jason Ai,
Sam Altman,
Andy Applebaum,
Edwin Arbus,
Rahul K. Arora,
Yu Bai,
Bowen Baker,
Haiming Bao,
Boaz Barak,
Ally Bennett,
Tyler Bertao,
Nivedita Brett,
Eugene Brevdo,
Greg Brockman,
Sebastien Bubeck,
Che Chang,
Kai Chen,
Mark Chen,
Enoch Cheung,
Aidan Clark,
Dan Cook
, et al. (102 additional authors not shown)
Abstract:
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for develope…
▽ More
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
△ Less
Submitted 8 August, 2025;
originally announced August 2025.
-
Generation and Evaluation in the Human Invention Process through the Lens of Game Design
Authors:
Katherine M. Collins,
Graham Todd,
Cedegao E. Zhang,
Adrian Weller,
Julian Togelius,
Junyi Chu,
Lionel Wong,
Thomas L. Griffiths,
Joshua B. Tenenbaum
Abstract:
The human ability to learn rules and solve problems has been a central concern of cognitive science research since the field's earliest days. But we do not just follow rules and solve problems given to us by others: we modify those rules, create new problems, and set new goals and tasks for ourselves and others. Arguably, even more than rule following and problem solving, human intelligence is abo…
▽ More
The human ability to learn rules and solve problems has been a central concern of cognitive science research since the field's earliest days. But we do not just follow rules and solve problems given to us by others: we modify those rules, create new problems, and set new goals and tasks for ourselves and others. Arguably, even more than rule following and problem solving, human intelligence is about creatively breaking and stretching the rules, changing the game, and inventing new problems worth thinking about. Creating a good rule or a good problem depends not just on the ideas one can think up but on how one evaluates such proposals. Here, we study invention through the lens of game design. We focus particularly on the early stages of novice, "everyday" game creation, where the stakes are low. We draw on a dataset of over 450 human created games, created by participants who saw an initial seed set of two-player grid-based strategy games. We consider two different cognitive mechanisms that may be at work during the early processes of intuitive game invention: an associative proposal based on previous games one has seen and compute-bounded model-based evaluation that an everyday game creator may use to refine their initial draft proposals. In our preliminary work, we conduct a model-based analysis of how people invented new games based on prior experience and find that generated games are best described by a model which incorporates model-based estimates of game quality at a population level. Our work points to how human invention is based not only on what people propose, but how they evaluate and offers a computational toolkit to scale empirical studies of model-based simulation in open-ended human innovation.
△ Less
Submitted 31 July, 2025;
originally announced August 2025.
-
Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks
Authors:
Baran Atalar,
Eddie Zhang,
Carlee Joe-Wong
Abstract:
With the increasing popularity of large language models (LLMs) for a variety of tasks, there has been a growing interest in strategies that can predict which out of a set of LLMs will yield a successful answer at low cost. This problem promises to become more and more relevant as providers like Microsoft allow users to easily create custom LLM "assistants" specialized to particular types of querie…
▽ More
With the increasing popularity of large language models (LLMs) for a variety of tasks, there has been a growing interest in strategies that can predict which out of a set of LLMs will yield a successful answer at low cost. This problem promises to become more and more relevant as providers like Microsoft allow users to easily create custom LLM "assistants" specialized to particular types of queries. However, some tasks (i.e., queries) may be too specialized and difficult for a single LLM to handle alone. These applications often benefit from breaking down the task into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, in extracting a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select another, possibly different, LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires that we select a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask's output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned and accounted for during selection. We propose a neural contextual bandit-based algorithm that trains neural networks that model LLM success on each subtask in an online manner, thus learning to guide the LLM selections for the different subtasks, even in the absence of historical LLM performance data. Experiments on telecommunications question answering and medical diagnosis prediction datasets illustrate the effectiveness of our proposed approach compared to other LLM selection algorithms.
△ Less
Submitted 17 August, 2025; v1 submitted 13 August, 2025;
originally announced August 2025.
-
Calibration Attention: Instance-wise Temperature Scaling for Vision Transformers
Authors:
Wenhao Liang,
Wei Emma Zhang,
Lin Yue,
Miao Xu,
Olaf Maennel,
Weitong Chen
Abstract:
Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT's CLS token. Across CIFAR-10/100, MNIST, Ti…
▽ More
Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT's CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: [https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-](https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-)
△ Less
Submitted 11 August, 2025;
originally announced August 2025.
-
Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation
Authors:
Xiao Huang,
Xu Liu,
Enze Zhang,
Tong Yu,
Shuai Li
Abstract:
Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we prop…
▽ More
Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.
△ Less
Submitted 8 August, 2025;
originally announced August 2025.
-
MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation
Authors:
Yiqun Chen,
Erhan Zhang,
Lingyong Yan,
Shuaiqiang Wang,
Jizhou Huang,
Dawei Yin,
Jiaxin Mao
Abstract:
In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipel…
▽ More
In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable limits.The code of MAO-ARAG is on https://github.com/chenyiqun/Agentic-RAG.
△ Less
Submitted 1 August, 2025;
originally announced August 2025.
-
FedDPG: An Adaptive Yet Efficient Prompt-tuning Approach in Federated Learning Settings
Authors:
Ali Shakeri,
Wei Emma Zhang,
Amin Beheshti,
Weitong Chen,
Jian Yang,
Lishan Yang
Abstract:
Pre-trained Language Models (PLMs) have demonstrated impressive performance in various NLP tasks. However, traditional fine-tuning methods for leveraging PLMs for downstream tasks entail significant computational overhead. Prompt-tuning has emerged as an efficient alternative that involves prepending a limited number of parameters to the input sequence and only updating them while the PLM's parame…
▽ More
Pre-trained Language Models (PLMs) have demonstrated impressive performance in various NLP tasks. However, traditional fine-tuning methods for leveraging PLMs for downstream tasks entail significant computational overhead. Prompt-tuning has emerged as an efficient alternative that involves prepending a limited number of parameters to the input sequence and only updating them while the PLM's parameters are frozen. However, this technique's prompts remain fixed for all inputs, reducing the model's flexibility. The Federated Learning (FL) technique has gained attention in recent years to address the growing concerns around data privacy. However, challenges such as communication and computation limitations of clients still need to be addressed. To mitigate these challenges, this paper introduces the Federated Dynamic Prompt Generator (FedDPG), which incorporates a dynamic prompt generator network to generate context-aware prompts based on the given input, ensuring flexibility and adaptability while prioritising data privacy in federated learning settings. Our experiments on three NLP benchmark datasets showcase that FedDPG outperforms the state-of-the-art parameter-efficient fine-tuning methods in terms of global model performance, and has significantly reduced the calculation time and the number of parameters to be sent through the FL network.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
AtrousMamaba: An Atrous-Window Scanning Visual State Space Model for Remote Sensing Change Detection
Authors:
Tao Wang,
Tiecheng Bai,
Chao Xu,
Bin Liu,
Erlei Zhang,
Jiyun Huang,
Hongming Zhang
Abstract:
Recently, a novel visual state space (VSS) model, referred to as Mamba, has demonstrated significant progress in modeling long sequences with linear complexity, comparable to Transformer models, thereby enhancing its adaptability for processing visual data. Although most methods aim to enhance the global receptive field by directly modifying Mamba's scanning mechanism, they tend to overlook the cr…
▽ More
Recently, a novel visual state space (VSS) model, referred to as Mamba, has demonstrated significant progress in modeling long sequences with linear complexity, comparable to Transformer models, thereby enhancing its adaptability for processing visual data. Although most methods aim to enhance the global receptive field by directly modifying Mamba's scanning mechanism, they tend to overlook the critical importance of local information in dense prediction tasks. Additionally, whether Mamba can effectively extract local features as convolutional neural networks (CNNs) do remains an open question that merits further investigation. In this paper, We propose a novel model, AtrousMamba, which effectively balances the extraction of fine-grained local details with the integration of global contextual information. Specifically, our method incorporates an atrous-window selective scan mechanism, enabling a gradual expansion of the scanning range with adjustable rates. This design shortens the distance between adjacent tokens, enabling the model to effectively capture fine-grained local features and global context. By leveraging the atrous window scan visual state space (AWVSS) module, we design dedicated end-to-end Mamba-based frameworks for binary change detection (BCD) and semantic change detection (SCD), referred to as AWMambaBCD and AWMambaSCD, respectively. Experimental results on six benchmark datasets show that the proposed framework outperforms existing CNN-based, Transformer-based, and Mamba-based methods. These findings clearly demonstrate that Mamba not only captures long-range dependencies in visual data but also effectively preserves fine-grained local details.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models
Authors:
Lionel Wong,
Katherine M. Collins,
Lance Ying,
Cedegao E. Zhang,
Adrian Weller,
Tobias Gerstenberg,
Timothy O'Donnell,
Alexander K. Lew,
Jacob D. Andreas,
Joshua B. Tenenbaum,
Tyler Brooke-Wilson
Abstract:
When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental…
▽ More
When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset -- built around a `Model Olympics` domain of sports vignettes -- tests models' capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people's ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.
△ Less
Submitted 18 July, 2025; v1 submitted 16 July, 2025;
originally announced July 2025.
-
Deconstructing Implicit Beliefs in Visual Data Journalism: Unstable Meanings Behind Data as Truth & Design for Insight
Authors:
Ke Er Amy Zhang,
Jodie Jenkinson,
Laura Garrison
Abstract:
We conduct a deconstructive reading of a qualitative interview study with 17 visual data journalists from newsrooms across the globe. We borrow a deconstruction approach from literary critique to explore the instability of meaning in language and reveal implicit beliefs in words and ideas. Through our analysis we surface two sets of opposing implicit beliefs in visual data journalism: objectivity/…
▽ More
We conduct a deconstructive reading of a qualitative interview study with 17 visual data journalists from newsrooms across the globe. We borrow a deconstruction approach from literary critique to explore the instability of meaning in language and reveal implicit beliefs in words and ideas. Through our analysis we surface two sets of opposing implicit beliefs in visual data journalism: objectivity/subjectivity and humanism/mechanism. We contextualize these beliefs through a genealogical analysis, which brings deconstruction theory into practice by providing a historic backdrop for these opposing perspectives. Our analysis shows that these beliefs held within visual data journalism are not self-enclosed but rather a product of external societal forces and paradigm shifts over time. Through this work, we demonstrate how thinking with critical theories such as deconstruction and genealogy can reframe "success" in visual data storytelling and diversify visualization research outcomes. These efforts push the ways in which we as researchers produce domain knowledge to examine the sociotechnical issues of today's values towards datafication and data visualization. All supplemental materials for this work are available at osf.io/5fr48.
△ Less
Submitted 18 July, 2025; v1 submitted 16 July, 2025;
originally announced July 2025.
-
Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents
Authors:
Enhao Zhang,
Erkang Zhu,
Gagan Bansal,
Adam Fourney,
Hussein Mozannar,
Jack Gerrits
Abstract:
Large language model (LLM)-based multi-agent systems have demonstrated remarkable promise for tackling complex tasks by breaking them down into subtasks that are iteratively planned, executed, observed, and refined. Despite their effectiveness, these systems often incur high latency because real-world problems frequently demand multiple iterative cycles of reasoning steps. To address this challeng…
▽ More
Large language model (LLM)-based multi-agent systems have demonstrated remarkable promise for tackling complex tasks by breaking them down into subtasks that are iteratively planned, executed, observed, and refined. Despite their effectiveness, these systems often incur high latency because real-world problems frequently demand multiple iterative cycles of reasoning steps. To address this challenge, we propose M1-Parallel, a framework that concurrently runs multiple multi-agent teams in parallel to uncover distinct solution paths. By leveraging an event-driven communication model with asynchronous messaging, M1-Parallel efficiently capitalizes on the inherent diversity of valid plans to either reduce end-to-end latency or boost task completion rates. Our experiments on complex tasks show that M1-Parallel with early termination achieves up to $2.2\times$ speedup while preserving accuracy, and that M1-Parallel with aggregation yields higher task completion rates. We further investigate strategies aimed at encouraging diverse execution plans but observe no additional performance gains over repeated sampling. Overall, these findings underscore the potential of parallel plan execution for optimizing multi-agent systems for real-world, high-complexity reasoning tasks.
△ Less
Submitted 11 July, 2025;
originally announced July 2025.