-
Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech
Authors:
Shuchang Pan,
Siddharth Banerjee,
Dhruv Hebbar,
Siddhant Patel,
Akshaj Gupta,
Kan Jen Cheng,
Hanjo Kim,
Zeyi Austin Li,
Martin Q. Ma,
Tingle Li,
Gopala Anumanchipalli,
Jiachen Lian
Abstract:
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathwa…
▽ More
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
△ Less
Submitted 25 December, 2025;
originally announced December 2025.
-
Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection
Authors:
Junjun Pan,
Yixin Liu,
Rui Miao,
Kaize Ding,
Yu Zheng,
Quoc Viet Hung Nguyen,
Alan Wee-Chung Liew,
Shirui Pan
Abstract:
Large language model (LLM)-based multi-agent systems (MAS) have shown strong capabilities in solving complex tasks. As MAS become increasingly autonomous in various safety-critical tasks, detecting malicious agents has become a critical security concern. Although existing graph anomaly detection (GAD)-based defenses can identify anomalous agents, they mainly rely on coarse sentence-level informati…
▽ More
Large language model (LLM)-based multi-agent systems (MAS) have shown strong capabilities in solving complex tasks. As MAS become increasingly autonomous in various safety-critical tasks, detecting malicious agents has become a critical security concern. Although existing graph anomaly detection (GAD)-based defenses can identify anomalous agents, they mainly rely on coarse sentence-level information and overlook fine-grained lexical cues, leading to suboptimal performance. Moreover, the lack of interpretability in these methods limits their reliability and real-world applicability. To address these limitations, we propose XG-Guard, an explainable and fine-grained safeguarding framework for detecting malicious agents in MAS. To incorporate both coarse and fine-grained textual information for anomalous agent identification, we utilize a bi-level agent encoder to jointly model the sentence- and token-level representations of each agent. A theme-based anomaly detector further captures the evolving discussion focus in MAS dialogues, while a bi-level score fusion mechanism quantifies token-level contributions for explanation. Extensive experiments across diverse MAS topologies and attack scenarios demonstrate robust detection performance and strong interpretability of XG-Guard.
△ Less
Submitted 21 December, 2025;
originally announced December 2025.
-
SCOPE: Prompt Evolution for Enhancing Agent Effectiveness
Authors:
Zehua Pei,
Hui-Ling Zhen,
Shixiong Kai,
Sinno Jialin Pan,
Yunhe Wang,
Mingxuan Yuan,
Bei Yu
Abstract:
Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbf{SCOPE} (Self-evolving C…
▽ More
Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbf{SCOPE} (Self-evolving Context Optimization via Prompt Evolution). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent's prompt. We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles). Furthermore, we introduce Perspective-Driven Exploration to maximize strategy coverage, increasing the likelihood that the agent has the correct strategy for any given task. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23\% to 38.64\% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.
△ Less
Submitted 17 December, 2025;
originally announced December 2025.
-
Memory in the Age of AI Agents
Authors:
Yuyang Hu,
Shichun Liu,
Yanwei Yue,
Guibin Zhang,
Boyang Liu,
Fangyi Zhu,
Jiahang Lin,
Honglin Guo,
Shihan Dou,
Zhiheng Xi,
Senjie Jin,
Jiejun Tan,
Yanbin Yin,
Jiongnan Liu,
Zeyu Zhang,
Zhongxiang Sun,
Yutao Zhu,
Hao Sun,
Boci Peng,
Zhenrong Cheng,
Xuanbo Fan,
Jiaxin Guo,
Xinlei Yu,
Zhenhong Zhou,
Zewen Hu
, et al. (22 additional authors not shown)
Abstract:
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the prol…
▽ More
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
△ Less
Submitted 15 December, 2025;
originally announced December 2025.
-
The quermassintegral inequalities for horo-convex domains in the sphere
Authors:
Shujing Pan,
Julian Scheuer
Abstract:
We study a new notion of convexity for subsets of the unit sphere, which closely resembles the horo-convexity for subsets of the hyperbolic space. We call this notion, accordingly, horo-convexity. For horo-convex hypersurfaces of the unit sphere, we prove the smooth convergence of the classical Guan/Li flow of inverse type and use this result to prove the full set of quermassintegral inequalities…
▽ More
We study a new notion of convexity for subsets of the unit sphere, which closely resembles the horo-convexity for subsets of the hyperbolic space. We call this notion, accordingly, horo-convexity. For horo-convex hypersurfaces of the unit sphere, we prove the smooth convergence of the classical Guan/Li flow of inverse type and use this result to prove the full set of quermassintegral inequalities for horo-convex hypersurfaces of the unit sphere.
△ Less
Submitted 14 December, 2025;
originally announced December 2025.
-
JUNO's Impact on the Neutrino Mass Ordering from Lorentz Invariance Violation
Authors:
Tatiana Araya-Santander,
Cesar Bonilla,
Supriya Pan
Abstract:
We explore the potential of the Jiangmen Underground Neutrino Observatory (JUNO) to probe new physics by searching for Lorentz-invariance violation (LIV). Using the 59.1-day dataset recently released by this experiment, we analyze neutrino oscillations to place new constraints on the LIV parameters in the CPT-even ($c_{ee} - c_{eμ}$, $c_{ee} - c_{eτ}$) and CPT-odd ($a_{ee} - a_{eμ}$,…
▽ More
We explore the potential of the Jiangmen Underground Neutrino Observatory (JUNO) to probe new physics by searching for Lorentz-invariance violation (LIV). Using the 59.1-day dataset recently released by this experiment, we analyze neutrino oscillations to place new constraints on the LIV parameters in the CPT-even ($c_{ee} - c_{eμ}$, $c_{ee} - c_{eτ}$) and CPT-odd ($a_{ee} - a_{eμ}$, $a_{ee} - a_{eτ}$) sectors. Our analysis reveals a significant shift in the oscillation parameter space of $\sin^2θ_{12}-Δm^2_{21}$ when LIV is included; with the best-fit point for normal ordering moving to the higher values of the solar angle $θ_{12}$, a strong preference emerges for inverted mass ordering. In particular, the $c_{ee} - c_{eτ}$ and $a_{ee} - a_{eτ}$ sectors show the most pronounced effects. We report the most stringent bounds from JUNO to date on these LIV parameters, showcasing the detector's unique sensitivity to physics beyond the Standard Model.
△ Less
Submitted 18 December, 2025; v1 submitted 12 December, 2025;
originally announced December 2025.
-
Beyond Two Parameters: Revisiting Dark Energy with the Latest Cosmic Probes
Authors:
Hanyu Cheng,
Supriya Pan,
Eleonora Di Valentino
Abstract:
Dark energy (DE) models with many free parameters are often considered excessive, as constraining all parameters poses a significant challenge. On the other hand, such models offer greater flexibility to probe the DE sector in more detail. With the rapid advancement of astronomical surveys and the availability of diverse datasets, it is timely to examine whether current combined observations can e…
▽ More
Dark energy (DE) models with many free parameters are often considered excessive, as constraining all parameters poses a significant challenge. On the other hand, such models offer greater flexibility to probe the DE sector in more detail. With the rapid advancement of astronomical surveys and the availability of diverse datasets, it is timely to examine whether current combined observations can effectively constrain an extended parameter space in DE models. This article investigates a four-parameter dynamical dark energy (DDE) model that spans a broad region of the universe's expansion history through four key parameters: the present-day value of the DE equation of state ($w_0$), its initial value ($w_m$), the scale factor depicting transition from $w_m$ to $w_0$ occurs ($a_t$), and the steepness of this transition ($Δ_{\rm de}$). We constrain the model using CMB data from Planck, BAO from DESI DR2, and three distinct compilations of Type Ia Supernovae: PantheonPlus, DESY5, and Union3. Our results show that constraining all four parameters remains difficult: $a_t$ is not constrained by any dataset, while the remaining three parameters can be constrained only when all observational probes are combined (with the exception of DESY5). The results further show that DE has a quintessential nature at present ($w_0 > -1$), while $w_m$ is negative, indicating a phantom-like behaviour at early times. Interestingly, despite its larger parameter space, the proposed DDE model is preferred over the $Λ$CDM scenario, based on both $Δχ^2$ and Bayesian evidence, for certain combined datasets, particularly CMB+BAO+DESY5 and CMB+BAO+Union3.
△ Less
Submitted 10 December, 2025;
originally announced December 2025.
-
Deterministic World Models for Verification of Closed-loop Vision-based Systems
Authors:
Yuang Geng,
Zhuoyang Zhou,
Zhongzheng Zhang,
Siyuan Pan,
Hoang-Dung Tran,
Ivan Ruchkin
Abstract:
Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deter…
▽ More
Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual-objective loss function that combines pixel-level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star-based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision-based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent-variable baseline.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
A Unifying Human-Centered AI Fairness Framework
Authors:
Munshi Mahbubur Rahman,
Shimei Pan,
James R. Foulds
Abstract:
The increasing use of Artificial Intelligence (AI) in critical societal domains has amplified concerns about fairness, particularly regarding unequal treatment across sensitive attributes such as race, gender, and socioeconomic status. While there has been substantial work on ensuring AI fairness, navigating trade-offs between competing notions of fairness as well as predictive accuracy remains ch…
▽ More
The increasing use of Artificial Intelligence (AI) in critical societal domains has amplified concerns about fairness, particularly regarding unequal treatment across sensitive attributes such as race, gender, and socioeconomic status. While there has been substantial work on ensuring AI fairness, navigating trade-offs between competing notions of fairness as well as predictive accuracy remains challenging, creating barriers to the practical deployment of fair AI systems. To address this, we introduce a unifying human-centered fairness framework that systematically covers eight distinct fairness metrics, formed by combining individual and group fairness, infra-marginal and intersectional assumptions, and outcome-based and equality-of-opportunity (EOO) perspectives. This structure allows stakeholders to align fairness interventions with their values and contextual considerations. The framework uses a consistent and easy-to-understand formulation for all metrics to reduce the learning curve for non-experts. Rather than privileging a single fairness notion, the framework enables stakeholders to assign weights across multiple fairness objectives, reflecting their priorities and facilitating multi-stakeholder compromises. We apply this approach to four real-world datasets: the UCI Adult census dataset for income prediction, the COMPAS dataset for criminal recidivism, the German Credit dataset for credit risk assessment, and the MEPS dataset for healthcare utilization. We show that adjusting weights reveals nuanced trade-offs between different fairness metrics. Finally, through case studies in judicial decision-making and healthcare, we demonstrate how the framework can inform practical and value-sensitive deployment of fair AI systems.
△ Less
Submitted 7 December, 2025;
originally announced December 2025.
-
LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence
Authors:
Wenjin Liu,
Haoran Luo,
Xin Feng,
Xiang Ji,
Lijuan Zhou,
Rui Mao,
Jiapu Wang,
Shirui Pan,
Erik Cambria
Abstract:
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we p…
▽ More
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.
△ Less
Submitted 4 December, 2025;
originally announced December 2025.
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Authors:
DeepSeek-AI,
Aixin Liu,
Aoxue Mei,
Bangcai Lin,
Bing Xue,
Bingxuan Wang,
Bingzheng Xu,
Bochao Wu,
Bowei Zhang,
Chaofan Lin,
Chen Dong,
Chengda Lu,
Chenggang Zhao,
Chengqi Deng,
Chenhao Xu,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Erhang Li,
Fangqi Zhou,
Fangyun Lin,
Fucong Dai,
Guangbo Hao
, et al. (239 additional authors not shown)
Abstract:
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2)…
▽ More
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
△ Less
Submitted 2 December, 2025;
originally announced December 2025.
-
PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization
Authors:
Mingzhe Li,
Renhao Zhang,
Zhiyang Wen,
Siqi Pan,
Bruno Castro da Silva,
Juan Zhai,
Shiqing Ma
Abstract:
Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to securit…
▽ More
Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: https://github.com/aaFrostnova/PromptMiner
△ Less
Submitted 27 November, 2025;
originally announced November 2025.
-
A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models
Authors:
Zhen Tao,
Shidong Pan,
Zhenchang Xing,
Emily Black,
Talia Gillis,
Chunyang Chen
Abstract:
Large language model (LLM) services have been rapidly integrated into people's daily lives as chatbots and agentic systems. They are nourished by collecting rich streams of data, raising privacy concerns around excessive collection of sensitive personal information. Privacy policies are the fundamental mechanism for informing users about data practices in modern information privacy paradigm. Altho…
▽ More
Large language model (LLM) services have been rapidly integrated into people's daily lives as chatbots and agentic systems. They are nourished by collecting rich streams of data, raising privacy concerns around excessive collection of sensitive personal information. Privacy policies are the fundamental mechanism for informing users about data practices in modern information privacy paradigm. Although traditional web and mobile policies are well studied, the privacy policies of LLM providers, their LLM-specific content, and their evolution over time remain largely underexplored. In this paper, we present the first longitudinal empirical study of privacy policies for mainstream LLM providers worldwide. We curate a chronological dataset of 74 historical privacy policies and 115 supplemental privacy documents from 11 LLM providers across 5 countries up to August 2025, and extract over 3,000 sentence-level edits between consecutive policy versions. We compare LLM privacy policies to those of other software formats, propose a taxonomy tailored to LLM privacy policies, annotate policy edits and align them with a timeline of key LLM ecosystem events. Results show they are substantially longer, demand college-level reading ability, and remain highly vague. Our taxonomy analysis reveals patterns in how providers disclose LLM-specific practices and highlights regional disparities in coverage. Policy edits are concentrated in first-party data collection and international/specific-audience sections, and that product releases and regulatory actions are the primary drivers, shedding light on the status quo and the evolution of LLM privacy policies.
△ Less
Submitted 24 November, 2025;
originally announced November 2025.
-
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Authors:
Yu Cui,
Yifei Liu,
Hang Fu,
Sicheng Pan,
Haibin Zhang,
Cong Zuo,
Licheng Wang
Abstract:
Research on the safety evaluation of large language models (LLMs) has become extensive, driven by jailbreak studies that elicit unsafe responses. Such response involves information already available to humans, such as the answer to "how to make a bomb". When LLMs are jailbroken, the practical threat they pose to humans is negligible. However, it remains unclear whether LLMs commonly produce unpred…
▽ More
Research on the safety evaluation of large language models (LLMs) has become extensive, driven by jailbreak studies that elicit unsafe responses. Such response involves information already available to humans, such as the answer to "how to make a bomb". When LLMs are jailbroken, the practical threat they pose to humans is negligible. However, it remains unclear whether LLMs commonly produce unpredictable outputs that could pose substantive threats to human safety. To address this gap, we study whether LLM-generated content contains potential existential threats, defined as outputs that imply or promote direct harm to human survival. We propose \textsc{ExistBench}, a benchmark designed to evaluate such risks. Each sample in \textsc{ExistBench} is derived from scenarios where humans are positioned as adversaries to AI assistants. Unlike existing evaluations, we use prefix completion to bypass model safeguards. This leads the LLMs to generate suffixes that express hostility toward humans or actions with severe threat, such as the execution of a nuclear strike. Our experiments on 10 LLMs reveal that LLM-generated content indicates existential threats. To investigate the underlying causes, we also analyze the attention logits from LLMs. To highlight real-world safety risks, we further develop a framework to assess model behavior in tool-calling. We find that LLMs actively select and invoke external tools with existential threats. Code and data are available at: https://github.com/cuiyu-ai/ExistBench.
△ Less
Submitted 20 December, 2025; v1 submitted 24 November, 2025;
originally announced November 2025.
-
LLMAID: Identifying AI Capabilities in Android Apps with LLMs
Authors:
Pei Liu,
Terry Zhuo,
Jiawei Deng,
Thong James,
Shidong Pan,
Sherry Xu,
Zhenchang Xing,
Qinghua Lu,
Xiaoning Du,
Hongyu Zhang
Abstract:
Recent advancements in artificial intelligence (AI) and its widespread integration into mobile software applications have received significant attention, highlighting the growing prominence of AI capabilities in modern software systems. However, the inherent hallucination and reliability issues of AI continue to raise persistent concerns. Consequently, application users and regulators increasingly…
▽ More
Recent advancements in artificial intelligence (AI) and its widespread integration into mobile software applications have received significant attention, highlighting the growing prominence of AI capabilities in modern software systems. However, the inherent hallucination and reliability issues of AI continue to raise persistent concerns. Consequently, application users and regulators increasingly ask critical questions such as: Does the application incorporate AI capabilities? and What specific types of AI functionalities are embedded? Preliminary efforts have been made to identify AI capabilities in mobile software; however, existing approaches mainly rely on manual inspection and rule-based heuristics. These methods are not only costly and time-consuming but also struggle to adapt advanced AI techniques.
To address the limitations of existing methods, we propose LLMAID (Large Language Model for AI Discovery). LLMAID includes four main tasks: (1) candidate extraction, (2) knowledge base interaction, (3) AI capability analysis and detection, and (4) AI service summarization. We apply LLMAID to a dataset of 4,201 Android applications and demonstrate that it identifies 242% more real-world AI apps than state-of-the-art rule-based approaches. Our experiments show that LLM4AID achieves high precision and recall, both exceeding 90%, in detecting AI-related components. Additionally, a user study indicates that developers find the AI service summaries generated by LLMAID to be more informative and preferable to the original app descriptions. Finally, we leverage LLMAID to perform an empirical analysis of AI capabilities across Android apps. The results reveal a strong concentration of AI functionality in computer vision (54.80%), with object detection emerging as the most common task (25.19%).
△ Less
Submitted 28 November, 2025; v1 submitted 24 November, 2025;
originally announced November 2025.
-
Domain-constrained Synthesis of Inconsistent Key Aspects in Textual Vulnerability Descriptions
Authors:
Linyi Han,
Shidong Pan,
Zhenchang Xing,
Sofonias Yitagesu,
Xiaowang Zhang,
Zhiyong Feng,
Jiamou Sun,
Qing Huang
Abstract:
Textual Vulnerability Descriptions (TVDs) are crucial for security analysts to understand and address software vulnerabilities. However, the key aspect inconsistencies in TVDs from different repositories pose challenges for achieving a comprehensive understanding of vulnerabilities. Existing approaches aim to mitigate inconsistencies by aligning TVDs with external knowledge bases, but they often d…
▽ More
Textual Vulnerability Descriptions (TVDs) are crucial for security analysts to understand and address software vulnerabilities. However, the key aspect inconsistencies in TVDs from different repositories pose challenges for achieving a comprehensive understanding of vulnerabilities. Existing approaches aim to mitigate inconsistencies by aligning TVDs with external knowledge bases, but they often discard valuable information and fail to synthesize comprehensive representations. In this paper, we propose a domain-constrained LLM-based synthesis framework for unifying key aspects of TVDs. Our framework consists of three stages: 1) Extraction, guided by rule-based templates to ensure all critical details are captured; 2) Self-evaluation, using domain-specific anchor words to assess semantic variability across sources; and 3) Fusion, leveraging information entropy to reconcile inconsistencies and prioritize relevant details. This framework improves synthesis performance, increasing the F1 score for key aspect augmentation from 0.82 to 0.87, while enhancing comprehension and efficiency by over 30\%. We further develop Digest Labels, a practical tool for visualizing TVDs, which human evaluations show significantly boosts usability.
△ Less
Submitted 20 November, 2025;
originally announced November 2025.
-
Data-driven Acceleration of MPC with Guarantees
Authors:
Agustin Castellano,
Shijie Pan,
Enrique Mallada
Abstract:
Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a non…
▽ More
Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a nonparametric lookup rule that is orders of magnitude faster than solving MPC online. Our analysis shows that under sufficient coverage condition of the offline data, the policy is recursively feasible and admits provable, bounded optimality gap. These conditions establish an explicit trade-off between the amount of data collected and the tightness of the bounds. Our experiments show that this policy is between 100 and 1000 times faster than standard MPC, with only a modest hit to optimality, showing potential for real-time control tasks.
△ Less
Submitted 17 November, 2025;
originally announced November 2025.
-
Measurement of Exclusive $π^+$--argon Interactions Using ProtoDUNE-SP
Authors:
DUNE Collaboration,
S. Abbaslu,
A. Abed Abud,
R. Acciarri,
L. P. Accorsi,
M. A. Acero,
M. R. Adames,
G. Adamov,
M. Adamowski,
C. Adriano,
F. Akbar,
F. Alemanno,
N. S. Alex,
K. Allison,
M. Alrashed,
A. Alton,
R. Alvarez,
T. Alves,
A. Aman,
H. Amar,
P. Amedo,
J. Anderson,
D. A. Andrade,
C. Andreopoulos,
M. Andreotti
, et al. (1304 additional authors not shown)
Abstract:
We present the measurement of $π^{+}$--argon inelastic cross sections using the ProtoDUNE Single-Phase liquid argon time projection chamber in the incident $π^+$ kinetic energy range of 500 -- 800 MeV in multiple exclusive channels (absorption, charge exchange, and the remaining inelastic interactions). The results of this analysis are important inputs to simulations of liquid argon neutrino exper…
▽ More
We present the measurement of $π^{+}$--argon inelastic cross sections using the ProtoDUNE Single-Phase liquid argon time projection chamber in the incident $π^+$ kinetic energy range of 500 -- 800 MeV in multiple exclusive channels (absorption, charge exchange, and the remaining inelastic interactions). The results of this analysis are important inputs to simulations of liquid argon neutrino experiments such as the Deep Underground Neutrino Experiment and the Short Baseline Neutrino program at Fermi National Accelerator Laboratory. They will be employed to improve the modeling of final state interactions within neutrino event generators used by these experiments, as well as the modeling of $π^{+}$--argon secondary interactions within the liquid argon. This is the first measurement of $π^+$--argon absorption at this kinetic energy range as well as the first ever measurement of $π^{+}$--argon charge exchange.
△ Less
Submitted 17 November, 2025;
originally announced November 2025.
-
First Measurement of $π^+$-Ar and $p$-Ar Total Inelastic Cross Sections in the Sub-GeV Energy Regime with ProtoDUNE-SP Data
Authors:
DUNE Collaboration,
S. Abbaslu,
F. Abd Alrahman,
A. Abed Abud,
R. Acciarri,
L. P. Accorsi,
M. A. Acero,
M. R. Adames,
G. Adamov,
M. Adamowski,
C. Adriano,
F. Akbar,
F. Alemanno,
N. S. Alex,
L. Aliaga Soplin,
K. Allison,
M. Alrashed,
A. Alton,
R. Alvarez,
T. Alves,
A. Aman,
H. Amar,
P. Amedo,
J. Anderson,
D. A. Andrade
, et al. (1327 additional authors not shown)
Abstract:
The ProtoDUNE-SP detector, a kiloton-scale prototype for the Deep Underground Neutrino Experiment (DUNE), is the largest liquid argon time projection chamber built to date. Operated at CERN from 2018 to 2020, it collected both cosmic-ray data and a beam consisting of positively-charged particles with discrete momentum settings across a range of 0.3 GeV/$c$ to 7 GeV/$c$. In this letter, we report t…
▽ More
The ProtoDUNE-SP detector, a kiloton-scale prototype for the Deep Underground Neutrino Experiment (DUNE), is the largest liquid argon time projection chamber built to date. Operated at CERN from 2018 to 2020, it collected both cosmic-ray data and a beam consisting of positively-charged particles with discrete momentum settings across a range of 0.3 GeV/$c$ to 7 GeV/$c$. In this letter, we report the total inelastic cross section measurements for $π^+$-Ar and $p$-Ar interactions using selected $π^+$ and proton samples from the 1 GeV/$c$ beam data. These results provide the first measurement of the total inelastic cross sections for $π^+$-Ar in the 500-900 MeV kinetic energy range and for $p$-Ar below 450 MeV, both of which are directly relevant to the DUNE energy range. The measured cross sections are consistent with predictions and provide a dataset that was previously unavailable for argon targets. These measurements are essential for constraining neutrino-argon interaction models, which are crucial for the precision physics goals of the upcoming DUNE experiment.
△ Less
Submitted 14 November, 2025;
originally announced November 2025.
-
Convergence analysis of inexact MBA method for constrained upper-$\mathcal{C}^2$ optimization problems
Authors:
Ruyu Liu,
Shaohua Pan
Abstract:
This paper concerns a class of constrained optimization problems in which, the objective and constraint functions are both upper-$\mathcal{C}^2$. For such nonconvex and nonsmooth optimization problems, we develop an inexact moving balls approximation (MBA) method by a workable inexactness criterion for the solving of subproblems. By leveraging a global error bound for the strongly convex program a…
▽ More
This paper concerns a class of constrained optimization problems in which, the objective and constraint functions are both upper-$\mathcal{C}^2$. For such nonconvex and nonsmooth optimization problems, we develop an inexact moving balls approximation (MBA) method by a workable inexactness criterion for the solving of subproblems. By leveraging a global error bound for the strongly convex program associated with parametric optimization problems, we establish the full convergence of the iterate sequence under the partial bounded multiplier property (BMP) and the Kurdyka-Łojasiewicz (KL) property of the constructed potential function, and achieve the local convergence rate of the iterate and objective value sequences if the potential function satisfies the KL property of exponent $q\in[1/2,1)$. A verifiable condition is also provided to check whether the potential function satisfies the KL property of exponent $q\in[1/2,1)$ at the given critical point. To the best of our knowledge, this is the first implementable inexact MBA method with a full convergence certificate for the constrained nonconvex and nonsmooth optimization problem.
△ Less
Submitted 12 November, 2025;
originally announced November 2025.
-
Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time
Authors:
Junjun Pan,
Yixin Liu,
Chuan Zhou,
Fei Xiong,
Alan Wee-Chung Liew,
Shirui Pan
Abstract:
Graph anomaly detection (GAD), which aims to detect outliers in graph-structured data, has received increasing research attention recently. However, existing GAD methods assume identical training and testing distributions, which is rarely valid in practice. In real-world scenarios, unseen but normal samples may emerge during deployment, leading to a normality shift that degrades the performance of…
▽ More
Graph anomaly detection (GAD), which aims to detect outliers in graph-structured data, has received increasing research attention recently. However, existing GAD methods assume identical training and testing distributions, which is rarely valid in practice. In real-world scenarios, unseen but normal samples may emerge during deployment, leading to a normality shift that degrades the performance of GAD models trained on the original data. Through empirical analysis, we reveal that the degradation arises from (1) semantic confusion, where unseen normal samples are misinterpreted as anomalies due to their novel patterns, and (2) aggregation contamination, where the representations of seen normal nodes are distorted by unseen normals through message aggregation. While retraining or fine-tuning GAD models could be a potential solution to the above challenges, the high cost of model retraining and the difficulty of obtaining labeled data often render this approach impractical in real-world applications. To bridge the gap, we proposed a lightweight and plug-and-play Test-time adaptation framework for correcting Unseen Normal pattErns (TUNE) in GAD. To address semantic confusion, a graph aligner is employed to align the shifted data to the original one at the graph attribute level. Moreover, we utilize the minimization of representation-level shift as a supervision signal to train the aligner, which leverages the estimated aggregation contamination as a key indicator of normality shift. Extensive experiments on 10 real-world datasets demonstrate that TUNE significantly enhances the generalizability of pre-trained GAD models to both synthetic and real unseen normal patterns.
△ Less
Submitted 10 November, 2025;
originally announced November 2025.
-
A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights
Authors:
Shubhayan Pan,
Saptarshi Chakraborty,
Debolina Paul,
Kushal Bose,
Swagatam Das
Abstract:
Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd's $k$-means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges them. Despite its advantages, this method can fail when dealing with data exhibiting linearly non-separable or non-convex structures. To mitigate the limitatio…
▽ More
Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd's $k$-means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges them. Despite its advantages, this method can fail when dealing with data exhibiting linearly non-separable or non-convex structures. To mitigate the limitations, we propose a kernelized extension of the convex clustering method. This approach projects the data points into a Reproducing Kernel Hilbert Space (RKHS) using a feature map, enabling convex clustering in this transformed space. This kernelization not only allows for better handling of complex data distributions but also produces an embedding in a finite-dimensional vector space. We provide a comprehensive theoretical underpinnings for our kernelized approach, proving algorithmic convergence and establishing finite sample bounds for our estimates. The effectiveness of our method is demonstrated through extensive experiments on both synthetic and real-world datasets, showing superior performance compared to state-of-the-art clustering techniques. This work marks a significant advancement in the field, offering an effective solution for clustering in non-linear and non-convex data scenarios.
△ Less
Submitted 7 November, 2025;
originally announced November 2025.
-
OriFeel: Origami-Inspired Actuation for Force-Based Tactile Feedback on Ambient Surfaces
Authors:
Shubham Rohal,
Shijia Pan
Abstract:
People are constantly in touch with surfaces in their lives, such as a sofa, armrest, and table, making them natural tactile interfaces. Despite the recent advancements in shape-changing surfaces, current available solutions are often challenging to retrofit into ambient surfaces due to their bulky form factor or high power requirements. We present \name, a foldable structure-enabled tactile feedb…
▽ More
People are constantly in touch with surfaces in their lives, such as a sofa, armrest, and table, making them natural tactile interfaces. Despite the recent advancements in shape-changing surfaces, current available solutions are often challenging to retrofit into ambient surfaces due to their bulky form factor or high power requirements. We present \name, a foldable structure-enabled tactile feedback mechanism that leverages the structural properties of Miura-Ori fold to enable on-surface force actuation. The foldable structure allows the surfaces to provide perpendicular force via lateral actuation, resulting in a slim form factor that can be actuated via cable-based design using a servo motor. We evaluate the system with a real-world prototype and a user study. The user study shows that users can effectively distinguish multiple intensity levels.
△ Less
Submitted 5 November, 2025;
originally announced November 2025.
-
Modular Task Decomposition and Dynamic Collaboration in Multi-Agent Systems Driven by Large Language Models
Authors:
Shuaidong Pan,
Di Wu
Abstract:
This paper addresses the limitations of a single agent in task decomposition and collaboration during complex task execution, and proposes a multi-agent architecture for modular task decomposition and dynamic collaboration based on large language models. The method first converts natural language task descriptions into unified semantic representations through a large language model. On this basis,…
▽ More
This paper addresses the limitations of a single agent in task decomposition and collaboration during complex task execution, and proposes a multi-agent architecture for modular task decomposition and dynamic collaboration based on large language models. The method first converts natural language task descriptions into unified semantic representations through a large language model. On this basis, a modular decomposition mechanism is introduced to break down the overall goal into multiple hierarchical sub-tasks. Then, dynamic scheduling and routing mechanisms enable reasonable division of labor and realtime collaboration among agents, allowing the system to adjust strategies continuously according to environmental feedback, thus maintaining efficiency and stability in complex tasks. Furthermore, a constraint parsing and global consistency mechanism is designed to ensure coherent connections between sub-tasks and balanced workload, preventing performance degradation caused by redundant communication or uneven resource allocation. The experiments validate the architecture across multiple dimensions, including task success rate, decomposition efficiency, sub-task coverage, and collaboration balance. The results show that the proposed method outperforms existing approaches in both overall performance and robustness, achieving a better balance between task complexity and communication overhead. In conclusion, this study demonstrates the effectiveness and feasibility of language-driven task decomposition and dynamic collaboration in multi-agent systems, providing a systematic solution for task execution in complex environments.
△ Less
Submitted 2 November, 2025;
originally announced November 2025.
-
A Big Step Forward? A User-Centric Examination of iOS App Privacy Report and Enhancements
Authors:
Liu Wang,
Dong Wang,
Shidong Pan,
Zheng Jiang,
Haoyu Wang,
Yi Wang
Abstract:
The prevalent engagement with mobile apps underscores the importance of understanding their data practices. Transparency plays a crucial role in this context, ensuring users to be informed and give consent before any data access occurs. Apple introduced a new feature since iOS 15.2, App Privacy Report, to inform users about detailed insights into apps' data access and sharing. This feature continu…
▽ More
The prevalent engagement with mobile apps underscores the importance of understanding their data practices. Transparency plays a crucial role in this context, ensuring users to be informed and give consent before any data access occurs. Apple introduced a new feature since iOS 15.2, App Privacy Report, to inform users about detailed insights into apps' data access and sharing. This feature continues Apple's trend of privacy-focused innovations (following Privacy Nutrition Labels), and has been marketed as a big step forward in user privacy. However, its real-world impacts on user privacy and control remain unexamined. We thus proposed an end-to-end study involving systematic assessment of the App Privacy Report's real-world benefits and limitations, LLM-enabled and multi-technique synthesized enhancements, and comprehensive evaluation from both system and user perspectives. Through a structured focus group study with twelve everyday iOS users, we explored their experiences, understanding, and perceptions of the feature, suggesting its limited practical impact resulting from missing important details. We identified two primary user concerns: the clarity of data access purpose and domain description. In response, we proposed enhancements including a purpose inference framework and domain clarification pipeline. We demonstrated the effectiveness and benefits of such enhancements for mobile app users. This work provides practical insights that could help enhance user privacy transparency and discusses areas for future research.
△ Less
Submitted 1 November, 2025;
originally announced November 2025.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Authors:
Kimi Team,
Yu Zhang,
Zongyu Lin,
Xingcheng Yao,
Jiaxi Hu,
Fanqing Meng,
Chengyin Liu,
Xin Men,
Songlin Yang,
Zhiyuan Li,
Wentao Li,
Enzhe Lu,
Weizhou Liu,
Yanru Chen,
Weixin Xu,
Longhui Yu,
Yejie Wang,
Yu Fan,
Longguang Zhong,
Enming Yuan,
Dehao Zhang,
Yizhi Zhang,
T. Y. Liu,
Haiming Wang,
Shengjun Fang
, et al. (35 additional authors not shown)
Abstract:
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mech…
▽ More
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.
To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
△ Less
Submitted 1 November, 2025; v1 submitted 30 October, 2025;
originally announced October 2025.
-
ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models
Authors:
Bosong Huang,
Ming Jin,
Yuxuan Liang,
Johan Barthelemy,
Debo Cheng,
Qingsong Wen,
Chenghao Liu,
Shirui Pan
Abstract:
Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification…
▽ More
Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification outcomes, existing post-hoc time series explanation (PHTSE) methods primarily focus on timestep-level feature attribution. These explanation methods overlook the fundamental prior that classification outcomes are predominantly driven by key shapelets. To bridge this gap, we present ShapeX, an innovative framework that segments time series into meaningful shapelet-driven segments and employs Shapley values to assess their saliency. At the core of ShapeX lies the Shapelet Describe-and-Detect (SDD) framework, which effectively learns a diverse set of shapelets essential for classification. We further demonstrate that ShapeX produces explanations which reveal causal relationships instead of just correlations, owing to the atomicity properties of shapelets. Experimental results on both synthetic and real-world datasets demonstrate that ShapeX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations.
△ Less
Submitted 24 October, 2025; v1 submitted 22 October, 2025;
originally announced October 2025.
-
A New Type of Adversarial Examples
Authors:
Xingyang Nie,
Guojie Xiao,
Su Pan,
Biao Wang,
Huilin Ge,
Tao Fang
Abstract:
Most machine learning models are vulnerable to adversarial examples, which poses security concerns on these models. Adversarial examples are crafted by applying subtle but intentionally worst-case modifications to examples from the dataset, leading the model to output a different answer from the original example. In this paper, adversarial examples are formed in an exactly opposite manner, which a…
▽ More
Most machine learning models are vulnerable to adversarial examples, which poses security concerns on these models. Adversarial examples are crafted by applying subtle but intentionally worst-case modifications to examples from the dataset, leading the model to output a different answer from the original example. In this paper, adversarial examples are formed in an exactly opposite manner, which are significantly different from the original examples but result in the same answer. We propose a novel set of algorithms to produce such adversarial examples, including the negative iterative fast gradient sign method (NI-FGSM) and the negative iterative fast gradient method (NI-FGM), along with their momentum variants: the negative momentum iterative fast gradient sign method (NMI-FGSM) and the negative momentum iterative fast gradient method (NMI-FGM). Adversarial examples constructed by these methods could be used to perform an attack on machine learning systems in certain occasions. Moreover, our results show that the adversarial examples are not merely distributed in the neighbourhood of the examples from the dataset; instead, they are distributed extensively in the sample space.
△ Less
Submitted 22 October, 2025;
originally announced October 2025.
-
Phantom scalar field with arbitrary potential: accelerating scaling attractors
Authors:
Sudip Halder,
Supriya Pan,
Paulo M. Sá,
Tapan Saha
Abstract:
In this article, we investigate the dynamics of a phantom scalar field with an arbitrary potential, focusing on accelerating scaling solutions of cosmological relevance. We consider both uncoupled and coupled cosmological scenarios. In the latter case, the coupling between phantom dark energy and dark matter is motivated by the warm inflationary paradigm, with the dissipation coefficient assumed t…
▽ More
In this article, we investigate the dynamics of a phantom scalar field with an arbitrary potential, focusing on accelerating scaling solutions of cosmological relevance. We consider both uncoupled and coupled cosmological scenarios. In the latter case, the coupling between phantom dark energy and dark matter is motivated by the warm inflationary paradigm, with the dissipation coefficient assumed to be either constant or variable. The evolution equations of our coupled and uncoupled cosmological models are written in the form of autonomous systems, whose stability is studied using methods of qualitative analysis of dynamical systems. For this analysis, the only requirement imposed on the phantom scalar-field potential is that a specific dynamical variable, defined in terms of the potential and its derivative, must be invertible. We show that the uncoupled phantom cosmological model cannot accommodate any accelerated scaling solution, while such solutions do exist in the coupled scenario, for both constant and variable dissipation coefficients. Although there is a limitation to these scaling solutions $-$ specifically, the current stage of accelerated expansion is not preceded by a long enough matter-dominated era $-$ our results show that the existence of a direct coupling between phantom dark energy and dark matter yields great potential for addressing the cosmic coincidence problem.
△ Less
Submitted 20 October, 2025;
originally announced October 2025.
-
When AI Takes the Wheel: Security Analysis of Framework-Constrained Program Generation
Authors:
Yue Liu,
Zhenchang Xing,
Shidong Pan,
Chakkrit Tantithamthavorn
Abstract:
In recent years, the AI wave has grown rapidly in software development. Even novice developers can now design and generate complex framework-constrained software systems based on their high-level requirements with the help of Large Language Models (LLMs). However, when LLMs gradually "take the wheel" of software development, developers may only check whether the program works. They often miss secu…
▽ More
In recent years, the AI wave has grown rapidly in software development. Even novice developers can now design and generate complex framework-constrained software systems based on their high-level requirements with the help of Large Language Models (LLMs). However, when LLMs gradually "take the wheel" of software development, developers may only check whether the program works. They often miss security problems hidden in how the generated programs are implemented.
In this work, we investigate the security properties of framework-constrained programs generated by state-of-the-art LLMs. We focus specifically on Chrome extensions due to their complex security model involving multiple privilege boundaries and isolated components. To achieve this, we built ChromeSecBench, a dataset with 140 prompts based on known vulnerable extensions. We used these prompts to instruct nine state-of-the-art LLMs to generate complete Chrome extensions, and then analyzed them for vulnerabilities across three dimensions: scenario types, model differences, and vulnerability categories. Our results show that LLMs produced vulnerable programs at alarmingly high rates (18%-50%), particularly in Authentication & Identity and Cookie Management scenarios (up to 83% and 78% respectively). Most vulnerabilities exposed sensitive browser data like cookies, history, or bookmarks to untrusted code. Interestingly, we found that advanced reasoning models performed worse, generating more vulnerabilities than simpler models. These findings highlight a critical gap between LLMs' coding skills and their ability to write secure framework-constrained programs.
△ Less
Submitted 12 November, 2025; v1 submitted 19 October, 2025;
originally announced October 2025.
-
A Preliminary Exploration of the Differences and Conjunction of Traditional PNT and Brain-inspired PNT
Authors:
Xu He,
Xiaolin Meng,
Wenxuan Yin,
Youdong Zhang,
Lingfei Mo,
Xiangdong An,
Fangwen Yu,
Shuguo Pan,
Yufeng Liu,
Jingnan Liu,
Yujia Zhang,
Wang Gao
Abstract:
Developing universal Positioning, Navigation, and Timing (PNT) is our enduring goal. Today's complex environments demand PNT that is more resilient, energy-efficient and cognitively capable. This paper asks how we can endow unmanned systems with brain-inspired spatial cognition navigation while exploiting the high precision of machine PNT to advance universal PNT. We provide a new perspective and…
▽ More
Developing universal Positioning, Navigation, and Timing (PNT) is our enduring goal. Today's complex environments demand PNT that is more resilient, energy-efficient and cognitively capable. This paper asks how we can endow unmanned systems with brain-inspired spatial cognition navigation while exploiting the high precision of machine PNT to advance universal PNT. We provide a new perspective and roadmap for shifting PNT from "tool-oriented" to "cognition-driven". Contributions: (1) multi-level dissection of differences among traditional PNT, biological brain PNT and brain-inspired PNT; (2) a four-layer (observation-capability-decision-hardware) fusion framework that unites numerical precision and brain-inspired intelligence; (3) forward-looking recommendations for future development of brain-inspired PNT.
△ Less
Submitted 19 October, 2025;
originally announced October 2025.
-
A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling
Authors:
Shiyu Ji,
Farnoosh Hashemi,
Joice Chen,
Juanwen Pan,
Weicheng Ma,
Hefan Zhang,
Sophia Pan,
Ming Cheng,
Shubham Mohole,
Saeed Hassanpour,
Soroush Vosoughi,
Michael Macy
Abstract:
Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We…
▽ More
Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.
△ Less
Submitted 16 October, 2025;
originally announced October 2025.
-
A Generalized Placeability Metric for Model-Free Unified Pick-and-Place Reasoning
Authors:
Benno Wingender,
Nils Dengler,
Rohit Menon,
Sicong Pan,
Maren Bennewitz
Abstract:
To reliably pick and place unknown objects under real-world sensing noise remains a challenging task, as existing methods rely on strong object priors (e.g., CAD models), or planar-support assumptions, limiting generalization and unified reasoning between grasping and placing. In this work, we introduce a generalized placeability metric that evaluates placement poses directly from noisy point clou…
▽ More
To reliably pick and place unknown objects under real-world sensing noise remains a challenging task, as existing methods rely on strong object priors (e.g., CAD models), or planar-support assumptions, limiting generalization and unified reasoning between grasping and placing. In this work, we introduce a generalized placeability metric that evaluates placement poses directly from noisy point clouds, without any shape priors. The metric jointly scores stability, graspability, and clearance. From raw geometry, we extract the support surfaces of the object to generate diverse candidates for multi-orientation placement and sample contacts that satisfy collision and stability constraints. By conditioning grasp scores on each candidate placement, our proposed method enables model-free unified pick-and-place reasoning and selects grasp-place pairs that lead to stable, collision-free placements. On unseen real objects and non-planar object supports, our metric delivers CAD-comparable accuracy in predicting stability loss and generally produces more physically plausible placements than learning-based predictors.
△ Less
Submitted 16 October, 2025;
originally announced October 2025.
-
TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening
Authors:
Nam Le,
Leo Yu Zhang,
Kewen Liao,
Shirui Pan,
Wei Luo
Abstract:
As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-awa…
▽ More
As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-aware framework that effectively detects subtle backdoors that evade existing defences. TED++ begins by constructing a tubular neighbourhood around each class's hidden-feature manifold, estimating its local ``thickness'' from a handful of clean activations. It then applies Locally Adaptive Ranking (LAR) to detect any activation that drifts outside the admissible tube. By aggregating these LAR-adjusted ranks across all layers, TED++ captures how faithfully an input remains on the evolving class submanifolds. Based on such characteristic ``tube-constrained'' behaviour, TED++ flags inputs whose LAR-based ranking sequences deviate significantly. Extensive experiments are conducted on benchmark datasets and tasks, demonstrating that TED++ achieves state-of-the-art detection performance under both adaptive-attack and limited-data scenarios. Remarkably, even with only five held-out examples per class, TED++ still delivers near-perfect detection, achieving gains of up to 14\% in AUROC over the next-best method. The code is publicly available at https://github.com/namle-w/TEDpp.
△ Less
Submitted 16 October, 2025;
originally announced October 2025.
-
Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning
Authors:
Xingrui Zhuo,
Jiapu Wang,
Gongqing Wu,
Zhongyuan Wang,
Jichen Zhang,
Shirui Pan,
Xindong Wu
Abstract:
Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have de…
▽ More
Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.
△ Less
Submitted 14 October, 2025;
originally announced October 2025.
-
High Bandwidth and Ultra-low Dark Current Ge Photodetector Enabled by Frequency Domain Equalization
Authors:
Wenxin Deng,
Hengsong Yue,
Xiaoyan Liu,
Jianhong Liang,
Jianbin Fu,
Shilong Pan,
Tao Chu
Abstract:
High bandwidth and low dark current germanium (Ge) photodetectors are crucial in silicon photonic integrated circuits. The bandwidth of Ge photodetectors is restricted by carrier transit time and parasitic parameters. And thermal generation of carriers within the Ge P-N junction results in an inherent dark current, typically in nA-μA range. Here, we propose an equalization photodetector (EqPD) uti…
▽ More
High bandwidth and low dark current germanium (Ge) photodetectors are crucial in silicon photonic integrated circuits. The bandwidth of Ge photodetectors is restricted by carrier transit time and parasitic parameters. And thermal generation of carriers within the Ge P-N junction results in an inherent dark current, typically in nA-μA range. Here, we propose an equalization photodetector (EqPD) utilizing the frequency response of a high-bandwidth photodetector PDA to subtract the frequency response of a low-bandwidth photodetector PDB. With the response of PDB attenuating more severely than PDA at high frequency, the differential response (the response of EqPD) can get higher values at high-frequency than at low-frequency. The dark current of EqPD can also be significantly reduced with PDB balancing the dark current of PDA. Experimental results show that the bandwidth of our proposed photodetector can be expanded to over 110 GHz with a dark current of 1 pA simultaneously, and its Non-Return-to-Zero (NRZ) transmission speed can reach 100 Gbaud without digital signal processing. To the best of our knowledge, this represents the highest bandwidth and lowest dark current in a vertical Ge photodetector. The high-performance EqPD provides a promising solution for high-speed and ultra-low noise photodetection in next-generation optical communication.
△ Less
Submitted 15 October, 2025;
originally announced October 2025.
-
Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models
Authors:
Di Wu,
Shuaidong Pan
Abstract:
This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variation…
▽ More
This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.
△ Less
Submitted 2 November, 2025; v1 submitted 12 October, 2025;
originally announced October 2025.
-
Is Dark Energy Changing? Probing the Universe's Expansion with present and future astronomical probes
Authors:
Mehdi Rezaei,
Supriya Pan,
Weiqiang Yang,
David F. Mota
Abstract:
This study explores the possibility of a time-varying dark energy (DE) equation of state (EoS) deviating from -1. We employ a comprehensive dataset of usual astronomical probes (Type Ia supernovae, baryon acoustic oscillations, Big Bang nucleosynthesis, Hubble data, and Planck 2018 CMB) alongside future mock gravitational wave (GW) distance measurements from the Einstein Telescope. We utilize the…
▽ More
This study explores the possibility of a time-varying dark energy (DE) equation of state (EoS) deviating from -1. We employ a comprehensive dataset of usual astronomical probes (Type Ia supernovae, baryon acoustic oscillations, Big Bang nucleosynthesis, Hubble data, and Planck 2018 CMB) alongside future mock gravitational wave (GW) distance measurements from the Einstein Telescope. We utilize the Pad'e approximation, a versatile framework encompassing well-known DE models like constant EoS, Chevallier-Polarski-Linder parametrization and other time-evolving DE parametrizations. Within Pad'e parametrization, we examine three specific forms (Pad'e-I, SPad'e-I, Pad'e-II) applied to both spatially flat and non-flat universes. Pad'e-II exhibits particularly interesting features in terms of the evidence of dynamical DE at many standard deviations. Our results can be summarized as follows. Flat Universe: When analyzing the combined dataset of standard probes (including CMB) with Pad'e-II in a flat universe, we find a strong preference (6.4σ) for a dynamical (time-varying) DE EoS. This preference remains significant (4.7σ) even when incorporating future GW data. Non-Flat Universe: In a non-flat universe, the combined standard datasets (without or with CMB) also indicate dynamical DE EoS at a high confidence level (6.2σ and 6.4σ, respectively). The addition of GW data slightly reduces the evidence (3.8σ and 5.1σ, respectively), but the preference persists. These results collectively suggest a robust case for dynamical DE in the dark sector. While a non-flat universe is not strongly favored, Pad'e-II hints at a possible closed universe when CMB data is included (with or without GW data).
△ Less
Submitted 22 December, 2025; v1 submitted 10 October, 2025;
originally announced October 2025.
-
Few-shot Molecular Property Prediction: A Survey
Authors:
Zeyu Wang,
Tianyi Jiang,
Huanchang Ma,
Yao Lu,
Xiaoze Bao,
Shanqing Yu,
Qi Xuan,
Shirui Pan,
Xin Zheng
Abstract:
AI-assisted molecular property prediction has become a promising technique in early-stage drug discovery and materials design in recent years. However, due to high-cost and complex wet-lab experiments, real-world molecules usually experience the issue of scarce annotations, leading to limited labeled data for effective supervised AI model learning. In light of this, few-shot molecular property pre…
▽ More
AI-assisted molecular property prediction has become a promising technique in early-stage drug discovery and materials design in recent years. However, due to high-cost and complex wet-lab experiments, real-world molecules usually experience the issue of scarce annotations, leading to limited labeled data for effective supervised AI model learning. In light of this, few-shot molecular property prediction (FSMPP) has emerged as an expressive paradigm that enables learning from only a few labeled examples. Despite rapidly growing attention, existing FSMPP studies remain fragmented, without a coherent framework to capture methodological advances and domain-specific challenges. In this work, we present the first comprehensive and systematic survey of few-shot molecular property prediction. We begin by analyzing the few-shot phenomenon in molecular datasets and highlighting two core challenges: (1) cross-property generalization under distribution shifts, where each task corresponding to each property, may follow a different data distribution or even be inherently weakly related to others from a biochemical perspective, requiring the model to transfer knowledge across heterogeneous prediction tasks, and (2) cross-molecule generalization under structural heterogeneity, where molecules involved in different or same properties may exhibit significant structural diversity, making model difficult to achieve generalization. Then, we introduce a unified taxonomy that organizes existing methods into data, model, and learning paradigm levels, reflecting their strategies for extracting knowledge from scarce supervision in few-shot molecular property prediction. Next, we compare representative methods, summarize benchmark datasets and evaluation protocols. In the end, we identify key trends and future directions for advancing the continued research on FSMPP.
△ Less
Submitted 9 October, 2025;
originally announced October 2025.
-
Identification of low-energy kaons in the ProtoDUNE-SP detector
Authors:
DUNE Collaboration,
S. Abbaslu,
F. Abd Alrahman,
A. Abed Abud,
R. Acciarri,
L. P. Accorsi,
M. A. Acero,
M. R. Adames,
G. Adamov,
M. Adamowski,
C. Adriano,
F. Akbar,
F. Alemanno,
N. S. Alex,
K. Allison,
M. Alrashed,
A. Alton,
R. Alvarez,
T. Alves,
A. Aman,
H. Amar,
P. Amedo,
J. Anderson,
D. A. Andrade,
C. Andreopoulos
, et al. (1325 additional authors not shown)
Abstract:
The Deep Underground Neutrino Experiment (DUNE) is a next-generation neutrino experiment with a rich physics program that includes searches for the hypothetical phenomenon of proton decay. Utilizing liquid-argon time-projection chamber technology, DUNE is expected to achieve world-leading sensitivity in the proton decay channels that involve charged kaons in their final states. The first DUNE demo…
▽ More
The Deep Underground Neutrino Experiment (DUNE) is a next-generation neutrino experiment with a rich physics program that includes searches for the hypothetical phenomenon of proton decay. Utilizing liquid-argon time-projection chamber technology, DUNE is expected to achieve world-leading sensitivity in the proton decay channels that involve charged kaons in their final states. The first DUNE demonstrator, ProtoDUNE Single-Phase, was a 0.77 kt detector that operated from 2018 to 2020 at the CERN Neutrino Platform, exposed to a mixed hadron and electron test-beam with momenta ranging from 0.3 to 7 GeV/c. We present a selection of low-energy kaons among the secondary particles produced in hadronic reactions, using data from the 6 and 7 GeV/c beam runs. The selection efficiency is 1\% and the sample purity 92\%. The initial energies of the selected kaon candidates encompass the expected energy range of kaons originating from proton decay events in DUNE (below $\sim$200 MeV). In addition, we demonstrate the capability of this detector technology to discriminate between kaons and other particles such as protons and muons, and provide a comprehensive description of their energy loss in liquid argon, which shows good agreement with the simulation. These results pave the way for future proton decay searches at DUNE.
△ Less
Submitted 9 October, 2025;
originally announced October 2025.
-
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
Authors:
Weisen Jiang,
Sinno Jialin Pan
Abstract:
This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-…
▽ More
This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at https://github.com/ws-jiang/MetaDefense.
△ Less
Submitted 9 October, 2025;
originally announced October 2025.
-
Temporal-Prior-Guided View Planning for Periodic 3D Plant Reconstruction
Authors:
Sicong Pan,
Xuying Huang,
Maren Bennewitz
Abstract:
Periodic 3D reconstruction is essential for crop monitoring, but costly when each cycle restarts from scratch, wasting resources and ignoring information from previous captures. We propose temporal-prior-guided view planning for periodic plant reconstruction, in which a previously reconstructed model of the same plant is non-rigidly aligned to a new partial observation to form an approximation of…
▽ More
Periodic 3D reconstruction is essential for crop monitoring, but costly when each cycle restarts from scratch, wasting resources and ignoring information from previous captures. We propose temporal-prior-guided view planning for periodic plant reconstruction, in which a previously reconstructed model of the same plant is non-rigidly aligned to a new partial observation to form an approximation of the current geometry. To accommodate plant growth, we inflate this approximation and solve a set covering optimization problem to compute a minimal set of views. We integrated this method into a complete pipeline that acquires one additional next-best view before registration for robustness and then plans a globally shortest path to connect the planned set of views and outputs the best view sequence. Experiments on maize and tomato under hemisphere and sphere view spaces show that our system maintains or improves surface coverage while requiring fewer views and comparable movement cost compared to state-of-the-art baselines.
△ Less
Submitted 8 October, 2025;
originally announced October 2025.
-
CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
Authors:
Chengwei Wu,
Jiapu Wang,
Mingyang Gao,
Xingrui Zhuo,
Jipeng Guo,
Runlin Lei,
Haoran Luo,
Tianyu Chen,
Haoyi Zhou,
Shirui Pan,
Zechao Li
Abstract:
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and…
▽ More
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.
△ Less
Submitted 7 October, 2025;
originally announced October 2025.
-
VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy
Authors:
Yu Cui,
Sicheng Pan,
Yifei Liu,
Haibin Zhang,
Cong Zuo
Abstract:
Large language models (LLMs) have been widely deployed in Conversational AIs (CAIs), while exposing privacy and security threats. Recent research shows that LLM-based CAIs can be manipulated to extract private information from human users, posing serious security threats. However, the methods proposed in that study rely on a white-box setting that adversaries can directly modify the system prompt.…
▽ More
Large language models (LLMs) have been widely deployed in Conversational AIs (CAIs), while exposing privacy and security threats. Recent research shows that LLM-based CAIs can be manipulated to extract private information from human users, posing serious security threats. However, the methods proposed in that study rely on a white-box setting that adversaries can directly modify the system prompt. This condition is unlikely to hold in real-world deployments. The limitation raises a critical question: can unprivileged attackers still induce such privacy risks in practical LLM-integrated applications? To address this question, we propose \textsc{VortexPIA}, a novel indirect prompt injection attack that induces privacy extraction in LLM-integrated applications under black-box settings. By injecting token-efficient data containing false memories, \textsc{VortexPIA} misleads LLMs to actively request private information in batches. Unlike prior methods, \textsc{VortexPIA} allows attackers to flexibly define multiple categories of sensitive data. We evaluate \textsc{VortexPIA} on six LLMs, covering both traditional and reasoning LLMs, across four benchmark datasets. The results show that \textsc{VortexPIA} significantly outperforms baselines and achieves state-of-the-art (SOTA) performance. It also demonstrates efficient privacy requests, reduced token consumption, and enhanced robustness against defense mechanisms. We further validate \textsc{VortexPIA} on multiple realistic open-source LLM-integrated applications, demonstrating its practical effectiveness.
△ Less
Submitted 5 October, 2025;
originally announced October 2025.
-
Abstain and Validate: A Dual-LLM Policy for Reducing Noise in Agentic Program Repair
Authors:
José Cambronero,
Michele Tufano,
Sherry Shi,
Renyao Wei,
Grant Uy,
Runxiang Cheng,
Chin-Jung Liu,
Shiying Pan,
Satish Chandra,
Pat Rondon
Abstract:
Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately agent-generated patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing unlikely patches to developers can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce t…
▽ More
Agentic Automated Program Repair (APR) is increasingly tackling complex, repository-level bugs in industry, but ultimately agent-generated patches still need to be reviewed by a human before committing them to ensure they address the bug. Showing unlikely patches to developers can lead to substantial noise, wasting valuable developer time and eroding trust in automated code changes. We introduce two complementary LLM-based policies to reduce such noise: bug abstention and patch validation policies. Bug abstention excludes bugs that the agentic APR system is unlikely to fix. Patch validation rejects patches that are unlikely to be a good fix for the given bug. We evaluate both policies on three sets of bugs from Google's codebase, and their candidate patches generated by an internal agentic APR system. On a set of 174 human-reported bugs, removing bugs and patch trajectories rejected by our policies can raise success rates by up to 13 percentage points and 15 percentage points, respectively, and by up to 39 percentage points in combination. On null pointer exceptions and sanitizer-reported bugs with machine-generated bug reports, patch validation also improves average single-sample success rates. This two-policy approach provides a practical path to the reliable, industrial-scale deployment of agentic APR systems.
△ Less
Submitted 3 October, 2025;
originally announced October 2025.
-
Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models
Authors:
Shuaidong Pan,
Di Wu
Abstract:
This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during gener…
▽ More
This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.
△ Less
Submitted 23 September, 2025;
originally announced October 2025.
-
Nephrobase Cell+: Multimodal Single-Cell Foundation Model for Decoding Kidney Biology
Authors:
Chenyu Li,
Elias Ziyadeh,
Yash Sharma,
Bernhard Dumoulin,
Jonathan Levinsohn,
Eunji Ha,
Siyu Pan,
Vishwanatha Rao,
Madhav Subramaniyam,
Mario Szegedy,
Nancy Zhang,
Katalin Susztak
Abstract:
Background: Large foundation models have revolutionized single-cell analysis, yet no kidney-specific model currently exists, and it remains unclear whether organ-focused models can outperform generalized models. The kidney's complex cellular architecture further complicate integration of large-scale omics data, where current frameworks trained on limited datasets struggle to correct batch effects,…
▽ More
Background: Large foundation models have revolutionized single-cell analysis, yet no kidney-specific model currently exists, and it remains unclear whether organ-focused models can outperform generalized models. The kidney's complex cellular architecture further complicate integration of large-scale omics data, where current frameworks trained on limited datasets struggle to correct batch effects, capture cross-modality variation, and generalize across species. Methods: We developed Nephrobase Cell+, the first kidney-focused large foundation model, pretrained on ~100 billion tokens from ~39.5 million single-cell and single-nucleus profiles across 4,319 samples. Nephrobase Cell+ uses a transformer-based encoder-decoder architecture with gene-token cross-attention and a mixture-of-experts module for scalable representation learning. Results: Nephrobase Cell+ sets a new benchmark for kidney single-cell analysis. It produces tightly clustered, biologically coherent embeddings in human and mouse kidneys, far surpassing previous foundation models such as Geneformer, scGPT, and UCE, as well as traditional methods such as PCA and autoencoders. It achieves the highest cluster concordance and batch-mixing scores, effectively removing donor/assay batch effects while preserving cell-type structure. Cross-species evaluation shows superior alignment of homologous cell types and >90% zero-shot annotation accuracy for major kidney lineages in both human and mouse. Even its 1B-parameter and 500M variants consistently outperform all existing models. Conclusions: Nephrobase Cell+ delivers a unified, high-fidelity representation of kidney biology that is robust, cross-species transferable, and unmatched by current single-cell foundation models, offering a powerful resource for kidney genomics and disease research.
△ Less
Submitted 30 September, 2025;
originally announced September 2025.
-
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Authors:
Tong Guan,
Zijie Meng,
Dianqi Li,
Shiyu Wang,
Chao-Han Huck Yang,
Qingsong Wen,
Zuozhu Liu,
Sabato Marco Siniscalchi,
Ming Jin,
Shirui Pan
Abstract:
Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely re…
▽ More
Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
Unit Test Update through LLM-Driven Context Collection and Error-Type-Aware Refinement
Authors:
Yuanhe Zhang,
Zhiquan Yang,
Shengyi Pan,
Zhongxin Liu
Abstract:
Unit testing is critical for ensuring software quality and software system stability. The current practice of manually maintaining unit tests suffers from low efficiency and the risk of delayed or overlooked fixes. Therefore, an automated approach is required to instantly update unit tests, with the capability to both repair and enhance unit tests. However, existing automated test maintenance meth…
▽ More
Unit testing is critical for ensuring software quality and software system stability. The current practice of manually maintaining unit tests suffers from low efficiency and the risk of delayed or overlooked fixes. Therefore, an automated approach is required to instantly update unit tests, with the capability to both repair and enhance unit tests. However, existing automated test maintenance methods primarily focus on repairing broken tests, neglecting the scenario of enhancing existing tests to verify new functionality. Meanwhile, due to their reliance on rule-based context collection and the lack of verification mechanisms, existing approaches struggle to handle complex code changes and often produce test cases with low correctness. To address these challenges, we propose TESTUPDATER, a novel LLM based approach that enables automated just-in-time test updates in response to production code changes. TESTUPDATER first leverages the LLM to analyze code changes and identify relevant context, which it then extracts and filters. Then, through carefully designed prompts, TESTUPDATER guides the LLM step by step to handle various types of code changes and introduce new dependencies, enabling both test repair and enhancement. Finally, we introduce an error-type-aware iterative refinement mechanism that executes the LLM-updated tests and repairs failures, which significantly improves the overall correctness of test updates. Since existing test repair datasets lack scenarios of test enhancement, we further construct a new benchmark, UPDATES4J, with 195 real-world samples from 7 projects. Experimental results show that TESTUPDATER achieves a compilation pass rate of 94.4% and a test pass rate of 86.7%, outperforming the state-of-the-art method SYNTER by 15.9% and 20.0%, respectively. Furthermore, TESTUPDATER exhibits 12.9% higher branch coverage and 15.2% greater line coverage than SYNTER.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
Authors:
Linhao Luo,
Zicheng Zhao,
Junnan Liu,
Zhangchi Qiu,
Junnan Dong,
Serge Panev,
Chen Gong,
Thuy-Trang Vu,
Gholamreza Haffari,
Dinh Phung,
Alan Wee-Chung Liew,
Shirui Pan
Abstract:
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs struggle with knowledge-intensive tasks due to fragmented information and weak modeling of knowledge structure. Graphs offer a natural way to model relationships within kn…
▽ More
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs struggle with knowledge-intensive tasks due to fragmented information and weak modeling of knowledge structure. Graphs offer a natural way to model relationships within knowledge, but LLMs are inherently unstructured and cannot effectively reason over graph-structured data. Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them. However, these methods often depend on ad-hoc graph designs, heuristic search, or costly agent pipelines, which hinder scalability and generalization. To address these challenges, we present G-reasoner, a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge. Central to our approach is QuadGraph, a standardized four-layer abstraction that unifies heterogeneous knowledge sources into a common graph representation. Building on this, we introduce a 34M-parameter graph foundation model (GFM) that jointly captures graph topology and textual semantics, and is integrated with LLMs to enhance reasoning in downstream applications. To ensure scalability and efficiency, mixed-precision training and distributed message-passing are implemented to scale GFM with more GPUs. Extensive experiments on six benchmarks show that G-reasoner consistently outperforms state-of-the-art baselines, significantly enhances LLM reasoning, and achieves strong efficiency and cross-graph generalization.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.