-
V2E: Validating Smart Contract Vulnerabilities through Profit-driven Exploit Generation and Execution
Authors:
Jingwen Zhang,
Yuhong Nan,
Kaiwen Ning,
Mingxi Ye,
Wei Li,
Yuming Xiao,
Yuming Feng,
Weizhe Zhang,
Zibin Zheng
Abstract:
Smart contracts are a critical component of blockchain systems. Due to the large amount of digital assets carried by smart contracts, their security is of critical importance. Although numerous tools have been developed for detecting smart contract vulnerability, their effectiveness remains limited, particularly due to the high false positives included in the reported results. Therefore, developer…
▽ More
Smart contracts are a critical component of blockchain systems. Due to the large amount of digital assets carried by smart contracts, their security is of critical importance. Although numerous tools have been developed for detecting smart contract vulnerability, their effectiveness remains limited, particularly due to the high false positives included in the reported results. Therefore, developers and auditors are often overwhelmed with manually verifying the reported issues. A fundamental reason behind this is that while a reported vulnerability satisfies specific vulnerable patterns, it may not actually be exploitable, either because the vulnerable code cannot be triggered or it does not result in any financial loss.
In this paper, we propose V2E, a new framework for validating whether a reported vulnerability is truly exploitable. The core idea of V2E is to automatically generate executable Proof-of-Concept Exploit (PoC for short), and then assess if the vulnerability could be triggered and incur any real damage (i.e., causing financial loss) by the PoC. While LLMs have shown proficiency in PoC generation, achieving our task is by no means trivial. In detail, it is difficult for LLM to: (1) generate and update PoC to trigger a specific vulnerability, (2) evaluate the PoC's effectiveness to validate exploitable vulnerability. To this end, V2E automates the whole process through a novel combination of PoC generation, validation, and refinement: (1) Firstly, V2E generates targeted PoCs by analyzing potential vulnerability paths. (2) Then, V2E verifies the validity of PoCs through triggerability and profitability analysis. (3) In addition, V2E iteratively refines the generated PoC based on PoC execution feedback, therefore, increasing the chance to confirm the vulnerability. Evaluation on 264 manually labeled contracts shows that V2E outperforms the baseline approach.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
EA-Agent: A Structured Multi-Step Reasoning Agent for Entity Alignment
Authors:
Yixuan Nan,
Xixun Lin,
Yanmin Shang,
Ge Zhang,
Zheng Fang,
Fang Fang,
Yanan Cao
Abstract:
Entity alignment (EA) aims to identify entities across different knowledge graphs (KGs) that refer to the same real-world object and plays a critical role in knowledge fusion and integration. Traditional EA methods mainly rely on knowledge representation learning, but their performance is often limited under noisy or sparsely supervised scenarios. Recently, large language models (LLMs) have been i…
▽ More
Entity alignment (EA) aims to identify entities across different knowledge graphs (KGs) that refer to the same real-world object and plays a critical role in knowledge fusion and integration. Traditional EA methods mainly rely on knowledge representation learning, but their performance is often limited under noisy or sparsely supervised scenarios. Recently, large language models (LLMs) have been introduced to EA and achieved notable improvements by leveraging rich semantic knowledge. However, existing LLM-based EA approaches typically treat LLMs as black-box decision makers, resulting in limited interpretability, and the direct use of large-scale triples substantially increases inference cost. To address these challenges, we propose \textbf{EA-Agent}, a reasoning-driven agent for EA. EA-Agent formulates EA as a structured reasoning process with multi-step planning and execution, enabling interpretable alignment decisions. Within this process, it introduces attribute and relation triple selectors to filter redundant triples before feeding them into the LLM, effectively addressing efficiency challenges. Experimental results on three benchmark datasets demonstrate that EA-Agent consistently outperforms existing EA methods and achieves state-of-the-art performance. The source code is available at https://github.com/YXNan0110/EA-Agent.
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
ASI-Evolve: AI Accelerates AI
Authors:
Weixian Xu,
Tiantian Mi,
Yixiu Liu,
Yang Nan,
Zhimeng Zhou,
Lyumanshan Ye,
Lin Zhang,
Yu Qiao,
Pengfei Liu
Abstract:
Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-expe…
▽ More
Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
ChainFuzzer: Greybox Fuzzing for Workflow-Level Multi-Tool Vulnerabilities in LLM Agents
Authors:
Jiangrong Wu,
Zitong Yao,
Yuhong Nan,
Zibin Zheng
Abstract:
Tool-augmented LLM agents increasingly rely on multi-step, multi-tool workflows to complete real tasks. This design expands the attack surface, because data produced by one tool can be persisted and later reused as input to another tool, enabling exploitable source-to-sink dataflows that only emerge through tool composition. We study this risk as multi-tool vulnerabilities in LLM agents, and show…
▽ More
Tool-augmented LLM agents increasingly rely on multi-step, multi-tool workflows to complete real tasks. This design expands the attack surface, because data produced by one tool can be persisted and later reused as input to another tool, enabling exploitable source-to-sink dataflows that only emerge through tool composition. We study this risk as multi-tool vulnerabilities in LLM agents, and show that existing discovery efforts focused on single-tool or single-hop testing miss these long-horizon behaviors and provide limited debugging value. We present ChainFuzzer, a greybox framework for discovering and reproducing multi-tool vulnerabilities with auditable evidence. ChainFuzzer (i) identifies high-impact operations with strict source-to-sink dataflow evidence and extracts plausible upstream candidate tool chains based on cross-tool dependencies, (ii) uses Trace-guided Prompt Solving (TPS) to synthesize stable prompts that reliably drive the agent to execute target chains, and (iii) performs guardrail-aware fuzzing to reproduce vulnerabilities under LLM guardrails via payload mutation and sink-specific oracles. We evaluate ChainFuzzer on 20 popular open-source LLM agent apps (998 tools). ChainFuzzer extracts 2,388 candidate tool chains and synthesizes 2,213 stable prompts, confirming 365 unique, reproducible vulnerabilities across 19/20 apps (302 require multi-tool execution). Component evaluation shows tool-chain extraction achieves 96.49% edge precision and 91.50% strict chain precision; TPS increases chain reachability from 27.05% to 95.45%; guardrail-aware fuzzing boosts payload-level trigger rate from 18.20% to 88.60%. Overall, ChainFuzzer achieves 3.02 vulnerabilities per 1M tokens, providing a practical foundation for testing and hardening real-world multi-tool agent systems.
△ Less
Submitted 12 March, 2026;
originally announced March 2026.
-
AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents
Authors:
Yixi Lin,
Jiangrong Wu,
Yuhong Nan,
Xueqiang Wang,
Xinyuan Zhang,
Zibin Zheng
Abstract:
The rapid integration of Large Language Model (LLM) agents into autonomous task execution has introduced significant privacy concerns within cross-tool data flows. In this paper, we systematically investigate and define a novel risk termed Data Over-Exposure (DOE) in LLM Agent, where an Agent inadvertently transmits sensitive data beyond the scope of user intent and functional necessity. We identi…
▽ More
The rapid integration of Large Language Model (LLM) agents into autonomous task execution has introduced significant privacy concerns within cross-tool data flows. In this paper, we systematically investigate and define a novel risk termed Data Over-Exposure (DOE) in LLM Agent, where an Agent inadvertently transmits sensitive data beyond the scope of user intent and functional necessity. We identify that DOE is primarily driven by the broad data paradigms in tool design and the coarse-grained data processing inherent in LLMs. In this paper, we present AgentRaft, the first automated framework for detecting DOE risks in LLM agents. AgentRaft combines program analysis with semantic reasoning through three synergistic modules: (1) it constructs a Cross-Tool Function Call Graph (FCG) to model the interaction landscape of heterogeneous tools; (2) it traverses the FCG to synthesize high-quality testing user prompts that act as deterministic triggers for deep-layer tool execution; and (3) it performs runtime taint tracking and employs a multi-LLM voting committee grounded in global privacy regulations (e.g., GDPR, CCPA, PIPL) to accurately identify privacy violations. We evaluate AgentRaft on a testing environment of 6,675 real-world agent tools. Our findings reveal that DOE is indeed a systemic risk, prevalent in 57.07% of potential tool interaction paths. AgentRaft achieves a high detection accuracy and effectiveness, outperforming baselines by 87.24%. Furthermore, AgentRaft reaches near-total DOE coverage (99%) within only 150 prompts while reducing per-chain verification costs by 88.6%. Our work provides a practical foundation for building auditable and privacy-compliant LLM agent systems.
△ Less
Submitted 8 March, 2026;
originally announced March 2026.
-
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
Authors:
Mubashara Akhtar,
Anka Reuel,
Prajna Soni,
Sanchit Ahuja,
Pawan Sasanka Ammanamanchi,
Ruchit Rawal,
Vilém Zouhar,
Srishti Yadav,
Chenxi Whitehouse,
Dayeon Ki,
Jennifer Mickel,
Leshem Choshen,
Marek Šuppa,
Jan Batzner,
Jenny Chim,
Jeba Sania,
Yanan Long,
Hossein A. Rahmani,
Christina Knight,
Yiyang Nan,
Jyoutir Raj,
Yu Fan,
Shubham Singh,
Subramanyam Sahoo,
Eliya Habba
, et al. (12 additional authors not shown)
Abstract:
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks…
▽ More
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
△ Less
Submitted 18 February, 2026;
originally announced February 2026.
-
DIVER: A Robust Text-to-SQL System with Dynamic Interactive Value Linking and Evidence Reasoning
Authors:
Yafeng Nan,
Haifeng Sun,
Zirui Zhuang,
Qi Qi,
Guojun Chu,
Jianxin Liao,
Dan Pei,
Jingyu Wang
Abstract:
In the era of large language models, Text-to-SQL, as a natural language interface for databases, is playing an increasingly important role. The sota Text-to-SQL models have achieved impressive accuracy, but their performance critically relies on expert-written evidence, which typically clarifies schema and value linking that existing models struggle to identify. Such limitations stem from the ambi…
▽ More
In the era of large language models, Text-to-SQL, as a natural language interface for databases, is playing an increasingly important role. The sota Text-to-SQL models have achieved impressive accuracy, but their performance critically relies on expert-written evidence, which typically clarifies schema and value linking that existing models struggle to identify. Such limitations stem from the ambiguity of user queries and, more importantly, the complexity of comprehending large-scale and dynamic database values. Consequently, in real-world scenarios where expert assistance is unavailable, existing methods suffer a severe performance collapse, with execution accuracy dropping by over 10%. This underscores their lack of robustness. To address this, we propose DIVER, a robust system that automates evidence reasoning with dynamic interactive value linking. It leverages a compatible toolbox containing diverse tools to probe the database. Then, restricted by a structured workspace (CoTF, Chain of Thoughts and Facts), it reflects based on probe results and selects a new tool for next round of probing. Through this automatically iterative process, DIVER identifies schema and value linking missed by existing methods. Based on these accurate linkings, DIVER is able to infer correct usage of SQL functions and formulas and generate high-quality evidence, achieving robust Text-to-SQL without expert assistance. Extensive experiments demonstrate that: 1) The DIVER system significantly enhances the robustness of various Text-to-SQL models, improving performance by up to 10.82% in Execution Accuracy (EX) and 16.09% in Valid Efficiency Score (VES). 2) Our dynamic interactive value linking significantly improves the robustness of existing systems and the accuracy of schema and value linking, especially when confronted with challenges posed by large-scale, dynamic database values.
△ Less
Submitted 12 February, 2026;
originally announced February 2026.
-
Embedding Perturbation may Better Reflect the Uncertainty in LLM Reasoning
Authors:
Qihao Wen,
Jiahao Wang,
Yang Nan,
Pengfei He,
Ravi Tandon,
Han Xu
Abstract:
Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model's uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essenti…
▽ More
Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model's uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM's ``intermediate uncertainty''during reasoning. Our study reveals that an LLMs' incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings. In this way, incorrect (uncertain) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metric achieves stronger uncertainty quantification performance compared with baseline methods such as token (generation) probability and token entropy. Besides, different from approaches that rely on multiple sampling, the perturbation-based metrics offer better simplicity and efficiency.
△ Less
Submitted 2 February, 2026;
originally announced February 2026.
-
Is My RPC Response Reliable? Detecting RPC Bugs in Ethereum Blockchain Client under Context
Authors:
Zhijie Zhong,
Yuhong Nan,
Mingxi Ye,
Qing Xue,
Jiashui Wang,
Xinlei Ying,
Long Liu,
Zibin Zheng
Abstract:
Blockchain clients are fundamental software for running blockchain nodes. They provide users with various RPC (Remote Procedure Call) interfaces to interact with the blockchain. These RPC methods are expected to follow the same specification across different blockchain nodes, providing users with seamless interaction. However, there have been continuous reports on various RPC bugs that can cause u…
▽ More
Blockchain clients are fundamental software for running blockchain nodes. They provide users with various RPC (Remote Procedure Call) interfaces to interact with the blockchain. These RPC methods are expected to follow the same specification across different blockchain nodes, providing users with seamless interaction. However, there have been continuous reports on various RPC bugs that can cause unexpected responses or even Denial of Service weakness. Existing studies on blockchain RPC bug detection mainly focus on generating the RPC method calls for testing blockchain clients. However, a wide range of the reported RPC bugs are triggered in various blockchain contexts. To the best of our knowledge, little attention is paid to generating proper contexts that can trigger these context-dependent RPC bugs.
In this work, we propose EthCRAFT, a Context-aware RPC Analysis and Fuzzing Tool for client RPC bug detection. EthCRAFT first proposes to explore the state transition program space of blockchain clients and generate various transactions to construct the context. EthCRAFT then designs a context-aware RPC method call generation method to send RPC calls to the blockchain clients. The responses of 5 different client implementations are used as cross-referring oracles to detect the RPC bugs. We evaluate EthCRAFT on real-world RPC bugs collected from the GitHub issues of Ethereum client implementations. Experiment results show that EthCRAFT outperforms existing client RPC detectors by detecting more RPC bugs. Moreover, EthCRAFT has found six new bugs in major Ethereum clients and reported them to the developers. One of the bug fixes has been written into breaking changes in the client's updates. Three of our bug reports have been offered a vulnerability bounty by the Ethereum Foundation.
△ Less
Submitted 29 January, 2026;
originally announced January 2026.
-
Interpretable Probability Estimation with LLMs via Shapley Reconstruction
Authors:
Yang Nan,
Qihao Wen,
Jiahao Wang,
Pengfei He,
Ravi Tandon,
Yong Ge,
Han Xu
Abstract:
Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challen…
▽ More
Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM's prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.
△ Less
Submitted 13 January, 2026;
originally announced January 2026.
-
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Authors:
Zhiheng Xi,
Xin Guo,
Yang Nan,
Enyu Zhou,
Junrui Shen,
Wenxiang Chen,
Jiaqi Liu,
Jixuan Huang,
Zhihao Zhang,
Honglin Guo,
Xun Deng,
Zhikai Lei,
Miao Zheng,
Guoteng Wang,
Shuo Zhang,
Peng Sun,
Rui Zheng,
Hang Yan,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empi…
▽ More
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
△ Less
Submitted 21 October, 2025;
originally announced October 2025.
-
ShortcutBreaker: Low-Rank Noisy Bottleneck and Frequency Filtering Block for Multi-Class Unsupervised Anomaly Detection
Authors:
Peng Tang,
Xiaobin Hu,
Tingcheng Li,
Yang Nan,
Tobias Lasser,
Hongwei Bran Li
Abstract:
Multi-class unsupervised anomaly detection (MUAD) has garnered growing research interest, as it seeks to develop a unified model for anomaly detection across multiple classes, i.e., eliminating the need to train separate models for distinct objects and thereby saving substantial computational resources. Under the MUAD setting, while advanced Transformer-based architectures have brought significant…
▽ More
Multi-class unsupervised anomaly detection (MUAD) has garnered growing research interest, as it seeks to develop a unified model for anomaly detection across multiple classes, i.e., eliminating the need to train separate models for distinct objects and thereby saving substantial computational resources. Under the MUAD setting, while advanced Transformer-based architectures have brought significant performance improvements, identity shortcuts persist: they directly copy inputs to outputs, narrowing the gap in reconstruction errors between normal and abnormal cases, and thereby making the two harder to distinguish. Therefore, we propose ShortcutBreaker, a novel unified feature-reconstruction framework for MUAD tasks, featuring two key innovations to address the issue of shortcuts. First, drawing on matrix rank inequality, we design a low-rank noisy bottleneck (LRNB) to project highdimensional features into a low-rank latent space, and theoretically demonstrate its capacity to prevent trivial identity reproduction. Second, leveraging ViTs global modeling capability instead of merely focusing on local features, we incorporate a global perturbation attention to prevent information shortcuts in the decoders. Extensive experiments are performed on four widely used anomaly detection benchmarks, including three industrial datasets (MVTec-AD, ViSA, and Real-IAD) and one medical dataset (Universal Medical). The proposed method achieves a remarkable image-level AUROC of 99.8%, 98.9%, 90.6%, and 87.8% on these four datasets, respectively, consistently outperforming previous MUAD methods across different scenarios Our code will be released..
△ Less
Submitted 27 March, 2026; v1 submitted 21 October, 2025;
originally announced October 2025.
-
Satellite: Detecting and Analyzing Smart Contract Vulnerabilities caused by Subcontract Misuse
Authors:
Zeqin Liao,
Yuhong Nan,
Zixu Gao,
Henglong Liang,
Sicheng Hao,
Jiajing Wu,
Zibin Zheng
Abstract:
Developers of smart contracts pervasively reuse subcontracts to improve development efficiency. Like any program language, such subcontract reuse may unexpectedly include, or introduce vulnerabilities to the end-point smart contract. Unfortunately, automatically detecting such issues poses several unique challenges. Particularly, in most cases, smart contracts are compiled as bytecode, whose class…
▽ More
Developers of smart contracts pervasively reuse subcontracts to improve development efficiency. Like any program language, such subcontract reuse may unexpectedly include, or introduce vulnerabilities to the end-point smart contract. Unfortunately, automatically detecting such issues poses several unique challenges. Particularly, in most cases, smart contracts are compiled as bytecode, whose class-level information (e.g., inheritance, virtual function table), and even semantics (e.g., control flow and data flow) are fully obscured as a single smart contract after compilation.
In this paper, we propose Satellite, a new bytecode-level static analysis framework for subcontract misuse vulnerability (SMV) detection in smart contracts. Satellite incorporates a series of novel designs to enhance its overall effectiveness.. Particularly, Satellite utilizes a transfer learning method to recover the inherited methods, which are critical for identifying subcontract reuse in smart contracts. Further, Satellite extracts a set of fine-grained method-level features and performs a method-level comparison, for identifying the reuse part of subcontract in smart contracts. Finally, Satellite summarizes a set of SMV indicators according to their types, and hence effectively identifies SMVs. To evaluate Satellite, we construct a dataset consisting of 58 SMVs derived from real-world attacks and collect additional 56 SMV patterns from SOTA studies. Experiment results indicate that Satellite exhibits good performance in identifying SMV, with a precision rate of 84.68% and a recall rate of 92.11%. In addition, Satellite successfully identifies 14 new/unknown SMV over 10,011 real-world smart contracts, affecting a total amount of digital assets worth 201,358 USD.
△ Less
Submitted 28 September, 2025;
originally announced September 2025.
-
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels
Authors:
Junjie Ye,
Yuming Yang,
Yang Nan,
Shuo Li,
Qi Zhang,
Tao Gui,
Xuanjing Huang,
Peng Wang,
Zhongchao Shi,
Jianping Fan
Abstract:
Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model's knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA)…
▽ More
Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model's knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.
△ Less
Submitted 10 February, 2026; v1 submitted 20 September, 2025;
originally announced September 2025.
-
MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance
Authors:
Kaikai Zhao,
Zhaoxiang Liu,
Peng Wang,
Xin Wang,
Zhicheng Ma,
Yajun Xu,
Wenjing Zhang,
Yibing Nan,
Kai Wang,
Shiguo Lian
Abstract:
General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset spec…
▽ More
General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS's effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5's performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6's from 0.678 to 0.921 (+35.8%), Qwen2-VL's from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL's from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
Authors:
Yang Nan,
Pengfei He,
Ravi Tandon,
Han Xu
Abstract:
Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textit{diagnosing the source of uncertainty}. In this study, we show that, when an LL…
▽ More
Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textit{diagnosing the source of uncertainty}. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.
△ Less
Submitted 28 August, 2025;
originally announced September 2025.
-
Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search
Authors:
Zeyu Xiong,
Yixuan Nan,
Li Gao,
Hengzhu Tang,
Shuaiqiang Wang,
Junfeng Wang,
Dawei Yin
Abstract:
In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant…
▽ More
In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.
△ Less
Submitted 28 August, 2025;
originally announced August 2025.
-
Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge
Authors:
Xiaoya Zhu,
Yibing Nan,
Shiguo Lian
Abstract:
With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image…
▽ More
With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And online data augmentation and offline sample generation methods are employed to enrich the diversity of training samples and increase the generalization ability of the model. Finally, we got the award of excellence in Deepfake image detection.
△ Less
Submitted 15 August, 2025;
originally announced August 2025.
-
RANA: Robust Active Learning for Noisy Network Alignment
Authors:
Yixuan Nan,
Xixun Lin,
Yanmin Shang,
Zhuofan Li,
Can Zhao,
Yanan Cao
Abstract:
Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To addres…
▽ More
Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To address these problems, we propose RANA, a Robust Active learning framework for noisy Network Alignment. RANA effectively tackles both structure noise and label noise while addressing the sparsity of anchor link annotations, which can improve the robustness of network alignment models. Specifically, RANA introduces the proposed Noise-aware Selection Module and the Label Denoising Module to address structural noise and labeling noise, respectively. In the first module, we design a noise-aware maximization objective to select node pairs, incorporating a cleanliness score to address structural noise. In the second module, we propose a novel multi-source fusion denoising strategy that leverages model and twin node pairs labeling to provide more accurate labels for node pairs. Empirical results on three real-world datasets demonstrate that RANA outperforms state-of-the-art active learning-based methods in alignment accuracy. Our code is available at https://github.com/YXNan0110/RANA.
△ Less
Submitted 7 August, 2025; v1 submitted 30 July, 2025;
originally announced July 2025.
-
An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs
Authors:
Zeqin Liao,
Zibin Zheng,
Peifan Reng,
Henglong Liang,
Zixu Gao,
Zhixiang Chen,
Wei Li,
Yuhong Nan
Abstract:
Embodied Artificial Intelligence Robots (EAIR) is an emerging and rapidly evolving technological domain. Ensuring their program correctness is fundamental to their successful deployment. However, a general and in-depth understanding of EAIR system bugs remains lacking, which hinders the development of practices and techniques to tackle EAIR system bugs.
To bridge this gap, we conducted the first…
▽ More
Embodied Artificial Intelligence Robots (EAIR) is an emerging and rapidly evolving technological domain. Ensuring their program correctness is fundamental to their successful deployment. However, a general and in-depth understanding of EAIR system bugs remains lacking, which hinders the development of practices and techniques to tackle EAIR system bugs.
To bridge this gap, we conducted the first systematic study of 885 EAIR system bugs collected from 80 EAIR system projects to investigate their symptoms, underlying causes, and module distribution. Our analysis takes considerable effort, which classifies these bugs into 18 underlying causes, 15 distinct symptoms, and identifies 13 affected modules. It reveals several new interesting findings and implications which help shed light on future research on tackling or repairing EAIR system bugs. First, among the 15 identified symptoms, our findings highlight 8 symptoms specific to EAIR systems, which is characterized by severe functional failures and potential physical hazards. Second, within the 18 underlying causes, we define 8 EAIR-specific causes, the majority of which stem from the intricate issues of AI- agent reasoning and decision making. Finally, to facilitate precise and efficient bug prediction, detection, and repair, we constructed a mapping between underlying causes and the modules in which they most frequently occur, which enables researchers to focus diagnostic efforts on the modules most susceptible to specific bug types.
△ Less
Submitted 24 July, 2025;
originally announced July 2025.
-
AlphaGo Moment for Model Architecture Discovery
Authors:
Yixiu Liu,
Yang Nan,
Weixian Xu,
Xiangkun Hu,
Lyumanshan Ye,
Zhen Qin,
Pengfei Liu
Abstract:
While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery--a fully autonomous system that sh…
▽ More
While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery--a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo's Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself--demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.
△ Less
Submitted 23 July, 2025;
originally announced July 2025.
-
Interaction as Intelligence: Deep Research With Human-AI Partnership
Authors:
Lyumanshan Ye,
Xiaojie Cai,
Xinkai Wang,
Junfei Wang,
Xiangkun Hu,
Jiadi Su,
Yang Nan,
Sihan Wang,
Bohan Zhang,
Xiaoze Fan,
Jinbin Luo,
Yuxiang Zheng,
Tianze Xu,
Dayuan Fu,
Yunze Wu,
Pengrui Lu,
Zengzhi Wang,
Yiwei Qin,
Zhen Huang,
Yan Ma,
Zhulin Hu,
Haoyang Zou,
Tiantian Mi,
Yixin Ye,
Ethan Chern
, et al. (1 additional authors not shown)
Abstract:
This paper introduces "Interaction as Intelligence" research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems e…
▽ More
This paper introduces "Interaction as Intelligence" research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems engage in extended thinking processes for research tasks, meaningful interaction transitions from an optional enhancement to an essential component of effective intelligence. Current deep research systems adopt an "input-wait-output" paradigm where users initiate queries and receive results after black-box processing. This approach leads to error cascade effects, inflexible research boundaries that prevent question refinement during investigation, and missed opportunities for expertise integration. To address these limitations, we introduce Deep Cognition, a system that transforms the human role from giving instructions to cognitive oversight-a mode of engagement where humans guide AI thinking processes through strategic intervention at critical junctures. Deep cognition implements three key innovations: (1)Transparent, controllable, and interruptible interaction that reveals AI reasoning and enables intervention at any point; (2)Fine-grained bidirectional dialogue; and (3)Shared cognitive context where the system observes and adapts to user behaviors without explicit instruction. User evaluation demonstrates that this cognitive oversight paradigm outperforms the strongest baseline across six key metrics: Transparency(+20.0%), Fine-Grained Interaction(+29.2%), Real-Time Intervention(+18.5%), Ease of Collaboration(+27.7%), Results-Worth-Effort(+8.8%), and Interruptibility(+20.7%). Evaluations on challenging research problems show 31.8% to 50.0% points of improvements over deep research systems.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents
Authors:
Jiangrong Wu,
Yuhong Nan,
Jianliang Wu,
Zitong Yao,
Zibin Zheng
Abstract:
The increasing capabilities of LLMs have led to the rapid proliferation of LLM agent apps, where developers enhance LLMs with access to external resources to support complex task execution. Among these, LLM email agent apps represent one of the widely used categories, as email remains a critical communication medium for users. LLM email agents are capable of managing and responding to email using…
▽ More
The increasing capabilities of LLMs have led to the rapid proliferation of LLM agent apps, where developers enhance LLMs with access to external resources to support complex task execution. Among these, LLM email agent apps represent one of the widely used categories, as email remains a critical communication medium for users. LLM email agents are capable of managing and responding to email using LLM-driven reasoning and autonomously executing user instructions via external email APIs (e.g., send email). However, despite their growing deployment and utility, the security mechanism of LLM email agent apps remains underexplored. Currently, there is no comprehensive study into the potential security risk within these agent apps and their broader implications.
In this paper, we conduct the first in-depth and systematic security study of LLM email agents. We propose the Email Agent Hijacking (EAH) attack, which overrides the original prompts of the email agent via external email resources, allowing attackers to gain control of the email agent remotely and further perform specific attack scenarios without user awareness.
To facilitate the large-scale evaluation, we propose EAHawk, a pipeline to evaluate the EAH attack of LLM email agent apps. By EAHawk, we performed an empirical study spanning 14 representative LLM agent frameworks, 63 agent apps, 12 LLMs, and 20 email services, which led to the generation of 1,404 real-world email agent instances for evaluation. Experimental results indicate that all 1,404 instances were successfully hijacked; on average, only 2.03 attack attempts are required to control an email agent instance. Even worse, for some LLMs, the average number of attempts needed to achieve full agent control drops to as few as 1.23.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
Understanding the Sneaky Patterns of Pop-up Windows in the Mobile Ecosystem
Authors:
Dongpeng Wu,
Yuhong Nan,
Shaojiang Wang,
Jiawei Wang,
Luwa Li,
Xueqiang Wang
Abstract:
In mobile applications, Pop-up window (PoW) plays a crucial role in improving user experience, guiding user actions, and delivering key information. Unfortunately, the excessive use of PoWs severely degrades the user experience. These PoWs often sneakily mislead users in their choices, employing tactics that subtly manipulate decision-making processes. In this paper, we provide the first in-depth…
▽ More
In mobile applications, Pop-up window (PoW) plays a crucial role in improving user experience, guiding user actions, and delivering key information. Unfortunately, the excessive use of PoWs severely degrades the user experience. These PoWs often sneakily mislead users in their choices, employing tactics that subtly manipulate decision-making processes. In this paper, we provide the first in-depth study on the Sneaky patterns in the mobile ecosystem. Our research first highlights five distinct Sneaky patterns that compromise user experience, including text mislead, UI mislead, forced action, out of context and privacy-intrusive by default. To further evaluate the impact of such Sneaky patterns at large, we developed an automated analysis pipeline called Poker, to tackle the challenges of identifying, dismissing, and collecting diverse PoWs in real-world apps. Evaluation results showed that Poker achieves high precision and recall in detecting PoWs, efficiently dismissed over 88% of PoWs with minimal user interaction, with good robustness and reliability in comprehensive app exploration. Further, our systematic analysis over the top 100 popular apps in China and U.S. revealing that both regions displayed significant ratios of Sneaky patterns, particularly in promotional contexts, with high occurrences in categories such as shopping and video apps. The findings highlight the strategic deployment of Sneaky tactics that compromise user trust and ethical app design.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Authors:
Saurabh Dash,
Yiyang Nan,
John Dang,
Arash Ahmadian,
Shivalika Singh,
Madeline Smith,
Bharat Venkitesh,
Vlad Shmyhlo,
Viraat Aryabumi,
Walter Beller-Morales,
Jeremy Pekmez,
Jason Ozuzu,
Pierre Richemond,
Acyr Locatelli,
Nick Frosst,
Phil Blunsom,
Aidan Gomez,
Ivan Zhang,
Marzieh Fadaee,
Manoj Govindassamy,
Sudip Roy,
Matthias Gallé,
Beyza Ermis,
Ahmet Üstün,
Sara Hooker
Abstract:
Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing d…
▽ More
Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
The Leaderboard Illusion
Authors:
Shivalika Singh,
Yiyang Nan,
Alex Wang,
Daniel D'Souza,
Sayash Kapoor,
Ahmet Üstün,
Sanmi Koyejo,
Yuntian Deng,
Shayne Longpre,
Noah A. Smith,
Beyza Ermis,
Marzieh Fadaee,
Sara Hooker
Abstract:
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private test…
▽ More
Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field
△ Less
Submitted 12 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
Unpaired Translation of Chest X-ray Images for Lung Opacity Diagnosis via Adaptive Activation Masks and Cross-Domain Alignment
Authors:
Junzhi Ning,
Dominic Marshall,
Yijian Gao,
Xiaodan Xing Yang Nan,
Yingying Fang,
Sheng Zhang,
Matthieu Komorowski,
Guang Yang
Abstract:
Chest X-ray radiographs (CXRs) play a pivotal role in diagnosing and monitoring cardiopulmonary diseases. However, lung opacities in CXRs frequently obscure anatomical structures, impeding clear identification of lung borders and complicating the localization of pathology. This challenge significantly hampers segmentation accuracy and precise lesion identification, which are crucial for diagnosis.…
▽ More
Chest X-ray radiographs (CXRs) play a pivotal role in diagnosing and monitoring cardiopulmonary diseases. However, lung opacities in CXRs frequently obscure anatomical structures, impeding clear identification of lung borders and complicating the localization of pathology. This challenge significantly hampers segmentation accuracy and precise lesion identification, which are crucial for diagnosis. To tackle these issues, our study proposes an unpaired CXR translation framework that converts CXRs with lung opacities into counterparts without lung opacities while preserving semantic features. Central to our approach is the use of adaptive activation masks to selectively modify opacity regions in lung CXRs. Cross-domain alignment ensures translated CXRs without opacity issues align with feature maps and prediction labels from a pre-trained CXR lesion classifier, facilitating the interpretability of the translation process. We validate our method using RSNA, MIMIC-CXR-JPG and JSRT datasets, demonstrating superior translation quality through lower Frechet Inception Distance (FID) and Kernel Inception Distance (KID) scores compared to existing methods (FID: 67.18 vs. 210.4, KID: 0.01604 vs. 0.225). Evaluation on RSNA opacity, MIMIC acute respiratory distress syndrome (ARDS) patient CXRs and JSRT CXRs show our method enhances segmentation accuracy of lung borders and improves lesion classification, further underscoring its potential in clinical settings (RSNA: mIoU: 76.58% vs. 62.58%, Sensitivity: 85.58% vs. 77.03%; MIMIC ARDS: mIoU: 86.20% vs. 72.07%, Sensitivity: 92.68% vs. 86.85%; JSRT: mIoU: 91.08% vs. 85.6%, Sensitivity: 97.62% vs. 95.04%). Our approach advances CXR imaging analysis, especially in investigating segmentation impacts through image translation techniques.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Revisiting Medical Image Retrieval via Knowledge Consolidation
Authors:
Yang Nan,
Huichi Zhou,
Xiaodan Xing,
Giorgos Papanastasiou,
Lei Zhu,
Zhifan Gao,
Alejandro F Fangi,
Guang Yang
Abstract:
As artificial intelligence and digital medicine increasingly permeate healthcare systems, robust governance frameworks are essential to ensure ethical, secure, and effective implementation. In this context, medical image retrieval becomes a critical component of clinical data management, playing a vital role in decision-making and safeguarding patient information. Existing methods usually learn ha…
▽ More
As artificial intelligence and digital medicine increasingly permeate healthcare systems, robust governance frameworks are essential to ensure ethical, secure, and effective implementation. In this context, medical image retrieval becomes a critical component of clinical data management, playing a vital role in decision-making and safeguarding patient information. Existing methods usually learn hash functions using bottleneck features, which fail to produce representative hash codes from blended embeddings. Although contrastive hashing has shown superior performance, current approaches often treat image retrieval as a classification task, using category labels to create positive/negative pairs. Moreover, many methods fail to address the out-of-distribution (OOD) issue when models encounter external OOD queries or adversarial attacks. In this work, we propose a novel method to consolidate knowledge of hierarchical features and optimisation functions. We formulate the knowledge consolidation by introducing Depth-aware Representation Fusion (DaRF) and Structure-aware Contrastive Hashing (SCH). DaRF adaptively integrates shallow and deep representations into blended features, and SCH incorporates image fingerprints to enhance the adaptability of positive/negative pairings. These blended features further facilitate OOD detection and content-based recommendation, contributing to a secure AI-driven healthcare environment. Moreover, we present a content-guided ranking to improve the robustness and reproducibility of retrieval results. Our comprehensive assessments demonstrate that the proposed method could effectively recognise OOD samples and significantly outperform existing approaches in medical image retrieval (p<0.05). In particular, our method achieves a 5.6-38.9% improvement in mean Average Precision on the anatomical radiology dataset.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Authors:
Yuming Yang,
Yang Nan,
Junjie Ye,
Shihan Dou,
Xiao Wang,
Shuo Li,
Huijie Lv,
Mingqi Wu,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we syst…
▽ More
Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance, highlighting its value in guiding data engineering practices. With NovelSum as an optimization objective, we further develop a greedy, diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric. The code is available at https://github.com/UmeanNever/NovelSum.
△ Less
Submitted 2 June, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
Augmenting Smart Contract Decompiler Output through Fine-grained Dependency Analysis and LLM-facilitated Semantic Recovery
Authors:
Zeqin Liao,
Yuhong Nan,
Zixu Gao,
Henglong Liang,
Sicheng Hao,
Peifan Reng,
Zibin Zheng
Abstract:
Decompiler is a specialized type of reverse engineering tool extensively employed in program analysis tasks, particularly in program comprehension and vulnerability detection. However, current Solidity smart contract decompilers face significant limitations in reconstructing the original source code. In particular, the bottleneck of SOTA decompilers lies in inaccurate method identification, incorr…
▽ More
Decompiler is a specialized type of reverse engineering tool extensively employed in program analysis tasks, particularly in program comprehension and vulnerability detection. However, current Solidity smart contract decompilers face significant limitations in reconstructing the original source code. In particular, the bottleneck of SOTA decompilers lies in inaccurate method identification, incorrect variable type recovery, and missing contract attributes. These deficiencies hinder downstream tasks and understanding of the program logic. To address these challenges, we propose SmartHalo, a new framework that enhances decompiler output by combining static analysis (SA) and large language models (LLM). SmartHalo leverages the complementary strengths of SA's accuracy in control and data flow analysis and LLM's capability in semantic prediction. More specifically, \system{} constructs a new data structure - Dependency Graph (DG), to extract semantic dependencies via static analysis. Then, it takes DG to create prompts for LLM optimization. Finally, the correctness of LLM outputs is validated through symbolic execution and formal verification. Evaluation on a dataset consisting of 465 randomly selected smart contract methods shows that SmartHalo significantly improves the quality of the decompiled code, compared to SOTA decompilers (e.g., Gigahorse). Notably, integrating GPT-4o with SmartHalo further enhances its performance, achieving precision rates of 87.39% for method boundaries, 90.39% for variable types, and 80.65% for contract attributes.
△ Less
Submitted 16 October, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Authors:
Angelika Romanou,
Negar Foroutan,
Anna Sotnikova,
Zeming Chen,
Sree Harsha Nelaturu,
Shivalika Singh,
Rishabh Maheshwary,
Micol Altomare,
Mohamed A. Haggag,
Snegha A,
Alfonso Amayuelas,
Azril Hafizi Amirudin,
Viraat Aryabumi,
Danylo Boiko,
Michael Chang,
Jenny Chim,
Gal Cohen,
Aditya Kumar Dalmia,
Abraham Diress,
Sharad Duwal,
Daniil Dzenhaliou,
Daniel Fernando Erazo Florez,
Fabian Farestam,
Joseph Marvin Imperial,
Shayekh Bin Islam
, et al. (34 additional authors not shown)
Abstract:
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other th…
▽ More
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
Authors:
Zhiwei Jia,
Yuesong Nan,
Huixi Zhao,
Gengdai Liu
Abstract:
Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to step-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL…
▽ More
Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to step-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for effective reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and tailors reward optimization for $\le2$-step image generation with efficient off-policy exploration. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including DDPO and Diffusion-DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage \href{https://sites.google.com/view/lasro}{here}.
△ Less
Submitted 11 March, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Authors:
Tianyu Yang,
Yiyang Nan,
Lisen Dai,
Zhenwen Liang,
Yapeng Tian,
Xiangliang Zhang
Abstract:
Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-N…
▽ More
Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
△ Less
Submitted 10 November, 2024; v1 submitted 7 November, 2024;
originally announced November 2024.
-
Continual Audio-Visual Sound Separation
Authors:
Weiguo Pian,
Yiyang Nan,
Shijian Deng,
Shentong Mo,
Yunhui Guo,
Yapeng Tian
Abstract:
In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound sep…
▽ More
In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning
Authors:
Xiaodan Xing,
Junzhi Ning,
Yang Nan,
Guang Yang
Abstract:
Deep generative models have significantly advanced medical imaging analysis by enhancing dataset size and quality. Beyond mere data augmentation, our research in this paper highlights an additional, significant capacity of deep generative models: their ability to reveal and demonstrate patterns in medical images. We employ a generative structure with hybrid conditions, combining clinical data and…
▽ More
Deep generative models have significantly advanced medical imaging analysis by enhancing dataset size and quality. Beyond mere data augmentation, our research in this paper highlights an additional, significant capacity of deep generative models: their ability to reveal and demonstrate patterns in medical images. We employ a generative structure with hybrid conditions, combining clinical data and segmentation masks to guide the image synthesis process. Furthermore, we innovatively transformed the tabular clinical data into textual descriptions. This approach simplifies the handling of missing values and also enables us to leverage large pre-trained vision-language models that investigate the relations between independent clinical entries and comprehend general terms, such as gender and smoking status. Our approach differs from and presents a more challenging task than traditional medical report-guided synthesis due to the less visual correlation of our clinical information with the images. To overcome this, we introduce a text-visual embedding mechanism that strengthens the conditions, ensuring the network effectively utilizes the provided information. Our pipeline is generalizable to both GAN-based and diffusion models. Experiments on chest CT, particularly focusing on the smoking status, demonstrated a consistent intensity shift in the lungs which is in agreement with clinical observations, indicating the effectiveness of our method in capturing and visualizing the impact of specific attributes on medical image patterns. Our methods offer a new avenue for the early detection and precise visualization of complex clinical conditions with deep generative models. All codes are https://github.com/junzhin/DGM-VLC.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
MONICA: Benchmarking on Long-tailed Medical Image Classification
Authors:
Lie Ju,
Siyuan Yan,
Yukun Zhou,
Yang Nan,
Xiaodan Xing,
Peibo Duan,
Zongyuan Ge
Abstract:
Long-tailed learning is considered to be an extremely challenging problem in data imbalance learning. It aims to train well-generalized models from a large number of images that follow a long-tailed class distribution. In the medical field, many diagnostic imaging exams such as dermoscopy and chest radiography yield a long-tailed distribution of complex clinical findings. Recently, long-tailed lea…
▽ More
Long-tailed learning is considered to be an extremely challenging problem in data imbalance learning. It aims to train well-generalized models from a large number of images that follow a long-tailed class distribution. In the medical field, many diagnostic imaging exams such as dermoscopy and chest radiography yield a long-tailed distribution of complex clinical findings. Recently, long-tailed learning in medical image analysis has garnered significant attention. However, the field currently lacks a unified, strictly formulated, and comprehensive benchmark, which often leads to unfair comparisons and inconclusive results. To help the community improve the evaluation and advance, we build a unified, well-structured codebase called Medical OpeN-source Long-taIled ClassifiCAtion (MONICA), which implements over 30 methods developed in relevant fields and evaluated on 12 long-tailed medical datasets covering 6 medical domains. Our work provides valuable practical guidance and insights for the field, offering detailed analysis and discussion on the effectiveness of individual components within the inbuilt state-of-the-art methodologies. We hope this codebase serves as a comprehensive and reproducible benchmark, encouraging further advancements in long-tailed medical image learning. The codebase is publicly available on https://github.com/PyJulie/MONICA.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
SmartReco: Detecting Read-Only Reentrancy via Fine-Grained Cross-DApp Analysis
Authors:
Jingwen Zhang,
Zibin Zheng,
Yuhong Nan,
Mingxi Ye,
Kaiwen Ning,
Yu Zhang,
Weizhe Zhang
Abstract:
Despite the increasing popularity of Decentralized Applications (DApps), they are suffering from various vulnerabilities that can be exploited by adversaries for profits. Among such vulnerabilities, Read-Only Reentrancy (called ROR in this paper), is an emerging type of vulnerability that arises from the complex interactions between DApps. In the recent three years, attack incidents of ROR have al…
▽ More
Despite the increasing popularity of Decentralized Applications (DApps), they are suffering from various vulnerabilities that can be exploited by adversaries for profits. Among such vulnerabilities, Read-Only Reentrancy (called ROR in this paper), is an emerging type of vulnerability that arises from the complex interactions between DApps. In the recent three years, attack incidents of ROR have already caused around 30M USD losses to the DApp ecosystem. Existing techniques for vulnerability detection in smart contracts can hardly detect Read-Only Reentrancy attacks, due to the lack of tracking and analyzing the complex interactions between multiple DApps. In this paper, we propose SmartReco, a new framework for detecting Read-Only Reentrancy vulnerability in DApps through a novel combination of static and dynamic analysis (i.e., fuzzing) over smart contracts. The key design behind SmartReco is threefold: (1) SmartReco identifies the boundary between different DApps from the heavy-coupled cross-contract interactions. (2) SmartReco performs fine-grained static analysis to locate points of interest (i.e., entry functions) that may lead to ROR. (3) SmartReco utilizes the on-chain transaction data and performs multi-function fuzzing (i.e., the entry function and victim function) across different DApps to verify the existence of ROR. Our evaluation of a manual-labeled dataset with 45 RORs shows that SmartReco achieves a precision of 88.63% and a recall of 86.36%. In addition, SmartReco successfully detects 43 new RORs from 123 popular DApps. The total assets affected by such RORs reach around 520,000 USD.
△ Less
Submitted 9 December, 2024; v1 submitted 27 September, 2024;
originally announced September 2024.
-
CA-BERT: Leveraging Context Awareness for Enhanced Multi-Turn Chat Interaction
Authors:
Minghao Liu,
Mingxiu Sui,
Yi Nan,
Cangqing Wang,
Zhijie Zhou
Abstract:
Effective communication in automated chat systems hinges on the ability to understand and respond to context. Traditional models often struggle with determining when additional context is necessary for generating appropriate responses. This paper introduces Context-Aware BERT (CA-BERT), a transformer-based model specifically fine-tuned to address this challenge. CA-BERT innovatively applies deep l…
▽ More
Effective communication in automated chat systems hinges on the ability to understand and respond to context. Traditional models often struggle with determining when additional context is necessary for generating appropriate responses. This paper introduces Context-Aware BERT (CA-BERT), a transformer-based model specifically fine-tuned to address this challenge. CA-BERT innovatively applies deep learning techniques to discern context necessity in multi-turn chat interactions, enhancing both the relevance and accuracy of responses.
We describe the development of CA-BERT, which adapts the robust architecture of BERT with a novel training regimen focused on a specialized dataset of chat dialogues. The model is evaluated on its ability to classify context necessity, demonstrating superior performance over baseline BERT models in terms of accuracy and efficiency. Furthermore, CA-BERT's implementation showcases significant reductions in training time and resource usage, making it feasible for real-time applications.
The results indicate that CA-BERT can effectively enhance the functionality of chatbots by providing a nuanced understanding of context, thereby improving user experience and interaction quality in automated systems. This study not only advances the field of NLP in chat applications but also provides a framework for future research into context-sensitive AI developments.
△ Less
Submitted 1 October, 2024; v1 submitted 5 September, 2024;
originally announced September 2024.
-
CONNECTOR: Enhancing the Traceability of Decentralized Bridge Applications via Automatic Cross-chain Transaction Association
Authors:
Dan Lin,
Jiajing Wu,
Yuxin Su,
Ziye Zheng,
Yuhong Nan,
Qinnan Zhang,
Bowen Song,
Zibin Zheng
Abstract:
Decentralized bridge applications are important software that connects various blockchains and facilitates cross-chain asset transfer in the decentralized finance (DeFi) ecosystem which currently operates in a multi-chain environment. Cross-chain transaction association identifies and matches unique transactions executed by bridge DApps, which is important research to enhance the traceability of c…
▽ More
Decentralized bridge applications are important software that connects various blockchains and facilitates cross-chain asset transfer in the decentralized finance (DeFi) ecosystem which currently operates in a multi-chain environment. Cross-chain transaction association identifies and matches unique transactions executed by bridge DApps, which is important research to enhance the traceability of cross-chain bridge DApps. However, existing methods rely entirely on unobservable internal ledgers or APIs, violating the open and decentralized properties of blockchain. In this paper, we analyze the challenges of this issue and then present CONNECTOR, an automated cross-chain transaction association analysis method based on bridge smart contracts. Specifically, CONNECTOR first identifies deposit transactions by extracting distinctive and generic features from the transaction traces of bridge contracts. With the accurate deposit transactions, CONNECTOR mines the execution logs of bridge contracts to achieve withdrawal transaction matching. We conduct real-world experiments on different types of bridges to demonstrate the effectiveness of CONNECTOR. The experiment demonstrates that CONNECTOR successfully identifies 100% deposit transactions, associates 95.81% withdrawal transactions, and surpasses methods for CeFi bridges. Based on the association results, we obtain interesting findings about cross-chain transaction behaviors in DeFi bridges and analyze the tracing abilities of CONNECTOR to assist the DeFi bridge apps.
△ Less
Submitted 19 December, 2024; v1 submitted 7 September, 2024;
originally announced September 2024.
-
Coupling AI and Citizen Science in Creation of Enhanced Training Dataset for Medical Image Segmentation
Authors:
Amir Syahmi,
Xiangrong Lu,
Yinxuan Li,
Haoxuan Yao,
Hanjun Jiang,
Ishita Acharya,
Shiyi Wang,
Yang Nan,
Xiaodan Xing,
Guang Yang
Abstract:
Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, w…
▽ More
Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, we introduce a robust and versatile framework that combines AI and crowdsourcing to improve both the quality and quantity of medical image datasets across different modalities. Our approach utilises a user-friendly online platform that enables a diverse group of crowd annotators to label medical images efficiently. By integrating the MedSAM segmentation AI with this platform, we accelerate the annotation process while maintaining expert-level quality through an algorithm that merges crowd-labelled images. Additionally, we employ pix2pixGAN, a generative AI model, to expand the training dataset with synthetic images that capture realistic morphological features. These methods are combined into a cohesive framework designed to produce an enhanced dataset, which can serve as a universal pre-processing pipeline to boost the training of any medical deep learning segmentation model. Our results demonstrate that this framework significantly improves model performance, especially when training data is limited.
△ Less
Submitted 20 July, 2025; v1 submitted 4 September, 2024;
originally announced September 2024.
-
Beyond the Hype: A dispassionate look at vision-language models in medical scenario
Authors:
Yang Nan,
Huichi Zhou,
Xiaodan Xing,
Guang Yang
Abstract:
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate on evaluating VLMs based on simple Visual Question Answering…
▽ More
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate on evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristics of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models' ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models' capabilities against unharmonized and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code is available at https://github.com/Nandayang/RadVUQA
△ Less
Submitted 9 April, 2025; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Probing Perfection: The Relentless Art of Meddling for Pulmonary Airway Segmentation from HRCT via a Human-AI Collaboration Based Active Learning Method
Authors:
Shiyi Wang,
Yang Nan,
Sheng Zhang,
Federico Felder,
Xiaodan Xing,
Yingying Fang,
Javier Del Ser,
Simon L F Walsh,
Guang Yang
Abstract:
In pulmonary tracheal segmentation, the scarcity of annotated data is a prevalent issue in medical segmentation. Additionally, Deep Learning (DL) methods face challenges: the opacity of 'black box' models and the need for performance enhancement. Our Human-Computer Interaction (HCI) based models (RS_UNet, LC_UNet, UUNet, and WD_UNet) address these challenges by combining diverse query strategies w…
▽ More
In pulmonary tracheal segmentation, the scarcity of annotated data is a prevalent issue in medical segmentation. Additionally, Deep Learning (DL) methods face challenges: the opacity of 'black box' models and the need for performance enhancement. Our Human-Computer Interaction (HCI) based models (RS_UNet, LC_UNet, UUNet, and WD_UNet) address these challenges by combining diverse query strategies with various DL models. We train four HCI models and repeat these steps: (1) Query Strategy: The HCI models select samples that provide the most additional representative information when labeled in each iteration and identify unlabeled samples with the greatest predictive disparity using Wasserstein Distance, Least Confidence, Entropy Sampling, and Random Sampling. (2) Central line correction: Selected samples are used for expert correction of system-generated tracheal central lines in each training round. (3) Update training dataset: Experts update the training dataset after each DL model's training epoch, enhancing the trustworthiness and performance of the models. (4) Model training: The HCI model is trained using the updated dataset and an enhanced UNet version. Experimental results confirm the effectiveness of these HCI-based approaches, showing that WD-UNet, LC-UNet, UUNet, and RS-UNet achieve comparable or superior performance to state-of-the-art DL models. Notably, WD-UNet achieves this with only 15%-35% of the training data, reducing physician annotation time by 65%-85%.
△ Less
Submitted 23 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Learning Pareto Set for Multi-Objective Continuous Robot Control
Authors:
Tianye Shu,
Ke Shang,
Cheng Gong,
Yang Nan,
Hisao Ishibuchi
Abstract:
For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-c…
▽ More
For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Fuzzy Attention-based Border Rendering Network for Lung Organ Segmentation
Authors:
Sheng Zhang,
Yang Nan,
Yingying Fang,
Shiyi Wang,
Xiaodan Xing,
Zhifan Gao,
Guang Yang
Abstract:
Automatic lung organ segmentation on CT images is crucial for lung disease diagnosis. However, the unlimited voxel values and class imbalance of lung organs can lead to false-negative/positive and leakage issues in advanced methods. Additionally, some slender lung organs are easily lost during the recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing severe discontinuity issue…
▽ More
Automatic lung organ segmentation on CT images is crucial for lung disease diagnosis. However, the unlimited voxel values and class imbalance of lung organs can lead to false-negative/positive and leakage issues in advanced methods. Additionally, some slender lung organs are easily lost during the recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing severe discontinuity issue. Inspired by these, this paper introduces an effective lung organ segmentation method called Fuzzy Attention-based Border Rendering (FABR) network. Since fuzzy logic can handle the uncertainty in feature extraction, hence the fusion of deep networks and fuzzy sets should be a viable solution for better performance. Meanwhile, unlike prior top-tier methods that operate on all regular dense points, our FABR depicts lung organ regions as cube-trees, focusing only on recycle-sampled border vulnerable points, rendering the severely discontinuous, false-negative/positive organ regions with a novel Global-Local Cube-tree Fusion (GLCF) module. All experimental results, on four challenging datasets of airway & artery, demonstrate that our method can achieve the favorable performance significantly.
△ Less
Submitted 1 July, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
SmartAxe: Detecting Cross-Chain Vulnerabilities in Bridge Smart Contracts via Fine-Grained Static Analysis
Authors:
Zeqin Liao,
Yuhong Nan,
Henglong Liang,
Sicheng Hao,
Juan Zhai,
Jiajing Wu,
Zibin Zheng
Abstract:
With the increasing popularity of blockchain, different blockchain platforms coexist in the ecosystem (e.g., Ethereum, BNB, EOSIO, etc.), which prompts the high demand for cross-chain communication. Cross-chain bridge is a specific type of decentralized application for asset exchange across different blockchain platforms. Securing the smart contracts of cross-chain bridges is in urgent need, as th…
▽ More
With the increasing popularity of blockchain, different blockchain platforms coexist in the ecosystem (e.g., Ethereum, BNB, EOSIO, etc.), which prompts the high demand for cross-chain communication. Cross-chain bridge is a specific type of decentralized application for asset exchange across different blockchain platforms. Securing the smart contracts of cross-chain bridges is in urgent need, as there are a number of recent security incidents with heavy financial losses caused by vulnerabilities in bridge smart contracts, as we call them Cross-Chain Vulnerabilities (CCVs). However, automatically identifying CCVs in smart contracts poses several unique challenges. Particularly, it is non-trivial to (1) identify application-specific access control constraints needed for cross-bridge asset exchange, and (2) identify inconsistent cross-chain semantics between the two sides of the bridge.
In this paper, we propose SmartAxe, a new framework to identify vulnerabilities in cross-chain bridge smart contracts. Particularly, to locate vulnerable functions that have access control incompleteness, SmartAxe models the heterogeneous implementations of access control and finds necessary security checks in smart contracts through probabilistic pattern inference. Besides, SmartAxe constructs cross-chain control-flow graph (xCFG) and data-flow graph (xDFG), which help to find semantic inconsistency during cross-chain data communication. To evaluate SmartAxe, we collect and label a dataset of 88 CCVs from real-attacks cross-chain bridge contracts. Evaluation results show that SmartAxe achieves a precision of 84.95% and a recall of 89.77%. In addition, SmartAxe successfully identifies 232 new/unknown CCVs from 129 real-world cross-chain bridge applications (i.e., from 1,703 smart contracts). These identified CCVs affect a total amount of digital assets worth 1,885,250 USD.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
SmartState: Detecting State-Reverting Vulnerabilities in Smart Contracts via Fine-Grained State-Dependency Analysis
Authors:
Zeqin Liao,
Sicheng Hao,
Yuhong Nan,
Zibin Zheng
Abstract:
Smart contracts written in Solidity are widely used in different blockchain platforms such as Ethereum, TRON and BNB Chain. One of the unique designs in Solidity smart contracts is its state-reverting mechanism for error handling and access control. Unfortunately, a number of recent security incidents showed that adversaries also utilize this mechanism to manipulate critical states of smart contra…
▽ More
Smart contracts written in Solidity are widely used in different blockchain platforms such as Ethereum, TRON and BNB Chain. One of the unique designs in Solidity smart contracts is its state-reverting mechanism for error handling and access control. Unfortunately, a number of recent security incidents showed that adversaries also utilize this mechanism to manipulate critical states of smart contracts, and hence, bring security consequences such as illegal profit-gain and Deny-of-Service (DoS). In this paper, we call such vulnerabilities as the State-reverting Vulnerability (SRV). Automatically identifying SRVs poses unique challenges, as it requires an in-depth analysis and understanding of the state-dependency relations in smart contracts.
This paper presents SmartState, a new framework for detecting state-reverting vulnerability in Solidity smart contracts via fine-grained state-dependency analysis. SmartState integrates a set of novel mechanisms to ensure its effectiveness. Particularly, Smart-State extracts state dependencies from both contract bytecode and historical transactions. Both of them are critical for inferring dependencies related to SRVs. Further, SmartState models the generic patterns of SRVs (i.e., profit-gain and DoS) as SRV indicators, and hence effectively identify SRVs based on the constructed state-dependency graph. To evaluate SmartState, we manually annotated a ground-truth dataset which contains 91 SRVs in the real world. Evaluation results showed that SmartState achieves a precision of 87.23% and a recall of 89.13%. In addition, SmartState successfully identifies 406 new SRVs from 47,351 real-world smart contracts. 11 of these SRVs are from popular smart contracts with high transaction amounts (i.e., top 2000). In total, our reported SRVs affect a total amount of digital assets worth 428,600 USD.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition
Authors:
Yuming Yang,
Wantong Zhao,
Caishuang Huang,
Junjie Ye,
Xiao Wang,
Huiyuan Zheng,
Yang Nan,
Yuran Wang,
Xueying Xu,
Kaixin Huang,
Yunke Zhang,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to da…
▽ More
Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs' generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly enhances LLMs' Open NER capabilities. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. The data, models, and code are publicly available at https://github.com/UmeanNever/B2NER.
△ Less
Submitted 21 April, 2025; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Authors:
Shijian Deng,
Erin E. Kosloski,
Siddhi Patel,
Zeke A. Barnett,
Yiyang Nan,
Alexander Kaplan,
Sisira Aarukapalli,
William T. Doan,
Matthew Wang,
Harsh Singh,
Pamela R. Rollins,
Yapeng Tian
Abstract:
In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-rel…
▽ More
In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.
△ Less
Submitted 22 March, 2024;
originally announced June 2024.
-
Deep asymmetric mixture model for unsupervised cell segmentation
Authors:
Yang Nan,
Guang Yang
Abstract:
Automated cell segmentation has become increasingly crucial for disease diagnosis and drug discovery, as manual delineation is excessively laborious and subjective. To address this issue with limited manual annotation, researchers have developed semi/unsupervised segmentation approaches. Among these approaches, the Deep Gaussian mixture model plays a vital role due to its capacity to facilitate co…
▽ More
Automated cell segmentation has become increasingly crucial for disease diagnosis and drug discovery, as manual delineation is excessively laborious and subjective. To address this issue with limited manual annotation, researchers have developed semi/unsupervised segmentation approaches. Among these approaches, the Deep Gaussian mixture model plays a vital role due to its capacity to facilitate complex data distributions. However, these models assume that the data follows symmetric normal distributions, which is inapplicable for data that is asymmetrically distributed. These models also obstacles weak generalization capacity and are sensitive to outliers. To address these issues, this paper presents a novel asymmetric mixture model for unsupervised cell segmentation. This asymmetric mixture model is built by aggregating certain multivariate Gaussian mixture models with log-likelihood and self-supervised-based optimization functions. The proposed asymmetric mixture model outperforms (nearly 2-30% gain in dice coefficient, p<0.05) the existing state-of-the-art unsupervised models on cell segmentation including the segment anything.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Enhancing Global Sensitivity and Uncertainty Quantification in Medical Image Reconstruction with Monte Carlo Arbitrary-Masked Mamba
Authors:
Jiahao Huang,
Liutao Yang,
Fanwen Wang,
Yang Nan,
Weiwen Wu,
Chengyan Wang,
Kuangyu Shi,
Angelica I. Aviles-Rivero,
Carola-Bibiane Schönlieb,
Daoqiang Zhang,
Guang Yang
Abstract:
Deep learning has been extensively applied in medical image reconstruction, where Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant paradigms, each possessing distinct advantages and inherent limitations: CNNs exhibit linear complexity with local sensitivity, whereas ViTs demonstrate quadratic complexity with global sensitivity. The emerging Mamba has sh…
▽ More
Deep learning has been extensively applied in medical image reconstruction, where Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant paradigms, each possessing distinct advantages and inherent limitations: CNNs exhibit linear complexity with local sensitivity, whereas ViTs demonstrate quadratic complexity with global sensitivity. The emerging Mamba has shown superiority in learning visual representation, which combines the advantages of linear scalability and global sensitivity. In this study, we introduce MambaMIR, an Arbitrary-Masked Mamba-based model with wavelet decomposition for joint medical image reconstruction and uncertainty estimation. A novel Arbitrary Scan Masking (ASM) mechanism "masks out" redundant information to introduce randomness for further uncertainty estimation. Compared to the commonly used Monte Carlo (MC) dropout, our proposed MC-ASM provides an uncertainty map without the need for hyperparameter tuning and mitigates the performance drop typically observed when applying dropout to low-level tasks. For further texture preservation and better perceptual quality, we employ the wavelet transformation into MambaMIR and explore its variant based on the Generative Adversarial Network, namely MambaMIR-GAN. Comprehensive experiments have been conducted for multiple representative medical image reconstruction tasks, demonstrating that the proposed MambaMIR and MambaMIR-GAN outperform other baseline and state-of-the-art methods in different reconstruction tasks, where MambaMIR achieves the best reconstruction fidelity and MambaMIR-GAN has the best perceptual quality. In addition, our MC-ASM provides uncertainty maps as an additional tool for clinicians, while mitigating the typical performance drop caused by the commonly used dropout.
△ Less
Submitted 25 June, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.