-
V2E: Validating Smart Contract Vulnerabilities through Profit-driven Exploit Generation and Execution
Authors:
Jingwen Zhang,
Yuhong Nan,
Kaiwen Ning,
Mingxi Ye,
Wei Li,
Yuming Xiao,
Yuming Feng,
Weizhe Zhang,
Zibin Zheng
Abstract:
Smart contracts are a critical component of blockchain systems. Due to the large amount of digital assets carried by smart contracts, their security is of critical importance. Although numerous tools have been developed for detecting smart contract vulnerability, their effectiveness remains limited, particularly due to the high false positives included in the reported results. Therefore, developer…
▽ More
Smart contracts are a critical component of blockchain systems. Due to the large amount of digital assets carried by smart contracts, their security is of critical importance. Although numerous tools have been developed for detecting smart contract vulnerability, their effectiveness remains limited, particularly due to the high false positives included in the reported results. Therefore, developers and auditors are often overwhelmed with manually verifying the reported issues. A fundamental reason behind this is that while a reported vulnerability satisfies specific vulnerable patterns, it may not actually be exploitable, either because the vulnerable code cannot be triggered or it does not result in any financial loss.
In this paper, we propose V2E, a new framework for validating whether a reported vulnerability is truly exploitable. The core idea of V2E is to automatically generate executable Proof-of-Concept Exploit (PoC for short), and then assess if the vulnerability could be triggered and incur any real damage (i.e., causing financial loss) by the PoC. While LLMs have shown proficiency in PoC generation, achieving our task is by no means trivial. In detail, it is difficult for LLM to: (1) generate and update PoC to trigger a specific vulnerability, (2) evaluate the PoC's effectiveness to validate exploitable vulnerability. To this end, V2E automates the whole process through a novel combination of PoC generation, validation, and refinement: (1) Firstly, V2E generates targeted PoCs by analyzing potential vulnerability paths. (2) Then, V2E verifies the validity of PoCs through triggerability and profitability analysis. (3) In addition, V2E iteratively refines the generated PoC based on PoC execution feedback, therefore, increasing the chance to confirm the vulnerability. Evaluation on 264 manually labeled contracts shows that V2E outperforms the baseline approach.
△ Less
Submitted 15 April, 2026;
originally announced April 2026.
-
4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview
Authors:
Benjamin Kiefer,
Jan Lukas Augustin,
Jon Muhovič,
Mingi Jeong,
Arnold Wiliem,
Janez Pers,
Matej Kristan,
Alberto Quattrini Li,
Matija Teršek,
Josip Šarić,
Arpita Vats,
Dominik Hildebrand,
Rafia Rahim,
Mahmut Karaaslan,
Arpit Vaishya,
Steve Xie,
Ersin Kaya,
Akib Mashrur,
Tze-Hsiang Tang,
Chun-Ming Tsai,
Jun-Wei Hsieh,
Ming-Ching Chang,
Wonwoo Jo,
Doyeon Lee,
Yusi Cao
, et al. (30 additional authors not shown)
Abstract:
The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challen…
▽ More
The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at https://macvi.org/workshop/cvpr26.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
Authors:
Ziyi He,
Yushi Feng,
Shuangyu Yang,
Yinghao Zhu,
Xichen Zhang,
Pak Chuen Patrick Tai,
Hei Yuet Lo,
Songying Wu,
Weifa Yang,
Lequan Yu
Abstract:
Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases…
▽ More
Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases annotated with expert-authored golden reasoning trajectories, together with hierarchical triage labels. We benchmark 19 proprietary, open-source, and medical-domain MLLMs against three junior dentists serving as the human baseline, and find a substantial human--model gap, on fine-grained treatment-level triage. Further analyses show that accurate triage requires both complaint and OPG information, and that model errors concentrate on cases with multiple referral domains, where MLLMs tend to produce overly narrow referral sets and omission-heavy errors. Dental-TriageBench provides a realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care.
△ Less
Submitted 18 March, 2026;
originally announced April 2026.
-
CODO: An Automated Compiler for Comprehensive Dataflow Optimization
Authors:
Weichuang Zhang,
Yiquan Wang,
Xinzhou Zhang,
Chi Zhang,
Yu Feng,
Xiaofeng Hou,
Chao Li,
Jieru Zhao,
Minyi Guo
Abstract:
FPGAs are well-suited for dataflow architectures that process data in a streaming or pipelined manner, thus satisfying the high computational and communication demands of emerging applications. However, manually implementing an efficient dataflow architecture for large-scale applications is still challenging, even for specialists who use high-level synthesis (HLS) to simplify FPGA programming.
T…
▽ More
FPGAs are well-suited for dataflow architectures that process data in a streaming or pipelined manner, thus satisfying the high computational and communication demands of emerging applications. However, manually implementing an efficient dataflow architecture for large-scale applications is still challenging, even for specialists who use high-level synthesis (HLS) to simplify FPGA programming.
To address this, we introduce CODO, an automated compiler that generates feasible and efficient dataflow accelerators on FPGAs. CODO features a systematic method for detecting and eliminating both coarse-grained and fine-grained dataflow violations. Building on this, CODO performs both on- and off-chip data movement optimizations to maximize transfer efficiency. To guarantee a higher design quality, CODO performs automatic scheduling to generate high-performance dataflow accelerators, ensuring a balanced performance-resource trade-off. Synthesis results show that CODO delivers $1.45\times$ to $4.52\times$ latency speedups on typical computation kernels and $3.7\times$ to $33.8\times$ speedups on DNN models compared to SOTA frameworks. In on-board evaluations, CODO achieves $7.3\times$ average speedup on CNN models and $2.07\times$ average speedup on the GPT-2 model over SOTA frameworks. The compiler is open-sourced at https://github.com/sjtu-zhao-lab/codo-artifact.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception
Authors:
Jilong Zhu,
Yang Feng
Abstract:
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or dilut…
▽ More
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.
△ Less
Submitted 14 April, 2026;
originally announced April 2026.
-
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration
Authors:
Zhuofan Wen,
Yang Feng
Abstract:
Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence…
▽ More
Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
Efficient Training for Cross-lingual Speech Language Models
Authors:
Yan Zhou,
Qingkai Fang,
Yun Hong,
Yang Feng
Abstract:
Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross…
▽ More
Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM's strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)
△ Less
Submitted 13 April, 2026;
originally announced April 2026.
-
From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation
Authors:
Mingfei Lu,
Yi Zhang,
Mengjia Wu,
Yue Feng
Abstract:
Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative r…
▽ More
Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.
△ Less
Submitted 12 April, 2026;
originally announced April 2026.
-
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Authors:
Yu Jiang,
Hanwen Jiang,
Ahmed Abdelkader,
Wen-Sheng Chu,
Brandon Y. Feng,
Zhangyang Wang,
Qixing Huang
Abstract:
With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentang…
▽ More
With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.
△ Less
Submitted 11 April, 2026;
originally announced April 2026.
-
MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
Authors:
Yunfei Feng,
Xi Zhao,
Cheng Zhang,
Dahu Feng,
Daolin Cheng,
Jianqi Yu,
Yubin Xia,
Erhu Feng
Abstract:
Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to…
▽ More
Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.
△ Less
Submitted 28 February, 2026;
originally announced April 2026.
-
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution
Authors:
Tianfu Wang,
Leilei Ding,
Ziyang Tao,
Yi Zhan,
Zhiyuan Ma,
Wei Wu,
Yuxuan Lei,
Yuan Feng,
Junyang Wang,
Yin Wu,
Yizhao Xu,
Hongyuan Zhu,
Qi Liu,
Nicholas Jing Yuan,
Yanyong Zhang,
Hui Xiong
Abstract:
High-fidelity diagram creation requires the complex orchestration of semantic topology, visual styling, and spatial layout, posing a significant challenge for automated systems. Existing methods also suffer from a representation gap: pixel-based models often lack precise control, while code-based synthesis limits intuitive flexibility. To bridge this gap, we introduce EvoDiagram, an agentic framew…
▽ More
High-fidelity diagram creation requires the complex orchestration of semantic topology, visual styling, and spatial layout, posing a significant challenge for automated systems. Existing methods also suffer from a representation gap: pixel-based models often lack precise control, while code-based synthesis limits intuitive flexibility. To bridge this gap, we introduce EvoDiagram, an agentic framework that generates object-level editable diagrams via an intermediate canvas schema. EvoDiagram employs a coordinated multi-agent system to decouple semantic intent from rendering logic, resolving conflicts across heterogeneous design layers. Additionally, we propose a design knowledge evolution mechanism that distills execution traces into a hierarchical memory of domain guidelines, enabling agents to retrieve context-aware expertise adaptively. We further release CanvasBench, a benchmark consisting of both data and metrics for canvas-based diagramming. Extensive experiments demonstrate that EvoDiagram exhibits excellent performance and balance against baselines in generating editable, structurally consistent, and aesthetically coherent diagrams. Our code is available at https://github.com/AuraX-AI/EvoDiagram.
△ Less
Submitted 20 February, 2026;
originally announced April 2026.
-
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
Authors:
Yushi Feng,
Junye Du,
Qifan Wang,
Zizhan Ma,
Qian Niu,
Yutaka Matsuo,
Long Feng,
Lequan Yu
Abstract:
Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees…
▽ More
Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety--helpfulness--interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.
△ Less
Submitted 10 April, 2026;
originally announced April 2026.
-
One Interface, Many Robots: Unified Real-Time Low-Level Motion Planning for Collaborative Arms
Authors:
Yue Feng,
Weicheng Huang,
I-Ming Chen
Abstract:
This paper proposes a common interface for real-time low-level motion planning of collaborative robotic arms, aimed at enabling broader applicability and improved portability across heterogeneous hardware platforms. In previous work, we introduced WinGs Operating Studio (WOS), a middleware solution that abstracts diverse robotic components into uniform software resources and provides a broad suite…
▽ More
This paper proposes a common interface for real-time low-level motion planning of collaborative robotic arms, aimed at enabling broader applicability and improved portability across heterogeneous hardware platforms. In previous work, we introduced WinGs Operating Studio (WOS), a middleware solution that abstracts diverse robotic components into uniform software resources and provides a broad suite of language-agnostic APIs. This paper specifically focuses on its minimal yet flexible interface for real-time end-effector trajectory control. By employing an n-degree polynomial interpolator in conjunction with a quadratic programming solver, the proposed method generates smooth, continuously differentiable trajectories with precise position, velocity, and acceleration profiles. We validate our approach in three distinct scenarios. First, in an offline demonstration, a collaborative arm accurately draws various geometric shapes on paper. Second, in an interruptible, low-frequency re-planning setting, a robotic manipulator grasps a dynamic object placed on a moving mobile robot. Finally, we conducted a teleoperation experiment in which one robotic arm controlled another to perform a series of dexterous manipulations, confirming the proposed method's reliability, versatility, and ease of use.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
Authors:
Hanzhi Liu,
Chaofan Shou,
Hongbo Wen,
Yanju Chen,
Ryan Jingyang Fang,
Yu Feng
Abstract:
Large language model (LLM) agents increasingly rely on third-party API routers to dispatch tool-calling requests across multiple upstream providers. These routers operate as application-layer proxies with full plaintext access to every in-flight JSON payload, yet no provider enforces cryptographic integrity between client and upstream model. We present the first systematic study of this attack sur…
▽ More
Large language model (LLM) agents increasingly rely on third-party API routers to dispatch tool-calling requests across multiple upstream providers. These routers operate as application-layer proxies with full plaintext access to every in-flight JSON payload, yet no provider enforces cryptographic integrity between client and upstream model. We present the first systematic study of this attack surface. We formalize a threat model for malicious LLM API routers and define two core attack classes, payload injection (AC-1) and secret exfiltration (AC-2), together with two adaptive evasion variants: dependency-targeted injection (AC-1.a) and conditional delivery (AC-1.b). Across 28 paid routers purchased from Taobao, Xianyu, and Shopify-hosted storefronts and 400 free routers collected from public communities, we find 1 paid and 8 free routers actively injecting malicious code, 2 deploying adaptive evasion triggers, 17 touching researcher-owned AWS canary credentials, and 1 draining ETH from a researcher-owned private key. Two poisoning studies further show that ostensibly benign routers can be pulled into the same attack surface: a leaked OpenAI key generates 100M GPT-5.4 tokens and more than seven Codex sessions, while weakly configured decoys yield 2B billed tokens, 99 credentials across 440 Codex sessions, and 401 sessions already running in autonomous YOLO mode. We build Mine, a research proxy that implements all four attack classes against four public agent frameworks, and use it to evaluate three deployable client-side defenses: a fail-closed policy gate, response-side anomaly screening, and append-only transparency logging.
△ Less
Submitted 9 April, 2026;
originally announced April 2026.
-
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
Authors:
Zhida Jiang,
Zhaolong Xing,
Huichao Chai,
Tianxing Sun,
Qiang Peng,
Baopeng Yuan,
Jiaxing Wang,
Hua Du,
Zhixin Wu,
Xuemiao Li,
Yikui Cao,
Xinyu Liu,
Yongxiang Feng,
Zhen Chen,
Ke Zhang
Abstract:
Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents Ne…
▽ More
Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
Authors:
Mingchen Li,
Wajdi Aljedaani,
Yingjie Liu,
Navyasri Meka,
Xuan Lu,
Xinyue Ye,
Junhua Ding,
Yunhe Feng
Abstract:
Skin-toned emojis are crucial for fostering personal identity and social inclusion in online communication. As AI models, particularly Large Language Models (LLMs), increasingly mediate interactions on web platforms, the risk that these systems perpetuate societal biases through their representation of such symbols is a significant concern. This paper presents the first large-scale comparative stu…
▽ More
Skin-toned emojis are crucial for fostering personal identity and social inclusion in online communication. As AI models, particularly Large Language Models (LLMs), increasingly mediate interactions on web platforms, the risk that these systems perpetuate societal biases through their representation of such symbols is a significant concern. This paper presents the first large-scale comparative study of bias in skin-toned emoji representations across two distinct model classes. We systematically evaluate dedicated emoji embedding models (emoji2vec, emoji-sw2v) against four modern LLMs (Llama, Gemma, Qwen, and Mistral). Our analysis first reveals a critical performance gap: while LLMs demonstrate robust support for skin tone modifiers, widely-used specialized emoji models exhibit severe deficiencies. More importantly, a multi-faceted investigation into semantic consistency, representational similarity, sentiment polarity, and core biases uncovers systemic disparities. We find evidence of skewed sentiment and inconsistent meanings associated with emojis across different skin tones, highlighting latent biases within these foundational models. Our findings underscore the urgent need for developers and platforms to audit and mitigate these representational harms, ensuring that AI's role on the web promotes genuine equity rather than reinforcing societal biases.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
Authors:
Yunhao Feng,
Yifan Ding,
Yingshui Tan,
Boren Zheng,
Yanming Guo,
Xiaolong Li,
Kun Zhai,
Yishan Li,
Wenke Huang
Abstract:
Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill…
▽ More
Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.
△ Less
Submitted 8 April, 2026;
originally announced April 2026.
-
Delta6: A Low-Cost, 6-DOF Force-Sensing Flexible End-Effector
Authors:
Yue Feng,
Weicheng Huang,
Chen Qiu,
Huixu Dong,
I-Ming Chen
Abstract:
This paper presents Delta6, a low-cost, six-degree-of-freedom (6-DOF) force/torque end-effector that combines antagonistic springs with magnetic encoders to deliver accurate wrench sensing while remaining as simple to assemble as flat-pack furniture. A fully 3D-printed prototype, assembled entirely from off-the-shelf parts, withstands peak forces above +/-14.4 N and torques of +/-0.33 N.m per axis…
▽ More
This paper presents Delta6, a low-cost, six-degree-of-freedom (6-DOF) force/torque end-effector that combines antagonistic springs with magnetic encoders to deliver accurate wrench sensing while remaining as simple to assemble as flat-pack furniture. A fully 3D-printed prototype, assembled entirely from off-the-shelf parts, withstands peak forces above +/-14.4 N and torques of +/-0.33 N.m per axis; these limits can be further extended by leveraging the proposed parametric analytical model. Without calibration, Delta6 attains a 99th-percentile error of 7% full scale (FS). With lightweight sequence models, the error is reduced to 3.8% FS by the best-performing network. Benchmarks on multiple computing platforms confirm that the device's bandwidth is adjustable, enabling balanced trade-offs among update rate, accuracy, and cost, while durability, thermal drift, and zero-calibration tests confirm its robustness. With Delta6 mounted on a robot arm governed by a force-impedance controller, the system successfully performs two contact-rich tasks: buffing curved surfaces and tight assemblies. Experiments validate the design, showing that Delta6 is a robust, low-cost alternative to existing 6-DOF force sensing solutions. Open-source site: https://wings-robotics.github.io/delta6 .
△ Less
Submitted 7 April, 2026;
originally announced April 2026.
-
Subset Balancing and Generalized Subset Sum via Lattices
Authors:
Yiming Gao,
Yansong Feng,
Honggang Hu,
Yanbin Pan
Abstract:
We study the \emph{Subset Balancing} problem: given $\mathbf{x} \in \mathbb{Z}^n$ and a coefficient set $C \subseteq \mathbb{Z}$, find a nonzero vector $\mathbf{c} \in C^n$ such that $\mathbf{c}\cdot\mathbf{x} = 0$. The standard meet-in-the-middle algorithm runs in time $\tilde{O}(|C|^{n/2})=\tilde{O}(2^{n\log |C|/2})$, and recent improvements (SODA~2022, Chen, Jin, Randolph, and Servedio; STOC~20…
▽ More
We study the \emph{Subset Balancing} problem: given $\mathbf{x} \in \mathbb{Z}^n$ and a coefficient set $C \subseteq \mathbb{Z}$, find a nonzero vector $\mathbf{c} \in C^n$ such that $\mathbf{c}\cdot\mathbf{x} = 0$. The standard meet-in-the-middle algorithm runs in time $\tilde{O}(|C|^{n/2})=\tilde{O}(2^{n\log |C|/2})$, and recent improvements (SODA~2022, Chen, Jin, Randolph, and Servedio; STOC~2026, Randolph and Węgrzycki) beyond this barrier apply mainly when $d$ is constant.
We give a reduction from Subset Balancing with $C = \{-d, \dots, d\}$ to a single instance of $\mathrm{SVP}_{\infty}$ in dimension $n+1$, which yields a deterministic algorithm with running time $\tilde{O}((6\sqrt{2πe})^n) \approx \tilde{O}(2^{4.632n})$, and a randomized algorithm with running time $\tilde{O}(2^{2.443n})$ (here $\tilde{O}$ suppresses $\operatorname{poly}(n)$ factors). We also show that for sufficiently large $d$, Subset Balancing is solvable in polynomial time. More generally, we extend the box constraint $[-d,d]^n$ to an arbitrary centrally symmetric convex body $K \subseteq \mathbb{R}^n$ with a deterministic $\tilde{O}(2^{c_K n})$-time algorithm, where $c_K$ depends only on the shape of $K$.
We further study the \emph{Generalized Subset Sum} problem of finding $\mathbf{c} \in C^n$ such that $\mathbf{c} \cdot \mathbf{x} = τ$. For $C = \{-d, \dots, d\}$, we reduce the worst-case problem to a single instance of $\mathrm{CVP}_{\infty}$. Although no general single exponential time algorithm is known for exact $\mathrm{CVP}_{\infty}$, we show that in the average-case setting, for both $C = \{-d, \dots, d\}$ and $C = \{-d, \dots, d\} \setminus \{0\}$, the embedded instance satisfies a bounded-distance promise with high probability. This yields a deterministic algorithm running in time $\tilde{O}((18\sqrt{2πe})^n) \approx \tilde{O}(2^{6.217n})$.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
SLSREC: Self-Supervised Contrastive Learning for Adaptive Fusion of Long- and Short-Term User Interests
Authors:
Wei Zhou,
Yue Shen,
Junkai Ji,
Yinglan Feng,
Xing Tang,
Xiuqiang He,
Liang Feng,
Zexuan Zhu
Abstract:
User interests typically encompass both long-term preferences and short-term intentions, reflecting the dynamic nature of user behaviors across different timeframes. The uneven temporal distribution of user interactions highlights the evolving patterns of interests, making it challenging to accurately capture shifts in interests using comprehensive historical behaviors. To address this, we propose…
▽ More
User interests typically encompass both long-term preferences and short-term intentions, reflecting the dynamic nature of user behaviors across different timeframes. The uneven temporal distribution of user interactions highlights the evolving patterns of interests, making it challenging to accurately capture shifts in interests using comprehensive historical behaviors. To address this, we propose SLSRec, a novel Session-based model with the fusion of Long- and Short-term Recommendations that effectively captures the temporal dynamics of user interests by segmenting historical behaviors over time. Unlike conventional models that combine long- and short-term user interests into a single representation, compromising recommendation accuracy, SLSRec utilizes a self-supervised learning framework to disentangle these two types of interests. A contrastive learning strategy is introduced to ensure accurate calibration of long- and short-term interest representations. Additionally, an attention-based fusion network is designed to adaptively aggregate interest representations, optimizing their integration to enhance recommendation performance. Extensive experiments on three public benchmark datasets demonstrate that SLSRec consistently outperforms state-of-the-art models while exhibiting superior robustness across various scenarios.We will release all source code upon acceptance.
△ Less
Submitted 6 April, 2026;
originally announced April 2026.
-
VA-FastNavi-MARL: Real-Time Robot Control with Multimedia-Driven Meta-Reinforcement Learning
Authors:
Yang Zhang,
Shengxi Jing,
Fengxiang Wang,
Yuan Feng,
Hong Wang
Abstract:
Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen…
▽ More
Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen directives with negligible inference overhead. Unlike approaches bottlenecked by heavy sensory processing, our modality-agnostic stream ensures seamless, low-latency control. Validation on a multi-arm workspace confirms that VA-FastNavi-MARL significantly outperforms baselines in sample efficiency and maintains robust, real-time execution even under noisy multimedia streams.
△ Less
Submitted 5 April, 2026;
originally announced April 2026.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
Authors:
Yunhao Feng,
Yifan Ding,
Yingshui Tan,
Xingjun Ma,
Yige Li,
Yutao Wu,
Yifeng Gao,
Kun Zhai,
Yanming Guo
Abstract:
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediat…
▽ More
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present \textbf{AgentHazard}, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains \textbf{2,653} instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of \textbf{73.63\%}, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus
Authors:
Shuai Wu,
Xue Li,
Yanna Feng,
Yufang Li,
Zhijun Wang
Abstract:
Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference.…
▽ More
Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference. In this paper, we propose the Council Mode, a novel multi-agent consensus framework that addresses these limitations by dispatching queries to multiple heterogeneous frontier LLMs in parallel and synthesizing their outputs through a dedicated consensus model. The Council pipeline operates in three phases: (1) an intelligent triage classifier that routes queries based on complexity, (2) parallel expert generation across architecturally diverse models, and (3) a structured consensus synthesis that explicitly identifies agreement, disagreement, and unique findings before producing the final response. We implement and evaluate this architecture within an open-source AI workspace. Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark and a 7.8-point improvement on TruthfulQA compared to the best-performing individual model, while maintaining significantly lower bias variance across domains. We provide the mathematical formulation of the consensus mechanism, detail the system architecture, and present extensive empirical results with ablation studies.
△ Less
Submitted 3 April, 2026;
originally announced April 2026.
-
A Visionary Look at Vibe Researching
Authors:
Yebo Feng,
Yang Liu
Abstract:
Vibe researching is an emerging paradigm in which human researchers provide high-level direction and critical judgment while LLM-based agents handle the labor-intensive execution of literature review, experimentation, data analysis, and manuscript drafting. Inspired by the "vibe coding" movement in software engineering, it occupies a middle ground between traditional manual research and fully auto…
▽ More
Vibe researching is an emerging paradigm in which human researchers provide high-level direction and critical judgment while LLM-based agents handle the labor-intensive execution of literature review, experimentation, data analysis, and manuscript drafting. Inspired by the "vibe coding" movement in software engineering, it occupies a middle ground between traditional manual research and fully autonomous AI research systems. This paper defines the concept, describes its methodology (multi-agent architectures, memory, tool use, retrieval-augmented generation, and the human's role as orchestrator), identifies seven technical limitations, weighs its positive and negative societal impacts, and maps each problem to a concrete future direction. Our goal is to provide the research community with a clear and honest map of the territory so that the conversation about responsible adoption can start from shared ground.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
Authors:
Karan Singh,
Michael Yu,
Varun Gangal,
Zhuofu Tao,
Sachin Kumar,
Emmy Liu,
Steven Y. Feng
Abstract:
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the tr…
▽ More
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.
△ Less
Submitted 1 April, 2026;
originally announced April 2026.
-
Scaling Video Pretraining for Surgical Foundation Models
Authors:
Sicheng Lu,
Zikai Xiao,
Jianhui Wei,
Danyu Sun,
Qi Lu,
Keli Hu,
Yang Feng,
Jian Wu,
Zongxin Yang,
Zuozhu Liu
Abstract:
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec…
▽ More
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
△ Less
Submitted 2 April, 2026; v1 submitted 31 March, 2026;
originally announced March 2026.
-
Drift-Aware Continual Tokenization for Generative Recommendation
Authors:
Yuebo Feng,
Jiahao Liu,
Mingzhe Han,
Dongsheng Li,
Hansu Gu,
Peng Zhang,
Tun Lu,
Ning Gu
Abstract:
Generative recommendation commonly adopts a two-stage pipeline in which a learnable tokenizer maps items to discrete token sequences (i.e. identifiers) and an autoregressive generative recommender model (GRM) performs prediction based on these identifiers. Recent tokenizers further incorporate collaborative signals so that items with similar user-behavior patterns receive similar codes, substantia…
▽ More
Generative recommendation commonly adopts a two-stage pipeline in which a learnable tokenizer maps items to discrete token sequences (i.e. identifiers) and an autoregressive generative recommender model (GRM) performs prediction based on these identifiers. Recent tokenizers further incorporate collaborative signals so that items with similar user-behavior patterns receive similar codes, substantially improving recommendation quality. However, real-world environments evolve continuously: new items cause identifier collision and shifts, while new interactions induce collaborative drift in existing items (e.g., changing co-occurrence patterns and popularity). Fully retraining both tokenizer and GRM is often prohibitively expensive, yet naively fine-tuning the tokenizer can alter token sequences for the majority of existing items, undermining the GRM's learned token-embedding alignment. To balance plasticity and stability for collaborative tokenizers, we propose DACT, a Drift-Aware Continual Tokenization framework with two stages: (i) tokenizer fine-tuning, augmented with a jointly trained Collaborative Drift Identification Module (CDIM) that outputs item-level drift confidence and enables differentiated optimization for drifting and stationary items; and (ii) hierarchical code reassignment using a relaxed-to-strict strategy to update token sequences while limiting unnecessary changes. Experiments on three real-world datasets with two representative GRMs show that DACT consistently achieves better performance than baselines, demonstrating effective adaptation to collaborative evolution with reduced disruption to prior knowledge. Our implementation is publicly available at https://github.com/HomesAmaranta/DACT for reproducibility.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models
Authors:
Linda Zeng,
Steven Y. Feng,
Michael C. Frank
Abstract:
Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get de…
▽ More
Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
Baby Scale: Investigating Models Trained on Individual Children's Language Input
Authors:
Steven Y. Feng,
Alvin W. M. Tan,
Michael C. Frank
Abstract:
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView d…
▽ More
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.
△ Less
Submitted 31 March, 2026;
originally announced March 2026.
-
Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
Authors:
Xin Wang,
Yang Feng,
Jiaoxiao Qian,
Yang Zhang,
Zhenhao Li,
Zishuo Ding
Abstract:
Logging statements are essential for software debugging and maintenance. However, existing approaches to automatic logging generation rely on static analysis and produce statements in a single pass without considering runtime behavior. They are also typically evaluated by similarity to developer-written logs, assuming these logs form an adequate gold standard. This assumption is increasingly limit…
▽ More
Logging statements are essential for software debugging and maintenance. However, existing approaches to automatic logging generation rely on static analysis and produce statements in a single pass without considering runtime behavior. They are also typically evaluated by similarity to developer-written logs, assuming these logs form an adequate gold standard. This assumption is increasingly limiting in the LLM era, where logs are consumed not only by developers but also by LLMs for downstream tasks. As a result, optimizing logs for human similarity does not necessarily reflect their practical utility.
To address these limitations, we introduce ReLog, an iterative logging generation framework guided by runtime feedback. ReLog leverages LLMs to generate, execute, evaluate, and refine logging statements so that runtime logs better support downstream tasks. Instead of comparing against developer-written logs, we evaluate ReLog through downstream debugging tasks, including defect localization and repair. We construct a benchmark based on Defects4J under both direct and indirect debugging settings. Results show that ReLog consistently outperforms all baselines, achieving an F1 score of 0.520 and repairing 97 defects in the direct setting, and the best F1 score of 0.408 in the indirect setting where source code is unavailable. Additional experiments across multiple LLMs demonstrate the generality of the framework, while ablations confirm the importance of iterative refinement and compilation repair. Overall, our work reframes logging as a runtime-guided, task-oriented process and advocates evaluating logs by their downstream utility rather than textual similarity.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
C2RustXW: Program-Structure-Aware C-to-Rust Translation via Program Analysis and LLM
Authors:
Yanyan Yan,
Yang Feng,
Jiangshan Liu,
Di Liu,
Zixi Liu,
Hao Teng,
Baowen Xu
Abstract:
The growing adoption of Rust for its memory safety and performance has increased the demand for effective migration of legacy C codebases. However, existing rule-based translators (e.g., \ctorust) often generate verbose, non-idiomatic code that preserves unsafe C semantics, limiting readability, maintainability, and practical adoption. Moreover, manual post-processing of such outputs is labor-inte…
▽ More
The growing adoption of Rust for its memory safety and performance has increased the demand for effective migration of legacy C codebases. However, existing rule-based translators (e.g., \ctorust) often generate verbose, non-idiomatic code that preserves unsafe C semantics, limiting readability, maintainability, and practical adoption. Moreover, manual post-processing of such outputs is labor-intensive and rarely yields high-quality Rust code, posing a significant barrier to large-scale migration. To address these limitations, we present \tool, a program-structure-aware C-to-Rust translation approach that integrates program analysis with Large Language Models (LLMs). \tool extracts the multi-level program structure, including global symbols, function dependencies, and control- and data-flow information, and encodes these as structured textual representations injected into LLM prompts to guide translation and repair. Based on this design, \tool performs dependency-aware translation and adopts a multi-stage repair pipeline that combines rule-based and structure-guided LLM-based techniques to ensure syntactic correctness. For semantic correctness, \tool further integrates execution-based validation with structure-guided reasoning to localize and repair behavioral inconsistencies. Experimental results show that \tool achieves 100\% syntactic correctness on CodeNet and 97.78\% on GitHub, while significantly reducing code size (up to 43.70\%) and unsafe usage (to 5.75\%). At the project level, \tool achieves perfect syntactic correctness and an average semantic correctness of 78.87\%, demonstrating its effectiveness for practical and scalable C-to-Rust migration.
△ Less
Submitted 30 March, 2026;
originally announced March 2026.
-
ViviDoc: Generating Interactive Documents through Human-Agent Collaboration
Authors:
Yinghao Tang,
Yupeng Xie,
Yingchaojie Feng,
Tingfeng Lan,
Jiale Lao,
Yue Cheng,
Wei Chen
Abstract:
Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can automate content creation, but directly applying them to interactive document generation of…
▽ More
Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and exploratory interfaces. However, creating such documents remains costly, as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can automate content creation, but directly applying them to interactive document generation often produces outputs that are difficult to control. To address this, we present ViviDoc, to the best of our knowledge the first work to systematically address interactive document generation. ViviDoc introduces a multi-agent pipeline (Planner, Styler, Executor, Evaluator). To make the generation process controllable, we provide three levels of human control: (1) the Document Specification (DocSpec) with SRTC Interaction Specifications (State, Render, Transition, Constraint) for structured planning, (2) a content-aware Style Palette for customizing writing and interaction styles, and (3) chat-based editing for iterative refinement. We also construct ViviBench, a benchmark of 101 topics derived from real-world interactive documents across 11 domains, along with a taxonomy of 8 interaction types and a 4-dimensional automated evaluation framework validated against human ratings (Pearson r > 0.84). Experiments show that ViviDoc achieves the highest content richness and interaction quality in both automated and human evaluation. A 12-person user study confirms that the system is easy to use, provides effective control over the generation process, and produces documents that satisfy users.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Authors:
Meituan LongCat Team,
Bin Xiao,
Chao Wang,
Chengjiang Li,
Chi Zhang,
Chong Peng,
Hang Yu,
Hao Yang,
Haonan Yan,
Haoze Sun,
Haozhe Zhao,
Hong Liu,
Hui Su,
Jiaqi Zhang,
Jiawei Wang,
Jing Li,
Kefeng Zhang,
Manyuan Zhang,
Minhao Jing,
Peng Pei,
Quan Chen,
Taofeng Xue,
Tongxin Pan,
Xiaotong Li,
Xiaoyang Li
, et al. (64 additional authors not shown)
Abstract:
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Aut…
▽ More
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
Visualization of Machine Learning Models through Their Spatial and Temporal Listeners
Authors:
Siyu Wu,
Lei Shi,
Lei Xia,
Cenyang Wu,
Zipeng Liu,
Yingchaojie Feng,
Liang Zhou,
Wei Chen
Abstract:
Model visualization (ModelVis) has emerged as a major research direction, yet existing taxonomies are largely organized by data or tasks, making it difficult to treat models as first-class analysis objects. We present a model-centric two-stage framework that employs abstract listeners to capture spatial and temporal model behaviors, and then connects the translated model behavior data to the class…
▽ More
Model visualization (ModelVis) has emerged as a major research direction, yet existing taxonomies are largely organized by data or tasks, making it difficult to treat models as first-class analysis objects. We present a model-centric two-stage framework that employs abstract listeners to capture spatial and temporal model behaviors, and then connects the translated model behavior data to the classical InfoVis pipeline. To apply the framework at scale, we build a retrieval-augmented human--large language model (LLM) extraction workflow and curate a corpus of 128 VIS/VAST ModelVis papers with 331 coded figures. Our analysis shows a dominant result-centric priority on visualizing model outcomes, quantitative/nominal data type, statistical charts, and performance evaluation. Citation-weighted trends further indicate that less frequent model-mechanism-oriented studies have disproportionately high impact while are less investigated recently. Overall, the framework is a general approach for comparing existing ModelVis systems and guiding possible future designs.
△ Less
Submitted 29 March, 2026;
originally announced March 2026.
-
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
Authors:
Yi Feng,
Junwu E,
Zizhan Guo,
Yu Ma,
Hanli Wang,
Rui Fan
Abstract:
Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry wit…
▽ More
Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains over 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset are available at https://mias.group/CarlaOcc.
△ Less
Submitted 28 March, 2026;
originally announced March 2026.
-
Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
Authors:
Yizhou Jin,
Yuezhu Feng,
Jinjin Zhang,
Peng Wang,
Qingjie Liu,
Yunhong Wang
Abstract:
Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to…
▽ More
Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.
△ Less
Submitted 28 March, 2026;
originally announced March 2026.
-
GSR-GNN: Training Acceleration and Memory-Saving Framework of Deep GNNs on Circuit Graph
Authors:
Yuebo Luo,
Shiyang Li,
Yifei Feng,
Vishal Kancharla,
Shaoyi Huang,
Caiwen Ding
Abstract:
Graph Neural Networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Spar…
▽ More
Graph Neural Networks (GNNs) show strong promise for circuit analysis, but scaling to modern large-scale circuit graphs is limited by GPU memory and training cost, especially for deep models. We revisit deep GNNs for circuit graphs and show that, when trainable, they significantly outperform shallow architectures, motivating an efficient, domain-specific training framework. We propose Grouped-Sparse-Reversible GNN (GSR-GNN), which enables training GNNs with up to hundreds of layers while reducing both compute and memory overhead. GSR-GNN integrates reversible residual modules with a group-wise sparse nonlinear operator that compresses node embeddings without sacrificing task-relevant information, and employs an optimized execution pipeline to eliminate fragmented activation storage and reduce data movement. On sampled circuit graphs, GSR-GNN achieves up to 87.2\% peak memory reduction and over 30$\times$ training speedup with negligible degradation in correlation-based quality metrics, making deep GNNs practical for large-scale EDA workloads.
△ Less
Submitted 28 March, 2026;
originally announced March 2026.
-
Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
Authors:
Guangfu Guo,
Xiaoqian Lu,
Yue Feng,
Mingming Sun
Abstract:
Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and or…
▽ More
Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.
△ Less
Submitted 21 March, 2026;
originally announced March 2026.
-
Self-Improvement of Large Language Models: A Technical Overview and Future Outlook
Authors:
Haoyan Yang,
Mario Xerri,
Solha Park,
Huajian Zhang,
Yiyang Feng,
Sai Akhil Kogilathota,
Jiawei Zhou
Abstract:
As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and…
▽ More
As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study
Authors:
Wenlong Hou,
Sheng Bi,
Guangqian Yang,
Lihao Liu,
Ye Du,
Hanxiao Xue,
Juncheng Wang,
Yuxiang Feng,
Yue Xun,
Nanxi Yu,
Ning Mao,
Mo Yang,
Yi Wah Eva Cheung,
Ling Long,
Kay Chen Tan,
Lequan Yu,
Xiaomeng Ma,
Shaozhen Yan,
Shujun Wang
Abstract:
Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in…
▽ More
Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.
△ Less
Submitted 26 March, 2026;
originally announced March 2026.
-
PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation
Authors:
Yuheng Feng,
Wen Zhang,
Haodong Duan,
Xingxing Zou
Abstract:
We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image…
▽ More
We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models' creativity and integrate human-centred design principles into generative vision-language systems.
△ Less
Submitted 25 March, 2026;
originally announced March 2026.
-
Environment Maps: Structured Environmental Representations for Long-Horizon Agents
Authors:
Yenchia Feng,
Chirag Sharma,
Karime Maamari
Abstract:
Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial-and-error. This paper introduces…
▽ More
Although large language models (LLMs) have advanced rapidly, robust automation of complex software workflows remains an open problem. In long-horizon settings, agents frequently suffer from cascading errors and environmental stochasticity; a single misstep in a dynamic interface can lead to task failure, resulting in hallucinations or trial-and-error. This paper introduces $\textit{Environment Maps}$: a persistent, agent-agnostic representation that mitigates these failures by consolidating heterogeneous evidence, such as screen recordings and execution traces, into a structured graph. The representation consists of four core components: (1) Contexts (abstracted locations), (2) Actions (parameterized affordances), (3) Workflows (observed trajectories), and (4) Tacit Knowledge (domain definitions and reusable procedures). We evaluate this framework on the WebArena benchmark across five domains. Agents equipped with environment maps achieve a 28.2% success rate, nearly doubling the performance of baselines limited to session-bound context (14.2%) and outperforming agents that have access to the raw trajectory data used to generate the environment maps (23.3%). By providing a structured interface between the model and the environment, Environment Maps establish a persistent foundation for long-horizon planning that is human-interpretable, editable, and incrementally refinable.
△ Less
Submitted 26 March, 2026; v1 submitted 24 March, 2026;
originally announced March 2026.
-
Double Coupling Architecture and Training Method for Optimization Problems of Differential Algebraic Equations with Parameters
Authors:
Wenqiang Yang,
Wenyuan Wu,
Yong Feng,
Changbo Chen
Abstract:
Simulation and modeling are essential in product development, integrated into the design and manufacturing process to enhance efficiency and quality. They are typically represented as complex nonlinear differential algebraic equations. The growing diversity of product requirements demands multi-task optimization, a key challenge in simulation modeling research. A dual physics-informed neural netwo…
▽ More
Simulation and modeling are essential in product development, integrated into the design and manufacturing process to enhance efficiency and quality. They are typically represented as complex nonlinear differential algebraic equations. The growing diversity of product requirements demands multi-task optimization, a key challenge in simulation modeling research. A dual physics-informed neural network architecture has been proposed to decouple constraints and objective functions in parametric differential algebraic equation optimization problems. Theoretical analysis shows that introducing a relaxation variable with a global error bound ensures solution equivalence between the network and optimization problem. A genetic algorithm-enhanced training framework for physics-informed neural networks improves training precision and efficiency, avoiding redundant solving of differential algebraic equations. This approach enables generalization for multi-task objectives with a single, training maintaining real-time responsiveness to product requirements.
△ Less
Submitted 23 March, 2026;
originally announced March 2026.
-
CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution
Authors:
Kaizhen Tan,
Yang Feng,
Heqing Du
Abstract:
Standard attribution heatmaps show where a vision-language model (VLM) focuses, but they do not reveal whether the recovered evidence is organized by the queried spatial relation or merely reflects image layout. To address this problem, we introduce CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts token-level attribution into a reference-centered compass…
▽ More
Standard attribution heatmaps show where a vision-language model (VLM) focuses, but they do not reveal whether the recovered evidence is organized by the queried spatial relation or merely reflects image layout. To address this problem, we introduce CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts token-level attribution into a reference-centered compass distribution and measures its directional alignment. CREG provides a shared directional readout across attribution methods and makes comparison with geometric controls explicit. Across three spatial-relation benchmarks, box-only geometry achieves Direction Alignment Error more than 30 degrees lower than current model-based attribution methods, leaving a substantial gap between attribution structure and simple target localization. To examine this gap, we apply a diagnostic battery including target intervention, reference-center randomization, and variance partition. Taken together, the results suggest that the directional structure recoverable from current attribution methods is limited and often mixed with image layout. We further find that higher task accuracy does not reliably coincide with better directional attribution: small-scale LoRA training and newer model generations can improve task accuracy while leaving Direction Alignment Error unchanged or worse. These findings characterize what current attribution methods reveal rather than the model's internal spatial representation. CREG provides a controlled protocol for testing whether improvements in spatial reasoning are accompanied by more directionally organized evidence.
△ Less
Submitted 13 April, 2026; v1 submitted 20 March, 2026;
originally announced March 2026.
-
An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
Authors:
Yuming Feng,
Christy Yang
Abstract:
Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet c…
▽ More
Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
△ Less
Submitted 20 March, 2026;
originally announced March 2026.
-
Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study
Authors:
Danya Li,
Yan Feng,
Rico Krueger
Abstract:
The integration of Automated Shuttles into shared urban spaces presents unique challenges due to the absence of traffic rules and the complex pedestrian interactions. Accurately anticipating pedestrian behavior in such unstructured environments is therefore critical for ensuring both safety and efficiency. This paper presents a Virtual Reality (VR) study that captures how pedestrians interact with…
▽ More
The integration of Automated Shuttles into shared urban spaces presents unique challenges due to the absence of traffic rules and the complex pedestrian interactions. Accurately anticipating pedestrian behavior in such unstructured environments is therefore critical for ensuring both safety and efficiency. This paper presents a Virtual Reality (VR) study that captures how pedestrians interact with automated shuttles across diverse scenarios, including varying approach angles and navigating in continuous traffic. We identify critical behavior patterns present in pedestrians' decision-making in shared spaces, including hesitation, evasive maneuvers, gaze allocation, and proxemic adjustments. To model pedestrian behavior, we propose GazeX-LSTM, a multimodal eye gaze-informed and context-aware prediction model that integrates pedestrians' trajectories, fine-grained eye gaze dynamics, and contextual factors. We shift prediction from a vehicle- to a human-centered perspective by leveraging eye-tracking data to capture pedestrian attention. We systematically validate the unique and irreplaceable predictive power of eye gaze over head orientation alone, further enhancing performance by integrating contextual variables. Notably, the combination of eye gaze data and contextual information produces super-additive improvements on pedestrian behavior prediction accuracy, revealing the complementary relationship between visual attention and situational contexts. Together, our findings provide the first evidence that eye gaze-informed modeling fundamentally advances pedestrian behavior prediction and highlight the critical role of situational contexts in shared-space interactions. This paves the way for safer and more adaptive automated vehicle technologies that account for how people perceive and act in complex shared spaces.
△ Less
Submitted 20 March, 2026;
originally announced March 2026.
-
PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment
Authors:
Tianci Luo,
Jinpeng Wang,
Shiyu Qin,
Niu Lian,
Yan Feng,
Bin Chen,
Chun Yuan,
Shu-Tao Xia
Abstract:
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To ove…
▽ More
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.
△ Less
Submitted 19 March, 2026;
originally announced March 2026.
-
Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model
Authors:
Chen Zhao,
Zhuoran Wang,
Haoyang Li,
Shifeng Bao,
Guanlin Li,
Youhe Feng,
Yang Li,
Jie Tang,
Jing Zhang
Abstract:
Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustne…
▽ More
Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
Authors:
Ruize Ma,
Yilei Jiang,
Shilin Zhang,
Zheng Ma,
Yi Feng,
Vincent Ng,
Zhi Wang,
Xiangyu Yue,
Chuanyi Li,
Lewei Lu
Abstract:
Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoni…
▽ More
Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoning is often performed over full-page screenshots without localized grounding, and failed repair attempts are rarely transformed into reusable knowledge. To address these challenges, we propose FailureMem, a multimodal repair framework that integrates three key mechanisms: a hybrid workflow-agent architecture that balances structured localization with flexible reasoning, active perception tools that enable region-level visual grounding, and a Failure Memory Bank that converts past repair attempts into reusable guidance. Experiments on SWE-bench Multimodal demonstrate FailureMem improves the resolved rate over GUIRepair by 3.7%.
△ Less
Submitted 18 March, 2026;
originally announced March 2026.
-
Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality
Authors:
Mengyu Bu,
Yang Feng
Abstract:
Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multil…
▽ More
Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.
△ Less
Submitted 6 April, 2026; v1 submitted 18 March, 2026;
originally announced March 2026.