Skip to main content

Showing 1–50 of 1,002 results for author: Ma, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2512.21329  [pdf, ps, other

    cs.CL

    Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

    Authors: Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma

    Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in ma… ▽ More

    Submitted 24 December, 2025; originally announced December 2025.

  2. arXiv:2512.18604  [pdf, ps, other

    cs.LG

    Trajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning

    Authors: Wencan Mao, Quanxi Zhou, Tomas Couso Coddou, Manabu Tsukada, Yunling Liu, Yusheng Ji

    Abstract: Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV-based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To ad… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

  3. arXiv:2512.18596  [pdf, ps, other

    cs.LG

    EIA-SEC: Improved Actor-Critic Framework for Multi-UAV Collaborative Control in Smart Agriculture

    Authors: Quanxi Zhou, Wencan Mao, Yilei Liang, Manabu Tsukada, Yunling Liu, Jon Crowcroft

    Abstract: The widespread application of wireless communication technology has promoted the development of smart agriculture, where unmanned aerial vehicles (UAVs) play a multifunctional role. We target a multi-UAV smart agriculture system where UAVs cooperatively perform data collection, image acquisition, and communication tasks. In this context, we model a Markov decision process to solve the multi-UAV tr… ▽ More

    Submitted 21 December, 2025; originally announced December 2025.

  4. arXiv:2512.17495  [pdf, ps, other

    cs.CV

    GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

    Authors: Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

    Abstract: Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified da… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

  5. arXiv:2512.17253  [pdf, ps, other

    cs.CV

    Mitty: Diffusion-based Human-to-Robot Video Generation

    Authors: Yiren Song, Cheng Liu, Weijia Mao, Mike Zheng Shou

    Abstract: Learning directly from human demonstration videos is a key milestone toward scalable and generalizable robot learning. Yet existing methods rely on intermediate representations such as keypoints or trajectories, introducing information loss and cumulative errors that harm temporal and visual consistency. We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-en… ▽ More

    Submitted 19 December, 2025; originally announced December 2025.

  6. arXiv:2512.16917  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

    Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

    Abstract: Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-base… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

  7. arXiv:2512.16881  [pdf, ps, other

    cs.RO cs.LG

    PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

    Authors: Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, Karl Pertsch

    Abstract: A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scene… ▽ More

    Submitted 18 December, 2025; originally announced December 2025.

    Comments: Website: https://polaris-evals.github.io/

  8. arXiv:2512.15430  [pdf, ps, other

    cs.LG cs.AI

    FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments

    Authors: Quanxi Zhou, Wencan Mao, Manabu Tsukada, John C. S. Lui, Yusheng Ji

    Abstract: Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning… ▽ More

    Submitted 17 December, 2025; originally announced December 2025.

  9. arXiv:2512.14200  [pdf, ps, other

    cs.CV

    Beyond a Single Light: A Large-Scale Aerial Dataset for Urban Scene Reconstruction Under Varying Illumination

    Authors: Zhuoxiao Li, Wenzong Ma, Taoyu Wu, Jinjing Zhu, Zhenchao Q, Shuai Zhang, Jing Ou, Yinrui Ren, Weiqing Qi, Guobin Shen, Hui Xiong, Wufan Zhao

    Abstract: Recent advances in Neural Radiance Fields and 3D Gaussian Splatting have demonstrated strong potential for large-scale UAV-based 3D reconstruction tasks by fitting the appearance of images. However, real-world large-scale captures are often based on multi-temporal data capture, where illumination inconsistencies across different times of day can significantly lead to color artifacts, geometric ina… ▽ More

    Submitted 16 December, 2025; originally announced December 2025.

  10. arXiv:2512.12615  [pdf, ps, other

    cs.OS

    gpu_ext: Extensible OS Policies for GPUs via eBPF

    Authors: Yusheng Zheng, Tong Yu, Yiwei Yang, Minghui Jiang, Xiangyu Gao, Jianchang Su, Yanpeng Hu, Wenan Mao, Wei Zhang, Dan Williams, Andi Quinn

    Abstract: Performance in modern GPU-centric systems increasingly depends on resource management policies, including memory placement, scheduling, and observability. However, uniform policies typically yield suboptimal performance across diverse workloads. Existing approaches present a tradeoff: user-space runtimes provide programmability and flexibility but lack cross-tenant visibility and fine-grained cont… ▽ More

    Submitted 20 December, 2025; v1 submitted 14 December, 2025; originally announced December 2025.

  11. arXiv:2512.11480  [pdf, ps, other

    cs.CV

    CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

    Authors: Weijian Ma, Shizhao Sun, Ruiyu Wang, Jiang Bian

    Abstract: A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's struc… ▽ More

    Submitted 12 December, 2025; originally announced December 2025.

    Comments: NeurIPS 2025

  12. arXiv:2512.10702  [pdf

    cs.AI

    COMPARE: Clinical Optimization with Modular Planning and Assessment via RAG-Enhanced AI-OCT: Superior Decision Support for Percutaneous Coronary Intervention Compared to ChatGPT-5 and Junior Operators

    Authors: Wei Fang, Chiyao Wang, Wenshuai Ma, Hui Liu, Jianqiang Hu, Xiaona Niu, Yi Chu, Mingming Zhang, Jingxiao Yang, Dongwei Zhang, Zelin Li, Pengyun Liu, Jiawei Zheng, Pengke Zhang, Chaoshi Qin, Wangang Guo, Bin Wang, Yugang Xue, Wei Zhang, Zikuan Wang, Rui Zhu, Yihui Cao, Quanmao Lu, Rui Meng, Yan Li

    Abstract: Background: While intravascular imaging, particularly optical coherence tomography (OCT), improves percutaneous coronary intervention (PCI) outcomes, its interpretation is operator-dependent. General-purpose artificial intelligence (AI) shows promise but lacks domain-specific reliability. We evaluated the performance of CA-GPT, a novel large model deployed on an AI-OCT system, against that of the… ▽ More

    Submitted 11 December, 2025; originally announced December 2025.

  13. arXiv:2512.08944  [pdf, ps, other

    cs.CL

    Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning

    Authors: Yudong Wang, Zhe Yang, Wenhan Ma, Zhifang Sui, Liang Zhao

    Abstract: While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question a… ▽ More

    Submitted 19 November, 2025; originally announced December 2025.

  14. arXiv:2512.07582  [pdf, ps, other

    cs.RO

    See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

    Authors: Guangyan Chen, Meiling Wang, Qi Shao, Zichen Zhou, Weixin Mao, Te Cui, Minzhao Zhu, Yinan Deng, Luojie Yang, Zhanqi Zhang, Yi Yang, Hua Chen, Yufeng Yue

    Abstract: Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring nov… ▽ More

    Submitted 8 December, 2025; originally announced December 2025.

  15. arXiv:2512.06013  [pdf, ps, other

    cs.CV cs.RO

    VAT: Vision Action Transformer by Unlocking Full Representation of ViT

    Authors: Wenhao Li, Chengwei Ma, Weixin Mao

    Abstract: In robot learning, Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer's features. We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT), a novel architecture that is extended from ViT and unlocks the full feature hierarchy of ViT. VAT processes specialized action… ▽ More

    Submitted 3 December, 2025; originally announced December 2025.

  16. arXiv:2512.05593  [pdf, ps, other

    cs.CV

    Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer

    Authors: Rong Wang, Wei Mao, Changsheng Lu, Hongdong Li

    Abstract: We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning s… ▽ More

    Submitted 5 December, 2025; originally announced December 2025.

    Comments: Accepted to 3DV 2026

  17. arXiv:2512.05110  [pdf, ps, other

    cs.CV cs.AI cs.GR

    ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

    Authors: Rundong Luo, Noah Snavely, Wei-Chiu Ma

    Abstract: We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ… ▽ More

    Submitted 4 December, 2025; originally announced December 2025.

    Comments: Project page: https://red-fairy.github.io/ShadowDraw/

  18. arXiv:2512.04733  [pdf, ps, other

    cs.CV cs.AI

    E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

    Authors: Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

    Abstract: End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feas… ▽ More

    Submitted 4 December, 2025; originally announced December 2025.

  19. arXiv:2512.01274  [pdf, ps, other

    cs.CL cs.AI cs.LG

    SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

    Authors: Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou, Fengqi Cao, Kun Zhou, Rui Ge, Tingting Long, Yuexiang Zhu, Yan Liu, Jie Zheng, Junnian Wei, Rong Zhu, Peng Zou, Wenyu Li, Zekai Cheng, Tian Ding, Yaxuan Wang, Yizhao Yan, Tingru Wei, Haowei Ming, Weijie Mao, Chen Sun, Yiming Liu , et al. (6 additional authors not shown)

    Abstract: Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both mul… ▽ More

    Submitted 30 November, 2025; originally announced December 2025.

    Comments: 35 pages, 11 figures, 5 tables

  20. arXiv:2511.20256  [pdf, ps, other

    cs.CV

    The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

    Authors: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

    Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we int… ▽ More

    Submitted 25 November, 2025; originally announced November 2025.

  21. arXiv:2511.19526  [pdf, ps, other

    cs.CV

    Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

    Authors: Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng, Tiezheng Zhang, Tao Wang, Wufei Ma, Siyi Chen, Yu-Cheng Chou, Prakhar Kaushik, Alan Yuille

    Abstract: We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evalu… ▽ More

    Submitted 24 November, 2025; originally announced November 2025.

  22. arXiv:2511.18601  [pdf, ps, other

    cs.CV

    RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data

    Authors: Wenchao Ma, Dario Kneubuehler, Maurice Chu, Ian Sachs, Haomiao Jiang, Sharon Xiaolei Huang

    Abstract: In this paper, we present RigAnyFace (RAF), a scalable neural auto-rigging framework for facial meshes of diverse topologies, including those with multiple disconnected components. RAF deforms a static neutral facial mesh into industry-standard FACS poses to form an expressive blendshape rig. Deformations are predicted by a triangulation-agnostic surface learning network augmented with our tailore… ▽ More

    Submitted 23 November, 2025; originally announced November 2025.

    Comments: Accepted by NeurIPS 2025

  23. arXiv:2511.16662  [pdf, ps, other

    cs.CV

    TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

    Authors: Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie, Alan Yuille

    Abstract: With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high compu… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

    Comments: 8 pages, 10 figures, Under review at a conference

  24. arXiv:2511.16049  [pdf, ps, other

    cs.CV

    LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

    Authors: Pei Liu, Songtao Wang, Lang Zhang, Xingyue Peng, Yuandong Lyu, Jiaxin Deng, Songxin Lu, Weiliang Ma, Xueyang Zhang, Yifei Zhan, XianPeng Lang, Jun Ma

    Abstract: Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly… ▽ More

    Submitted 20 November, 2025; originally announced November 2025.

  25. arXiv:2511.14357  [pdf, ps, other

    cs.CV

    IBGS: Image-Based Gaussian Splatting

    Authors: Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu

    Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a fast, high-quality method for novel view synthesis (NVS). However, its use of low-degree spherical harmonics limits its ability to capture spatially varying color and view-dependent effects such as specular highlights. Existing works augment Gaussians with either a global texture map, which struggles with complex scenes, or per-Gaussian textur… ▽ More

    Submitted 18 November, 2025; originally announced November 2025.

    Comments: Accepted to NeurIPS 2025

  26. arXiv:2511.14275  [pdf, ps, other

    cs.CL

    Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

    Authors: Ante Wang, Weizhi Ma, Yang Liu

    Abstract: Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this wo… ▽ More

    Submitted 18 November, 2025; originally announced November 2025.

  27. arXiv:2511.11944  [pdf, ps, other

    cs.CV

    From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

    Authors: Ling Wang, Yunfan Lu, Wenzong Ma, Huizai Yao, Pengteng Li, Hui Xiong

    Abstract: Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill-posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the \textbf{first time}. Event cameras offer much higher HDR (… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

    Comments: 11 pages, 8 figures. Completed in April 2025

  28. arXiv:2511.11438  [pdf, ps, other

    cs.CV

    VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

    Authors: Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei

    Abstract: Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability… ▽ More

    Submitted 14 November, 2025; originally announced November 2025.

    Comments: This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details

  29. arXiv:2511.10356  [pdf, ps, other

    cs.AI

    SITA: A Framework for Structure-to-Instance Theorem Autoformalization

    Authors: Chenyi Li, Wanli Ma, Zichen Wang, Zaiwen Wen

    Abstract: While large language models (LLMs) have shown progress in mathematical reasoning, they still face challenges in formalizing theorems that arise from instantiating abstract structures in concrete settings. With the goal of auto-formalizing mathematical results at the research level, we develop a framework for structure-to-instance theorem autoformalization (SITA), which systematically bridges the g… ▽ More

    Submitted 13 November, 2025; originally announced November 2025.

  30. arXiv:2511.10229  [pdf, ps, other

    cs.CL

    LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning

    Authors: Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Lei Huang, Weitao Ma, Qichen Hong, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin

    Abstract: Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task rel… ▽ More

    Submitted 13 November, 2025; originally announced November 2025.

    Comments: AAAI2026 Main Track Accepted

  31. arXiv:2511.09965  [pdf, ps, other

    cs.CV

    Equivariant Sampling for Improving Diffusion Model-based Image Restoration

    Authors: Chenxu Wu, Qingpeng Kong, Peiang Zhao, Wendi Yang, Wenxin Ma, Fenghe Tang, Zihang Jiang, S. Kevin Zhou

    Abstract: Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by an… ▽ More

    Submitted 12 November, 2025; originally announced November 2025.

    Comments: 12 pages, 9 figures

  32. arXiv:2511.08035  [pdf, ps, other

    cs.LG

    From Sequential to Recursive: Enhancing Decision-Focused Learning with Bidirectional Feedback

    Authors: Xinyu Wang, Jinxiao Du, Yiyang Peng, Wei Ma

    Abstract: Decision-focused learning (DFL) has emerged as a powerful end-to-end alternative to conventional predict-then-optimize (PTO) pipelines by directly optimizing predictive models through downstream decision losses. Existing DFL frameworks are limited by their strictly sequential structure, referred to as sequential DFL (S-DFL). However, S-DFL fails to capture the bidirectional feedback between predic… ▽ More

    Submitted 27 November, 2025; v1 submitted 11 November, 2025; originally announced November 2025.

    Comments: Accepted by The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26), Main track

  33. arXiv:2511.07738  [pdf, ps, other

    cs.LG cs.CV

    From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

    Authors: Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po

    Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Rel… ▽ More

    Submitted 10 November, 2025; originally announced November 2025.

  34. arXiv:2511.07006  [pdf, ps, other

    cs.LG cs.AI

    S$^2$Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening

    Authors: Bowei He, Bowen Gao, Yankai Chen, Yanyan Lan, Chen Ma, Philip S. Yu, Ya-Qin Zhang, Wei-Ying Ma

    Abstract: Virtual screening (VS) is an essential task in drug discovery, focusing on the identification of small-molecule ligands that bind to specific protein pockets. Existing deep learning methods, from early regression models to recent contrastive learning approaches, primarily rely on structural data while overlooking protein sequences, which are more accessible and can enhance generalizability. Howeve… ▽ More

    Submitted 10 November, 2025; originally announced November 2025.

    Comments: Accepted by AAAI 2026 Main Technical Track

  35. arXiv:2511.06882  [pdf, ps, other

    cs.IT

    Rate-Optimal Streaming Codes Under an Extended Delay Profile for Three-Node Relay Networks With Burst Erasures

    Authors: Zhipeng Li, Wenjie Ma

    Abstract: This paper investigates streaming codes for three-node relay networks under burst packet erasures with a delay constraint $T$. In any sliding window of $T+1$ consecutive packets, the source-to-relay and relay-to-destination channels may introduce burst erasures of lengths at most $b_1$ and $b_2$, respectively. Let $u = \max\{b_1, b_2\}$ and $v = \min\{b_1, b_2\}$. Singhvi et al. proposed a constru… ▽ More

    Submitted 10 November, 2025; originally announced November 2025.

  36. arXiv:2511.06449  [pdf, ps, other

    cs.LG cs.AI

    FLEX: Continuous Agent Evolution via Forward Learning from Experience

    Authors: Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, Hao Zhou

    Abstract: Autonomous agents driven by Large Language Models (LLMs) have revolutionized reasoning and problem-solving but remain static after training, unable to grow with experience as intelligent beings do during deployment. We introduce Forward Learning with EXperience (FLEX), a gradient-free learning paradigm that enables LLM agents to continuously evolve through accumulated experience. Specifically, FLE… ▽ More

    Submitted 7 December, 2025; v1 submitted 9 November, 2025; originally announced November 2025.

  37. arXiv:2511.05923  [pdf, ps, other

    cs.CV

    Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation

    Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng

    Abstract: Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development o… ▽ More

    Submitted 19 November, 2025; v1 submitted 8 November, 2025; originally announced November 2025.

    Comments: AAAI2026 Oral

  38. arXiv:2511.04977  [pdf, ps, other

    cs.CV cs.MM

    GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

    Authors: Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang

    Abstract: Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive… ▽ More

    Submitted 6 November, 2025; originally announced November 2025.

  39. arXiv:2511.04671  [pdf, ps, other

    cs.RO cs.AI cs.CV

    X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

    Authors: Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia

    Abstract: Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstra… ▽ More

    Submitted 6 November, 2025; originally announced November 2025.

  40. arXiv:2510.26583  [pdf, ps, other

    cs.CV

    Emu3.5: Native Multimodal Models are World Learners

    Authors: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang

    Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interle… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

    Comments: project page: https://emu.world

  41. arXiv:2510.25801  [pdf, ps, other

    cs.LG cs.AI cs.CL cs.CV

    Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

    Authors: Kun Chen, Peng Shi, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao, Lin Ma

    Abstract: Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, wh… ▽ More

    Submitted 18 November, 2025; v1 submitted 28 October, 2025; originally announced October 2025.

    Comments: Project Page: https://github.com/Kwen-Chen/SPECS-VL

  42. arXiv:2510.24345  [pdf, ps, other

    cs.CL cs.AI

    LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

    Authors: Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin

    Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment… ▽ More

    Submitted 28 October, 2025; originally announced October 2025.

    Comments: EMNLP Findings 2025

  43. arXiv:2510.17315  [pdf, ps, other

    cs.RO

    Implicit State Estimation via Video Replanning

    Authors: Po-Chen Ko, Jiayuan Mao, Yu-Hsiang Fu, Hsien-Jeng Yeh, Chu-Rong Chen, Wei-Chiu Ma, Yilun Du, Shao-Hua Sun

    Abstract: Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time… ▽ More

    Submitted 20 October, 2025; originally announced October 2025.

  44. arXiv:2510.17245  [pdf, ps, other

    cs.IR

    On Efficiency-Effectiveness Trade-off of Diffusion-based Recommenders

    Authors: Wenyu Mao, Jiancan Wu, Guoqing Hu, Zhengyi Yang, Wei Ji, Xiang Wang

    Abstract: Diffusion models have emerged as a powerful paradigm for generative sequential recommendation, which typically generate next items to recommend guided by user interaction histories with a multi-step denoising process. However, the multi-step process relies on discrete approximations, introducing discretization error that creates a trade-off between computational efficiency and recommendation effec… ▽ More

    Submitted 22 October, 2025; v1 submitted 20 October, 2025; originally announced October 2025.

  45. arXiv:2510.16688  [pdf, ps, other

    cs.CV cs.AI

    Pursuing Minimal Sufficiency in Spatial Reasoning

    Authors: Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang

    Abstract: Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information… ▽ More

    Submitted 18 October, 2025; originally announced October 2025.

  46. arXiv:2510.16366  [pdf, ps, other

    cs.CY

    Integrating LLM and Diffusion-Based Agents for Social Simulation

    Authors: Xinyi Li, Zhiqiang Guo, Qinglang Guo, Hao Jin, Weizhi Ma, Min Zhang

    Abstract: Agent-based social simulation provides a valuable methodology for predicting social information diffusion, yet existing approaches face two primary limitations. Traditional agent models often rely on rigid behavioral rules and lack semantic understanding of textual content, while emerging large language model (LLM)-based agents incur prohibitive computational costs at scale. To address these chall… ▽ More

    Submitted 18 October, 2025; originally announced October 2025.

    Comments: 10 pages, 3 figures, 4 tables

  47. arXiv:2510.15081  [pdf, ps, other

    cs.CL cs.SI

    A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

    Authors: Shiyu Ji, Farnoosh Hashemi, Joice Chen, Juanwen Pan, Weicheng Ma, Hefan Zhang, Sophia Pan, Ming Cheng, Shubham Mohole, Saeed Hassanpour, Soroush Vosoughi, Michael Macy

    Abstract: Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We… ▽ More

    Submitted 16 October, 2025; originally announced October 2025.

    Comments: The first two authors contributed equally

  48. arXiv:2510.14455  [pdf, ps, other

    cs.LG q-bio.BM

    Coder as Editor: Code-driven Interpretable Molecular Optimization

    Authors: Wenyu Zhu, Chengzhu Li, Xiaohe Tian, Yifan Wang, Yinjun Jia, Jianhui Wang, Bowen Gao, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan

    Abstract: Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a… ▽ More

    Submitted 16 October, 2025; originally announced October 2025.

  49. arXiv:2510.13888  [pdf, ps, other

    cs.CL cs.AI

    Reliable Fine-Grained Evaluation of Natural Language Math Proofs

    Authors: Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, Sewon Min

    Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for… ▽ More

    Submitted 13 October, 2025; originally announced October 2025.

    Comments: 31 pages, 6 figures, 10 tables

  50. arXiv:2510.13352  [pdf, ps, other

    cs.LG

    Kernel Representation and Similarity Measure for Incomplete Data

    Authors: Yang Cao, Sikun Yang, Kai He, Wenjun Ma, Ming Liu, Yujiu Yang, Jian Weng

    Abstract: Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis. Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates. This paper presents the proximity kernel, a new similarity measure that directly computes similarit… ▽ More

    Submitted 15 October, 2025; originally announced October 2025.