Skip to main content

Showing 1–50 of 1,852 results for author: Mao, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.09244  [pdf, ps, other

    cs.MM cs.CV cs.RO

    2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

    Authors: Zihao Zheng, Sicheng Tian, Zhihao Mao, Lingyue Zhang, Chenyue Li, Ziyun Zhang, Hong Gao, Yuchen Huang, Yutong Xu, Guojie Luo, Xiang Chen

    Abstract: Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

  2. arXiv:2604.09155  [pdf, ps, other

    cs.LG cs.AI

    CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

    Authors: Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, Lequan Yu

    Abstract: Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees… ▽ More

    Submitted 10 April, 2026; originally announced April 2026.

  3. arXiv:2604.08826  [pdf, ps, other

    cs.LG cs.AI cs.CL

    HiFloat4 Format for Language Model Pre-training on Ascend NPUs

    Authors: Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, Shadan Golestan

    Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be succ… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  4. arXiv:2604.08516  [pdf, ps, other

    cs.CV

    MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    Authors: Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna

    Abstract: Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: https://allenai.org/blog/molmoweb

  5. arXiv:2604.08377  [pdf, ps, other

    cs.AI cs.CL

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    Authors: Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, Xiangxiang Chu

    Abstract: Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: Work in progress

  6. arXiv:2604.07350  [pdf, ps, other

    cs.CV cs.GR cs.LG

    Fast Spatial Memory with Elastic Test-Time Training

    Authors: Ziqiao Ma, Xueyang Yu, Haoyu Zhen, Yuncong Yang, Joyce Chai, Chuang Gan

    Abstract: Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pa… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: Project Page: https://fast-spatial-memory.github.io/

  7. arXiv:2604.06950  [pdf, ps, other

    cs.CV

    Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

    Authors: Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu

    Abstract: Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into h… ▽ More

    Submitted 8 April, 2026; v1 submitted 8 April, 2026; originally announced April 2026.

    Comments: Accepted to ACL 2026. 19 pages, 6 figures

  8. arXiv:2604.06664  [pdf, ps, other

    cs.DC cs.LG

    Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

    Authors: Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, Z. Morley Mao

    Abstract: Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: be… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  9. arXiv:2604.05634  [pdf, ps, other

    cs.AI

    PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

    Authors: Zhiyong Ma, Zhitao Deng, Huan Tang, Jialin Chen, Zhijun Zheng, Zhengping Li, Qingyuan Chuai

    Abstract: Machine unlearning (MU) has become a critical technique for GenAI models' safe and compliant operation. While existing MU methods are effective, most impose prohibitive training time and computational overhead. Our analysis suggests the root cause lies in poorly directed gradient updates, which reduce training efficiency and destabilize convergence. To mitigate these issues, we propose PECKER, an… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: Accepted by ICPR 2026

  10. arXiv:2604.05623  [pdf, ps, other

    cs.CV cs.CL cs.MM

    DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

    Authors: Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, Zhanyu Ma

    Abstract: Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within e… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: 8 pages, 5 figures. The dataset and code are available at https://zyx-hhnkh.github.io/DetailVerifyBench/

  11. arXiv:2604.05581  [pdf, ps, other

    cs.CV

    High-Resolution Single-Shot Polarimetric Imaging Made Easy

    Authors: Shuangfan Zhou, Chu Zhou, Heng Guo, Youwei Lyu, Boxin Shi, Zhanyu Ma, Imari Sato

    Abstract: Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrifici… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

  12. arXiv:2604.05482  [pdf, ps, other

    cs.CV cs.AI

    Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

    Authors: Pu Wang, Zhixuan Mao, Jialu Li, Zhuoran Zheng, Dianjie Lu, Youshan Zhang

    Abstract: Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Lang… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

  13. arXiv:2604.05393  [pdf, ps, other

    cs.CV cs.MM

    Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

    Authors: Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Jun Gao, Weiming Hu

    Abstract: Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In t… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: Accepted to CVPR 2026. Project page, dataset, and code are available at: https://hahajun1101.github.io/OACIR/

  14. arXiv:2604.04942  [pdf, ps, other

    cs.CL cs.AI

    TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

    Authors: Jiaquan Zhang, Qigan Sun, Chaoning Zhang, Xudong Wang, Zhenzhen Huang, Yitian Zhou, Pengcheng Zheng, Chi-lok Andy Tai, Sung-Ho Bae, Zeyu Ma, Caiyan Qin, Jinyu Guo, Yang Yang, Hengtao Shen

    Abstract: Enhancing the reasoning capability of large language models (LLMs) remains a core challenge in natural language processing. The Chain-of-Thought (CoT) paradigm dominates practical applications for its single-round efficiency, yet its reasoning chains often exhibit logical gaps. While multi-round paradigms like Graph-of-Thoughts (GoT), Tree-of-Thoughts (ToT), and Atom of Thought (AoT) achieve stron… ▽ More

    Submitted 13 March, 2026; originally announced April 2026.

    Comments: 14 pages, 4 figures

  15. arXiv:2604.04925  [pdf, ps, other

    cs.CV

    SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo

    Authors: Zeyu Ma, Alexander Raistrick, Jia Deng

    Abstract: In this paper, we explore the design space of procedural rules for multi-view stereo (MVS). We demonstrate that we can generate effective training data using SimpleProc: a new, fully procedural generator driven by a very small set of rules using Non-Uniform Rational Basis Splines (NURBS), as well as basic displacement and texture patterns. At a modest scale of 8,000 images, our approach achieves s… ▽ More

    Submitted 7 April, 2026; v1 submitted 6 April, 2026; originally announced April 2026.

  16. arXiv:2604.04805  [pdf, ps, other

    cs.CR

    Unpacking .zip: A First Look at Domain and File Name Confusion

    Authors: Predrag Despotovic, Pranab Mishra, Kevin Rossel, Athanasios Avgetidis, Zane Ma

    Abstract: The namespace for filenames and DNS names has overlapped since the introduction of DNS in 1985: \texttt{.com} was the original binary format used for DOS and CP/M systems. Recently the introduction of gTLDs such as \texttt{.zip} and \texttt{.mov}, coupled with the growing prevalence of web resources, has ignited new concerns about potential issues related to DNS and filename confusion. Thus far, t… ▽ More

    Submitted 7 April, 2026; v1 submitted 6 April, 2026; originally announced April 2026.

  17. arXiv:2604.04451  [pdf, ps, other

    cs.CV

    Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

    Authors: Hao Liu, Ye Huang, Chenghuan Huang, Zhenyi Zheng, Jiangsu Du, Ziyang Ma, Jing Lyu, Yutong Lu

    Abstract: Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

  18. arXiv:2604.02183  [pdf, ps, other

    cs.AI

    TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning

    Authors: Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Yang Yang

    Abstract: Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamenta… ▽ More

    Submitted 10 April, 2026; v1 submitted 2 April, 2026; originally announced April 2026.

  19. arXiv:2604.01889  [pdf, ps, other

    cs.LG

    LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding

    Authors: Chenghao Yue, Zhiyuan Ma, Zhongye Xia, Xinche Zhang, Yisi Zhang, Xinke Shen, Sen Song

    Abstract: Electroencephalography (EEG) provides a non-invasive window into brain activity, offering high temporal resolution crucial for understanding and interacting with neural processes through brain-computer interfaces (BCIs). Current dual-stream neural networks for EEG often process temporal and spatial features independently through parallel branches, delaying their integration until a final, late-sta… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

  20. arXiv:2604.01251  [pdf, ps, other

    cs.CV eess.IV

    Camouflage-aware Image-Text Retrieval via Expert Collaboration

    Authors: Yao Jiang, Zhongkuan Mao, Xuan Wu, Keren Fu, Qijun Zhao

    Abstract: Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-… ▽ More

    Submitted 31 March, 2026; originally announced April 2026.

  21. arXiv:2604.01241  [pdf, ps, other

    cs.NE cs.AI cs.LG

    A Learning-Based Cooperative Coevolution Framework for Heterogeneous Large-Scale Global Optimization

    Authors: Wenjie Qiu, Zixin Wang, Hongyu Fang, Zeyuan Ma, Yue-Jiao Gong

    Abstract: Cooperative Coevolution (CC) effectively addresses Large-Scale Global Optimization (LSGO) via decomposition but struggles with the emerging class of Heterogeneous LSGO (H-LSGO) problems arising from real-world applications, where subproblems exhibit diverse dimensions and distinct landscapes. The prevailing CC paradigm, relying on a fixed low-dimensional optimizer, often fails to navigate this het… ▽ More

    Submitted 29 March, 2026; originally announced April 2026.

    Comments: 13 pages, 5 figures, 3 tables. Accepted for publication in GECCO 2026

  22. arXiv:2604.01155  [pdf, ps, other

    cs.SD

    FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

    Authors: Xiquan Li, Xuenan Xu, Ziyang Ma, Wenxi Chen, Haolin He, Qiuqiang Kong, Xie Chen

    Abstract: Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel trainin… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  23. arXiv:2603.28691  [pdf, ps, other

    cs.RO

    DRIVE-Nav: Directional Reasoning, Inspection, and Verification for Efficient Open-Vocabulary Navigation

    Authors: Maoguo Gao, Zejun Zhu, Zhiming Sun, Zhengwei Ma, Longze Yuan, Zhongjing Ma, Zhigang Gao, Jinhui Zhang, Suli Zou

    Abstract: Open-Vocabulary Object Navigation (OVON) requires an embodied agent to locate a language-specified target in unknown environments. Existing zero-shot methods often reason over dense frontier points under incomplete observations, causing unstable route selection, repeated revisits, and unnecessary action overhead. We present DRIVE-Nav, a structured framework that organizes exploration around persis… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

    Comments: 8 pages, 4 figures. Project page: https://coolmaoguo.github.io/drive-nav-page/

  24. arXiv:2603.28474  [pdf, ps, other

    cs.CV cs.AI

    CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

    Authors: Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao

    Abstract: The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcel… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

  25. arXiv:2603.28069  [pdf, ps, other

    cs.CV cs.AI

    MolmoPoint: Better Pointing for VLMs with Grounding Tokens

    Authors: Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

    Abstract: Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

  26. arXiv:2603.27705  [pdf, ps, other

    cs.CV cs.AI

    RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation

    Authors: Zhihao Mao, Bangpu Chen

    Abstract: Few-shot medical image segmentation (FSMIS) has achieved notable progress, yet most existing methods mainly rely on semantic correspondences from scarce annotations while under-utilizing a key property of medical imagery: anatomical targets exhibit repeatable high-frequency morphology (e.g., boundary geometry and spatial layout) across patients and acquisitions. We propose RAP, a training-free fra… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: This paper has been accepted by IJCNN 2026

  27. arXiv:2603.27537  [pdf, ps, other

    cs.RO

    Learning Smooth and Robust Space Robotic Manipulation of Dynamic Target via Inter-frame Correlation

    Authors: Siyi Lang, Hongyi Gao, Yingxin Zhang, Zihao Liu, Hanlin Dong, Zhaoke Ning, Zhiqiang Ma, Panfeng Huang

    Abstract: On-orbit servicing represents a critical frontier in future aerospace engineering, with the manipulation of dynamic non-cooperative targets serving as a key technology. In microgravity environments, objects are typically free-floating, lacking the support and frictional constraints found on Earth, which significantly escalates the complexity of tasks involving space robotic manipulation. Conventio… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: none

  28. arXiv:2603.26653  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG

    PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

    Authors: Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna

    Abstract: We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attri… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

    Comments: Project Page: https://perceptioncomp.github.io

  29. arXiv:2603.26049  [pdf, ps, other

    cs.CV cs.AI

    Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

    Authors: Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao

    Abstract: Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

    Comments: Code: https://github.com/mk-runner/CoGaze

  30. arXiv:2603.25969  [pdf, ps, other

    cs.AR

    FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators

    Authors: G Abarajithan, Zhenghua Ma, Francesco Restuccia, Ryan Kastner

    Abstract: Hardware-firmware integration is becoming a productivity bottleneck due to the increasing complexity of accelerators, characterized by intricate memory hierarchies and firmware-intensive execution. While numerous verification techniques focus on early-stage, approximate modeling of such systems to speed up initial development, developers still rely heavily on FPGA emulation to integrate firmware w… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

  31. arXiv:2603.25423  [pdf, ps, other

    cs.SI cs.AI

    From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild

    Authors: Zhi Zeng, Yifei Yang, Jiaying Wu, Xulang Zhang, Xiangzheng Kong, Herun Wan, Zihan Ma, Minnan Luo

    Abstract: The rise of micro-videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real-world cases that involve multimodal manipulation, AI-generated content, cognitive bias, and out-of-context reuse. Meanwhile, most detection models lack fine-grained attribution, l… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

    Comments: Accepted at WWW 2026

  32. arXiv:2603.25040  [pdf, ps, other

    cs.LG cs.CL cs.CV

    Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

    Authors: Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang , et al. (152 additional authors not shown)

    Abstract: We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertis… ▽ More

    Submitted 2 April, 2026; v1 submitted 26 March, 2026; originally announced March 2026.

  33. arXiv:2603.24575  [pdf, ps, other

    cs.CV cs.AI

    VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

    Authors: Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Daniel S Weld, Ranjay Krishna

    Abstract: Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figur… ▽ More

    Submitted 25 March, 2026; originally announced March 2026.

  34. arXiv:2603.23376  [pdf, ps, other

    cs.CV cs.RO

    ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

    Authors: Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

    Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates vi… ▽ More

    Submitted 27 March, 2026; v1 submitted 24 March, 2026; originally announced March 2026.

    Comments: Code: https://github.com/amap-cvlab/ABot-PhysWorld.git

  35. Detecting Non-Membership in LLM Training Data via Rank Correlations

    Authors: Pranav Shetty, Mirazul Haque, Zhiqiang Ma, Xiaomo Liu

    Abstract: As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem -- verifying that a dataset was not used -- has rec… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

    Comments: Accepted to EACL 2026 Main Conference

  36. arXiv:2603.22314  [pdf, ps, other

    cs.LG cs.AI

    Enhancing AI-Based Tropical Cyclone Track and Intensity Forecasting via Systematic Bias Correction

    Authors: Peisong Niu, Haifan Zhang, Yang Zhao, Tian Zhou, Ziqing Ma, Wenqiang Shen, Junping Zhao, Huiling Yuan, Liang Sun

    Abstract: Tropical cyclones (TCs) pose severe threats to life, infrastructure, and economies in tropical and subtropical regions, underscoring the critical need for accurate and timely forecasts of both track and intensity. Recent advances in AI-based weather forecasting have shown promise in improving TC track forecasts. However, these systems are typically trained on coarse-resolution reanalysis data (e.g… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  37. arXiv:2603.22309  [pdf, ps, other

    cs.LG cs.AI

    UniFluids: Unified Neural Operator Learning with Conditional Flow-matching

    Authors: Haosen Li, Qi Meng, Jiahao Li, Rui Zhang, Ruihua Song, Liang Ma, Zhi-Ming Ma

    Abstract: Partial differential equation (PDE) simulation holds extensive significance in scientific research. Currently, the integration of deep neural networks to learn solution operators of PDEs has introduced great potential. In this paper, we present UniFluids, a conditional flow-matching framework that harnesses the scalability of diffusion Transformer to unify learning of solution operators across div… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

    Comments: Preprint version. Work in progress

  38. arXiv:2603.21664  [pdf, ps, other

    cs.CV

    HumanOmni-Speaker: Identifying Who said What and When

    Authors: Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma

    Abstract: While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-… ▽ More

    Submitted 23 March, 2026; originally announced March 2026.

  39. arXiv:2603.20296  [pdf, ps, other

    cs.LG cs.AI

    Collaborative Adaptive Curriculum for Progressive Knowledge Distillation

    Authors: Jing Liu, Zhenchao Ma, Han Yu, Bobo Ju, Wenliang Yang, Chengfang Li, Bo Hu, Liang Song

    Abstract: Recent advances in collaborative knowledge distillation have demonstrated cutting-edge performance for resource-constrained distributed multimedia learning scenarios. However, achieving such competitiveness requires addressing a fundamental mismatch: high-dimensional teacher knowledge complexity versus heterogeneous client learning capacities, which currently prohibits deployment in edge-based vis… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

    Comments: Accepted by IEEE ICME 2026

  40. arXiv:2603.19536  [pdf, ps, other

    cs.SI

    Politicized Attention Shifts Amplify Polarization in the Information Ecosystem during California Wildfires

    Authors: Yiheng Chen, Alina Hagen, Fan Yang, Ratna B. Dougherty, Zihui Ma, Lingyao Li, Runlong Yu

    Abstract: Wildfires require governments to communicate under conditions of urgency, uncertainty, and intense public scrutiny, yet such communication now unfolds within a digitally mediated environment shaped by polarization and engagement-based amplification. We analyze over 1.3 million wildfire-related social media posts from California (2016-2025) to examine how institutional actors are evaluated within t… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  41. arXiv:2603.19384  [pdf, ps, other

    cs.RO

    SOFTMAP: Sim2Real Soft Robot Forward Modeling via Topological Mesh Alignment and Physics Prior

    Authors: Ziyong Ma, Uksang Yoo, Jonathan Francis, Weiming Zhi, Jeffrey Ichnowski, Jean Oh

    Abstract: While soft robot manipulators offer compelling advantages over rigid counterparts, including inherent compliance, safe human-robot interaction, and the ability to conform to complex geometries, accurate forward modeling from low-dimensional actuation commands remains an open challenge due to nonlinear material phenomena such as hysteresis and manufacturing variability. We present SOFTMAP, a sim-to… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

  42. arXiv:2603.18937  [pdf, ps, other

    cs.IT math.PR

    Theoretical Analyses of Detectors for Additive Noise Channels with Mean-Variance Uncertainty under Nonlinear Expectation Theory

    Authors: Wen-Xuan Lang, Guiying Yan, Zhi-Ming Ma

    Abstract: In classical information theory, both the form and performance of the optimal detector for additive noise channels can be precisely derived, based on the assumption that the channel noise follows a specific probability distribution or a mixture of known distributions, or that the exact distribution exists but is unknown. In this paper, we extend the analyses of detectors for additive noise channel… ▽ More

    Submitted 19 March, 2026; originally announced March 2026.

    Comments: 24 pages, 4 figures

  43. arXiv:2603.18466  [pdf, ps, other

    cs.CV

    Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

    Authors: Yuqi Yang, Dongliang Chang, Yijia Ling, Ruoyi Du, Zhanyu Ma

    Abstract: Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

    Comments: 18 pages, 12 figures

  44. arXiv:2603.18443  [pdf, ps, other

    cs.CV

    SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

    Authors: Leyuan Fang, Zan Mao, Zijing Wang, Yinlong Yan

    Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, result… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  45. arXiv:2603.17826  [pdf, ps, other

    cs.SE cs.AI

    FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

    Authors: Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng, Vincent Ng, Zhi Wang, Xiangyu Yue, Chuanyi Li, Lewei Lu

    Abstract: Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoni… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  46. arXiv:2603.17573  [pdf, ps, other

    cs.RO cs.DB cs.LG

    HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

    Authors: Zihao Zheng, Zhihao Mao, Sicheng Tian, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen

    Abstract: Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sol… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  47. arXiv:2603.17547  [pdf

    eess.IV cs.CV

    Deep Learning-Based Airway Segmentation in Systemic Lupus Erythematosus Patients with Interstitial Lung Disease (SLE-ILD): A Comparative High-Resolution CT Analysis

    Authors: Sirong Piao, Ying Ming, Ruijie Zhao, Jiaru Wang, Ran Xiao, Rui Zhao, Zicheng Liao, Qiqi Xu, Shaoze Luo, Bing Li, Lin Li, Zhuangfei Ma, Fuling Zheng, Wei Song

    Abstract: To characterize lobar and segmental airway volume differences between systemic lupus erythematosus (SLE) patients with interstitial lung disease (ILD) and those without ILD (non-ILD) using a deep learning-based approach on non-contrast chest high-resolution CT (HRCT). Methods: A retrospective analysis was conducted on 106 SLE patients (27 SLE-ILD, 79 SLE-non-ILD) who underwent HRCT. A customized d… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

  48. arXiv:2603.17455  [pdf

    cs.CV

    FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

    Authors: Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu, Lei Zhang, Xiaojun Chang, Yongdong Zhang

    Abstract: Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emoti… ▽ More

    Submitted 18 March, 2026; originally announced March 2026.

    Comments: Submitted to TPAMI. 16 pages, 9 figures

  49. arXiv:2603.16700  [pdf, ps, other

    cs.IT math.PR

    Nonlinear Information Theory: Characterizing Distributional Uncertainty in Communication Models with Sublinear Expectation

    Authors: Wen-Xuan Lang, Shaoshi Yang, Jianhua Zhang, Zhiming Ma

    Abstract: A mathematical framework for information-theoretic analysis is established, with a new viewpoint of describing transmitted messages and communication channels by the nonlinear expectation theory, beyond the framework of classical probability theory. The major motivation of this research is to emphasize the probabilistic distribution uncertainty within the ever increasingly complex communication ne… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.

    Comments: 48 pages,8 figures

  50. arXiv:2603.16393  [pdf, ps, other

    math.NA cs.AI

    Robust Physics-Guided Diffusion for Full-Waveform Inversion

    Authors: Jishen Peng, Enze Jiang, Zheng Ma, Xiongbin Yan

    Abstract: We develop a robust physics-guided diffusion framework for full-waveform inversion that combines a score-based generative prior with likelihood guidance computed through wave-equation simulations. We adopt a transport-based data-consistency potential (Wasserstein-2), incorporating wavefield enhancement via bounded weighting and observation-dependent normalization, thereby improving robustness to a… ▽ More

    Submitted 17 March, 2026; originally announced March 2026.