Skip to main content

Showing 1–50 of 3,500 results for author: Zhang, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.08337  [pdf, ps, other

    cs.CV cs.AI

    InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

    Authors: Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang, Betty Le Dem, Norimasa Kobori, Quan Kong

    Abstract: Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions.… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  2. arXiv:2604.08304  [pdf, ps, other

    cs.CR cs.AI

    Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

    Authors: Yuming Xu, Mingtao Zhang, Zhuohan Ge, Haoyang Li, Nicole Hu, Jason Chen Zhang, Qing Li, Lei Chen

    Abstract: Retrieval-augmented generation (RAG) significantly enhances large language models (LLMs) but introduces novel security risks through external knowledge access. While existing studies cover various RAG vulnerabilities, they often conflate inherent LLM risks with those specifically introduced by RAG. In this paper, we propose that secure RAG is fundamentally about the security of the external knowle… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  3. arXiv:2604.08281  [pdf, ps, other

    cs.CL

    When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

    Authors: Ruotao Xu, Yixin Ji, Yu Luo, Jinpeng Li, Dong Li, Peifeng Li, Juntao Li, Min Zhang

    Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  4. arXiv:2604.07958  [pdf, ps, other

    cs.CV

    ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    Authors: Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang, Siyu Chen, Xiaoda Yang, Tao Jin, Zhou Zhao

    Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient frame… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

  5. arXiv:2604.07922  [pdf, ps, other

    cs.AI cs.CL

    SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

    Authors: Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen, Weili Guan, Min Zhang

    Abstract: Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive "overthinking", generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framewo… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: accepted to ACL2026 main conference

  6. arXiv:2604.07812  [pdf, ps, other

    cs.CV

    HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

    Authors: Qihui Zhu, Tao Zhang, Yuchen Wang, Zijian Wen, Mengjie Zhang, Shuangwu Chen, Xiaobin Tan, Jian Yang, Yang Liu, Zhenhua Dong, Xianzhi Yu, Yinfei Pan

    Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads… ▽ More

    Submitted 9 April, 2026; originally announced April 2026.

    Comments: CVPR 2026

  7. arXiv:2604.07765  [pdf, ps, other

    cs.CV

    RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

    Authors: Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao, Rui Min, Yongjun Li, Chaoqian Ouyang, Shimin Di, Min-Ling Zhang

    Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguo… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  8. arXiv:2604.07717  [pdf

    cs.CL cs.AI

    Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models

    Authors: Ziyi Chen, Yasir Khan, Mengyuan Zhang, Cheng Peng, Mengxian Lyu, Yiyang Liu, Krishna Vaddiparti, Robert L Cook, Mattia Prosperi, Yonghui Wu

    Abstract: Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  9. arXiv:2604.07394  [pdf, ps, other

    cs.LG cs.CL

    Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

    Authors: Quantong Qiu, Zhiyi Hong, Yi Yang, Haitian Wang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang

    Abstract: The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different task… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

  10. arXiv:2604.06787  [pdf, ps, other

    cs.CL

    When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

    Authors: Yang Xiang, Yixin Ji, Ruotao Xu, Dan Qiao, Zheming Yang, Juntao Li, Min Zhang

    Abstract: Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been ge… ▽ More

    Submitted 8 April, 2026; originally announced April 2026.

    Comments: ACL 2026 Main Conference

  11. arXiv:2604.06747  [pdf

    cs.AI

    TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

    Authors: Juan Du, Yueteng Wu, Pan Zhao, Yuze Liu, Min Zhang, Xiaobin Xu, Xinglong Zhang

    Abstract: The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design challenging. To address this issue, this s… ▽ More

    Submitted 8 April, 2026; v1 submitted 8 April, 2026; originally announced April 2026.

  12. arXiv:2604.05682  [pdf, ps, other

    cs.IT

    Non-GRS type MDS and AMDS codes from extended TGRS codes

    Authors: Meiying Zhang, Shudi Yang, Yanbin Zheng

    Abstract: Maximum distance separable (MDS) and almost maximum distance separable (AMDS) codes have been widely used in various fields such as communication systems, data storage, and quantum codes because of their algebraic properties and excellent error-correcting capabilities. In this paper, we construct a class of extended twisted generalized Reed-Solomon (TGRS) codes and determine the necessary and suff… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

  13. arXiv:2604.05430  [pdf, ps, other

    cs.RO

    Synergizing Efficiency and Reliability for Continuous Mobile Manipulation

    Authors: Chengkai Wu, Ruilin Wang, Yixin Zeng, Jiayuan Wang, Mingjie Zhang, Guiyong Zheng, Qun Niu, Juepeng Zheng, Jun Ma, Boyu Zhou

    Abstract: Humans seamlessly fuse anticipatory planning with immediate feedback to perform successive mobile manipulation tasks without stopping, achieving both high efficiency and reliability. Replicating this fluid and reliable behavior in robots remains fundamentally challenging, not only due to conflicts between long-horizon planning and real-time reactivity, but also because excessively pursuing efficie… ▽ More

    Submitted 7 April, 2026; originally announced April 2026.

    Comments: 33 pages, 26 figures, 4 tables. Video: https://www.bilibili.com/video/BV1YWP4zxEQD

  14. arXiv:2604.05005  [pdf, ps, other

    cs.CY cs.AI cs.CL

    EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

    Authors: Shuzhen Bi, Mingzi Zhang, Zhuoxuan Li, Xiaolong Wang, keqian Li, Aimin Zhou

    Abstract: Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present E… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

  15. arXiv:2604.04986  [pdf, ps, other

    cs.LG

    Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

    Authors: Zesheng Yao, Zhen-Hua Wan, Canjun Yang, Qingchao Xia, Mengqi Zhang

    Abstract: Model-free deep reinforcement learning (DRL) methods suffer from poor sample efficiency. To overcome this limitation, this work introduces an adaptive reduced-order-model (ROM)-based reinforcement learning framework for active flow control. In contrast to conventional actor--critic architectures, the proposed approach leverages a ROM to estimate the gradient information required for controller opt… ▽ More

    Submitted 4 April, 2026; originally announced April 2026.

    Comments: 43 pages, 26 figures

  16. arXiv:2604.04783  [pdf, ps, other

    cs.CR cs.AR

    GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference

    Authors: Guoci Chen, Xiurui Pan, Qiao Li, Bo Mao, Congming Gao, Chengying Huan, Mingzhe Zhang, Jie Zhang

    Abstract: Deploying large language models (LLMs) as cloud services raises privacy concerns as inference may leak sensitive data. Fully Homomorphic Encryption (FHE) allows computation on encrypted data, but current FHE methods struggle with efficient and precise nonlinear function evaluation. Specifically, CKKS-based approaches require high-degree polynomial approximations, which are costly when target preci… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: 11 pages, 7 figures

  17. arXiv:2604.04074  [pdf, ps, other

    cs.AI cs.LG

    FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

    Authors: Hang Xu, Ling Yue, Chaoqian Ouyang, Yuchen Liu, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang

    Abstract: Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactRev… ▽ More

    Submitted 7 April, 2026; v1 submitted 5 April, 2026; originally announced April 2026.

  18. arXiv:2604.04009  [pdf, ps, other

    cs.SE

    Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

    Authors: Shuyin Ouyang, Jie M. Zhang, Jingzhi Gong, Gunel Jahangirova, Mohammad Reza Mousavi, Jack Johns, Beum Seuk Lee, Adam Ziolkowski, Botond Virginas, Joost Noppen

    Abstract: Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in large language models has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-language models (VLMs) to… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

  19. arXiv:2604.03964  [pdf, ps, other

    cs.AI

    SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

    Authors: Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, Jian Ma

    Abstract: Modern scientific ecosystems are rich in procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, yet much of this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize. This gap between abundant scientific know-how and usable agent capabilities is a key bottleneck for building effective scientific age… ▽ More

    Submitted 5 April, 2026; originally announced April 2026.

  20. arXiv:2604.03259  [pdf, ps, other

    cs.CY

    From Pre-trained Models to Large Language Models: A Comprehensive Survey of AI-Driven Psychological Computing

    Authors: Huiyao Chen, Ruimeng Liu, Yan Luo, Jiawen Zhang, Meishan Zhang, Baotian Hu, Min Zhang

    Abstract: The intersection of artificial intelligence and psychological science has experienced remarkable growth, with annual publications expanding from 859 papers in 2000 to 29,979 by 2025. However, this rapid evolution has created methodological fragmentation where similar computational techniques are independently developed across isolated psychological domains. This survey introduces the first systema… ▽ More

    Submitted 12 March, 2026; originally announced April 2026.

    Comments: 56 pages, Psychological Computing with AI

    MSC Class: 68U35 ACM Class: K.4.2

  21. ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

    Authors: Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang

    Abstract: Functional verification consumes over 50% of the IC development lifecycle, where SystemVerilog Assertions (SVAs) are indispensable for formal property verification and enhanced simulation-based debugging. However, manual SVA authoring is labor-intensive and error-prone. While Large Language Models (LLMs) show promise, their direct deployment is hindered by low functional accuracy and a severe scar… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

    Comments: Accepted by DAC 2026

  22. arXiv:2604.02759  [pdf, ps, other

    cs.RO

    OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

    Authors: Michael Zhang, Wei Ying, Fangwen Chen, Shifeng Bai, Hanwen Kang

    Abstract: Accurate 6D object pose estimation is a fundamental capability for embodied agents, yet remains highly challenging in open-world environments. Many existing methods often rely on closed-set assumptions or geometry-agnostic regression schemes, limiting their generalization, stability, and real-time applicability in robotic systems. We present OMNI-PoseX, a vision foundation model that introduces a… ▽ More

    Submitted 3 April, 2026; originally announced April 2026.

  23. arXiv:2604.02686  [pdf, ps, other

    cs.LG cs.AI

    Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

    Authors: Yuheng Zhang, Mingyue Huo, Minghao Zhu, Mengxue Zhang, Nan Jiang

    Abstract: Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a f… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

  24. arXiv:2604.02647  [pdf, ps, other

    cs.SE

    Runtime Execution Traces Guided Automated Program Repair with Multi-Agent Debate

    Authors: Jiaqing Wu, Tong Wu, Manqing Zhang, Yunwei Dong, Bo Shen

    Abstract: Automated Program Repair (APR) struggles with complex logic errors and silent failures. Current LLM-based APR methods are mostly static, relying on source code and basic test outputs, which fail to accurately capture complex runtime behaviors and dynamic data dependencies. While incorporating runtime evidence like execution traces exposes concrete state transitions, a single LLM interpreting this… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

    Comments: 12 pages, 4 figures, 8 tables

    ACM Class: D.2.5; I.2.2

  25. arXiv:2604.01826  [pdf, ps, other

    cs.CV

    SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers

    Authors: Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Min Yang

    Abstract: Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net… ▽ More

    Submitted 2 April, 2026; originally announced April 2026.

    Comments: CVPR26

  26. arXiv:2604.01538  [pdf

    cs.CL cs.AI

    Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

    Authors: Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu

    Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging f… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  27. arXiv:2604.01092  [pdf, ps, other

    cs.CR cs.AR cs.NI

    LightGuard: Transparent WiFi Security via Physical-Layer LiFi Key Bootstrapping

    Authors: Shiqi Xu, Yuyang Du, Mingyue Zhang, Hongwei Cui, Soung Chang Liew

    Abstract: WiFi is inherently vulnerable to eavesdropping because RF signals may penetrate many physical boundaries, such as walls and floors. LiFi, by contrast, is an optical method confined to line-of-sight and blocked by opaque surfaces. We present LightGuard, a dual-link architecture built on this insight: cryptographic key establishment can be offloaded from WiFi to a physically confined LiFi channel to… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  28. arXiv:2604.00835  [pdf, ps, other

    cs.CL

    Agentic Tool Use in Large Language Models

    Authors: Jinchao Hu, Meizhi Zhong, Kehai Chen, Xuefeng Bai, Min Zhang

    Abstract: Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action. Existing studies remain fragmented across tasks, tool types, and training settings, lacking a unified view of how tool-use methods differ and evolve. This paper organizes the literature into three paradig… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  29. arXiv:2604.00702  [pdf, ps, other

    cs.SE cs.CR

    Enhancing REST API Fuzzing with Access Policy Violation Checks and Injection Attacks

    Authors: Omur Sahin, Man Zhang, Andrea Arcuri

    Abstract: Due to their widespread use in industry, several techniques have been proposed in the literature to fuzz REST APIs. Existing fuzzers for REST APIs have been focusing on detecting crashes (e.g., 500 HTTP server error status code). However, security vulnerabilities can have major drastic consequences on existing cloud infrastructures. In this paper, we propose a series of novel automated oracles a… ▽ More

    Submitted 1 April, 2026; originally announced April 2026.

  30. arXiv:2604.00368  [pdf, ps, other

    cs.DC

    TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving

    Authors: Feng Ren, Ruoyu Qin, Teng Ma, Shangming Cai, Zheng Liu, Chao Lei, Dejiang Zhu, Ke Yang, Zheming Li, Jialei Cui, Weixiao Huang, Yikai Zhao, Yineng Zhang, Hao Wu, Xiang Gao, Yuhao Fu, Jinlei Jiang, Yongwei Wu, Mingxing Zhang

    Abstract: Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative,… ▽ More

    Submitted 31 March, 2026; originally announced April 2026.

  31. arXiv:2603.29620  [pdf, ps, other

    cs.CV cs.MM

    Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

    Authors: Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng

    Abstract: Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-worl… ▽ More

    Submitted 1 April, 2026; v1 submitted 31 March, 2026; originally announced March 2026.

    Comments: Project Page: https://github.com/shawn0728/Unify-Agent

  32. arXiv:2603.29587  [pdf, ps, other

    cs.GR

    Style-Instructed Mask-Free Virtual Try On

    Authors: Mengqi Zhang, Qi Li, Mehmet Saygin Seyfioglu, Karim Bouyarmane

    Abstract: Virtual Try-On is a promising research area with broad applications in e-commerce and everyday life, enabling users to visualize garments on themselves or others before purchase. Most existing methods depend on predefined or user-specified masks to guide garment placement, but their performance is highly sensitive to mask quality, often causing misalignment or artifacts, and introduces redundant s… ▽ More

    Submitted 4 February, 2026; originally announced March 2026.

    Comments: Project page: https://smf-vto.github.io

  33. arXiv:2603.29407  [pdf, ps, other

    cs.LG cs.AI

    Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields

    Authors: Fu Wang, Qifeng Lu, Xinyu Long, Meng Zhang, Xiaofei Yang, Weijia Cao, Xiaowen Chu

    Abstract: Accurate forecasting of three-dimensional (3D) cloud fields is important for atmospheric analysis and short-range numerical weather prediction, yet it remains challenging because cloud evolution involves cross-layer interactions, nonlocal dependencies, and multiscale spatiotemporal dynamics. Existing spatiotemporal prediction models based on convolutions, recurrence, or attention often rely on loc… ▽ More

    Submitted 31 March, 2026; originally announced March 2026.

  34. arXiv:2603.28767  [pdf, ps, other

    cs.CV

    Gen-Searcher: Reinforcing Agentic Search for Image Generation

    Authors: Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue

    Abstract: Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generat… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

    Comments: Project page: https://gen-searcher.vercel.app Code: https://github.com/tulerfeng/Gen-Searcher

  35. arXiv:2603.28560  [pdf, ps, other

    cs.CV

    Curriculum-Guided Myocardial Scar Segmentation for Ischemic and Non-ischemic Cardiomyopathy

    Authors: Nivetha Jayakumar, Jonathan Pan, Shuo Wang, Bishow Paudel, Nisha Hosadurg, Cristiane C. Singulane, Sivam Bhatt, Amit R. Patel, Miaomiao Zhang

    Abstract: Identification and quantification of myocardial scar is important for diagnosis and prognosis of cardiovascular diseases. However, reliable scar segmentation from Late Gadolinium Enhancement Cardiac Magnetic Resonance (LGE-CMR) images remains a challenge due to variations in contrast enhancement across patients, suboptimal imaging conditions such as post contrast washout, and inconsistencies in gr… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

  36. arXiv:2603.28458  [pdf, ps, other

    cs.LG cs.AI

    HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

    Authors: Yufei Xu, Fanxu Meng, Fan Jiang, Yuxuan Wang, Ruijie Zhou, Zhaohui Wang, Jiexi Wu, Zhixin Pan, Xiaojuan Tang, Wenjie Pei, Tongxuan Liu, Di Yin, Xing Sun, Muhan Zhang

    Abstract: Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical key for each query through a lightweight indexer, then computing attention only on the selected subset. While the downstream sparse attention itself scales favorably, the indexer must still scan the entire prefix for every query, introducing an per… ▽ More

    Submitted 6 April, 2026; v1 submitted 30 March, 2026; originally announced March 2026.

  37. arXiv:2603.28452  [pdf, ps, other

    cs.SE

    Detecting and Mitigating Flakiness in REST API Fuzzing

    Authors: Man Zhang, Chongyang Shen, Andrea Arcuri, Tao Yue

    Abstract: Test flakiness is a common problem in industry, which hinders the reliability of automated build and testing workflows. Most existing research on test flakiness has primarily focused on unit and small-scale integration tests. In contrast, flakiness in system-level testing such as REST APIs are comparatively under-explored. A large body of literature has been dedicated to the topic of fuzzing REST… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

  38. arXiv:2603.28362  [pdf

    cs.RO cond-mat.mtrl-sci cond-mat.soft physics.app-ph

    A Foldable and Agile Soft Electromagnetic Robot for Multimodal Navigation in Confined and Unstructured Environments

    Authors: Zhihao Lv, Xiaoyong Zhang, Mengfan Zhang, Xiaoyu Song, Xingyue Liu, Yide Liu, Shaoxing Qu, Guoyong Mao

    Abstract: Multimodal locomotion is crucial for an animal's adaptability in unstructured wild environments. Similarly, in the human gastrointestinal tract, characterized by viscoelastic mucus, complex rugae, and narrow sphincters like the cardia, multimodal locomotion is also essential for a small-scale soft robot to conduct tasks. Here, we introduce a small-scale compact, foldable, and robust soft electroma… ▽ More

    Submitted 30 March, 2026; originally announced March 2026.

  39. arXiv:2603.27850  [pdf, ps, other

    cs.SE cs.CL

    EffiSkill: Agent Skill Based Automated Code Efficiency Optimization

    Authors: Zimu Wang, Yuling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng Wan, Xiaodong Gu

    Abstract: Code efficiency is a fundamental aspect of software quality, yet how to harness large language models (LLMs) to optimize programs remains challenging. Prior approaches have sought for one-shot rewriting, retrieved exemplars, or prompt-based search, but they do not explicitly distill reusable optimization knowledge, which limits generalization beyond individual instances. In this paper, we presen… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

  40. arXiv:2603.27703  [pdf, ps, other

    cs.CL cs.LG

    KAT-Coder-V2 Technical Report

    Authors: Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan, Mengtong Li, Minglei Zhang, Pengcheng Xu, Wenhao Zhuang, Yizhen Shao, Zongxian Feng, Can Tang, Chao Wang, Chengxiao Tong, Fan Yang, Gang Xiong, Haixuan Gao, Han Gao, Hao Wang, Haochen Liu, Hongliang Sun, Jiabao Li, Jingwen Chang, Jun Du , et al. (21 additional authors not shown)

    Abstract: We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy disti… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: 22 pages, 7 figures

  41. arXiv:2603.27538  [pdf, ps, other

    cs.CV cs.CL

    LongCat-Next: Lexicalizing Modalities as Discrete Tokens

    Authors: Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li , et al. (64 additional authors not shown)

    Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Aut… ▽ More

    Submitted 29 March, 2026; originally announced March 2026.

    Comments: LongCat-Next Technical Report

  42. arXiv:2603.27460  [pdf, ps, other

    cs.CV cs.AI

    Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

    Authors: Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou , et al. (102 additional authors not shown)

    Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of… ▽ More

    Submitted 28 March, 2026; originally announced March 2026.

    Comments: 157 pages, 19 figures, 26 tables. Project repo: \url{https://github.com/uni-medical/Project-Imaging-X}

  43. arXiv:2603.26546  [pdf, ps, other

    cs.CV

    AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing

    Authors: Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Li, Yuan Liu, Ping Tan

    Abstract: Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inher… ▽ More

    Submitted 1 April, 2026; v1 submitted 27 March, 2026; originally announced March 2026.

    Comments: Project Page: https://lty2226262.github.io/autoweather4d/ | Github: https://github.com/lty2226262/AutoWeather4D

  44. arXiv:2603.26535  [pdf, ps, other

    cs.AI

    PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

    Authors: Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai

    Abstract: We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and grad… ▽ More

    Submitted 3 April, 2026; v1 submitted 27 March, 2026; originally announced March 2026.

    Comments: 16 Pages,9 Figures

  45. arXiv:2603.26496  [pdf, ps, other

    cs.NI

    Innovation Discovery System for Networking Research

    Authors: Mengrui Zhang, Bang Huang, Yunxin Xu, Haiying Huang, Luxi Zhao, Mochun Long, Qingyu Song, Qiao Xiang, Xue Liu, Jiwu Shu

    Abstract: As networking systems become increasingly complex, achieving disruptive innovation grows more challenging. At the same time, recent progress in Large Language Models (LLMs) has shown strong potential for scientific hypothesis formation and idea generation. Nevertheless, applying LLMs effectively to networking research remains difficult for two main reasons: standalone LLMs tend to generate ideas b… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

  46. arXiv:2603.26380  [pdf, ps, other

    cs.CL

    Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

    Authors: Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang, Yichun Yin, Lifeng Shang, Ming Zhang

    Abstract: The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attemp… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

  47. arXiv:2603.26341  [pdf, ps, other

    cs.CV

    HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network

    Authors: Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, Yupeng Hu

    Abstract: Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

    Comments: Accepted by ICASSP 2026

  48. arXiv:2603.26250  [pdf, ps, other

    cs.CV

    Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment

    Authors: Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

    Abstract: Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based ste… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

  49. arXiv:2603.26108  [pdf, ps, other

    cs.LG cs.CV

    Accurate Precipitation Forecast by Efficiently Learning from Massive Atmospheric Variables and Unbalanced Distribution

    Authors: Shuangliang Li, Siwei Li, Li Li, Weijie Zou, Jie Yang, Maolin Zhang

    Abstract: Short-term (0-24 hours) precipitation forecasting is highly valuable to socioeconomic activities and public safety. However, the highly complex evolution patterns of precipitation events, the extreme imbalance between precipitation and non-precipitation samples, and the inability of existing models to efficiently and effectively utilize large volumes of multi-source atmospheric observation data hi… ▽ More

    Submitted 27 March, 2026; originally announced March 2026.

  50. arXiv:2603.25500  [pdf, ps, other

    cs.CR cs.IR

    Unveiling the Resilience of LLM-Enhanced Search Engines against Black-Hat SEO Manipulation

    Authors: Pei Chen, Geng Hong, Xinyi Wu, Mengying Wu, Zixuan Zhu, Mingxuan Liu, Baojun Liu, Mi Zhang, Min Yang

    Abstract: The emergence of Large Language Model-enhanced Search Engines (LLMSEs) has revolutionized information retrieval by integrating web-scale search capabilities with AI-powered summarization. While these systems demonstrate improved efficiency over traditional search engines, their security implications against well-established black-hat Search Engine Optimization (SEO) attacks remain unexplored. In… ▽ More

    Submitted 26 March, 2026; originally announced March 2026.

    Comments: Accepted at The ACM Web Conference 2026 (WWW 2026)