Skip to main content

Showing 1–50 of 144 results for author: Zhuang, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2604.04921  [pdf, ps, other

    cs.CL cs.CV

    TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    Authors: Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen

    Abstract: Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-Ro… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: Code is available at https://github.com/WeianMao/triattention

  2. arXiv:2604.04887  [pdf, ps, other

    cs.CV

    HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

    Authors: Mauricio Soroco, Francesco Pittaluga, Zaid Tasneem, Abhishek Aich, Bingbing Zhuang, Wuyang Chen, Manmohan Chandraker, Ziyu Jiang

    Abstract: Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-… ▽ More

    Submitted 6 April, 2026; originally announced April 2026.

    Comments: CVPR Findings 2026

  3. arXiv:2604.04838  [pdf, ps, other

    cs.CV

    Less Detail, Better Answers: Degradation-Driven Prompting for VQA

    Authors: Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

    Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force mod… ▽ More

    Submitted 7 April, 2026; v1 submitted 6 April, 2026; originally announced April 2026.

    Comments: Accepted to CVPRW 2026. Project page: https://hhx-jpg.github.io/ddp/ , Code: https://github.com/ziplab/DDP

    MSC Class: 68T45

  4. arXiv:2603.27460  [pdf, ps, other

    cs.CV cs.AI

    Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

    Authors: Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou , et al. (102 additional authors not shown)

    Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of… ▽ More

    Submitted 28 March, 2026; originally announced March 2026.

    Comments: 157 pages, 19 figures, 26 tables. Project repo: \url{https://github.com/uni-medical/Project-Imaging-X}

  5. arXiv:2603.13398  [pdf, ps, other

    cs.CV

    Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

    Authors: Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen

    Abstract: We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout a… ▽ More

    Submitted 11 March, 2026; originally announced March 2026.

  6. HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI

    Authors: Chengwen Zhang, Chun Yu, Borong Zhuang, Haopeng Jin, Qingyang Wan, Zhuojun Li, Zhe He, Zhoutong Ye, Yu Mei, Chang Liu, Weinan Shi, Yuanchun Shi

    Abstract: Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signal… ▽ More

    Submitted 25 March, 2026; v1 submitted 12 March, 2026; originally announced March 2026.

    Report number: chi26-82 ACM Class: H.5.2; I.2.9

  7. arXiv:2602.09587  [pdf, ps, other

    cs.CV cs.AI

    MieDB-100k: A Comprehensive Dataset for Medical Image Editing

    Authors: Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo, Fan Wang, Bohan Zhuang, Shenda Hong

    Abstract: The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text… ▽ More

    Submitted 10 February, 2026; originally announced February 2026.

  8. arXiv:2602.05305  [pdf, ps, other

    cs.CV cs.AI cs.CL

    FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

    Authors: Zhuokun Chen, Jianfei Cai, Bohan Zhuang

    Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repea… ▽ More

    Submitted 6 February, 2026; v1 submitted 4 February, 2026; originally announced February 2026.

  9. arXiv:2601.05172  [pdf, ps, other

    cs.CV cs.AI

    CoV: Chain-of-View Prompting for Spatial Reasoning

    Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

    Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose… ▽ More

    Submitted 9 January, 2026; v1 submitted 8 January, 2026; originally announced January 2026.

    Comments: Code link https://github.com/ziplab/CoV

  10. arXiv:2512.13006  [pdf, ps, other

    cs.CV

    Few-Step Distillation for Text-to-Image Generation: A Practical Guide

    Authors: Yifan Pu, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Fan Wang, Bohan Zhuang, Gao Huang

    Abstract: Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstac… ▽ More

    Submitted 15 December, 2025; originally announced December 2025.

  11. arXiv:2512.04025  [pdf, ps, other

    cs.CV cs.AI cs.LG

    PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

    Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang

    Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high spar… ▽ More

    Submitted 3 December, 2025; originally announced December 2025.

    Comments: Tech report

  12. arXiv:2511.22973  [pdf, ps, other

    cs.CV

    BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

    Authors: Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang

    Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sam… ▽ More

    Submitted 28 November, 2025; originally announced November 2025.

  13. arXiv:2511.22659  [pdf, ps, other

    cs.AI cs.CV

    Geometrically-Constrained Agent for Spatial Reasoning

    Authors: Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng

    Abstract: Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles.… ▽ More

    Submitted 27 November, 2025; originally announced November 2025.

    Comments: 27 pages, 13 figures

  14. arXiv:2511.20714  [pdf, ps, other

    cs.CV cs.AI

    Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

    Authors: Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang

    Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A k… ▽ More

    Submitted 24 November, 2025; originally announced November 2025.

  15. arXiv:2511.12201  [pdf, ps, other

    cs.CV

    OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

    Authors: Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

    Abstract: Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. I… ▽ More

    Submitted 18 November, 2025; v1 submitted 15 November, 2025; originally announced November 2025.

    Comments: Accepted by AAAI2026

  16. arXiv:2511.08246  [pdf, ps, other

    cs.AI

    Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

    Authors: Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai

    Abstract: Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vect… ▽ More

    Submitted 11 November, 2025; originally announced November 2025.

    Comments: Accepted by AAAI 2026

  17. arXiv:2510.26136  [pdf, ps, other

    cs.AI

    Beyond Benchmarks: The Economics of AI Inference

    Authors: Boqin Zhuang, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao

    Abstract: The inference cost of Large Language Models (LLMs) has become a critical factor in determining their commercial viability and widespread adoption. This paper introduces a quantitative ``economics of inference'' framework, treating the LLM inference process as a compute-driven intelligent production activity. We analyze its marginal cost, economies of scale, and quality of output under various perf… ▽ More

    Submitted 30 October, 2025; originally announced October 2025.

  18. arXiv:2510.20726  [pdf, ps, other

    cs.CV

    AutoScape: Geometry-Consistent Long-Horizon Scene Generation

    Authors: Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su, Sparsh Garg, Ying Wu, Manmohan Chandraker

    Abstract: This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly co… ▽ More

    Submitted 23 October, 2025; originally announced October 2025.

    Comments: ICCV 2025. Project page: https://auto-scape.github.io

  19. arXiv:2509.22323  [pdf, ps, other

    cs.CV

    RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer

    Authors: Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou, Kai Wang, Bohan Zhuang, Zhangyang Wang, Fan Wang, Yang You

    Abstract: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image ada… ▽ More

    Submitted 26 September, 2025; originally announced September 2025.

  20. arXiv:2509.19552  [pdf, ps, other

    cs.CV

    iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

    Authors: Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich

    Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference… ▽ More

    Submitted 5 December, 2025; v1 submitted 23 September, 2025; originally announced September 2025.

    Comments: Accepted at NeurIPS 2025

  21. arXiv:2509.19297  [pdf, ps, other

    cs.CV

    VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

    Authors: Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Feng Chen, Zheng Zhu, Donny Y. Chen, Bohan Zhuang

    Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a \emph{pixel-aligned} Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the… ▽ More

    Submitted 12 March, 2026; v1 submitted 23 September, 2025; originally announced September 2025.

    Comments: Project Page: https://lhmd.top/volsplat, Code: https://github.com/ziplab/VolSplat

  22. arXiv:2509.18189  [pdf, ps, other

    cs.CV cs.AI

    Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

    Authors: Daxiang Dong, Mingming Zheng, Dong Xu, Bairong Zhuang, Wenyu Zhang, Chunhua Luo, Haoran Wang, Zijian Zhao, Jie Li, Yuxuan Li, Hanjun Zhong, Mengyue Liu, Jieting Chen, Shupeng Li, Lun Tian, Yaping Feng, Xin Li, Donggang Jiang, Yong Chen, Yehua Xu, Duohao Qin, Chen Feng, Dan Wang, Henghua Zhang, Jingjing Ha , et al. (10 additional authors not shown)

    Abstract: We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong g… ▽ More

    Submitted 19 September, 2025; originally announced September 2025.

    Comments: 12 pages

  23. arXiv:2508.15360  [pdf, ps, other

    cs.CV

    An Empirical Study on How Video-LLMs Answer Video Questions

    Authors: Chenhui Gou, Ziyu Ma, Zicheng Duan, Haoyu He, Feng Chen, Akide Liu, Bohan Zhuang, Jianfei Cai, Hamid Rezatofighi

    Abstract: Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing Vide… ▽ More

    Submitted 21 August, 2025; originally announced August 2025.

  24. arXiv:2508.10774  [pdf, ps, other

    cs.CV cs.AI cs.LG

    BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

    Authors: Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, Bohan Zhuang

    Abstract: Diffusion Transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents crit… ▽ More

    Submitted 29 September, 2025; v1 submitted 14 August, 2025; originally announced August 2025.

    Comments: Tech report

  25. arXiv:2507.20454  [pdf, ps, other

    cs.CV cs.LG

    Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis

    Authors: Zhuokun Chen, Jugang Fan, Zhuowei Yu, Bohan Zhuang, Mingkui Tan

    Abstract: Visual autoregressive modeling, based on the next-scale prediction paradigm, exhibits notable advantages in image quality and model scalability over traditional autoregressive and diffusion models. It generates images by progressively refining resolution across multiple stages. However, the computational overhead in high-resolution stages remains a critical challenge due to the substantial number… ▽ More

    Submitted 27 July, 2025; originally announced July 2025.

  26. arXiv:2507.17307  [pdf, ps, other

    cs.LG cs.AI cs.CL

    R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

    Authors: Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang

    Abstract: Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and ri… ▽ More

    Submitted 8 February, 2026; v1 submitted 23 July, 2025; originally announced July 2025.

  27. arXiv:2506.08541  [pdf, ps, other

    cs.CV cs.AI

    TrajFlow: Multi-modal Motion Prediction via Flow Matching

    Authors: Qi Yan, Brian Zhang, Yutong Zhang, Daniel Yang, Joshua White, Di Chen, Jiachao Liu, Langechuan Liu, Binnan Zhuang, Shaoshuai Shi, Renjie Liao

    Abstract: Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction method… ▽ More

    Submitted 5 July, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: IROS 2025

  28. arXiv:2506.05327  [pdf, ps, other

    cs.CV

    Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

    Authors: Duochao Shi, Weijie Wang, Donny Y. Chen, Zeyu Zhang, Jia-Wang Bian, Bohan Zhuang, Chunhua Shen

    Abstract: Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering qual… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Project page: https://aim-uofa.github.io/PMLoss

  29. arXiv:2506.04648  [pdf, ps, other

    cs.CV

    FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

    Authors: Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang

    Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to signi… ▽ More

    Submitted 5 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: Project Page: https://fps.ziplab.co

  30. arXiv:2505.23734  [pdf, ps, other

    cs.CV

    ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

    Authors: Weijie Wang, Donny Y. Chen, Zeyu Zhang, Duochao Shi, Akide Liu, Bohan Zhuang

    Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their models, leading to degraded performance or excessive memory consumption as the number of input views increases.… ▽ More

    Submitted 17 November, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: NeurIPS 2025, Project Page: https://lhmd.top/zpressor, Code: https://github.com/ziplab/ZPressor

  31. arXiv:2505.17387  [pdf, ps, other

    cs.CL

    WiNGPT-3.0 Technical Report

    Authors: Boqin Zhuang, Chenxiao Song, Huitong Lu, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao

    Abstract: Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning… ▽ More

    Submitted 4 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  32. arXiv:2504.18579  [pdf, ps, other

    cs.LG

    Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

    Authors: Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu

    Abstract: Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model's inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity t… ▽ More

    Submitted 27 February, 2026; v1 submitted 22 April, 2025; originally announced April 2025.

    Comments: Accepted by ICLR 2026

  33. arXiv:2503.10696  [pdf, other

    cs.CV eess.IV

    Neighboring Autoregressive Modeling for Efficient Visual Generation

    Authors: Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

    Abstract: Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (N… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

    Comments: 16 pages

  34. arXiv:2503.06955  [pdf, other

    cs.CV

    Motion Anything: Any to Motion Generation

    Authors: Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley

    Abstract: Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail t… ▽ More

    Submitted 11 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

  35. arXiv:2503.02358  [pdf, other

    cs.CV cs.AI cs.CL

    Are Large Vision Language Models Good Game Players?

    Authors: Xinyu Wang, Bohan Zhuang, Qi Wu

    Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs' capabilities. These benchmarks are limited by issues such as inadequate… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: ICLR2025

  36. arXiv:2412.15283  [pdf, other

    cs.CL cs.AI cs.LG

    Channel Merging: Preserving Specialization for Merged Experts

    Authors: Mingyang Zhang, Jing Liu, Ganggui Ding, Xinyi Yu, Linlin Ou, Bohan Zhuang

    Abstract: Lately, the practice of utilizing task-specific fine-tuning has been implemented to improve the performance of large language models (LLM) in subsequent tasks. Through the integration of diverse LLMs, the overall competency of LLMs is significantly boosted. Nevertheless, traditional ensemble methods are notably memory-intensive, necessitating the simultaneous loading of all specialized models into… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: accepted by AAAI 2025

  37. arXiv:2412.14494  [pdf, other

    cs.CV

    Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

    Authors: Chuang Lin, Bingbing Zhuang, Shanlin Sun, Ziyu Jiang, Jianfei Cai, Manmohan Chandraker

    Abstract: The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates a set of good practices to finetune large pretrained models for a real-world task -- har… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

  38. arXiv:2412.04062  [pdf, ps, other

    cs.CV cs.AI

    ZipAR: Parallel Auto-regressive Image Generation through Spatial Locality

    Authors: Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

    Abstract: In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction schem… ▽ More

    Submitted 29 June, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: 11 pages

  39. arXiv:2412.01289  [pdf, other

    cs.CV cs.AI

    Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion

    Authors: Zhuokun Chen, Jinwu Hu, Zeshuai Deng, Yufeng Wang, Bohan Zhuang, Mingkui Tan

    Abstract: Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this… ▽ More

    Submitted 4 December, 2024; v1 submitted 2 December, 2024; originally announced December 2024.

  40. arXiv:2411.14725  [pdf, other

    cs.CV cs.CL cs.LG

    Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens

    Authors: Feng Chen, Chenhui Gou, Jing Liu, Yang Yang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

    Abstract: As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of \textbf{vision perception} abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation met… ▽ More

    Submitted 3 June, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

    Comments: Code repository: https://github.com/Chenfeng1271/AbilityLens/tree/main

  41. arXiv:2411.09728  [pdf, other

    cs.LG math.NA stat.CO

    Physics-informed neural networks (PINNs) for numerical model error approximation and superresolution

    Authors: Bozhou Zhuang, Sashank Rana, Brandon Jones, Danny Smyl

    Abstract: Numerical modeling errors are unavoidable in finite element analysis. The presence of model errors inherently reflects both model accuracy and uncertainty. To date there have been few methods for explicitly quantifying errors at points of interest (e.g. at finite element nodes). The lack of explicit model error approximators has been addressed recently with the emergence of machine learning (ML),… ▽ More

    Submitted 14 November, 2024; originally announced November 2024.

  42. arXiv:2411.06481  [pdf, other

    cs.CV

    KMM: Key Frame Mask Mamba for Extended Motion Generation

    Authors: Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang, Danning Li, Rui Zhao, Zhenming Li, Zhongwen Zhou, Hao Tang, Bohan Zhuang

    Abstract: Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective,… ▽ More

    Submitted 16 April, 2025; v1 submitted 10 November, 2024; originally announced November 2024.

  43. arXiv:2411.04924  [pdf, other

    cs.CV

    MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

    Authors: Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, Jianfei Cai

    Abstract: We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively comb… ▽ More

    Submitted 7 November, 2024; originally announced November 2024.

    Comments: NeurIPS 2024, Project page: https://donydchen.github.io/mvsplat360, Code: https://github.com/donydchen/mvsplat360

  44. arXiv:2410.08584  [pdf, other

    cs.CV cs.AI

    ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification

    Authors: Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

    Abstract: The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention… ▽ More

    Submitted 18 December, 2024; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: 13 pages

  45. arXiv:2409.00585  [pdf, other

    cs.CV

    McCaD: Multi-Contrast MRI Conditioned, Adaptive Adversarial Diffusion Model for High-Fidelity MRI Synthesis

    Authors: Sanuwani Dayarathna, Kh Tohidul Islam, Bohan Zhuang, Guang Yang, Jianfei Cai, Meng Law, Zhaolin Chen

    Abstract: Magnetic Resonance Imaging (MRI) is instrumental in clinical diagnosis, offering diverse contrasts that provide comprehensive diagnostic information. However, acquiring multiple MRI contrasts is often constrained by high costs, long scanning durations, and patient discomfort. Current synthesis methods, typically focused on single-image contrasts, fall short in capturing the collective nuances acro… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

  46. arXiv:2408.03361  [pdf, other

    eess.IV cs.CV

    GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

    Authors: Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao

    Abstract: Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Curren… ▽ More

    Submitted 21 October, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

    Comments: GitHub: https://github.com/uni-medical/GMAI-MMBench Hugging face: https://huggingface.co/datasets/OpenGVLab/GMAI-MMBench

  47. arXiv:2407.10061  [pdf, other

    cs.CV

    InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

    Authors: Zeyu Zhang, Akide Liu, Qi Chen, Feng Chen, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang

    Abstract: Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  48. arXiv:2407.04938  [pdf, other

    cs.CV

    SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation

    Authors: Guoan Wang, Jin Ye, Junlong Cheng, Tianbin Li, Zhaolin Chen, Jianfei Cai, Junjun He, Bohan Zhuang

    Abstract: Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervi… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Journal ref: MICCAI 2024

  49. arXiv:2406.09041  [pdf, other

    cs.CL cs.AI cs.LG

    ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

    Authors: Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang

    Abstract: LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the… ▽ More

    Submitted 26 October, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Tech report

  50. arXiv:2405.19726  [pdf, other

    cs.CV

    Streaming Video Diffusion: Online Video Editing with Diffusion Models

    Authors: Feng Chen, Zhen Yang, Bohan Zhuang, Qi Wu

    Abstract: We present a novel task called online video editing, which is designed to edit \textbf{streaming} frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term tem… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.