Computer Science > Computer Vision and Pattern Recognition

arXiv:2512.16906 (cs)

[Submitted on 18 Dec 2025]

Title:VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Authors:Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

Abstract:Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2512.16906 [cs.CV]
	(or arXiv:2512.16906v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2512.16906

Submission history

From: Xiaoyan Cong [view email]
[v1] Thu, 18 Dec 2025 18:58:42 UTC (34,207 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators