Computer Science > Computer Vision and Pattern Recognition

arXiv:2512.03540 (cs)

[Submitted on 3 Dec 2025 (v1), last revised 5 Dec 2025 (this version, v2)]

Title:CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Authors:Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng

Abstract:Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

Comments:	Accepted by ACM Multimedia 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2512.03540 [cs.CV]
	(or arXiv:2512.03540v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2512.03540
Related DOI:	https://doi.org/10.1145/3746027.3755174

Submission history

From: Ruoxuan Zhang [view email]
[v1] Wed, 3 Dec 2025 08:01:48 UTC (35,972 KB)
[v2] Fri, 5 Dec 2025 10:31:39 UTC (35,972 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators