CookAnything: A Flexible Diffusion Framework for Recipe Image Generation

Ruoxuan Zhang1, Bin Wen1, Hongxia Xie1, Yi Yao2, Songhan Zuo1, Jian-Yu Jiang-Lin3, Hong-Han Shuai2, Wen-Huang Cheng3

1Jilin University, 2National Yang Ming Chiao Tung University, 3National Taiwan University

RecipeGen Preview
Preview: Demonstration of our CookAnything model generating multi-step cooking instructions in a single pass.

Proposed Method

Overall structure of our CookAnything model, illustrated with a 3-step vegetable pancake recipe.

Model Framework

Figure: Overview of the AutoRecipe Framework.

Visualization

Model Framework

Figure: Qualitative comparisons. SKD refers to StackedDiffusion, and SD3.5 refers to Stable Diffusion 3.5.

Experiments

Comparison with Other Models in RecipeGen.

Table 2: Comparison with Other Models in RecipeGen. SF means Step Faithfulness, IA means Ingredient Accuracy and UB means Usability. (C) means CLIP score, (G) means GPT-Score and (H) is Human evaluation. The UB metric is applicable only to methods capable of Joint Generation. TF means Traning-Free and TB means Traning - Based.
Category Method Step Flexibility Joint Generation Goal Faithfulness (↑)
Cross-Step Consistency (↓)
SF (↑)
(C)
SF (↑)
(G)
IA (↑)
(G)
IA (↑)
(H)
UB (↑)
(G)
UB (↑)
(H)
UNet - based SD1.5 [32] 26.84 5.42 28.40 5.30 6.14 3.41 - -
SD2.1 [34] 26.88 7.54 28.51 6.06 6.81 3.45 - -
SDXL [26] 27.46 2.98 29.37 6.79 7.51 3.71 - -
SKD [18] 26.62 0.7 28.53 4.57 6.67 2.59 6.43 3.59
DiT - based SD3.5 [35] 27.42 2.97 28.77 6.73 7.58 3.97 - -
Flux.1 - dev [11] 26.47 3.47 28.31 5.31 5.93 5.71 - -
IC - LoRA [9] 26.07 9.03 26.58 4.03 5.50 3.91 5.34 4.45
RPF [4] 27.19 8.73 25.99 4.45 7.05 3.02 3.89 4.24
Layout - aware GLIGEN 26.99 2.17 26.72 5.17 6.16 5.28 - -
A + R [25] 26.31 2.46 27.63 4.58 5.46 5.29 - -
DiT - based Ours (TF) 30.12 0.17 29.80 8.52 9.12 6.92 9.89 7.66
Ours (TB) 30.59 0.19 30.45 8.69 9.27 7.15 9.70 8.48