GEBench Benchmarking Image Generation Models as GUI Environments

700 Samples 5 Task Categories GUI Environments
Haodong Li1,2, Jingwei Wu1, Quan Sun1,†, Guopeng Li1, Juanxi Tian7, Huanyu Zhang5, Yanlin Lai1,4, Ruichuan An3,
Hongbo Peng1, Yuhong Dai1, Chenxi Li6, Chunmei Qing2,∗, Jia Wang1, Ziyang Meng1, Zheng Ge1,∗, Xiangyu Zhang1, Daxin Jiang1
1StepFun     2South China University of Technology     3Peking University     4Tsinghua University
5Institute of Automation, Chinese Academy of Sciences     6The University of Chicago     7Nanyang Technological University
Project Leader    Corresponding Author
StepFun South China University of Technology Tsinghua University Peking University Institute of Automation, Chinese Academy of Sciences Nanyang Technological University

GEBench Leaderboard

Main evaluation results on GEBench across Chinese and English subsets. The table compares 8 commercial models and 4 open-source models across five core dimensions.

rank#1 rank#2
Model Chinese Subset English Subset
single
-step
multi
-step
fiction
-app
real
-app
ground
-ing
GE
Score
single
-step
multi
-step
fiction
-app
real
-app
ground
-ing
GE
Score
Nano Banana Pro 84.50 68.65 65.75 64.35 64.83 69.62 84.32 69.51 46.33 47.20 58.64 61.20
Nano Banana 64.36 34.16 64.82 65.89 54.48 56.74 64.80 50.75 48.88 47.12 49.04 52.12
GPT-image-1.5 83.79 56.97 60.11 55.65 53.33 63.22 80.80 58.87 63.68 58.93 49.23 63.16
GPT-image-1.0 64.72 49.20 57.31 59.04 31.68 52.39 60.92 64.33 58.94 56.16 37.84 55.64
Seedream 4.5 63.64 53.11 56.48 53.44 52.90 55.91 49.49 45.30 53.81 51.80 49.63 50.01
Seedream 4.0 62.04 48.64 49.28 50.93 53.53 52.88 53.28 37.57 47.92 49.36 44.17 46.46
Wan 2.6 64.20 50.11 52.72 50.40 59.58 55.40 60.17 44.36 49.55 44.80 53.36 50.45
Flux-2-Pro 68.83 55.07 58.13 55.41 50.24 57.54 61.00 52.17 49.92 47.16 45.67 51.18
Bagel 34.84 13.45 27.36 33.52 35.10 28.85 32.91 8.61 26.08 35.12 37.30 28.00
UniWorld-V2 55.33 24.95 32.03 21.39 49.60 36.66 42.68 14.14 30.08 26.83 47.04 32.15
Qwen-Image-Edit 41.12 26.79 23.78 26.10 50.80 33.72 40.12 18.61 25.80 25.95 54.55 33.01
Longcat-Image 48.76 12.75 30.03 30.00 51.02 34.51 36.69 8.44 37.30 36.83 47.12 33.28

Model Rankings

Overall model rankings based on average GE-Score across Chinese and English subsets.

Rank Model Avg GE-Score Chinese English
1 Nano Banana Pro 65.41 69.62 61.20
2 GPT-image-1.5 63.19 63.22 63.16
3 Nano Banana 54.43 56.74 52.12
4 Flux-2-Pro 54.36 57.54 51.18
5 GPT-image-1.0 54.02 52.39 55.64
6 Seedream 4.5 52.96 55.91 50.01
7 Wan 2.6 52.92 55.40 50.45
8 Seedream 4.0 49.67 52.88 46.46
9 UniWorld-V2 34.41 36.66 32.15
10 Longcat-Image 33.89 34.51 33.28
11 Qwen-Image-Edit 33.36 33.72 33.01
12 Bagel 28.43 28.85 28.00

Figures

Framework Comparison
GEBench framework comparison

Benchmark Positioning. GEBench compared with traditional T2I benchmarks and video generation benchmarks.

Task suites (five task types)
Examples of the five task types in GEBench

Task suites. Examples of the five task types in GEBench.

Visual Cases
GEBench visual cases

Visual Cases. Model performance comparison on key challenging scenarios including text rendering, icon interpretation, and localization precision.

BibTeX


        @article{li2026gebench,
          title={GEBench: Benchmarking Image Generation Models as GUI Environments},
          author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
          journal={arXiv preprint arXiv:2602.09007},
          year={2026}
        }