GEBench Benchmarking Image Generation Models as GUI Environments

700 Samples 5 Task Categories GUI Environments

Haodong Li^1,2, Jingwei Wu¹, Quan Sun^1,†, Guopeng Li¹, Juanxi Tian⁷, Huanyu Zhang⁵, Yanlin Lai^1,4, Ruichuan An³,
Hongbo Peng¹, Yuhong Dai¹, Chenxi Li⁶, Chunmei Qing^2,∗, Jia Wang¹, Ziyang Meng¹, Zheng Ge^1,∗, Xiangyu Zhang¹, Daxin Jiang¹

¹StepFun     ²South China University of Technology     ³Peking University     ⁴Tsinghua University
⁵Institute of Automation, Chinese Academy of Sciences     ⁶The University of Chicago     ⁷Nanyang Technological University
^†Project Leader    ^∗Corresponding Author

Institute of Automation, Chinese Academy of Sciences

Paper Dataset Code Model Results (Coming Soon)

GEBench Leaderboard

Main evaluation results on GEBench across Chinese and English subsets. The table compares 8 commercial models and 4 open-source models across five core dimensions.

rank#1 rank#2

Model	Chinese Subset						English Subset
Model	single -step	multi -step	fiction -app	real -app	ground -ing	GE Score	single -step	multi -step	fiction -app	real -app	ground -ing	GE Score
Nano Banana Pro	84.50	68.65	65.75	64.35	64.83	69.62	84.32	69.51	46.33	47.20	58.64	61.20
Nano Banana	64.36	34.16	64.82	65.89	54.48	56.74	64.80	50.75	48.88	47.12	49.04	52.12
GPT-image-1.5	83.79	56.97	60.11	55.65	53.33	63.22	80.80	58.87	63.68	58.93	49.23	63.16
GPT-image-1.0	64.72	49.20	57.31	59.04	31.68	52.39	60.92	64.33	58.94	56.16	37.84	55.64
Seedream 4.5	63.64	53.11	56.48	53.44	52.90	55.91	49.49	45.30	53.81	51.80	49.63	50.01
Seedream 4.0	62.04	48.64	49.28	50.93	53.53	52.88	53.28	37.57	47.92	49.36	44.17	46.46
Wan 2.6	64.20	50.11	52.72	50.40	59.58	55.40	60.17	44.36	49.55	44.80	53.36	50.45
Flux-2-Pro	68.83	55.07	58.13	55.41	50.24	57.54	61.00	52.17	49.92	47.16	45.67	51.18
Bagel	34.84	13.45	27.36	33.52	35.10	28.85	32.91	8.61	26.08	35.12	37.30	28.00
UniWorld-V2	55.33	24.95	32.03	21.39	49.60	36.66	42.68	14.14	30.08	26.83	47.04	32.15
Qwen-Image-Edit	41.12	26.79	23.78	26.10	50.80	33.72	40.12	18.61	25.80	25.95	54.55	33.01
Longcat-Image	48.76	12.75	30.03	30.00	51.02	34.51	36.69	8.44	37.30	36.83	47.12	33.28

Model Rankings

Overall model rankings based on average GE-Score across Chinese and English subsets.

Rank	Model	Avg GE-Score	Chinese	English
1	Nano Banana Pro	65.41	69.62	61.20
2	GPT-image-1.5	63.19	63.22	63.16
3	Nano Banana	54.43	56.74	52.12
4	Flux-2-Pro	54.36	57.54	51.18
5	GPT-image-1.0	54.02	52.39	55.64
6	Seedream 4.5	52.96	55.91	50.01
7	Wan 2.6	52.92	55.40	50.45
8	Seedream 4.0	49.67	52.88	46.46
9	UniWorld-V2	34.41	36.66	32.15
10	Longcat-Image	33.89	34.51	33.28
11	Qwen-Image-Edit	33.36	33.72	33.01
12	Bagel	28.43	28.85	28.00

Figures

Framework Comparison

Benchmark Positioning. GEBench compared with traditional T2I benchmarks and video generation benchmarks.

Task suites (five task types)

Task suites. Examples of the five task types in GEBench.

Visual Cases

Visual Cases. Model performance comparison on key challenging scenarios including text rendering, icon interpretation, and localization precision.

BibTeX


        @article{li2026gebench,
          title={GEBench: Benchmarking Image Generation Models as GUI Environments},
          author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
          journal={arXiv preprint arXiv:2602.09007},
          year={2026}
        }