Main evaluation results on GEBench across Chinese and English subsets. The table compares 8 commercial models and 4 open-source models across five core dimensions.
| Model | Chinese Subset | English Subset | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| single -step |
multi -step |
fiction -app |
real -app |
ground -ing |
GE Score |
single -step |
multi -step |
fiction -app |
real -app |
ground -ing |
GE Score |
|
| Nano Banana Pro | 84.50 | 68.65 | 65.75 | 64.35 | 64.83 | 69.62 | 84.32 | 69.51 | 46.33 | 47.20 | 58.64 | 61.20 |
| Nano Banana | 64.36 | 34.16 | 64.82 | 65.89 | 54.48 | 56.74 | 64.80 | 50.75 | 48.88 | 47.12 | 49.04 | 52.12 |
| GPT-image-1.5 | 83.79 | 56.97 | 60.11 | 55.65 | 53.33 | 63.22 | 80.80 | 58.87 | 63.68 | 58.93 | 49.23 | 63.16 |
| GPT-image-1.0 | 64.72 | 49.20 | 57.31 | 59.04 | 31.68 | 52.39 | 60.92 | 64.33 | 58.94 | 56.16 | 37.84 | 55.64 |
| Seedream 4.5 | 63.64 | 53.11 | 56.48 | 53.44 | 52.90 | 55.91 | 49.49 | 45.30 | 53.81 | 51.80 | 49.63 | 50.01 |
| Seedream 4.0 | 62.04 | 48.64 | 49.28 | 50.93 | 53.53 | 52.88 | 53.28 | 37.57 | 47.92 | 49.36 | 44.17 | 46.46 |
| Wan 2.6 | 64.20 | 50.11 | 52.72 | 50.40 | 59.58 | 55.40 | 60.17 | 44.36 | 49.55 | 44.80 | 53.36 | 50.45 |
| Flux-2-Pro | 68.83 | 55.07 | 58.13 | 55.41 | 50.24 | 57.54 | 61.00 | 52.17 | 49.92 | 47.16 | 45.67 | 51.18 |
| Bagel | 34.84 | 13.45 | 27.36 | 33.52 | 35.10 | 28.85 | 32.91 | 8.61 | 26.08 | 35.12 | 37.30 | 28.00 |
| UniWorld-V2 | 55.33 | 24.95 | 32.03 | 21.39 | 49.60 | 36.66 | 42.68 | 14.14 | 30.08 | 26.83 | 47.04 | 32.15 |
| Qwen-Image-Edit | 41.12 | 26.79 | 23.78 | 26.10 | 50.80 | 33.72 | 40.12 | 18.61 | 25.80 | 25.95 | 54.55 | 33.01 |
| Longcat-Image | 48.76 | 12.75 | 30.03 | 30.00 | 51.02 | 34.51 | 36.69 | 8.44 | 37.30 | 36.83 | 47.12 | 33.28 |
Overall model rankings based on average GE-Score across Chinese and English subsets.
| Rank | Model | Avg GE-Score | Chinese | English |
|---|---|---|---|---|
| 1 | Nano Banana Pro | 65.41 | 69.62 | 61.20 |
| 2 | GPT-image-1.5 | 63.19 | 63.22 | 63.16 |
| 3 | Nano Banana | 54.43 | 56.74 | 52.12 |
| 4 | Flux-2-Pro | 54.36 | 57.54 | 51.18 |
| 5 | GPT-image-1.0 | 54.02 | 52.39 | 55.64 |
| 6 | Seedream 4.5 | 52.96 | 55.91 | 50.01 |
| 7 | Wan 2.6 | 52.92 | 55.40 | 50.45 |
| 8 | Seedream 4.0 | 49.67 | 52.88 | 46.46 |
| 9 | UniWorld-V2 | 34.41 | 36.66 | 32.15 |
| 10 | Longcat-Image | 33.89 | 34.51 | 33.28 |
| 11 | Qwen-Image-Edit | 33.36 | 33.72 | 33.01 |
| 12 | Bagel | 28.43 | 28.85 | 28.00 |
Benchmark Positioning. GEBench compared with traditional T2I benchmarks and video generation benchmarks.
Task suites. Examples of the five task types in GEBench.
Visual Cases. Model performance comparison on key challenging scenarios including text rendering, icon interpretation, and localization precision.
@article{li2026gebench,
title={GEBench: Benchmarking Image Generation Models as GUI Environments},
author={Haodong Li and Jingwei Wu and Quan Sun and Guopeng Li and Juanxi Tian and Huanyu Zhang and Yanlin Lai and Ruichuan An and Hongbo Peng and Yuhong Dai and Chenxi Li and Chunmei Qing and Jia Wang and Ziyang Meng and Zheng Ge and Xiangyu Zhang and Daxin Jiang},
journal={arXiv preprint arXiv:2602.09007},
year={2026}
}