Skip to main content

Advertisement

Log in

Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

The rapid growth of AI-generated images in fields such as entertainment, e-commerce, and media has heightened the demand for robust evaluation methods to ensure high-quality, photorealistic outputs. However, current computational metrics often lack alignment with human perception, creating a gap in accurately assessing the quality of AI-generated visuals. This study introduces subjective human assessments named Visual Verity, alongside objective computational metrics, to evaluate photorealism, image quality, and text-image alignment in AI-generated images. We designed a comprehensive questionnaire and benchmarked these assessments against human judgments. The experiments are conducted using state-of-the-art models, including DALL\(\cdot\)E 2, DALL\(\cdot\)E 3, GLIDE, and Stable Diffusion by comparing their outputs with camera-generated images. Our findings show that while AI models excel in image quality, camera-generated images surpass them in photorealism and text-image alignment. Further analysis benchmarks traditional metrics, such as SSIM and PSNR, against human judgments and highlights the Interpolative Binning Scale as a more interpretable approach to metric scores. The framework provides a structured pathway for advancing the evaluation of AI-generated images and informing future developments in AI-driven visual media.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

Data and code supporting the research presented in this manuscript are available at: https://github.com/udanish50/VisualVerity.

References

  1. Everypixel Journal, AI Image Statistics Report (2024). https://journal.everypixel.com/ai-image-statistics

  2. Tech Report, AI Image Generator Market Statistics (2024). https://techreport.com/statistics/ai-image-generator-market-statistics/

  3. Talebi, H., Milanfar, P.: Nima: neural image assessment. IEEE Trans. Image Process. 21, 3339–3352 (2018)

    MathSciNet  MATH  Google Scholar 

  4. Caramiaux, B., Fdili Alaoui, S.: "Explorers of unknown planets" practices and politics of artificial intelligence in visual arts. In: Proceedings of the ACM on Human-Computer Interaction (2022)

  5. Li, C., Li, K., Liu, J., Lin, X., Luo, J.: A diverse dataset for automated visual aesthetics assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4977–4986 (2019)

  6. Shi, Y., Gao, T., Jiao, X., Cao, N.: Understanding design collaboration between designers and artificial intelligence: a systematic literature review. In: Proceedings of the ACM on Human-Computer Interaction (2023)

  7. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2014)

    MathSciNet  Google Scholar 

  8. Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1552–1565 (2020)

    MATH  Google Scholar 

  9. Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 20th International Conference on Pattern Recognition. IEEE, vol. 2010, pp. 2366–2369 (2010)

  10. Wang, K., Doneva, M., Meineke, J., Amthor, T., Karasan, E., Tan, F., Tamir, J.I., Yu, S.X., Lustig, M.: High-fidelity direct contrast synthesis from magnetic resonance fingerprinting. Magn. Reson. Med. 90(5), 2116–2129 (2023)

    Google Scholar 

  11. Talebi, H., Milanfar, P.: Learned perceptual image enhancement. In: 2018 IEEE International Conference on Computational Photography. IEEE (2018)

  12. Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014). arXiv:1405.0312

  13. Jiang, Y., Yang, S., Qiu, H., Wu, W., Loy, C.C., Liu, Z.: Text2human: text-driven controllable human image generation. ACM Trans. Graph. (TOG) 44, 1–11 (2022)

    Google Scholar 

  14. Rokh, B., Azarpeyvand, A., Khanteymoori, A.: A comprehensive survey on model quantization for deep neural networks in image classification. ACM Trans. Intell. Syst. Technol. 22, 182–191 (2023)

    Google Scholar 

  15. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)

  16. Zhang, R., Isola, P., Efros, A.A.. Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

  17. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016) (2016)

  18. Dayma, B., Patil, S., Cuenca, P., Saifullah, K., Abraham, T., Le Khac, P., Melas, L., Ghosh, R.: Dall· e mini, HuggingFace.com. https://huggingface.co/spaces/dallemini/dalle-mini. Accessed 29 Sept 2022 (2021)

  19. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805

  20. Ko, H., Lee, D.Y., Cho, S., Bovik, A.C.: Quality prediction on deep generative images. IEEE Trans. Image Process. 29, 5964–5979 (2020)

    MATH  Google Scholar 

  21. Lu, L.-F., Zimmerman, E.: A descriptive study of preservice art teacher responses to computer-generated and noncomputer-generated art images, Ph.D. thesis (2000)

  22. Lu, Z., Huang, D., Bai, L., Qu, J., Wu, C., Liu, X., Ouyang, W.: Seeing is not always believing: Benchmarking human and model perception of AI-generated images. Advances in Neural Information Processing Systems 36 (2024)

  23. Ragot, M., Martin, N., Cojean, S.: AI-generated vs. human artworks. a perception bias towards artificial intelligence?. In: Extended abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–10 (2020)

  24. Zhou, Y., Kawabata, H.: Eyes can tell: Assessment of implicit attitudes toward AI art. i-Perception 14(5), 20416695231209850 (2023)

    MATH  Google Scholar 

  25. Treder, M.S., Codrai, R., Tsvetanov, K.A.: Quality assessment of anatomical mri images from generative adversarial networks: human assessment and image quality metrics. J. Neurosci. Methods 374, 109579 (2022)

    Google Scholar 

  26. Aziz, M., Rehman, U., Danish, M.U., Grolinger, K: Global-local image perceptual score (glips): Evaluating photorealistic quality of AI-generated images (2024). arXiv:2405.09426

  27. Aziz, M., Rehman, U., Danish, M.U., Grolinger, K.: Global-local image perceptual score (glips): evaluating photorealistic quality of AI-generated images. IEEE Trans. Hum. Mach. Syst. 55, 223–233 (2025)

    Google Scholar 

  28. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation, in: International Conference on Machine Learning, PMLR, pp. 8821–8831 (2021)

  29. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022). arXiv:2204.06125

  30. Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Comput. Sci. (2023) https://cdn.openai.com/papers/dall-e-3.pdf

  31. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models (2021). arXiv:2112.10741

  32. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022)

  33. Sheikh, H.R., Bovik, A.C., de Veciana, G.: Image information and visual quality. IEEE Trans. Image Process. 15, 430–444 (2006)

    MATH  Google Scholar 

  34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). arXiv:1512.03385

  35. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans (2018). arXiv:1801.01401

  36. Tang, Z., Wang, Z., Peng, B., Dong, J.: Clip-agiqa: boosting the performance of AI-generated image quality assessment with clip. In: International Conference on Pattern Recognition, pp. 48–61. Springer (2025)

  37. Antonelli, G., Libanio, D., De Groof, A.J., van der Sommen, F., Mascagni, P., Sinonquel, P., Abdelrahim, M., Ahmad, O., Berzin, T., Bhandari, P., et al.: Quaide-quality assessment of AI preclinical studies in diagnostic endoscopy. Gut 74, 153–161 (2025)

    Google Scholar 

  38. Zhu, H., Terashi, G., Farheen, F., Nakamura, T., Kihara, D.: AI-based quality assessment methods for protein structure models from cryo-em. Curr. Res. Struct. Biol. 9, 100164 (2025)

    Google Scholar 

  39. Zhang, J., Zhao, D., Zhang, D., Lv, C., Song, M., Peng, Q., Wang, N., Xu, C.: No-reference image quality assessment based on information entropy vision transformer. Imaging Sci. J. https://doi.org/10.1080/13682199.2025.2456431 (2025)

    Article  MATH  Google Scholar 

  40. Li, C., Zhang, Z., Wu, H., Sun, W., Min, X., Liu, X., Zhai, G., Lin, W.: Agiqa-3k: an open database for AI-generated image quality assessment. IEEE Trans. Circuits Syst. Video Technol. 34(8), 6833–6846 (2023)

    Google Scholar 

  41. Qiao, S., Eglin, R.: Accurate behaviour and believability of computer generated images of human head. in: Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry, pp. 545–548. ACM (2011)

  42. Sarkar, K., Liu, L., Golyanik, V., Theobalt, C.: Humangan: a generative model of human images. In: 2021 International Conference on 3D Vision (3DV), pp. 258–267. IEEE (2021)

  43. Ha, A.Y.J., Passananti, J., Bhaskar, R., Shan, S., Southen, R., Zheng, H., Zhao, B.Y.: Organic or diffused: can we distinguish human art from AI-generated images? (2024) arXiv:2402.03214

  44. Rassin, R., Ravfogel, S., Goldberg, Y.: Dalle-2 is seeing double: flaws in word-to-concept mapping in text2image models (2022). arXiv:2210.10606

  45. Hulzebosch, N., Ibrahimi, S., Worring, M.: Detecting cnn-generated facial images in real-world scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)

  46. Degardin, B., Lopes, V., Proença, H.: Fake it till you recognize it: quality assessment for human action generative models. IEEE Trans. Biom. Behav. Identity Sci. 6, 261–271 (2024)

    Google Scholar 

  47. Sun, L., Wei, M., Sun, Y., Suh, Y.J., Shen, L., Yang, S.: Smiling women pitching down: auditing representational and presentational gender biases in image-generative AI. J. Comput. Mediat. Commun. 29(1), 1–15 (2024)

    MathSciNet  MATH  Google Scholar 

  48. Xu, S., Hou, D., Pang, L., Deng, J., Xu, J., Shen, H., Cheng, X.: Ai-generated images introduce invisible relevance bias to text-image retrieval (2023). arXiv:2311.14084

  49. Zhou, T., Tan, S., Zhou, W., Luo, Y., Wang, Y.-G., Yue, G.: Adaptive mixed-scale feature fusion network for blind AI-generated image quality assessment. IEEE Trans. Broadcast. 70(3), 833–843 (2024)

    MATH  Google Scholar 

  50. Yang, L., Duan, H., Teng, L., Zhu, Y., Liu, X., Hu, M., Min, X., Zhai, G., Le Callet, P.: Aigcoiqa2024: perceptual quality assessment of ai generated omnidirectional images. In: 2024 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 1239–1245 (2024)

  51. Wang, J., Duan, H., Zhai, G., Min, X.: Understanding and evaluating human preferences for AI generated images with instruction tuning (2024). arXiv:2405.07346

  52. Rehman, A., Zeng, K., Wang, Z.: Display device-adapted video quality-of-experience assessment. In: Human vision and electronic imaging XX, SPIE (2015)

  53. Luna, R., Zabaleta, I., Bertalmío, M.: State-of-the-art image and video quality assessment with a metric based on an intrinsically non-linear neural summation model. Front. Neurosci. 17, 1222815 (2023)

    Google Scholar 

  54. Varga, D.: Saliency-guided local full-reference image quality assessment. Signals 3(3), 483–496 (2022)

    MathSciNet  MATH  Google Scholar 

  55. Huang, X., Wang, Y., Zhang, L.: Multi-task learning for perceptual image quality assessment. IEEE Trans. Multimed. (2022)

  56. Olson, K.E., O’Brien, M.A., Rogers, W.A., Charness, N.: Diffusion of technology: frequency of use for younger and older adults. Ageing Int. 36(1), 123–145 (2011)

    Google Scholar 

  57. Shadiev, R., Wu, T.-T., Huang, Y.-M.: Using image-to-text recognition technology to facilitate vocabulary acquisition in authentic contexts. ReCALL (2020)

  58. Abdul-Latif, S.-A., Abdul-Talib, A.-N.: Consumer racism: a scale modification. Asia Pac. J. Mark. Logist. 29(3), 616–633 (2017)

    MATH  Google Scholar 

  59. Hair, J., Alamer, A.: Partial least squares structural equation modeling (pls-sem) in second language and education research: Guidelines using an applied example. Res. Methods Appl. Linguist. 1(3), 100027 (2022)

    MATH  Google Scholar 

  60. Henseler, J., Ringle, C.M., Sarstedt, M.: A new criterion for assessing discriminant validity in variance-based structural equation modeling. J. Acad. Mark. Sci. 43(1), 115–135 (2015)

    MATH  Google Scholar 

  61. Lin, X., Ma, Y.-l., Ma, L., Zhang, R.-l.: A survey for image resizing. J. Zhejiang Univ. Sci C 15(9):697–716 (2014)

  62. Prolific, Prolific - online participant recruitment for surveys and market research (2024). https://www.prolific.co

  63. Qualtrics, Qualtrics - online survey platforms2 2024. https://uwo.eu.qualtrics.com/homepage/ui

  64. World Papulation Review, World Papulation Review Statistics (2024). https://worldpopulationreview.com/state-rankings/transgender-population-by-state

  65. World Papulation Review. World Papulation Review Statistics (2024). https://worldpopulationreview.com/country-rankings/phd-percentage-by-country

  66. Human Centric Computing Group. Human Centric Computing Group (2024). https://thehccg.com/

  67. Danish, M.U., Grolinger, K.: Kolmogorov–Arnold recurrent network for short term load forecasting across diverse consumers. Available at SSRN 4974855

  68. Danish, M.U., Grolinger, K.: Leveraging hypernetworks and learnable kernels for consumer energy forecasting across diverse consumer types. IEEE Trans. Power Deliv. (2024). https://doi.org/10.1109/TPWRD.2024.3486010

    Article  MATH  Google Scholar 

Download references

Funding

Funding was provided by National Science and Engineering Research Council of Canada (NSERC) Discovery Grants (Grant No. RGPIN-2024-05191).

Author information

Authors and Affiliations

Authors

Contributions

M.A. and U.R. conceptualized the study and designed the research methodology. M.A. developed the Visual Verity questionnaire and coordinated the data collection process. M.U.D. played a central role in the comparative evaluations of AI models, prepared Figures 1–5 and Tables 1–18, and contributed extensively to drafting the results and discussion sections. S.A.S. conducted the literature review and provided insights for the comparative evaluation. A.Z.A. contributed to the development and statistical validation of the Visual Verity questionnaire. M.A. and M.U.D. wrote the initial draft of the manuscript. U.R. reviewed and revised the manuscript for intellectual content, ensuring methodological rigor and clarity. All authors reviewed the manuscript, provided critical feedback, and approved the final version for submission.

Corresponding author

Correspondence to Umair Rehman.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aziz, M., Rehman, U., Danish, M.U. et al. Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images. Multimedia Systems 31, 180 (2025). https://doi.org/10.1007/s00530-025-01769-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-025-01769-7

Keywords

Profiles

  1. Muhammad Umair Danish