Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images

Aziz, Memoona; Rehman, Umair; Danish, Muhammad Umair; Ali, Syed; Abbasi, Amir Zaib

doi:10.1007/s00530-025-01769-7

Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images

Regular Paper
Published: 04 April 2025

Volume 31, article number 180, (2025)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Memoona Aziz¹,
Umair Rehman¹,
Muhammad Umair Danish¹,
Syed Ali¹ &
…
Amir Zaib Abbasi²

369 Accesses
Explore all metrics

Abstract

The rapid growth of AI-generated images in fields such as entertainment, e-commerce, and media has heightened the demand for robust evaluation methods to ensure high-quality, photorealistic outputs. However, current computational metrics often lack alignment with human perception, creating a gap in accurately assessing the quality of AI-generated visuals. This study introduces subjective human assessments named Visual Verity, alongside objective computational metrics, to evaluate photorealism, image quality, and text-image alignment in AI-generated images. We designed a comprehensive questionnaire and benchmarked these assessments against human judgments. The experiments are conducted using state-of-the-art models, including DALL$\cdot$E 2, DALL$\cdot$E 3, GLIDE, and Stable Diffusion by comparing their outputs with camera-generated images. Our findings show that while AI models excel in image quality, camera-generated images surpass them in photorealism and text-image alignment. Further analysis benchmarks traditional metrics, such as SSIM and PSNR, against human judgments and highlights the Interpolative Binning Scale as a more interpretable approach to metric scores. The framework provides a structured pathway for advancing the evaluation of AI-generated images and informing future developments in AI-driven visual media.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring User Acceptance of Al Image Generator: Unveiling Influential Factors in Embracing an Artistic AIGC Software

Identifying Usability Challenges in Text-to-Image AI: A Comprehensive Comparison Among Mainstream Platforms

Marketing and AI-Based Image Generation: A Responsible AI Perspective

Data availability

Data and code supporting the research presented in this manuscript are available at: https://github.com/udanish50/VisualVerity.

References

Everypixel Journal, AI Image Statistics Report (2024). https://journal.everypixel.com/ai-image-statistics
Tech Report, AI Image Generator Market Statistics (2024). https://techreport.com/statistics/ai-image-generator-market-statistics/
Talebi, H., Milanfar, P.: Nima: neural image assessment. IEEE Trans. Image Process. 21, 3339–3352 (2018)
MathSciNet MATH Google Scholar
Caramiaux, B., Fdili Alaoui, S.: "Explorers of unknown planets" practices and politics of artificial intelligence in visual arts. In: Proceedings of the ACM on Human-Computer Interaction (2022)
Li, C., Li, K., Liu, J., Lin, X., Luo, J.: A diverse dataset for automated visual aesthetics assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4977–4986 (2019)
Shi, Y., Gao, T., Jiao, X., Cao, N.: Understanding design collaboration between designers and artificial intelligence: a systematic literature review. In: Proceedings of the ACM on Human-Computer Interaction (2023)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2014)
MathSciNet Google Scholar
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1552–1565 (2020)
MATH Google Scholar
Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 20th International Conference on Pattern Recognition. IEEE, vol. 2010, pp. 2366–2369 (2010)
Wang, K., Doneva, M., Meineke, J., Amthor, T., Karasan, E., Tan, F., Tamir, J.I., Yu, S.X., Lustig, M.: High-fidelity direct contrast synthesis from magnetic resonance fingerprinting. Magn. Reson. Med. 90(5), 2116–2129 (2023)
Google Scholar
Talebi, H., Milanfar, P.: Learned perceptual image enhancement. In: 2018 IEEE International Conference on Computational Photography. IEEE (2018)
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014). arXiv:1405.0312
Jiang, Y., Yang, S., Qiu, H., Wu, W., Loy, C.C., Liu, Z.: Text2human: text-driven controllable human image generation. ACM Trans. Graph. (TOG) 44, 1–11 (2022)
Google Scholar
Rokh, B., Azarpeyvand, A., Khanteymoori, A.: A comprehensive survey on model quantization for deep neural networks in image classification. ACM Trans. Intell. Syst. Technol. 22, 182–191 (2023)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
Zhang, R., Isola, P., Efros, A.A.. Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016) (2016)
Dayma, B., Patil, S., Cuenca, P., Saifullah, K., Abraham, T., Le Khac, P., Melas, L., Ghosh, R.: Dall· e mini, HuggingFace.com. https://huggingface.co/spaces/dallemini/dalle-mini. Accessed 29 Sept 2022 (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
Ko, H., Lee, D.Y., Cho, S., Bovik, A.C.: Quality prediction on deep generative images. IEEE Trans. Image Process. 29, 5964–5979 (2020)
MATH Google Scholar
Lu, L.-F., Zimmerman, E.: A descriptive study of preservice art teacher responses to computer-generated and noncomputer-generated art images, Ph.D. thesis (2000)
Lu, Z., Huang, D., Bai, L., Qu, J., Wu, C., Liu, X., Ouyang, W.: Seeing is not always believing: Benchmarking human and model perception of AI-generated images. Advances in Neural Information Processing Systems 36 (2024)
Ragot, M., Martin, N., Cojean, S.: AI-generated vs. human artworks. a perception bias towards artificial intelligence?. In: Extended abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–10 (2020)
Zhou, Y., Kawabata, H.: Eyes can tell: Assessment of implicit attitudes toward AI art. i-Perception 14(5), 20416695231209850 (2023)
MATH Google Scholar
Treder, M.S., Codrai, R., Tsvetanov, K.A.: Quality assessment of anatomical mri images from generative adversarial networks: human assessment and image quality metrics. J. Neurosci. Methods 374, 109579 (2022)
Google Scholar
Aziz, M., Rehman, U., Danish, M.U., Grolinger, K: Global-local image perceptual score (glips): Evaluating photorealistic quality of AI-generated images (2024). arXiv:2405.09426
Aziz, M., Rehman, U., Danish, M.U., Grolinger, K.: Global-local image perceptual score (glips): evaluating photorealistic quality of AI-generated images. IEEE Trans. Hum. Mach. Syst. 55, 223–233 (2025)
Google Scholar
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation, in: International Conference on Machine Learning, PMLR, pp. 8821–8831 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022). arXiv:2204.06125
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Comput. Sci. (2023) https://cdn.openai.com/papers/dall-e-3.pdf
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models (2021). arXiv:2112.10741
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022)
Sheikh, H.R., Bovik, A.C., de Veciana, G.: Image information and visual quality. IEEE Trans. Image Process. 15, 430–444 (2006)
MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). arXiv:1512.03385
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans (2018). arXiv:1801.01401
Tang, Z., Wang, Z., Peng, B., Dong, J.: Clip-agiqa: boosting the performance of AI-generated image quality assessment with clip. In: International Conference on Pattern Recognition, pp. 48–61. Springer (2025)
Antonelli, G., Libanio, D., De Groof, A.J., van der Sommen, F., Mascagni, P., Sinonquel, P., Abdelrahim, M., Ahmad, O., Berzin, T., Bhandari, P., et al.: Quaide-quality assessment of AI preclinical studies in diagnostic endoscopy. Gut 74, 153–161 (2025)
Google Scholar
Zhu, H., Terashi, G., Farheen, F., Nakamura, T., Kihara, D.: AI-based quality assessment methods for protein structure models from cryo-em. Curr. Res. Struct. Biol. 9, 100164 (2025)
Google Scholar
Zhang, J., Zhao, D., Zhang, D., Lv, C., Song, M., Peng, Q., Wang, N., Xu, C.: No-reference image quality assessment based on information entropy vision transformer. Imaging Sci. J. https://doi.org/10.1080/13682199.2025.2456431 (2025)
Article MATH Google Scholar
Li, C., Zhang, Z., Wu, H., Sun, W., Min, X., Liu, X., Zhai, G., Lin, W.: Agiqa-3k: an open database for AI-generated image quality assessment. IEEE Trans. Circuits Syst. Video Technol. 34(8), 6833–6846 (2023)
Google Scholar
Qiao, S., Eglin, R.: Accurate behaviour and believability of computer generated images of human head. in: Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry, pp. 545–548. ACM (2011)
Sarkar, K., Liu, L., Golyanik, V., Theobalt, C.: Humangan: a generative model of human images. In: 2021 International Conference on 3D Vision (3DV), pp. 258–267. IEEE (2021)
Ha, A.Y.J., Passananti, J., Bhaskar, R., Shan, S., Southen, R., Zheng, H., Zhao, B.Y.: Organic or diffused: can we distinguish human art from AI-generated images? (2024) arXiv:2402.03214
Rassin, R., Ravfogel, S., Goldberg, Y.: Dalle-2 is seeing double: flaws in word-to-concept mapping in text2image models (2022). arXiv:2210.10606
Hulzebosch, N., Ibrahimi, S., Worring, M.: Detecting cnn-generated facial images in real-world scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)
Degardin, B., Lopes, V., Proença, H.: Fake it till you recognize it: quality assessment for human action generative models. IEEE Trans. Biom. Behav. Identity Sci. 6, 261–271 (2024)
Google Scholar
Sun, L., Wei, M., Sun, Y., Suh, Y.J., Shen, L., Yang, S.: Smiling women pitching down: auditing representational and presentational gender biases in image-generative AI. J. Comput. Mediat. Commun. 29(1), 1–15 (2024)
MathSciNet MATH Google Scholar
Xu, S., Hou, D., Pang, L., Deng, J., Xu, J., Shen, H., Cheng, X.: Ai-generated images introduce invisible relevance bias to text-image retrieval (2023). arXiv:2311.14084
Zhou, T., Tan, S., Zhou, W., Luo, Y., Wang, Y.-G., Yue, G.: Adaptive mixed-scale feature fusion network for blind AI-generated image quality assessment. IEEE Trans. Broadcast. 70(3), 833–843 (2024)
MATH Google Scholar
Yang, L., Duan, H., Teng, L., Zhu, Y., Liu, X., Hu, M., Min, X., Zhai, G., Le Callet, P.: Aigcoiqa2024: perceptual quality assessment of ai generated omnidirectional images. In: 2024 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 1239–1245 (2024)
Wang, J., Duan, H., Zhai, G., Min, X.: Understanding and evaluating human preferences for AI generated images with instruction tuning (2024). arXiv:2405.07346
Rehman, A., Zeng, K., Wang, Z.: Display device-adapted video quality-of-experience assessment. In: Human vision and electronic imaging XX, SPIE (2015)
Luna, R., Zabaleta, I., Bertalmío, M.: State-of-the-art image and video quality assessment with a metric based on an intrinsically non-linear neural summation model. Front. Neurosci. 17, 1222815 (2023)
Google Scholar
Varga, D.: Saliency-guided local full-reference image quality assessment. Signals 3(3), 483–496 (2022)
MathSciNet MATH Google Scholar
Huang, X., Wang, Y., Zhang, L.: Multi-task learning for perceptual image quality assessment. IEEE Trans. Multimed. (2022)
Olson, K.E., O’Brien, M.A., Rogers, W.A., Charness, N.: Diffusion of technology: frequency of use for younger and older adults. Ageing Int. 36(1), 123–145 (2011)
Google Scholar
Shadiev, R., Wu, T.-T., Huang, Y.-M.: Using image-to-text recognition technology to facilitate vocabulary acquisition in authentic contexts. ReCALL (2020)
Abdul-Latif, S.-A., Abdul-Talib, A.-N.: Consumer racism: a scale modification. Asia Pac. J. Mark. Logist. 29(3), 616–633 (2017)
MATH Google Scholar
Hair, J., Alamer, A.: Partial least squares structural equation modeling (pls-sem) in second language and education research: Guidelines using an applied example. Res. Methods Appl. Linguist. 1(3), 100027 (2022)
MATH Google Scholar
Henseler, J., Ringle, C.M., Sarstedt, M.: A new criterion for assessing discriminant validity in variance-based structural equation modeling. J. Acad. Mark. Sci. 43(1), 115–135 (2015)
MATH Google Scholar
Lin, X., Ma, Y.-l., Ma, L., Zhang, R.-l.: A survey for image resizing. J. Zhejiang Univ. Sci C 15(9):697–716 (2014)
Prolific, Prolific - online participant recruitment for surveys and market research (2024). https://www.prolific.co
Qualtrics, Qualtrics - online survey platforms2 2024. https://uwo.eu.qualtrics.com/homepage/ui
World Papulation Review, World Papulation Review Statistics (2024). https://worldpopulationreview.com/state-rankings/transgender-population-by-state
World Papulation Review. World Papulation Review Statistics (2024). https://worldpopulationreview.com/country-rankings/phd-percentage-by-country
Human Centric Computing Group. Human Centric Computing Group (2024). https://thehccg.com/
Danish, M.U., Grolinger, K.: Kolmogorov–Arnold recurrent network for short term load forecasting across diverse consumers. Available at SSRN 4974855
Danish, M.U., Grolinger, K.: Leveraging hypernetworks and learnable kernels for consumer energy forecasting across diverse consumer types. IEEE Trans. Power Deliv. (2024). https://doi.org/10.1109/TPWRD.2024.3486010
Article MATH Google Scholar

Download references

Funding

Funding was provided by National Science and Engineering Research Council of Canada (NSERC) Discovery Grants (Grant No. RGPIN-2024-05191).

Author information

Authors and Affiliations

Western University, London, Canada
Memoona Aziz, Umair Rehman, Muhammad Umair Danish & Syed Ali
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
Amir Zaib Abbasi

Authors

Memoona Aziz
View author publications
Search author on:PubMed Google Scholar
Umair Rehman
View author publications
Search author on:PubMed Google Scholar
Muhammad Umair Danish
View author publications
Search author on:PubMed Google Scholar
Syed Ali
View author publications
Search author on:PubMed Google Scholar
Amir Zaib Abbasi
View author publications
Search author on:PubMed Google Scholar

Contributions

M.A. and U.R. conceptualized the study and designed the research methodology. M.A. developed the Visual Verity questionnaire and coordinated the data collection process. M.U.D. played a central role in the comparative evaluations of AI models, prepared Figures 1–5 and Tables 1–18, and contributed extensively to drafting the results and discussion sections. S.A.S. conducted the literature review and provided insights for the comparative evaluation. A.Z.A. contributed to the development and statistical validation of the Visual Verity questionnaire. M.A. and M.U.D. wrote the initial draft of the manuscript. U.R. reviewed and revised the manuscript for intellectual content, ensuring methodological rigor and clarity. All authors reviewed the manuscript, provided critical feedback, and approved the final version for submission.

Corresponding author

Correspondence to Umair Rehman.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aziz, M., Rehman, U., Danish, M.U. et al. Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images. Multimedia Systems 31, 180 (2025). https://doi.org/10.1007/s00530-025-01769-7

Download citation

Received: 26 November 2024
Accepted: 17 March 2025
Published: 04 April 2025
DOI: https://doi.org/10.1007/s00530-025-01769-7

Keywords

Profiles

Muhammad Umair Danish View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring User Acceptance of Al Image Generator: Unveiling Influential Factors in Embracing an Artistic AIGC Software

Identifying Usability Challenges in Text-to-Image AI: A Comprehensive Comparison Among Mainstream Platforms

Marketing and AI-Based Image Generation: A Responsible AI Perspective

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now