Abstract
The rapid growth of AI-generated images in fields such as entertainment, e-commerce, and media has heightened the demand for robust evaluation methods to ensure high-quality, photorealistic outputs. However, current computational metrics often lack alignment with human perception, creating a gap in accurately assessing the quality of AI-generated visuals. This study introduces subjective human assessments named Visual Verity, alongside objective computational metrics, to evaluate photorealism, image quality, and text-image alignment in AI-generated images. We designed a comprehensive questionnaire and benchmarked these assessments against human judgments. The experiments are conducted using state-of-the-art models, including DALL\(\cdot\)E 2, DALL\(\cdot\)E 3, GLIDE, and Stable Diffusion by comparing their outputs with camera-generated images. Our findings show that while AI models excel in image quality, camera-generated images surpass them in photorealism and text-image alignment. Further analysis benchmarks traditional metrics, such as SSIM and PSNR, against human judgments and highlights the Interpolative Binning Scale as a more interpretable approach to metric scores. The framework provides a structured pathway for advancing the evaluation of AI-generated images and informing future developments in AI-driven visual media.







Similar content being viewed by others
Data availability
Data and code supporting the research presented in this manuscript are available at: https://github.com/udanish50/VisualVerity.
References
Everypixel Journal, AI Image Statistics Report (2024). https://journal.everypixel.com/ai-image-statistics
Tech Report, AI Image Generator Market Statistics (2024). https://techreport.com/statistics/ai-image-generator-market-statistics/
Talebi, H., Milanfar, P.: Nima: neural image assessment. IEEE Trans. Image Process. 21, 3339–3352 (2018)
Caramiaux, B., Fdili Alaoui, S.: "Explorers of unknown planets" practices and politics of artificial intelligence in visual arts. In: Proceedings of the ACM on Human-Computer Interaction (2022)
Li, C., Li, K., Liu, J., Lin, X., Luo, J.: A diverse dataset for automated visual aesthetics assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4977–4986 (2019)
Shi, Y., Gao, T., Jiao, X., Cao, N.: Understanding design collaboration between designers and artificial intelligence: a systematic literature review. In: Proceedings of the ACM on Human-Computer Interaction (2023)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2014)
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1552–1565 (2020)
Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 20th International Conference on Pattern Recognition. IEEE, vol. 2010, pp. 2366–2369 (2010)
Wang, K., Doneva, M., Meineke, J., Amthor, T., Karasan, E., Tan, F., Tamir, J.I., Yu, S.X., Lustig, M.: High-fidelity direct contrast synthesis from magnetic resonance fingerprinting. Magn. Reson. Med. 90(5), 2116–2129 (2023)
Talebi, H., Milanfar, P.: Learned perceptual image enhancement. In: 2018 IEEE International Conference on Computational Photography. IEEE (2018)
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014). arXiv:1405.0312
Jiang, Y., Yang, S., Qiu, H., Wu, W., Loy, C.C., Liu, Z.: Text2human: text-driven controllable human image generation. ACM Trans. Graph. (TOG) 44, 1–11 (2022)
Rokh, B., Azarpeyvand, A., Khanteymoori, A.: A comprehensive survey on model quantization for deep neural networks in image classification. ACM Trans. Intell. Syst. Technol. 22, 182–191 (2023)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
Zhang, R., Isola, P., Efros, A.A.. Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016) (2016)
Dayma, B., Patil, S., Cuenca, P., Saifullah, K., Abraham, T., Le Khac, P., Melas, L., Ghosh, R.: Dall· e mini, HuggingFace.com. https://huggingface.co/spaces/dallemini/dalle-mini. Accessed 29 Sept 2022 (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
Ko, H., Lee, D.Y., Cho, S., Bovik, A.C.: Quality prediction on deep generative images. IEEE Trans. Image Process. 29, 5964–5979 (2020)
Lu, L.-F., Zimmerman, E.: A descriptive study of preservice art teacher responses to computer-generated and noncomputer-generated art images, Ph.D. thesis (2000)
Lu, Z., Huang, D., Bai, L., Qu, J., Wu, C., Liu, X., Ouyang, W.: Seeing is not always believing: Benchmarking human and model perception of AI-generated images. Advances in Neural Information Processing Systems 36 (2024)
Ragot, M., Martin, N., Cojean, S.: AI-generated vs. human artworks. a perception bias towards artificial intelligence?. In: Extended abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–10 (2020)
Zhou, Y., Kawabata, H.: Eyes can tell: Assessment of implicit attitudes toward AI art. i-Perception 14(5), 20416695231209850 (2023)
Treder, M.S., Codrai, R., Tsvetanov, K.A.: Quality assessment of anatomical mri images from generative adversarial networks: human assessment and image quality metrics. J. Neurosci. Methods 374, 109579 (2022)
Aziz, M., Rehman, U., Danish, M.U., Grolinger, K: Global-local image perceptual score (glips): Evaluating photorealistic quality of AI-generated images (2024). arXiv:2405.09426
Aziz, M., Rehman, U., Danish, M.U., Grolinger, K.: Global-local image perceptual score (glips): evaluating photorealistic quality of AI-generated images. IEEE Trans. Hum. Mach. Syst. 55, 223–233 (2025)
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation, in: International Conference on Machine Learning, PMLR, pp. 8821–8831 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents (2022). arXiv:2204.06125
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Comput. Sci. (2023) https://cdn.openai.com/papers/dall-e-3.pdf
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models (2021). arXiv:2112.10741
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022)
Sheikh, H.R., Bovik, A.C., de Veciana, G.: Image information and visual quality. IEEE Trans. Image Process. 15, 430–444 (2006)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). arXiv:1512.03385
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans (2018). arXiv:1801.01401
Tang, Z., Wang, Z., Peng, B., Dong, J.: Clip-agiqa: boosting the performance of AI-generated image quality assessment with clip. In: International Conference on Pattern Recognition, pp. 48–61. Springer (2025)
Antonelli, G., Libanio, D., De Groof, A.J., van der Sommen, F., Mascagni, P., Sinonquel, P., Abdelrahim, M., Ahmad, O., Berzin, T., Bhandari, P., et al.: Quaide-quality assessment of AI preclinical studies in diagnostic endoscopy. Gut 74, 153–161 (2025)
Zhu, H., Terashi, G., Farheen, F., Nakamura, T., Kihara, D.: AI-based quality assessment methods for protein structure models from cryo-em. Curr. Res. Struct. Biol. 9, 100164 (2025)
Zhang, J., Zhao, D., Zhang, D., Lv, C., Song, M., Peng, Q., Wang, N., Xu, C.: No-reference image quality assessment based on information entropy vision transformer. Imaging Sci. J. https://doi.org/10.1080/13682199.2025.2456431 (2025)
Li, C., Zhang, Z., Wu, H., Sun, W., Min, X., Liu, X., Zhai, G., Lin, W.: Agiqa-3k: an open database for AI-generated image quality assessment. IEEE Trans. Circuits Syst. Video Technol. 34(8), 6833–6846 (2023)
Qiao, S., Eglin, R.: Accurate behaviour and believability of computer generated images of human head. in: Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry, pp. 545–548. ACM (2011)
Sarkar, K., Liu, L., Golyanik, V., Theobalt, C.: Humangan: a generative model of human images. In: 2021 International Conference on 3D Vision (3DV), pp. 258–267. IEEE (2021)
Ha, A.Y.J., Passananti, J., Bhaskar, R., Shan, S., Southen, R., Zheng, H., Zhao, B.Y.: Organic or diffused: can we distinguish human art from AI-generated images? (2024) arXiv:2402.03214
Rassin, R., Ravfogel, S., Goldberg, Y.: Dalle-2 is seeing double: flaws in word-to-concept mapping in text2image models (2022). arXiv:2210.10606
Hulzebosch, N., Ibrahimi, S., Worring, M.: Detecting cnn-generated facial images in real-world scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)
Degardin, B., Lopes, V., Proença, H.: Fake it till you recognize it: quality assessment for human action generative models. IEEE Trans. Biom. Behav. Identity Sci. 6, 261–271 (2024)
Sun, L., Wei, M., Sun, Y., Suh, Y.J., Shen, L., Yang, S.: Smiling women pitching down: auditing representational and presentational gender biases in image-generative AI. J. Comput. Mediat. Commun. 29(1), 1–15 (2024)
Xu, S., Hou, D., Pang, L., Deng, J., Xu, J., Shen, H., Cheng, X.: Ai-generated images introduce invisible relevance bias to text-image retrieval (2023). arXiv:2311.14084
Zhou, T., Tan, S., Zhou, W., Luo, Y., Wang, Y.-G., Yue, G.: Adaptive mixed-scale feature fusion network for blind AI-generated image quality assessment. IEEE Trans. Broadcast. 70(3), 833–843 (2024)
Yang, L., Duan, H., Teng, L., Zhu, Y., Liu, X., Hu, M., Min, X., Zhai, G., Le Callet, P.: Aigcoiqa2024: perceptual quality assessment of ai generated omnidirectional images. In: 2024 IEEE International Conference on Image Processing (ICIP). IEEE, pp. 1239–1245 (2024)
Wang, J., Duan, H., Zhai, G., Min, X.: Understanding and evaluating human preferences for AI generated images with instruction tuning (2024). arXiv:2405.07346
Rehman, A., Zeng, K., Wang, Z.: Display device-adapted video quality-of-experience assessment. In: Human vision and electronic imaging XX, SPIE (2015)
Luna, R., Zabaleta, I., Bertalmío, M.: State-of-the-art image and video quality assessment with a metric based on an intrinsically non-linear neural summation model. Front. Neurosci. 17, 1222815 (2023)
Varga, D.: Saliency-guided local full-reference image quality assessment. Signals 3(3), 483–496 (2022)
Huang, X., Wang, Y., Zhang, L.: Multi-task learning for perceptual image quality assessment. IEEE Trans. Multimed. (2022)
Olson, K.E., O’Brien, M.A., Rogers, W.A., Charness, N.: Diffusion of technology: frequency of use for younger and older adults. Ageing Int. 36(1), 123–145 (2011)
Shadiev, R., Wu, T.-T., Huang, Y.-M.: Using image-to-text recognition technology to facilitate vocabulary acquisition in authentic contexts. ReCALL (2020)
Abdul-Latif, S.-A., Abdul-Talib, A.-N.: Consumer racism: a scale modification. Asia Pac. J. Mark. Logist. 29(3), 616–633 (2017)
Hair, J., Alamer, A.: Partial least squares structural equation modeling (pls-sem) in second language and education research: Guidelines using an applied example. Res. Methods Appl. Linguist. 1(3), 100027 (2022)
Henseler, J., Ringle, C.M., Sarstedt, M.: A new criterion for assessing discriminant validity in variance-based structural equation modeling. J. Acad. Mark. Sci. 43(1), 115–135 (2015)
Lin, X., Ma, Y.-l., Ma, L., Zhang, R.-l.: A survey for image resizing. J. Zhejiang Univ. Sci C 15(9):697–716 (2014)
Prolific, Prolific - online participant recruitment for surveys and market research (2024). https://www.prolific.co
Qualtrics, Qualtrics - online survey platforms2 2024. https://uwo.eu.qualtrics.com/homepage/ui
World Papulation Review, World Papulation Review Statistics (2024). https://worldpopulationreview.com/state-rankings/transgender-population-by-state
World Papulation Review. World Papulation Review Statistics (2024). https://worldpopulationreview.com/country-rankings/phd-percentage-by-country
Human Centric Computing Group. Human Centric Computing Group (2024). https://thehccg.com/
Danish, M.U., Grolinger, K.: Kolmogorov–Arnold recurrent network for short term load forecasting across diverse consumers. Available at SSRN 4974855
Danish, M.U., Grolinger, K.: Leveraging hypernetworks and learnable kernels for consumer energy forecasting across diverse consumer types. IEEE Trans. Power Deliv. (2024). https://doi.org/10.1109/TPWRD.2024.3486010
Funding
Funding was provided by National Science and Engineering Research Council of Canada (NSERC) Discovery Grants (Grant No. RGPIN-2024-05191).
Author information
Authors and Affiliations
Contributions
M.A. and U.R. conceptualized the study and designed the research methodology. M.A. developed the Visual Verity questionnaire and coordinated the data collection process. M.U.D. played a central role in the comparative evaluations of AI models, prepared Figures 1–5 and Tables 1–18, and contributed extensively to drafting the results and discussion sections. S.A.S. conducted the literature review and provided insights for the comparative evaluation. A.Z.A. contributed to the development and statistical validation of the Visual Verity questionnaire. M.A. and M.U.D. wrote the initial draft of the manuscript. U.R. reviewed and revised the manuscript for intellectual content, ensuring methodological rigor and clarity. All authors reviewed the manuscript, provided critical feedback, and approved the final version for submission.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Aziz, M., Rehman, U., Danish, M.U. et al. Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images. Multimedia Systems 31, 180 (2025). https://doi.org/10.1007/s00530-025-01769-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01769-7