Skip to main content

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one closed-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, geometric transformations, and color differences) in both full-reference and no-reference scenarios. Experimental results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This is also known as the double-stimulus impairment rating by treating the reference image as a second visual stimulus.

  2. 2.

    We follow [11] to introduce mild geometric transformations.

References

  1. Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Data Warehousing and Knowledge Discovery, pp. 305–316 (2008)

    Google Scholar 

  2. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. Alayrac, J.B., et al.: flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)

    Google Scholar 

  4. Bai, J., et al.: Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  5. Bracci, S., Mraz, J., Zeman, A., Leys, G., Op de Beeck, H.: The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLOS Comput. Biol. 19(4), 1–5 (2023)

    Google Scholar 

  6. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  7. Cao, P., Li, D., Ma, K.: Image quality assessment: integrating model-centric and data-centric approaches. In: Conference on Parsimony and Learning, pp. 529–541 (2024)

    Google Scholar 

  8. Chen, C., et al.: TOPIQ: a top-down approach from semantics to distortions for image quality assessment. arXiv preprint arXiv:2308.03060 (2023)

  9. Chen, H., Wang, Z., Yang, Y., Sun, Q., Ma, K.: Learning a deep color difference metric for photographic images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22242–22251 (2023)

    Google Scholar 

  10. Chubarau, A., Akhavan, T., Yoo, H., Mantiuk, R.K., Clark, J.: Perceptual image quality assessment for various viewing conditions and display systems. In: Image Quality and System Performance, pp. 1–9 (2020)

    Google Scholar 

  11. Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (2020)

    MATH  Google Scholar 

  12. Dong, Q., et al.: A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)

  13. Dong, X., et al.: InternLM-XComposer2: mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)

  14. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  15. Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020)

    Google Scholar 

  16. Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  17. Guo, Q., et al.: Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (2024)

    Google Scholar 

  18. Hu, E.J., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

    Google Scholar 

  19. Kaplan, J., et al.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  20. Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: multi-scale image quality transformer. In: IEEE/CVF International Conference on Computer Vision, pp. 5148–5157 (2021)

    Google Scholar 

  21. Kewenig, V., et al.: Multimodality and attention increase alignment in natural language prediction between humans and computational models. arXiv preprint arXiv:2308.06035 (2024)

  22. Lao, S., et al.: Attentions help CNNs see better: attention-based hybrid image quality assessment network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 1140–1149 (2022)

    Google Scholar 

  23. Li, C., et al.: AGIQA-3K: an open database for AI-generated image quality assessment. arXiv preprint arXiv:2306.04717 (2023)

  24. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742 (2023)

    Google Scholar 

  25. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Association for Computational Linguistics and International Joint Conference on Natural Language Processing, pp. 4582–4597 (2021)

    Google Scholar 

  26. Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  MATH  Google Scholar 

  27. Liang, Z., Li, C., Zhou, S., Feng, R., Loy, C.C.: Iterative prompt learning for unsupervised backlit image enhancement. In: IEEE/CVF International Conference on Computer Vision, pp. 8094–8103 (2023)

    Google Scholar 

  28. Lin, H., Hosu, V., Saupe, D.: KADID-10k: a large-scale artificially distorted IQA database. In: International Conference on Quality of Multimedia Experience, pp. 1–3 (2019)

    Google Scholar 

  29. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems, vol. 36, pp. 1–25 (2024)

    Google Scholar 

  30. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  31. Ma, K., Duanmu, Z., Wang, Z.: Geometric transformation invariant image quality assessment using convolutional neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6732–6736 (2018)

    Google Scholar 

  32. Ma, K., Liu, W., Zhang, K., Duanmu, Z., Wang, Z., Zuo, W.: End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 27(3), 1202–1213 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  33. Ma, K., et al.: Group MAD competition-a new methodology to compare objective image quality models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1664–1673 (2016)

    Google Scholar 

  34. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Lett. 20(3), 209–212 (2012)

    Google Scholar 

  35. Ngo, R., Chan, L., Mindermann, S.: The alignment problem from a deep learning perspective. In: International Conference on Learning Representations (2022)

    Google Scholar 

  36. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)

    Google Scholar 

  37. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: ACM Asia Conference on Computer and Communications Security, pp. 506–519 (2017)

    Google Scholar 

  38. Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  39. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  40. Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Trans. Image Process. 15(2), 430–444 (2006)

    Article  MATH  Google Scholar 

  41. Shin, S., et al.: On the effect of pretraining corpora on in-context learning by a large-scale language model. In: The North American Chapter of the Association for Computational Linguistics, pp. 5168–5186 (2022)

    Google Scholar 

  42. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2014)

    Google Scholar 

  43. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  44. Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34, 273–286 (1927)

    Article  MATH  Google Scholar 

  45. Tong, S., et al.: Cambrian-1: a fully open, vision-centric exploration of multimodal LLMs. arXiv preprint arXiv:2406.16860 (2024)

  46. Topiwala, P., Dai, W., Pian, J., Biondi, K., Krovvidi, A.: VMAF and variants: towards a unified VQA. In: Applications of Digital Image Processing, vol. 11842, pp. 96–104 (2021)

    Google Scholar 

  47. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  48. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)

    Google Scholar 

  49. Wang, Z., et al.: Measuring perceptual color differences of smartphone photographs. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 10114–10128 (2023)

    Article  MATH  Google Scholar 

  50. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  MATH  Google Scholar 

  51. Wei, J., et al.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2022)

    Google Scholar 

  52. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)

    Google Scholar 

  53. Wu, H., et al.: Q-Bench: a benchmark for general-purpose foundation models on low-level vision. In: International Conference on Learning Representations (2024)

    Google Scholar 

  54. Wu, H., et al.: Q-Instruct: improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)

  55. Wu, H., et al.: Q-Align: teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)

  56. Wu, H., et al.: Towards open-ended visual quality comparison. arXiv preprint arXiv:2402.16641 (2024)

  57. Wu, T., et al.: Assessor360: multi-sequence network for blind omnidirectional image quality assessment. In: Advances in Neural Information Processing Systems, vol. 36, pp. 1–14 (2024)

    Google Scholar 

  58. Yang, S., et al.: MANIQA: multi-dimension attention network for no-reference image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 1191–1200 (2022)

    Google Scholar 

  59. Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023)

  60. Ye, P., Doermann, D.: Active sampling for subjective image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4249–4256 (2014)

    Google Scholar 

  61. Ye, P., Kumar, J., Kang, L., Doermann, D.: Unsupervised feature learning framework for no-reference image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1098–1105 (2012)

    Google Scholar 

  62. Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)

  63. Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)

  64. Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PaQ-2-PiQ): mapping the perceptual space of picture quality. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3585 (2020)

    Google Scholar 

  65. You, Z., et al.: Descriptive image quality assessment in the wild. arXiv preprint arXiv:2405.18842 (2024)

  66. You, Z., Li, Z., Gu, J., Yin, Z., Xue, T., Dong, C.: Depicting beyond scores: advancing image quality assessment through multi-modal language models. arXiv preprint arXiv:2312.08962 (2023)

  67. Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE Trans. Image Process. 20(8), 2378–2386 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  68. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

    Google Scholar 

  69. Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process. 30, 3474–3486 (2021)

    Article  MATH  Google Scholar 

  70. Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: a multitask learning perspective. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14071–14081 (2023)

    Google Scholar 

  71. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  72. Zhu, H., et al.: 2AFC prompting of large multimodal models for image quality assessment. arXiv preprint arXiv:2402.01162 (2024)

  73. Zhuang, S., Hadfield-Menell, D.: Consequences of misaligned AI. In: Advances in Neural Information Processing Systems, vol. 33, pp. 15763–15773 (2020)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the generous support from OPPO. This work was supported in part by the National Natural Science Foundation of China (62071407 and 61991451), the Hong Kong ITC Innovation and Technology Fund (9440379 and 9440390), and the Shenzhen Science and Technology Program (JCYJ20220818101001004).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kede Ma or Yujiu Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 260 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, T., Ma, K., Liang, J., Yang, Y., Zhang, L. (2025). A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72904-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72903-4

  • Online ISBN: 978-3-031-72904-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics