Abstract
While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one closed-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, geometric transformations, and color differences) in both full-reference and no-reference scenarios. Experimental results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This is also known as the double-stimulus impairment rating by treating the reference image as a second visual stimulus.
- 2.
We follow [11] to introduce mild geometric transformations.
References
Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Data Warehousing and Knowledge Discovery, pp. 305–316 (2008)
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Alayrac, J.B., et al.: flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
Bai, J., et al.: Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
Bracci, S., Mraz, J., Zeman, A., Leys, G., Op de Beeck, H.: The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLOS Comput. Biol. 19(4), 1–5 (2023)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Cao, P., Li, D., Ma, K.: Image quality assessment: integrating model-centric and data-centric approaches. In: Conference on Parsimony and Learning, pp. 529–541 (2024)
Chen, C., et al.: TOPIQ: a top-down approach from semantics to distortions for image quality assessment. arXiv preprint arXiv:2308.03060 (2023)
Chen, H., Wang, Z., Yang, Y., Sun, Q., Ma, K.: Learning a deep color difference metric for photographic images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22242–22251 (2023)
Chubarau, A., Akhavan, T., Yoo, H., Mantiuk, R.K., Clark, J.: Perceptual image quality assessment for various viewing conditions and display systems. In: Image Quality and System Performance, pp. 1–9 (2020)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (2020)
Dong, Q., et al.: A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)
Dong, X., et al.: InternLM-XComposer2: mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020)
Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Guo, Q., et al.: Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (2024)
Hu, E.J., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Kaplan, J., et al.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: multi-scale image quality transformer. In: IEEE/CVF International Conference on Computer Vision, pp. 5148–5157 (2021)
Kewenig, V., et al.: Multimodality and attention increase alignment in natural language prediction between humans and computational models. arXiv preprint arXiv:2308.06035 (2024)
Lao, S., et al.: Attentions help CNNs see better: attention-based hybrid image quality assessment network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 1140–1149 (2022)
Li, C., et al.: AGIQA-3K: an open database for AI-generated image quality assessment. arXiv preprint arXiv:2306.04717 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742 (2023)
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Association for Computational Linguistics and International Joint Conference on Natural Language Processing, pp. 4582–4597 (2021)
Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Liang, Z., Li, C., Zhou, S., Feng, R., Loy, C.C.: Iterative prompt learning for unsupervised backlit image enhancement. In: IEEE/CVF International Conference on Computer Vision, pp. 8094–8103 (2023)
Lin, H., Hosu, V., Saupe, D.: KADID-10k: a large-scale artificially distorted IQA database. In: International Conference on Quality of Multimedia Experience, pp. 1–3 (2019)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems, vol. 36, pp. 1–25 (2024)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Ma, K., Duanmu, Z., Wang, Z.: Geometric transformation invariant image quality assessment using convolutional neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6732–6736 (2018)
Ma, K., Liu, W., Zhang, K., Duanmu, Z., Wang, Z., Zuo, W.: End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 27(3), 1202–1213 (2017)
Ma, K., et al.: Group MAD competition-a new methodology to compare objective image quality models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1664–1673 (2016)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Lett. 20(3), 209–212 (2012)
Ngo, R., Chan, L., Mindermann, S.: The alignment problem from a deep learning perspective. In: International Conference on Learning Representations (2022)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: ACM Asia Conference on Computer and Communications Security, pp. 506–519 (2017)
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Trans. Image Process. 15(2), 430–444 (2006)
Shin, S., et al.: On the effect of pretraining corpora on in-context learning by a large-scale language model. In: The North American Chapter of the Association for Computational Linguistics, pp. 5168–5186 (2022)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2014)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34, 273–286 (1927)
Tong, S., et al.: Cambrian-1: a fully open, vision-centric exploration of multimodal LLMs. arXiv preprint arXiv:2406.16860 (2024)
Topiwala, P., Dai, W., Pian, J., Biondi, K., Krovvidi, A.: VMAF and variants: towards a unified VQA. In: Applications of Digital Image Processing, vol. 11842, pp. 96–104 (2021)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)
Wang, Z., et al.: Measuring perceptual color differences of smartphone photographs. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 10114–10128 (2023)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Wei, J., et al.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Wu, H., et al.: Q-Bench: a benchmark for general-purpose foundation models on low-level vision. In: International Conference on Learning Representations (2024)
Wu, H., et al.: Q-Instruct: improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)
Wu, H., et al.: Q-Align: teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
Wu, H., et al.: Towards open-ended visual quality comparison. arXiv preprint arXiv:2402.16641 (2024)
Wu, T., et al.: Assessor360: multi-sequence network for blind omnidirectional image quality assessment. In: Advances in Neural Information Processing Systems, vol. 36, pp. 1–14 (2024)
Yang, S., et al.: MANIQA: multi-dimension attention network for no-reference image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 1191–1200 (2022)
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023)
Ye, P., Doermann, D.: Active sampling for subjective image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4249–4256 (2014)
Ye, P., Kumar, J., Kang, L., Doermann, D.: Unsupervised feature learning framework for no-reference image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1098–1105 (2012)
Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)
Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PaQ-2-PiQ): mapping the perceptual space of picture quality. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3585 (2020)
You, Z., et al.: Descriptive image quality assessment in the wild. arXiv preprint arXiv:2405.18842 (2024)
You, Z., Li, Z., Gu, J., Yin, Z., Xue, T., Dong, C.: Depicting beyond scores: advancing image quality assessment through multi-modal language models. arXiv preprint arXiv:2312.08962 (2023)
Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE Trans. Image Process. 20(8), 2378–2386 (2011)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process. 30, 3474–3486 (2021)
Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: a multitask learning perspective. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14071–14081 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, H., et al.: 2AFC prompting of large multimodal models for image quality assessment. arXiv preprint arXiv:2402.01162 (2024)
Zhuang, S., Hadfield-Menell, D.: Consequences of misaligned AI. In: Advances in Neural Information Processing Systems, vol. 33, pp. 15763–15773 (2020)
Acknowledgements
The authors would like to thank the generous support from OPPO. This work was supported in part by the National Natural Science Foundation of China (62071407 and 61991451), the Hong Kong ITC Innovation and Technology Fund (9440379 and 9440390), and the Shenzhen Science and Technology Program (JCYJ20220818101001004).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, T., Ma, K., Liang, J., Yang, Y., Zhang, L. (2025). A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72904-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)