A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Wu, Tianhe; Ma, Kede; Liang, Jie; Yang, Yujiu; Zhang, Lei

doi:10.1007/978-3-031-72904-1_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

556 Accesses

Abstract

While Multimodal Large Language Models (MLLMs) have experienced significant advancement in visual understanding and reasoning, their potential to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. We first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one closed-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, geometric transformations, and color differences) in both full-reference and no-reference scenarios. Experimental results show that only the closed-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models

A Benchmark and Chain-of-Thought Prompting Strategy for Large Multimodal Models with Multiple Image Inputs

Evaluating Image Similarity Using Contextual Information of Images with Pre-trained Models

Notes

1.
This is also known as the double-stimulus impairment rating by treating the reference image as a second visual stimulus.
2.
We follow [11] to introduce mild geometric transformations.

References

Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Data Warehousing and Knowledge Discovery, pp. 305–316 (2008)
Google Scholar
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Alayrac, J.B., et al.: flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
Google Scholar
Bai, J., et al.: Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
Bracci, S., Mraz, J., Zeman, A., Leys, G., Op de Beeck, H.: The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLOS Comput. Biol. 19(4), 1–5 (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Cao, P., Li, D., Ma, K.: Image quality assessment: integrating model-centric and data-centric approaches. In: Conference on Parsimony and Learning, pp. 529–541 (2024)
Google Scholar
Chen, C., et al.: TOPIQ: a top-down approach from semantics to distortions for image quality assessment. arXiv preprint arXiv:2308.03060 (2023)
Chen, H., Wang, Z., Yang, Y., Sun, Q., Ma, K.: Learning a deep color difference metric for photographic images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22242–22251 (2023)
Google Scholar
Chubarau, A., Akhavan, T., Yoo, H., Mantiuk, R.K., Clark, J.: Perceptual image quality assessment for various viewing conditions and display systems. In: Image Quality and System Performance, pp. 1–9 (2020)
Google Scholar
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (2020)
MATH Google Scholar
Dong, Q., et al.: A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)
Dong, X., et al.: InternLM-XComposer2: mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686 (2020)
Google Scholar
Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Guo, Q., et al.: Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In: International Conference on Learning Representations (2024)
Google Scholar
Hu, E.J., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Google Scholar
Kaplan, J., et al.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: multi-scale image quality transformer. In: IEEE/CVF International Conference on Computer Vision, pp. 5148–5157 (2021)
Google Scholar
Kewenig, V., et al.: Multimodality and attention increase alignment in natural language prediction between humans and computational models. arXiv preprint arXiv:2308.06035 (2024)
Lao, S., et al.: Attentions help CNNs see better: attention-based hybrid image quality assessment network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 1140–1149 (2022)
Google Scholar
Li, C., et al.: AGIQA-3K: an open database for AI-generated image quality assessment. arXiv preprint arXiv:2306.04717 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742 (2023)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Association for Computational Linguistics and International Joint Conference on Natural Language Processing, pp. 4582–4597 (2021)
Google Scholar
Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Article MATH Google Scholar
Liang, Z., Li, C., Zhou, S., Feng, R., Loy, C.C.: Iterative prompt learning for unsupervised backlit image enhancement. In: IEEE/CVF International Conference on Computer Vision, pp. 8094–8103 (2023)
Google Scholar
Lin, H., Hosu, V., Saupe, D.: KADID-10k: a large-scale artificially distorted IQA database. In: International Conference on Quality of Multimedia Experience, pp. 1–3 (2019)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems, vol. 36, pp. 1–25 (2024)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Ma, K., Duanmu, Z., Wang, Z.: Geometric transformation invariant image quality assessment using convolutional neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6732–6736 (2018)
Google Scholar
Ma, K., Liu, W., Zhang, K., Duanmu, Z., Wang, Z., Zuo, W.: End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 27(3), 1202–1213 (2017)
Article MathSciNet MATH Google Scholar
Ma, K., et al.: Group MAD competition-a new methodology to compare objective image quality models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1664–1673 (2016)
Google Scholar
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Lett. 20(3), 209–212 (2012)
Google Scholar
Ngo, R., Chan, L., Mindermann, S.: The alignment problem from a deep learning perspective. In: International Conference on Learning Representations (2022)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
Google Scholar
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: ACM Asia Conference on Computer and Communications Security, pp. 506–519 (2017)
Google Scholar
Peng, Z., et al.: KOSMOS-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Trans. Image Process. 15(2), 430–444 (2006)
Article MATH Google Scholar
Shin, S., et al.: On the effect of pretraining corpora on in-context learning by a large-scale language model. In: The North American Chapter of the Association for Computational Linguistics, pp. 5168–5186 (2022)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2014)
Google Scholar
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34, 273–286 (1927)
Article MATH Google Scholar
Tong, S., et al.: Cambrian-1: a fully open, vision-centric exploration of multimodal LLMs. arXiv preprint arXiv:2406.16860 (2024)
Topiwala, P., Dai, W., Pian, J., Biondi, K., Krovvidi, A.: VMAF and variants: towards a unified VQA. In: Applications of Digital Image Processing, vol. 11842, pp. 96–104 (2021)
Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008 (2017)
Google Scholar
Wang, Z., et al.: Measuring perceptual color differences of smartphone photographs. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 10114–10128 (2023)
Article MATH Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article MATH Google Scholar
Wei, J., et al.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2022)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Google Scholar
Wu, H., et al.: Q-Bench: a benchmark for general-purpose foundation models on low-level vision. In: International Conference on Learning Representations (2024)
Google Scholar
Wu, H., et al.: Q-Instruct: improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)
Wu, H., et al.: Q-Align: teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
Wu, H., et al.: Towards open-ended visual quality comparison. arXiv preprint arXiv:2402.16641 (2024)
Wu, T., et al.: Assessor360: multi-sequence network for blind omnidirectional image quality assessment. In: Advances in Neural Information Processing Systems, vol. 36, pp. 1–14 (2024)
Google Scholar
Yang, S., et al.: MANIQA: multi-dimension attention network for no-reference image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 1191–1200 (2022)
Google Scholar
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023)
Ye, P., Doermann, D.: Active sampling for subjective image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4249–4256 (2014)
Google Scholar
Ye, P., Kumar, J., Kang, L., Doermann, D.: Unsupervised feature learning framework for no-reference image quality assessment. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1098–1105 (2012)
Google Scholar
Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)
Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PaQ-2-PiQ): mapping the perceptual space of picture quality. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3585 (2020)
Google Scholar
You, Z., et al.: Descriptive image quality assessment in the wild. arXiv preprint arXiv:2405.18842 (2024)
You, Z., Li, Z., Gu, J., Yin, Z., Xue, T., Dong, C.: Depicting beyond scores: advancing image quality assessment through multi-modal language models. arXiv preprint arXiv:2312.08962 (2023)
Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE Trans. Image Process. 20(8), 2378–2386 (2011)
Article MathSciNet MATH Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process. 30, 3474–3486 (2021)
Article MATH Google Scholar
Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: a multitask learning perspective. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14071–14081 (2023)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, H., et al.: 2AFC prompting of large multimodal models for image quality assessment. arXiv preprint arXiv:2402.01162 (2024)
Zhuang, S., Hadfield-Menell, D.: Consequences of misaligned AI. In: Advances in Neural Information Processing Systems, vol. 33, pp. 15763–15773 (2020)
Google Scholar

Download references

Acknowledgements

The authors would like to thank the generous support from OPPO. This work was supported in part by the National Natural Science Foundation of China (62071407 and 61991451), the Hong Kong ITC Innovation and Technology Fund (9440379 and 9440390), and the Shenzhen Science and Technology Program (JCYJ20220818101001004).

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Tianhe Wu & Yujiu Yang
Department of Computer Science, City University of Hong Kong, Hong Kong, China
Tianhe Wu & Kede Ma
OPPO Research Institute, Shenzhen, China
Jie Liang & Lei Zhang
Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
Lei Zhang

Authors

Tianhe Wu
View author publications
You can also search for this author in PubMed Google Scholar
Kede Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liang
View author publications
You can also search for this author in PubMed Google Scholar
Yujiu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kede Ma or Yujiu Yang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 260 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, T., Ma, K., Liang, J., Yang, Y., Zhang, L. (2025). A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_9
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment