Taming CLIP for Fine-Grained and Structured Visual Understanding of Museum Exhibits

Balauca, Ada-Astrid; Paudel, Danda Pani; Toutanova, Kristina; Van Gool, Luc

doi:10.1007/978-3-031-73116-7_22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

European Conference on Computer Vision

311 Accesses

Abstract

CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured – in the form of tabular data – visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP’s powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP’s image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: https://github.com/insait-institute/MUZE

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bridging the Gap: Decoding Abstract Concepts in Cultural Heritage Images

Enhanced automated art curation using supervised modified CNN for art style classification

Article Open access 01 March 2025

How to Read Paintings: Semantic Art Understanding with Multi-modal Retrieval

Notes

1.
Some other variants, such as [1] also exist, however, a separate vision-only representation models are often preferred primarily due to the development ease.
2.
We refer to both the dataset and method with the same name. In case of ambiguity we use MUZE (dataset) or MUZE (method).

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Google Scholar
Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432 (2021)
Google Scholar
Bangalath, H., Maaz, M., Khattak, M.U., Khan, S.H., Shahbaz Khan, F.: Bridging the gap between object and image-level representations for open-vocabulary detection. Adv. Neural. Inf. Process. Syst. 35, 33781–33794 (2022)
Google Scholar
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Barz, B., Denzler, J.: Deep learning on small datasets without pre-training using cosine loss. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1371–1380 (2020)
Google Scholar
Becattini, F., et al.: VISCOUNTH: a large-scale multilingual visual question answering dataset for cultural heritage. ACM Trans. Multimedia Comput. Commun. Appl. (2023)
Google Scholar
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)
Article Google Scholar
Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., Wang, W.: HybridQA: a dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:2004.07347 (2020)
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. In: ICLR (2023)
Google Scholar
Chen, Y.-C., et al.: UNITER: universal image-text representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pp. 104–120. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)
Google Scholar
Conde, M.V., Turgutlu, K.: CLIP-Art: contrastive pre-training for fine-grained art classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3956–3960 (2021)
Google Scholar
Cui, P., Zhang, D., Deng, Z., Dong, Y., Zhu, J.: Learning sample difficulty from pre-trained models for reliable prediction. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Dataset, E.: Novel datasets for fine-grained image categorization. In: First Workshop on Fine Grained Visual Categorization, CVPR. Citeseer. Citeseer. Citeseer (2011)
Google Scholar
Ding, J., Xue, N., Xia, G., Dai, D.: Decoupling zero-shot semantic segmentation. 2022 ieee. In: CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11573–11582 (2021)
Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Garcia, N., et al.: A dataset and baselines for visual question answering on art. In: Bartoli, A., Fusiello, A. (eds.) Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II, pp. 92–108. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-66096-3_8
Chapter Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Hannan, D., Jain, A., Bansal, M.: MANYMODALQA: modality disambiguation and QA over diverse inputs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 7879–7886 (2020)
Google Scholar
Hwang, W., et al.: Post-OCR parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 105–124. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7
Chapter Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)
Google Scholar
Kim, G., et al.: OCR-free document understanding transformer. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Google Scholar
Lee, C.Y., et al.: FormNet: structural encoding beyond sequential modeling in form document information extraction. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022). https://doi.org/10.18653/v1/2022.acl-long.260, https://aclanthology.org/2022.acl-long.260
Lee, K., et al.: Pix2Struct: screenshot parsing as pretraining for visual language understanding. In: Proceedings of the 40th International Conference on Machine Learning (2023). https://proceedings.mlr.press/v202/lee23g.html
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=RriDjddCLN
Li, B., Weinberger, K.Q., Belongie, S.J., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. CoRR abs/2201.03546 (2022). https://arxiv.org/abs/2201.03546
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pp. 121–137. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, W., Chen, J., Mei, J., Coca, A., Byrne, B.: Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Liu, F., et al.: DePlot: one-shot visual language reasoning by plot-to-table translation. In: Findings of the Association for Computational Linguistics: ACL 2023 (2023). https://doi.org/10.18653/v1/2023.findings-acl.660, https://aclanthology.org/2023.findings-acl.660
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Lu, Y., Guo, C., Dai, X., Wang, F.Y.: Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training. Neurocomputing 490, 163–180 (2022)
Article Google Scholar
Luo, J., Li, Z., Wang, J., Lin, C.Y.: ChartOCR: data extraction from charts images via a deep hybrid framework. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1917–1925 (2021)
Google Scholar
Maaz, M., Rasheed, H., Khan, S., Khan, F.S., Anwer, R.M., Yang, M.-H.: Class-agnostic object detection with multi-modal transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pp. 512–531. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_30
Chapter Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Marty, P.F., Jones, K.B.: Museum Informatics: People, Information, and Technology in Museums, vol. 2. Taylor & Francis (2008)
Google Scholar
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)
Meng, F., et al.: Foundation model is efficient multimodal multitask model selector. arXiv preprint arXiv:2308.06262 (2023)
Ni, B., et al.: Expanding language-image pretrained models for general video recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp. 1–18. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_1
Chapter Google Scholar
Nishanbaev, I., Champion, E., McMeekin, D.A.: A survey of geospatial semantic web for cultural heritage. Heritage 2(2), 1471–1498 (2019)
Article Google Scholar
Pham, H., et al.: Combined scaling for zero-shot transfer learning. Neurocomputing 555, 126658 (2023)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6545–6554 (2023)
Google Scholar
Sheng, S., Van Gool, L., Moens, M.F.: A dataset for multimodal question answering in the cultural heritage domain. In: Proceedings of the COLING 2016 Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 10–17. ACL (2016)
Google Scholar
Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII, pp. 664–680. Springer International Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_41
Chapter Google Scholar
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: FLAVA: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)
Google Scholar
Talmor, A., et al.: MULTIMODALQA: complex question answering over text, tables and images. arXiv preprint arXiv:2104.06039 (2021)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200–2011 dataset. Technical report California Institute of Technology (2011)
Google Scholar
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163 (2022)
Wei, Y., et al.: Improving clip fine-tuning performance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5439–5449 (2023)
Google Scholar
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133 (2022)
Google Scholar
Zhang, C., Kaeser-Chen, C., Vesom, G., Choi, J., Kessler, M., Belongie, S.: The iMet collection 2019 challenge dataset. arXiv preprint arXiv:1906.00901 (2019)
Zhang, R., et al.: Tip-Adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pp. 696–712. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Chapter Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)
Article Google Scholar
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: European Conference on Computer Vision, pp. 350–368. Springer (2022). https://doi.org/10.1007/978-3-031-20077-9_21

Download references

Acknowledgements

This research was partially funded by the Ministry of Education and Science of Bulgaria (support for INSAIT, part of the Bulgarian National Roadmap for Research Infrastructure). We thank the Bulgarian National Archaeological Institute with Museum for the support and guidance, the British Museum and Victoria&Albert Museum for the access to their data that made this research possible, and Google DeepMind which provided vital support and resources for this research. We also thank the anonymous reviewers for their efforts and valuable feedback to improve our work.

Author information

Authors and Affiliations

INSAIT, Sofia University, Sofia, Bulgaria
Ada-Astrid Balauca, Danda Pani Paudel, Kristina Toutanova & Luc Van Gool
ETH Zurich, Zürich, Switzerland
Luc Van Gool
Google DeepMind, Seattle, WA, USA
Kristina Toutanova

Authors

Ada-Astrid Balauca
View author publications
You can also search for this author in PubMed Google Scholar
Danda Pani Paudel
View author publications
You can also search for this author in PubMed Google Scholar
Kristina Toutanova
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ada-Astrid Balauca .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 817 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Balauca, AA., Paudel, D.P., Toutanova, K., Van Gool, L. (2025). Taming CLIP for Fine-Grained and Structured Visual Understanding of Museum Exhibits. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-73116-7_22
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73115-0
Online ISBN: 978-3-031-73116-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Taming CLIP for Fine-Grained and Structured Visual Understanding of Museum Exhibits