An Implementation of the “Guess Who?” Game Using CLIP

Sarri, Arnau Martí; Rodriguez-Fernandez, Victor

doi:10.1007/978-3-030-91608-4_41

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13113))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1766 Accesses
3 Altmetric

Abstract

CLIP (Contrastive Language-Image Pretraining) is an efficient method for learning computer vision tasks from natural language supervision that has powered a recent breakthrough in deep learning due to its zero-shot transfer capabilities. By training from image-text pairs available on the internet, the CLIP model transfers non-trivially to most tasks without the need for any data set specific training. In this work, we use CLIP to implement the engine of the popular game “Guess who?”, so that the player interacts with the game using natural language prompts and CLIP automatically decides whether an image in the game board fulfills that prompt or not. We study the performance of this approach by benchmarking on different ways of prompting the questions to CLIP, and show the limitations of its zero-shot capabilites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Single-Stream Multi-level Alignment for Vision-Language Pretraining

UIT: Unifying Pre-training Objectives for Image-Text Understanding

SILC: Improving Vision Language Pretraining with Self-distillation

Notes

1.
The paper was accompanied with a blog post publication: https://openai.com/blog/clip/.
2.
https://github.com/openai/CLIP.
3.
https://github.com/ArnauDIMAI/CLIP-GuessWho.

References

Conde, M.V., Turgutlu, K.: CLIP-Art: contrastive pre-training for fine-grained art classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3956–3960, June 2021
Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: mastering video-text retrieval via image clip. arXiv:2106.11097 (2021)
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32, 829–864 (2020)
Article MathSciNet Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738 (2015)
Google Scholar
Nawaz, S., Calefati, A., Caraffini, M., Landro, N., Gallo, I.: Are these birds similar: learning branched networks for fine-grained representations. In: 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), pp. 1–5 (2019). https://doi.org/10.1109/IVCNZ48456.2019.8960960
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)
Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2610–2621 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)

Download references

Acknowledgements

This work has been partially supported by the company Dimai S.L, the “Convenio Plurianual” with the Universidad Politécnica de Madrid in the actuation line of “Programa de Excelencia para el Profesorado Universitario” and by next research projects: FightDIS (PID2020-117263GB-100), IBERIFIER (2020-EU-IA-0252:29374659), and the CIVIC project (BBVA Foundation Grants For Scientific Research Teams SARS-CoV-2 and COVID-19).

Author information

Authors and Affiliations

Valencian International University, Calle Pintor Sorolla 21, 46002, Valencia, Spain
Arnau Martí Sarri
Dimai S.L., Cam í de la Font Calda 10, 08270, Navarcles, Spain
Arnau Martí Sarri
School of Computer Systems Engineering, Universidad Politécnica de Madrid, Calle de Alan Turing, 28038, Madrid, Spain
Victor Rodriguez-Fernandez

Authors

Arnau Martí Sarri
View author publications
You can also search for this author in PubMed Google Scholar
Victor Rodriguez-Fernandez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victor Rodriguez-Fernandez .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Universidad Politecnica de Madrid, Madrid, Spain
David Camacho
University of Birmingham, Birmingham, UK
Peter Tino
University of Manchester, Manchester, UK
Richard Allmendinger
University of Huelva, Huelva, Spain
Antonio J. Tallón-Ballesteros
Southern University of Science and Technology, Shenzhen, China
Ke Tang
Yonsei University, Seoul, Korea (Republic of)
Sung-Bae Cho
University of Minho, Braga, Portugal
Paulo Novais
NOVA University of Lisbon, Lisbon, Portugal
Susana Nascimento

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarri, A.M., Rodriguez-Fernandez, V. (2021). An Implementation of the “Guess Who?” Game Using CLIP. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2021. IDEAL 2021. Lecture Notes in Computer Science(), vol 13113. Springer, Cham. https://doi.org/10.1007/978-3-030-91608-4_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-91608-4_41
Published: 23 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91607-7
Online ISBN: 978-3-030-91608-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Implementation of the “Guess Who?” Game Using CLIP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Single-Stream Multi-level Alignment for Vision-Language Pretraining

UIT: Unifying Pre-training Objectives for Image-Text Understanding

SILC: Improving Vision Language Pretraining with Self-distillation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An Implementation of the “Guess Who?” Game Using CLIP

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Single-Stream Multi-level Alignment for Vision-Language Pretraining

UIT: Unifying Pre-training Objectives for Image-Text Understanding

SILC: Improving Vision Language Pretraining with Self-distillation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation