Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification

Chavez, Raynor Kirkson E.; Reynoso, Kyle Gabriel M.; Raquel, Carlo R.; Naval, Prospero C.

doi:10.1007/978-981-97-5934-7_2

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2145))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

103 Accesses

Abstract

Taxonomic classification is a fundamental aspect of biology and conservation that poses significant challenges due to the necessity for efficient cross-referencing across a vast taxonomic database. This study explores the efficacy of the CLIP model in enhancing classification across taxonomic ranks by assembling a comprehensive image-caption dataset and aggregating features related to taxonomic hierarchy. The Wikimedia Animals dataset, which consists of approximately 203,000 species image-caption pairs and an average of 1–3 images per species, was used to create representations relevant to animal taxonomy and fine-tune the model on a range of hyper-parameters. Our evaluation reveals divergent model performance along distinct taxonomic rank classifications, with the model trained on a compressed representation of classes demonstrating the highest generalization capability, particularly in the Phylum and Class ranks. Our results provide novel insights into the application of multimodal models in taxonomic classification and highlight potential directions for future research in this field. The development of the large image-caption dataset serves as a benchmark to design models that enhance generalizability for taxonomic classification tasks.

R.K.E. Chavez and K.G.M. Reynoso—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Utilising SkyScript for Open-Vocabulary Categorization, Extraction, and Captioning to Enhance Multi-Modal Tasks in Remote Sensing

Article 25 July 2024

GPT Vision Meets Taxonomy: A Comprehensive Evaluation for Biological Image Classification

“Let It BEE”: Natural Language Classification of Arthropod Specimens Based on Their Spanish Description

Data Availability Statement

The dataset that supports the findings of this study is available here.

Notes

References

Alharbi, F., Alharbi, A., Kamioka, E.: Animal species classification using machine learning techniques. In: MATEC Web of Conferences, vol. 277, p. 02033. EDP Sciences (2019)
Google Scholar
Austen, G.E., Bindemann, M., Griffiths, R.A., Roberts, D.L.: Species identification by experts and non-experts: comparing images from field guides. Sci. Rep. 6(1), 33634 (2016). https://doi.org/10.1038/srep33634
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), pp. 67–74. IEEE (2018)
Google Scholar
Han, X., et al.: Pre-trained models: past, present and future. AI Open 2, 225–250 (2021)
Article Google Scholar
Miller, S.J., Howard, J., Adams, P., Schwan, M., Slater, R.: Multi-modal classification using images and text. SMU Data Sci. Rev. 3(3), 6 (2020)
Google Scholar
Norouzzadeh, M.S., et al.: Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. 115(25), E5716–E5725 (2018)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Sun, J., Futahashi, R., Yamanaka, T.: Improving the accuracy of species identification by combining deep learning with field occurrence records. Front. Ecol. Evol. 9, 918 (2021)
Article Google Scholar
Swanson, A., Kosmala, M., Lintott, C., Simpson, R., Smith, A., Packer, C.: Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci. Data 2(1), 150026 (2015). https://doi.org/10.1038/sdata.2015.26
Tan, M., et al.: Animal detection and classification from camera trap images using different mainstream object detection architectures. Animals (Basel) 12(15) (2022)
Google Scholar
Ukwuoma, C.C., et al.: Animal species detection and classification framework based on modified multi-scale attention mechanism and feature pyramid network. Sci. Afr. 16, e01151 (2022). https://doi.org/10.1016/j.sciaf.2022.e01151
Article Google Scholar
Yechuri, P.K., Ramadass, S.: Classification of image and text data using deep learning-based LSTM model. Traitement du Sig. 38, 1809–1817 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Vision and Machine Intelligence Group, Department of Computer Science, University of the Philippines Diliman, Quezon City, 1101, Philippines
Raynor Kirkson E. Chavez, Kyle Gabriel M. Reynoso, Carlo R. Raquel & Prospero C. Naval Jr.

Authors

Raynor Kirkson E. Chavez
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Gabriel M. Reynoso
View author publications
You can also search for this author in PubMed Google Scholar
Carlo R. Raquel
View author publications
You can also search for this author in PubMed Google Scholar
Prospero C. Naval Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Carlo R. Raquel or Prospero C. Naval Jr. .

Editor information

Editors and Affiliations

Wroclaw University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
University of Pau and Adour Countries, Pau, France
Richard Chbeir
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
Iwate Prefectural University, Takizawa, Japan
Hamido Fujita
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Japan Advanced Institute of Science and Technology, Nomi, Japan
Le Minh Nguyen
Wrocław University of Science and Technology, Wrocław, Poland
Krystian Wojtkiewicz

Ethics declarations

Competing Interests

The authors declare no competing interests.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chavez, R.K.E., Reynoso, K.G.M., Raquel, C.R., Naval, P.C. (2024). Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2024. Communications in Computer and Information Science, vol 2145. Springer, Singapore. https://doi.org/10.1007/978-981-97-5934-7_2

Download citation

DOI: https://doi.org/10.1007/978-981-97-5934-7_2
Published: 13 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5933-0
Online ISBN: 978-981-97-5934-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Utilising SkyScript for Open-Vocabulary Categorization, Extraction, and Captioning to Enhance Multi-Modal Tasks in Remote Sensing

GPT Vision Meets Taxonomy: A Comprehensive Evaluation for Biological Image Classification

“Let It BEE”: Natural Language Classification of Arthropod Specimens Based on Their Spanish Description

Data Availability Statement

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Ethics declarations

Competing Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Utilising SkyScript for Open-Vocabulary Categorization, Extraction, and Captioning to Enhance Multi-Modal Tasks in Remote Sensing

GPT Vision Meets Taxonomy: A Comprehensive Evaluation for Biological Image Classification

“Let It BEE”: Natural Language Classification of Arthropod Specimens Based on Their Spanish Description

Data Availability Statement

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Ethics declarations

Competing Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation