Skip to main content

Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification

  • Conference paper
  • First Online:
Recent Challenges in Intelligent Information and Database Systems (ACIIDS 2024)

Abstract

Taxonomic classification is a fundamental aspect of biology and conservation that poses significant challenges due to the necessity for efficient cross-referencing across a vast taxonomic database. This study explores the efficacy of the CLIP model in enhancing classification across taxonomic ranks by assembling a comprehensive image-caption dataset and aggregating features related to taxonomic hierarchy. The Wikimedia Animals dataset, which consists of approximately 203,000 species image-caption pairs and an average of 1–3 images per species, was used to create representations relevant to animal taxonomy and fine-tune the model on a range of hyper-parameters. Our evaluation reveals divergent model performance along distinct taxonomic rank classifications, with the model trained on a compressed representation of classes demonstrating the highest generalization capability, particularly in the Phylum and Class ranks. Our results provide novel insights into the application of multimodal models in taxonomic classification and highlight potential directions for future research in this field. The development of the large image-caption dataset serves as a benchmark to design models that enhance generalizability for taxonomic classification tasks.

R.K.E. Chavez and K.G.M. Reynoso—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Data Availability Statement

The dataset that supports the findings of this study is available here.

Notes

  1. 1.

    https://www.wikidata.org/wiki/Wikidata:WikiProject_Taxonomy.

  2. 2.

    https://commons.wikimedia.org/wiki/Main_Page.

References

  1. Alharbi, F., Alharbi, A., Kamioka, E.: Animal species classification using machine learning techniques. In: MATEC Web of Conferences, vol. 277, p. 02033. EDP Sciences (2019)

    Google Scholar 

  2. Austen, G.E., Bindemann, M., Griffiths, R.A., Roberts, D.L.: Species identification by experts and non-experts: comparing images from field guides. Sci. Rep. 6(1), 33634 (2016). https://doi.org/10.1038/srep33634

  3. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), pp. 67–74. IEEE (2018)

    Google Scholar 

  4. Han, X., et al.: Pre-trained models: past, present and future. AI Open 2, 225–250 (2021)

    Article  Google Scholar 

  5. Miller, S.J., Howard, J., Adams, P., Schwan, M., Slater, R.: Multi-modal classification using images and text. SMU Data Sci. Rev. 3(3), 6 (2020)

    Google Scholar 

  6. Norouzzadeh, M.S., et al.: Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. 115(25), E5716–E5725 (2018)

    Article  Google Scholar 

  7. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  8. Sun, J., Futahashi, R., Yamanaka, T.: Improving the accuracy of species identification by combining deep learning with field occurrence records. Front. Ecol. Evol. 9, 918 (2021)

    Article  Google Scholar 

  9. Swanson, A., Kosmala, M., Lintott, C., Simpson, R., Smith, A., Packer, C.: Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci. Data 2(1), 150026 (2015). https://doi.org/10.1038/sdata.2015.26

  10. Tan, M., et al.: Animal detection and classification from camera trap images using different mainstream object detection architectures. Animals (Basel) 12(15) (2022)

    Google Scholar 

  11. Ukwuoma, C.C., et al.: Animal species detection and classification framework based on modified multi-scale attention mechanism and feature pyramid network. Sci. Afr. 16, e01151 (2022). https://doi.org/10.1016/j.sciaf.2022.e01151

    Article  Google Scholar 

  12. Yechuri, P.K., Ramadass, S.: Classification of image and text data using deep learning-based LSTM model. Traitement du Sig. 38, 1809–1817 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Carlo R. Raquel or Prospero C. Naval Jr. .

Editor information

Editors and Affiliations

Ethics declarations

Competing Interests

The authors declare no competing interests.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chavez, R.K.E., Reynoso, K.G.M., Raquel, C.R., Naval, P.C. (2024). Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2024. Communications in Computer and Information Science, vol 2145. Springer, Singapore. https://doi.org/10.1007/978-981-97-5934-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5934-7_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5933-0

  • Online ISBN: 978-981-97-5934-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics