Abstract
Taxonomic classification is a fundamental aspect of biology and conservation that poses significant challenges due to the necessity for efficient cross-referencing across a vast taxonomic database. This study explores the efficacy of the CLIP model in enhancing classification across taxonomic ranks by assembling a comprehensive image-caption dataset and aggregating features related to taxonomic hierarchy. The Wikimedia Animals dataset, which consists of approximately 203,000 species image-caption pairs and an average of 1–3 images per species, was used to create representations relevant to animal taxonomy and fine-tune the model on a range of hyper-parameters. Our evaluation reveals divergent model performance along distinct taxonomic rank classifications, with the model trained on a compressed representation of classes demonstrating the highest generalization capability, particularly in the Phylum and Class ranks. Our results provide novel insights into the application of multimodal models in taxonomic classification and highlight potential directions for future research in this field. The development of the large image-caption dataset serves as a benchmark to design models that enhance generalizability for taxonomic classification tasks.
R.K.E. Chavez and K.G.M. Reynoso—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Data Availability Statement
The dataset that supports the findings of this study is available here.
References
Alharbi, F., Alharbi, A., Kamioka, E.: Animal species classification using machine learning techniques. In: MATEC Web of Conferences, vol. 277, p. 02033. EDP Sciences (2019)
Austen, G.E., Bindemann, M., Griffiths, R.A., Roberts, D.L.: Species identification by experts and non-experts: comparing images from field guides. Sci. Rep. 6(1), 33634 (2016). https://doi.org/10.1038/srep33634
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), pp. 67–74. IEEE (2018)
Han, X., et al.: Pre-trained models: past, present and future. AI Open 2, 225–250 (2021)
Miller, S.J., Howard, J., Adams, P., Schwan, M., Slater, R.: Multi-modal classification using images and text. SMU Data Sci. Rev. 3(3), 6 (2020)
Norouzzadeh, M.S., et al.: Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. 115(25), E5716–E5725 (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Sun, J., Futahashi, R., Yamanaka, T.: Improving the accuracy of species identification by combining deep learning with field occurrence records. Front. Ecol. Evol. 9, 918 (2021)
Swanson, A., Kosmala, M., Lintott, C., Simpson, R., Smith, A., Packer, C.: Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci. Data 2(1), 150026 (2015). https://doi.org/10.1038/sdata.2015.26
Tan, M., et al.: Animal detection and classification from camera trap images using different mainstream object detection architectures. Animals (Basel) 12(15) (2022)
Ukwuoma, C.C., et al.: Animal species detection and classification framework based on modified multi-scale attention mechanism and feature pyramid network. Sci. Afr. 16, e01151 (2022). https://doi.org/10.1016/j.sciaf.2022.e01151
Yechuri, P.K., Ramadass, S.: Classification of image and text data using deep learning-based LSTM model. Traitement du Sig. 38, 1809–1817 (2021)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Ethics declarations
Competing Interests
The authors declare no competing interests.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chavez, R.K.E., Reynoso, K.G.M., Raquel, C.R., Naval, P.C. (2024). Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2024. Communications in Computer and Information Science, vol 2145. Springer, Singapore. https://doi.org/10.1007/978-981-97-5934-7_2
Download citation
DOI: https://doi.org/10.1007/978-981-97-5934-7_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5933-0
Online ISBN: 978-981-97-5934-7
eBook Packages: Computer ScienceComputer Science (R0)