Keywords

1 Introduction

Birds play a vital role in environmental balance and serve as an excellent indicator of biodiversity [6]. To assess the quality of the living environment, accurate data on the species of birds and animals is important. In ecological study, monitoring animal populations is crucial, especially in light of the ongoing danger of climate change [24]. Birds are abundant and sensitive to environmental changes. The study of birds can help us comprehend the world around us and nature, but the identification of birds manually is a tedious and time consuming process. The unavailability of experts along with human limitations pose an upper limit on manual identification of birds and their species. In the past, many efforts have been carried out for environmental conservation and the rescue of endangered animals. Using an automated approach to identify bird species is a smart idea in this scenario. Analyzing the diversity and abundance of birds can be simplified with the help of an automated bird identification system. Using the technique, researchers no longer need to study through thick textbooks to organize and categorize their photographic images. A combination of the bird species detector with other forms of cultural knowledge, such as poetry and mythology, may be a lot of fun in a community. Public interest in birds may be sparked, that might have a positive effect on conservation programs.

Image categorization is one of the most important area of research in machine learning and deep learning. The categorization of the many species of birds presents a hard challenge for both human beings and computer programs. Birds of varying shapes and sizes, surroundings, lighting situations, and extreme postures all provide obstacles for object detection algorithms during attempt to accomplish this work automatically.

In the past, various bird species classification methods have been proposed. In literature, it was found that majority of the work done on bird species categorization relies on one of two inputs modalities — image or sound-based. In recent years, most of appearance-based research identifies species from a single image using the properties of birds In broad areas, there are two types of image-based classification methods: one employs the entire picture for feature extraction, called as “non part-based” while the other uses the structural properties of each bird called as “part-based” [24]. To execute specific operations, such as categorization and species identification, non-part-based approaches employed the colour and shape attributes of the complete bird [14]. A research was conducted in the paper [9] to detect and categorize bird species of Bangladesh. They used VGG-16 model to identify the categories of bird species.

The main concentration of this work is to classify and identify birds using deep learning models. To evaluate the various models, this study uses publicly available dataset CUB-200-2011 and their results are compared on standard evaluation metrics. The contributions of the paper may be summarized as follows:

  • In the previous paragraphs, we have discussed about the lack of significant amount of work specifically with birds species classification. This study uses the CUB-200-2011 dataset [27] for evaluation the proposed model.

  • We have considered 4 different deep learning models with CSPDarknet53 as feature extractor. The YOLOv4 model achieves an accuracy of 95.43% on the provided test dataset for 20 classes of birds on above mentioned, publicly available dataset.

  • Due to the comprehensive set of methods used in experiments, the authors report the best performance that outperforms the recent state-of-art methods for bird species classification.

The rest of the paper is organized into the following sections. Section 2 discusses the recent contributions on bird species classification and identification in the literature, especially using machine learning and deep learning methods. Section 3 describes the methodology used in the experiments. Section 4 reports and discusses the results obtained, and shows comparison of the results obtained in the study with recent state-of-art approaches. Section 5 concludes the paper with final thoughts and future directions.

2 Literature Survey

This section describes the work done on bird species classification and identification in the past. In the past, mainly birds are classified in on the basis of sound and image features. Automated identification of birds based on aural rather than visual signals has been used in a number of previous studies [10, 17, 29]. In the paper [17], the authors used convolutional neural networks to recognize bird species from audio. In this paper, the authors used three CNN-based network structures and a basic ensemble model with a mean average accuracy of 41.2%. Again, in the paper [10], the authors use ResNet and Inception networks on BirdCLEF2019 dataset for classification and achieve a mAP of 0.23% for Inception model. In the paper [29], the authors discuss the comparative analysis of bird classification experiments. They concluded that experiments on a total of 43 bird species yielded an overall accuracy of 86.31%. Audio signals, however, are only relevant for species with unique calls and no line of sight. If the audio stream has noise, it will be harder to recognize and categorize. Furthermore, auditory signals have some restrictions, making it difficult to differentiate species.

Due to limitations for classification in audio signals, the authors of the paper [15] implemented classification by combining appearance features on caltech-ucsd birds-200-2011 dataset [27] with acoustic signal taken from Xeno-Canto dataset. In the paper [14], the authors compared the appearance features and achieved a higher classification rate, with improvements between 1.2% and 15.7% on machine learning models.

In order to classify birds based on their appearance, several studies have been carried out based on other approaches. In the paper [4], the authors propose appearance features and follow the two-step prediction methods to estimate the object. In this paper, the authors uses the object properties for classification by using the cluster-based method with 84.5% accuracy. In the paper [11], the authors used RNN model with Inception Net, train it on CVIP dataset and achieve F-1 score of 55.67%. The authors of the paper [16] worked on part-based categorization and created a framework using discriminative features. They experimented on the CUB-200-2011 dataset and attained an accuracy of 64.6% mAP.

The authors of the paper [5] used deep learning algorithms to solve the problem of identifying and classify bird species on caltech-ucsd birds-200-2011 dataset. They used a DCNN-like layered structure to extract features from the input images. To maximize classification accuracy, various alignments or features such as head, color, body, form, beak, and whole bird image was extracted using a deep network that achieves 90.93% classification accuracy. Another deep learning-based model was used in the paper [9] to identify Bangladeshi bird species. For bird species classification, they have employed Random Forest, kNN, and SVM with VGG-16. They utilized a data collection that have images of 27 species and 1600 images of Bangladesh without any annotations. SVM was the most accurate of the algorithms tested in this study, with an accuracy of 89%. In the paper [30], the authors worked on a manually constructed dataset of 32,442 images taken by a camera. They employ Haar-like image features, HOG, AdaBoost, and CNN algorithms to categorize bird species surrounding a wind farm. They used thee types of recognition tasks — bird detection, species filtering, and bird species classification — and tested using images collected at the wind farm. This study concluded that LeNet correctly identified 83% of the hawks with an FPR of 0.1. In another paper [2], the authors used a video dataset of 13 bird species. They used classification using Random Forests with 90% accuracy. In the paper [12], the authors compare different SVM, K-Means Clustering, deep learning algorithms and commented that deep learning methods outperform as compare to machine learning methods in general. In the paper [18], the authors discuss classification on 12 bird species using machine learning techniques and obtain 96% accuracy subsequently. In the research [1], the researchers employed regularized softmax with broad classes and achieved 70% accuracy using the regularized softmax, SVM, and transfer learning algorithms for classifying bird species.

In the literature review, it was observed that there are considerable scope of improvement in research on bird species classification and identification. Three major issues were identified and addressed as described here. First, most of the research have done experiments on different datasets with limited data. Second, the researchers used limited classes to classify the identify the bird species. Finally, we evaluate the methods comprehensively using all available performance metrics and compare with recent state-of-art methods for 20 classes.

3 Methodology

This section describes the methodology used for bird species classification and identification while addressing the issues identified in the literature. It includes discussion on the dataset, proposed methodology and the overall layout for bird species identification.

3.1 Dataset Description

This study uses the one of most popular and publicly available image datasets, CUB-200-2011 [27]. This dataset contains 200 categories of birds, each with 40 to 60 images, and total 11,788 images of North American bird species. In the analysis of dataset, it was found that the image collection has not been cleaned or filtered in any way, and the images were shot in the actual surroundings. In the background, there are leaves and branches, adding to the natural feel of the image along with complexity in identification. Figure 1 depicts a few sample images from the dataset.

Fig. 1.
figure 1

Images of Dataset [27]

The dataset is labeled manually using LabelImg annotation tool [26]. Figure 2 demonstrates an example bounding box labelling along with choosing the corresponding class. After labelling, it produces an XML/TXT annotation file with object description.

Fig. 2.
figure 2

Labelling of image

3.2 Model Architecture

This subsection describes the architecture of the proposed model, used for bird species identification and classification. This study uses YOLOv4 [3] object detection model, a real-time CNN-based system. In a single stage, the YOLOv4 network can predict the object’s bounding boxes and class. Objects are directly detected by applying model to image [3, 19,20,21]. YOLOv4 comprises of backbone, neck, and head. Backbone extracts features, neck gathers feature maps from network stages, and head makes the predictions. Figure 3 illustrates YOLOv4 architecture.

Fig. 3.
figure 3

Basic Architecture of YOLOv4 [3]

In the backbone, the network first extracts image features by utilizing CSPDarknet53. CSPDarknet53 is an upgraded version of YOLOv3’s darknet53. It used skip connections in the Darknet53 network’s consecutive 3\(\,\times \,\)3 and 1\(\,\times \,\)1 convolutional layers [8]. YOLOv4 modified Darknet53 with Cross Stage Partial (CSP) networks, renaming it CSPDarknet53. CSP improves gradient combination to minimize model calculation cost. It distributes computational tasks to each CNN layer to improve model computation and decrease memory cost. Neck contains the additional features of YOLOv4. It is mainly used to gather feature map from different stage of backbone. PaNet [13] has developed an architecture that allows for improved propagation of layer data.

CNN generally requires scaling of all images of dataset to make them fixed size. During the scaling, the required region may be distorted or cropped away. To overcome such types of issues, the YOLOv4 uses the SPP [7]. It doesn’t matter how big or small a picture is, SPP produces a fixed-length representation. Also, it uses Max pooling to generate a feature map with a fixed size and a variety of representations. A Spatial Attention Module (SAM) block [28] is used in this model to enhance the representation of an interested region by considering only relevant features. In Yolov4, head acts as object detector.

Fig. 4.
figure 4

Anchor Box [20]

Bounding Box Prediction: The original YOLO has four parameters for each bounding box: x, y, w, and h, where (x,y) coordinates denote the box centre. The width and height are determined in relation to the total picture size. The second version of YOLO uses anchors and projected offsets. Predicting offsets instead of coordinates simplifies the model and facilitates the network’s learning process. Predicted objects’ tx,ty coordinates are calculated by the network using two anchor dimensions: height and width. Figure 4 represent the anchor box.

3.3 Other Models Used

Even though the YOLOv4 model performed the best among all, the following other models were also used during the study and are described briefly.

ResNet [8] contains the residual block to avoid the problem of vanishing gradient in CNN based model having large number of CNN layers. It helps to build a very deep CNN based model without compromising with gradient. Residual blocks have some CNN layer connected in a series and have a skip connection.

MobileNetV2 [23] is a deep learning model that uses the depth-wise separable convolution and the residual connections. Through the integrating residual connections and the utilization of depth-wise Separable Convolution, MobileNetV2 enhances operating speed over MobileNetV1.

Faster R-CNN [22] contains 2 main components. Region Proposal Network (RPN) – Convolution feature map generated by the backbone layer is the input for region proposal network and this RPN outputs the coordinates of interested objects that are produced by the convolutional operation on entire input feature map. Object Detection – Faster R-CNN uses an object detection network that uses the RoI pooling layer for making fixed-size region proposals, and a dual layer of softmax classifier with the bounding box regressor to predict the objects and object’s coordinates.

Inception model [25] uses 1\(\,\times \,\)1 convolutional layer to reduce the computation cost. Deep Convolutional Networks require high computation, but 1\(\,\times \,\)1 convolution reduces computation effectively. Inserting a 1\(\,\times \,\)1 convolution between the 3\(\,\times \,\)3 and 5\(\,\times \,\)5 convolutions limits the input channels. Inception Net delivers high accuracy with less processing than earlier CNN models.

Single Shot Multibox Detector (SSD) classifies and localizes objects using feature maps from the feature extraction network using feed-forward neural networks. SSD utilizes VGG-16 with 6 additional layers to lower the feature map size to distinguish large and tiny items. Merging these layers’ feature maps provides the required detection.

4 Results and Analysis

This section elaborates the experimental results and discussion on the results, along with various experimental decisions considered during the study.

4.1 Training

The default configuration file for yolov4 has undergone several modifications, including the dataset, number of classes, label map, max epochs, step size, and batch size. Training and testing sets of data are created from the entire dataset. Training and testing datasets comprise 80% and 20% original images. This is the strategy most researchers use to separate the dataset into two part. Experiments are conducted for the 20 class dataset. The values of various hyper parameters used in experiments include 0.001 learning rate, 0.0005 decay, and 0.95 momentum. Three activation types are used – most levels employ mish activation to transport signal back and forth, linear activation for skip connections, and leaky activation at deep layers.

4.2 Model Loss Trend

Fig. 5.
figure 5

Yolov4 loss graph for 20 classes

Figure 5 shows the total loss graph of model. Google Colab Pro was used to train the model for around 10 h and 6000 iterations. The loss graph showed a significant drop after the 400th iteration. Further iterations show a linear decline in the loss graph. Loss curve changes become negligible after step 4800 that indicate the completion of training.

4.3 Result

The results obtained using various models are shown in Table 1. The graphical comparison of accuracy is depicted in Fig. 6. As can be seen, the YOLOv4 model is the best performing among all the models used in the work with 95.43% accuracy and 94.27% F-1 score in bird species identification. Furthermore, SSD turned out to be the worst performing model for the current task of bird species classification.

Evaluation of object detection models is generally also done by using mean average precision (mAP) based metrics. Average Precision (AP) is calculated by area under the precision recall (r) curve for recall value of 0 to 1.

$$\begin{aligned} AP = \int _{0}^{1}p(r)dx \end{aligned}$$
(1)

Mean Average Precision (mAP) is achieved by average of Average Precision of M classes. Mathematically mAP is defined as:

$$\begin{aligned} mAP=\frac{1}{M}\sum _{j=1}^{j=M}APj \end{aligned}$$
(2)

By comparing the ground-truth bounding box to the detected box, the mAP score is calculated. The greater the score, the better the model will be able to identify the objects. Table 1 also depicts the mAP comparison of the various models. Finally, a few sample outputs for classification and detection by YOLOv4 and YOLOv5 models are also shown in Fig. 7.

Table 1. Results obtained using various deep learning methods utilized in this work
Fig. 6.
figure 6

Graph showing results of various methods used in this work

Fig. 7.
figure 7

Sample outputs of YOLOv4 and YOLOv5

4.4 Comparison with State-of-Art

The proposed methodology outperforms other recent research works for 20 class automatic bird species classification, on similar datasets. There is one research [18] that has a slightly better accuracy but it considers only 12 classes and on a different dataset. So this difference is an outcome of less number of classes. With each additional class, the accuracy of models begins to decline and this model may not scale well when the number of classes are increased to 20 or more. The comparison of the proposed work with recent state-of-art is shown in Table 2 and Fig. 8.

Table 2. Comparison of the proposed methodology with recent state-of-art
Fig. 8.
figure 8

Comparison of the proposed methodology with recent state-of-art

5 Conclusion and Future Work

In this paper, the authors have investigated the task of bird species classification and detection. The dataset used in this work [27] contains 20 bird species for classification purposes. Several deep learning models including YOLOv4, ResNet101, YOLOv5 and SSD are evaluated to classify bird species from images. The YOLOv4 model achieves better performance than other utilized models and outperforms recent state-of-the-art models for 20 classes bird species classfication. There are opportunities in several directions to enhance the effectiveness of the proposed methodology. Firstly, data augmentation methods can add more images per class to a dataset. This contributes to the enhancement of training robustness and influences the model’s overall performance in a positive way. Secondly, other recent transformers-based model may be investigated in future to further improve performance. Finally, the developed models may be converted into a smartphone app that the public can use to identify birds in real time.