Audio parameterization with robust frame selection for improved bird identification
Introduction
Biodiversity monitoring is a prerequisite for sustainable conservation action and is particularly important in efforts to reduce the loss of species (Pereira et al., 2013). Traditionally, animal species distribution, diversity, and population density are assessed with a variety of survey methods that are costly and limited in space and time (e.g., Bibby et al., 2000, Jahn, 2011a, Jahn, 2011b).
Since many animals, such as grasshoppers, crickets, katydids, cicadas, anurans, birds, and certain mammals are more often heard than seen, one promising non-intrusive method for monitoring their presence and activity is the automated acoustic detection and identification. Remote and autonomous survey methods can provide continuous information on the presence/absence of rare and threatened species as well as on the general status of biodiversity in a cost-effective way (e.g., Aide et al., 2013, Ganchev et al., 2015, Potamitis et al., 2014, Sueur, Pavoine, Hamerlynck, Duvail, 2008). Thus, the use of new technologies is considered as an opportunity for facilitating biodiversity monitoring efforts in remote and difficult-to-access areas, such as the vast Pantanal wetlands of Brazil (Schuchmann, Marques, Jahn, Ganchev, & Figueiredo, 2014).
Based on soundscapes, it is possible to identify the species that are present in an area. However this is not a simple task, since the amount of data to be analyzed is very large, reaching the order of several terabytes per continuous annual cycle of recordings. Consequently, data processing is lengthy and computationally expensive (Oba, 2004). The principle prerequisites for large-scale application of soundscape analysis methods are an increased species recognition accuracy and reduction of the overall computational demands. For that purpose improvements, in the sense of accuracy and speed, are required in the audio parameterization and the classification methods. In the present work we focus on the audio parameterization.
Nowadays, the statistical machine learning approach dominates the field of bioacoustics. The audio signal is first parameterized and subsequently the statistical distribution of the audio parameters is modeled. The most widely used modeling techniques for acoustic animal identification are based on the Hidden Markov Model (HMM) (Bardeli et al., 2010, Chu and Blumstein, 2011, Potamitis et al., 2014, Trifa et al., 2008) or its single-state version known as Gaussian Mixture Models (GMMs) (Ganchev et al., 2015, Henríquez et al., 2014). The success of the GMM- and HMM-based recognition method depends on the appropriateness of the audio parameterization process, particularly the segmentation and selection of representative portions of the species-specific sound emissions.
Various strategies for audio parameterization were reported in the literature. Simple solutions, which incorporate energy-based frame selection methods for eliminating silent portions of the signal, do not depend on prior knowledge about the signal and are quite easy to implement (Zhang & Li, 2013). This is the main reason for their widespread use in environmental sound recognition. However their accuracy in low signal-to-noise ratio (SNR) conditions is often unsatisfactory.
In a large-scale experiment on the acoustic identification of 501 bird species, Stowell and Plumbley (2014) applied unsupervised feature learning on raw audio, i.e. without prior segmentation and reported species identification accuracy of 42.9%.
Härmä (2003) proposed a method that extracts syllables from bird vocalizations. Huang, Yang, Yang, and Chen (2009) used this approach to classify frogs by determining three different features from the syllables: spectral centroid, signal bandwidth, and threshold-crossing rate. Lee, Han, and Chuang (2008) applied the same algorithm to identify birds sounds by generating Mel Frequency Cepstral Coefficients (MFCCs) from syllables and Lee, Chou, Han, and Huang (2006) classified animal sounds on the basis of linear discriminant analysis. Other syllabification approaches were studied by Chou, Lee, and Ni (2007), who obtained syllables and clustered them with the fuzzy C-means method whereas Chou and Liu (2009) used wavelet transformations to determine sections in the bird songs.
Juang and Chen (2007) proposed an energy-based method for audio segmentation and subsequent selection of segments with bird song activity. In a related work Acevedo, Corrada-Bravo, Corrada-Bravo, Villanueva-Rivera, and Aide (2009) manually selected portions of interest in the spectrogram, and then compared various machine learning techniques for audio data from frog and bird species. Neal, Briggs, Raich, and Fern (2011) used a Random Forest classifier to implement supervised time and frequency audio segmentation and Evangelista, Priolli, Silla, Angelico, and Kaestner (2014) experimented with sound representation in the frequency domain, energy of the signal, and its spectral centroid to carry out an automatic segmentation of audio.
A more recent approach, based on the idea to treat the sound spectrogram as an image, selects regions of interest in the spectrogram and then extracts their statistical characteristics. The features computed from these regions of interest are used to train machine learning algorithms (Aide et al., 2013, Briggs et al., 2012, Kaewtip et al., 2013, Potamitis, 2014). Likewise, Bardeli (2009) proposed a method in which the sound spectrogram is processed as an image and subsequently used similarity-search techniques to classify a set of animal sounds. In de Oliveira et al. (2015), morphological filtering was employed for the purpose of bird acoustic activity detection which is part of a species-specific recognizer for automated acoustic recognition of Vanellus chilensis vocalizations.
Motivated by previous related work, in Section 2 we present an improved audio parameterization method that incorporates robust audio segmentation based on morphological processing of the sound spectrogram considered as an image. Our work differs from previous related work (Aide et al., 2013, Briggs et al., 2012, de Oliveira et al., 2015, Kaewtip et al., 2013, Potamitis, 2014), where morphological filtering of the spectrogram is only part of noise suppression or acoustic activity detection. By contrast, in the current work it is used as part of the robust frame selection that is integrated in the MFCC feature extraction process. By this the audio parameterization computes MFCCs only for the selected audio segments, which speeds up the operation. In Section 3 we describe the experimental setup, which involves the classification of short audio recordings of 40 bird species from Mato Grosso, Brazil. The results of a comparative evaluation of the proposed method with three other frame selection approaches (Briggs et al., 2012, Härmä, 2003, Sahidullah and Saha, 2012) are presented in Section 4. Finally, in Section 5 we evaluate our work providing a detailed discussion on the advantages and shortcomings of the proposed method and its application area.
Section snippets
Method
Parameterization transforms the audio signals so that useful information is presented in a compact way and irrelevant information is eliminated. The audio features computed during parameterization are next fed to the classification stage (Fig. 1). The latter allows for a final decision of the category to which each input audio recording belongs, based on the scores computed from the individual species-specific models.
An effective parameterization is crucial for achieving high recognition
Experimental setup
In the following subsections we briefly outline the common experimental protocol used in the comparative evaluation of the proposed audio parameterization method with other related traditional and recent methods.
Results
In the following subsections we analyze the experimental results of the bird-identification performance evaluation, which involves different frame selection and audio parameterization methods. These methods are compared in terms of identification accuracy, time needed for training the HMM-based species-specific models, and operational speed.
Processing time
We estimated the processing time for the two main processing steps: the audio parameterization and the classification stages (Tables 3–5).
The proposed and the GMM-based methods need nearly the same time for the audio parameterization, with a slight advantage for the proposed method. When compared with the GMM approach the proposed method is faster by 4.3%. Such a speed up of computations would make significant difference only when large quantities of audio recordings are processed, which is
Conclusion
Aiming to improve the audio parameterization process in bird identification tasks, we propose an approach that incorporates robust frame selection based on morphological filtering of the spectrogram treated as an image. The robust frame selection shares common processing steps with the MFCC parameter computation so the two algorithms integrate well with only a small overhead. This approach and the fact that MFCC parameters are computed only for a subset of selected frames speeds up the overall
Acknowledgments
The authors acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and the Conselho Nacional de Desenvolvimento Cientíco e Tecnológico (CNPq), the financial and logistic project support by the National Institute for Science and Technology in Wetlands (INAU/UFMT), the Brehm Foundation for International Bird Conservation, Germany, the project OP “Competitiveness" BG161PO003-1.2.04-0044-C0001 financed by the Structural Funds of the European
References (40)
- et al.
Automated classification of bird and amphibian calls using machine learning: a comparison of methods
Ecological Informatics
(2009) - et al.
Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring
Pattern Recognition Letters
(2010) - et al.
Automated acoustic detection of Vanellus chilensis lampronotus
Expert Systems with Applications
(2015) - et al.
An automatic acoustic bat identification system based on the audible spectrum
Expert Systems with Applications
(2014) - et al.
Frog classification using machine learning techniques
Expert Systems with Applications
(2009) - et al.
Birdsong recognition using prediction-based recurrent neural fuzzy networks
Neurocomputing
(2007) - et al.
Automatic recognition of animal vocalizations using averaged {MFCC} and linear discriminant analysis
Pattern Recognition Letters
(2006) - et al.
Bird acoustic activity detection based on morphological filtering of the spectrogram
Applied Acoustics
(2015) - et al.
Automatic bird sound detection in long real-field recordings: applications and tools
Applied Acoustics
(2014) - et al.
Environmental sound recognition using double-level energy detection
Journal of Signal and Information Processing
(2013)
Real-time bioacoustics monitoring and automated species identification
PeerJ
Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus
Similarity search in animal sound databases
IEEE Transactions on Multimedia
Bird Census Techniques
A tutorial on text-independent speaker verification
EURASIP Journal on Advances in Signal Processing
Handbook of image and video processing
Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach
The Journal of the Acoustical Society of America
Morphological processing of spectrograms for speech enhancement
Advances in Nonlinear Speech Processing
Bird species recognition by comparing the HMMs of syllabes
Bird species recognition by wavelet transformation of a section of birdsong
Cited by (38)
Transound: Hyper-head attention transformer for birds sound recognition
2023, Ecological InformaticsMultileveled ternary pattern and iterative ReliefF based bird sound classification
2021, Applied AcousticsBased investigate of beehive sound to detect air pollutants by machine learning
2021, Ecological InformaticsCitation Excerpt :In order to effectively identify different beehives' sound in MFCC data processed by PCA, support vector machine (SVM) (Cortes and Vapnik, 1995) was used for classification. Researchers(Byun and Der Haar, 2018; Ventura et al., 2015; Lee et al. Robertson and Wanner, 2006; Anwar et al., 2019) have proved that SVM can successfully train and test on the classification task of multidimensional MFCCs.
A two-stage approach to automatically detect and classify woodpecker (Fam. Picidae) sounds
2020, Applied AcousticsThe use of WSN (wireless sensor network) in the surveillance of endangered bird species
2020, Advances in Ubiquitous Computing: Cyber-Physical Systems, Smart Cities and Ecological Monitoring