Audio parameterization with robust frame selection for improved bird identification

doi:10.1016/j.eswa.2015.07.002

Expert Systems with Applications

Volume 42, Issue 22, 1 December 2015, Pages 8463-8471

https://doi.org/10.1016/j.eswa.2015.07.002 Get rights and content

Highlights

•
Audio parameterization method with robust frame selection.
•
Automated acoustic recognition of 40 bird species.
•
HMM-based bird identification.

Abstract

A major challenge in the automated acoustic recognition of bird species is the audio segmentation, which aims to select portions of audio that contain meaningful sound events and eliminates segments that contain predominantly background noise or sound events of other origin. Here we report on the development of an audio parameterization method with integrated robust frame selection that makes use of morphological filtering applied on the spectrogram seen as an image. The morphological filtering allows to exclude from further processing certain audio events, which otherwise could cause misclassification errors. The Mel Frequency Cepstral Coefficients (MFCCs) computed for the selected audio frames offer a good representation of the spectral information for dominant vocalizations because the morphological filtering eliminates short bursts of noise and suppresses weak competing signals. Experimental validation of the proposed method on the identification of 40 bird species from Brazil demonstrated superior accuracy and faster operation than three traditional and recent approaches. This is expressed as reduction of the relative error rate by 3.4% and the overall operational time by 7.5% when compared to the second best result. The improved frame selection robustness, precision, and operational speed facilitate applications like multi-species identification of real-field recordings.

Introduction

Biodiversity monitoring is a prerequisite for sustainable conservation action and is particularly important in efforts to reduce the loss of species (Pereira et al., 2013). Traditionally, animal species distribution, diversity, and population density are assessed with a variety of survey methods that are costly and limited in space and time (e.g., Bibby et al., 2000, Jahn, 2011a, Jahn, 2011b).

Since many animals, such as grasshoppers, crickets, katydids, cicadas, anurans, birds, and certain mammals are more often heard than seen, one promising non-intrusive method for monitoring their presence and activity is the automated acoustic detection and identification. Remote and autonomous survey methods can provide continuous information on the presence/absence of rare and threatened species as well as on the general status of biodiversity in a cost-effective way (e.g., Aide et al., 2013, Ganchev et al., 2015, Potamitis et al., 2014, Sueur, Pavoine, Hamerlynck, Duvail, 2008). Thus, the use of new technologies is considered as an opportunity for facilitating biodiversity monitoring efforts in remote and difficult-to-access areas, such as the vast Pantanal wetlands of Brazil (Schuchmann, Marques, Jahn, Ganchev, & Figueiredo, 2014).

Based on soundscapes, it is possible to identify the species that are present in an area. However this is not a simple task, since the amount of data to be analyzed is very large, reaching the order of several terabytes per continuous annual cycle of recordings. Consequently, data processing is lengthy and computationally expensive (Oba, 2004). The principle prerequisites for large-scale application of soundscape analysis methods are an increased species recognition accuracy and reduction of the overall computational demands. For that purpose improvements, in the sense of accuracy and speed, are required in the audio parameterization and the classification methods. In the present work we focus on the audio parameterization.

Nowadays, the statistical machine learning approach dominates the field of bioacoustics. The audio signal is first parameterized and subsequently the statistical distribution of the audio parameters is modeled. The most widely used modeling techniques for acoustic animal identification are based on the Hidden Markov Model (HMM) (Bardeli et al., 2010, Chu and Blumstein, 2011, Potamitis et al., 2014, Trifa et al., 2008) or its single-state version known as Gaussian Mixture Models (GMMs) (Ganchev et al., 2015, Henríquez et al., 2014). The success of the GMM- and HMM-based recognition method depends on the appropriateness of the audio parameterization process, particularly the segmentation and selection of representative portions of the species-specific sound emissions.

Various strategies for audio parameterization were reported in the literature. Simple solutions, which incorporate energy-based frame selection methods for eliminating silent portions of the signal, do not depend on prior knowledge about the signal and are quite easy to implement (Zhang & Li, 2013). This is the main reason for their widespread use in environmental sound recognition. However their accuracy in low signal-to-noise ratio (SNR) conditions is often unsatisfactory.

In a large-scale experiment on the acoustic identification of 501 bird species, Stowell and Plumbley (2014) applied unsupervised feature learning on raw audio, i.e. without prior segmentation and reported species identification accuracy of 42.9%.

Härmä (2003) proposed a method that extracts syllables from bird vocalizations. Huang, Yang, Yang, and Chen (2009) used this approach to classify frogs by determining three different features from the syllables: spectral centroid, signal bandwidth, and threshold-crossing rate. Lee, Han, and Chuang (2008) applied the same algorithm to identify birds sounds by generating Mel Frequency Cepstral Coefficients (MFCCs) from syllables and Lee, Chou, Han, and Huang (2006) classified animal sounds on the basis of linear discriminant analysis. Other syllabification approaches were studied by Chou, Lee, and Ni (2007), who obtained syllables and clustered them with the fuzzy C-means method whereas Chou and Liu (2009) used wavelet transformations to determine sections in the bird songs.

Juang and Chen (2007) proposed an energy-based method for audio segmentation and subsequent selection of segments with bird song activity. In a related work Acevedo, Corrada-Bravo, Corrada-Bravo, Villanueva-Rivera, and Aide (2009) manually selected portions of interest in the spectrogram, and then compared various machine learning techniques for audio data from frog and bird species. Neal, Briggs, Raich, and Fern (2011) used a Random Forest classifier to implement supervised time and frequency audio segmentation and Evangelista, Priolli, Silla, Angelico, and Kaestner (2014) experimented with sound representation in the frequency domain, energy of the signal, and its spectral centroid to carry out an automatic segmentation of audio.

A more recent approach, based on the idea to treat the sound spectrogram as an image, selects regions of interest in the spectrogram and then extracts their statistical characteristics. The features computed from these regions of interest are used to train machine learning algorithms (Aide et al., 2013, Briggs et al., 2012, Kaewtip et al., 2013, Potamitis, 2014). Likewise, Bardeli (2009) proposed a method in which the sound spectrogram is processed as an image and subsequently used similarity-search techniques to classify a set of animal sounds. In de Oliveira et al. (2015), morphological filtering was employed for the purpose of bird acoustic activity detection which is part of a species-specific recognizer for automated acoustic recognition of Vanellus chilensis vocalizations.

Motivated by previous related work, in Section 2 we present an improved audio parameterization method that incorporates robust audio segmentation based on morphological processing of the sound spectrogram considered as an image. Our work differs from previous related work (Aide et al., 2013, Briggs et al., 2012, de Oliveira et al., 2015, Kaewtip et al., 2013, Potamitis, 2014), where morphological filtering of the spectrogram is only part of noise suppression or acoustic activity detection. By contrast, in the current work it is used as part of the robust frame selection that is integrated in the MFCC feature extraction process. By this the audio parameterization computes MFCCs only for the selected audio segments, which speeds up the operation. In Section 3 we describe the experimental setup, which involves the classification of short audio recordings of 40 bird species from Mato Grosso, Brazil. The results of a comparative evaluation of the proposed method with three other frame selection approaches (Briggs et al., 2012, Härmä, 2003, Sahidullah and Saha, 2012) are presented in Section 4. Finally, in Section 5 we evaluate our work providing a detailed discussion on the advantages and shortcomings of the proposed method and its application area.

Section snippets

Method

Parameterization transforms the audio signals so that useful information is presented in a compact way and irrelevant information is eliminated. The audio features computed during parameterization are next fed to the classification stage (Fig. 1). The latter allows for a final decision of the category to which each input audio recording belongs, based on the scores computed from the individual species-specific models.

An effective parameterization is crucial for achieving high recognition

Experimental setup

In the following subsections we briefly outline the common experimental protocol used in the comparative evaluation of the proposed audio parameterization method with other related traditional and recent methods.

Results

In the following subsections we analyze the experimental results of the bird-identification performance evaluation, which involves different frame selection and audio parameterization methods. These methods are compared in terms of identification accuracy, time needed for training the HMM-based species-specific models, and operational speed.

Processing time

We estimated the processing time for the two main processing steps: the audio parameterization and the classification stages (Tables 3–5).

The proposed and the GMM-based methods need nearly the same time for the audio parameterization, with a slight advantage for the proposed method. When compared with the GMM approach the proposed method is faster by 4.3%. Such a speed up of computations would make significant difference only when large quantities of audio recordings are processed, which is

Conclusion

Aiming to improve the audio parameterization process in bird identification tasks, we propose an approach that incorporates robust frame selection based on morphological filtering of the spectrogram treated as an image. The robust frame selection shares common processing steps with the MFCC parameter computation so the two algorithms integrate well with only a small overhead. This approach and the fact that MFCC parameters are computed only for a subset of selected frames speeds up the overall

Acknowledgments

The authors acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and the Conselho Nacional de Desenvolvimento Cientíco e Tecnológico (CNPq), the financial and logistic project support by the National Institute for Science and Technology in Wetlands (INAU/UFMT), the Brehm Foundation for International Bird Conservation, Germany, the project OP “Competitiveness" BG161PO003-1.2.04-0044-C0001 financed by the Structural Funds of the European

References (40)

AcevedoM.A. et al.
Automated classification of bird and amphibian calls using machine learning: a comparison of methods
Ecological Informatics
(2009)
BardeliR. et al.
Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring
Pattern Recognition Letters
(2010)
GanchevT. et al.
Automated acoustic detection of Vanellus chilensis lampronotus
Expert Systems with Applications
(2015)
HenríquezA. et al.
An automatic acoustic bat identification system based on the audible spectrum
Expert Systems with Applications
(2014)
HuangC.-J. et al.
Frog classification using machine learning techniques
Expert Systems with Applications
(2009)
JuangC.-F. et al.
Birdsong recognition using prediction-based recurrent neural fuzzy networks
Neurocomputing
(2007)
LeeC.-H. et al.
Automatic recognition of animal vocalizations using averaged {MFCC} and linear discriminant analysis
Pattern Recognition Letters
(2006)
de OliveiraA.G. et al.
Bird acoustic activity detection based on morphological filtering of the spectrogram
Applied Acoustics
(2015)
PotamitisI. et al.
Automatic bird sound detection in long real-field recordings: applications and tools
Applied Acoustics
(2014)
ZhangX. et al.
Environmental sound recognition using double-level energy detection
Journal of Signal and Information Processing
(2013)

AideT.M. et al.

Real-time bioacoustics monitoring and automated species identification

PeerJ

(2013)

AlamJ. et al.

Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus

BardeliR.

Similarity search in animal sound databases

IEEE Transactions on Multimedia

(2009)

BibbyC.J. et al.

Bird Census Techniques

(2000)

BimbotF. et al.

A tutorial on text-independent speaker verification

EURASIP Journal on Advances in Signal Processing

(2004)

BovikA.

Handbook of image and video processing

(2005)

BriggsF. et al.

Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach

The Journal of the Acoustical Society of America

(2012)

CadoreJ. et al.

Morphological processing of spectrograms for speech enhancement

Advances in Nonlinear Speech Processing

(2011)

ChouC-H. et al.

Bird species recognition by comparing the HMMs of syllabes

ChouC-H. et al.

Bird species recognition by wavelet transformation of a section of birdsong

Cited by (38)

Systematic review of machine learning methods applied to ecoacoustics and soundscape monitoring
2023, Heliyon
Soundscape ecology is a promising area that studies landscape patterns based on their acoustic composition. It focuses on the distribution of biotic and abiotic sounds at different frequencies of the landscape acoustic attribute and the relationship of said sounds with ecosystem health metrics and indicators (e.g., species richness, acoustic biodiversity, vectors of structural change, gradients of vegetation cover, landscape connectivity, and temporal and spatial characteristics). To conduct such studies, researchers analyze recordings from Acoustic Recording Units (ARUs). The increasing use of ARUs and their capacity to record hours of audio for months at a time have created a need for automatic processing methods to reduce time consumption, correlate variables implicit in the recordings, extract features, and characterize sound patterns related to landscape attributes. Consequently, traditional machine learning methods have been commonly used to process data on different characteristics of soundscapes, mainly the presence–absence of species. In addition, it has been employed for call segmentation, species identification, and sound source clustering. However, some authors highlight the importance of the new approaches that use unsupervised deep learning methods to improve the results and diversify the assessed attributes. In this paper, we present a systematic review of machine learning methods used in the field of ecoacoustics for data processing. It includes recent trends, such as semi-supervised and unsupervised deep learning methods. Moreover, it maintains the format found in the reviewed papers. First, we describe the ARUs employed in the papers analyzed, their configuration, and the study sites where the datasets were collected. Then, we provide an ecological justification that relates acoustic monitoring to landscape features. Subsequently, we explain the machine learning methods followed to assess various landscape attributes. The results show a trend towards label-free methods that can process the large volumes of data gathered in recent years. Finally, we discuss the need to adopt methods with a machine learning approach in other biological dimensions of landscapes.
Transound: Hyper-head attention transformer for birds sound recognition
2023, Ecological Informatics
Bird strikes in low-altitude areas can cause severe economic losses and endanger the lives of airline passengers. Thus, it is necessary to drive away the corresponding birds, which requires adequate and accurate identification of birds. In this paper, we propose an effective bird identification algorithm using a vision transformer (ViT) with hyper-head attention and a Mel frequency cepstral coefficient (MFCC) flow framework. The original sound signal is preprocessed by using preemphasis, framing, and windowing. Then, the designed MFCC flow, which includes discrete Fourier transform, Mel frequency filtering, and discrete cosine transform operations, is proposed to extract sound features, which are then normalized as a recognizable visual dataset that contains the visual feature and can be identified by subsequent visual feature networks. Next, the ViT with hyper-head attention is designed to encode visual features and accurately identify birds. Extensive experiments on two public datasets show that the proposed method performs satisfactorily. Compared with five recent state-of-the-art approaches, the proposed Transound method achieves average increments of 10.64%, 5.65%, 1.15%, 1.78%, and 1.51%.
Multileveled ternary pattern and iterative ReliefF based bird sound classification
2021, Applied Acoustics
Birds may need to be identified for purposes such as environmental monitoring, follow-up, and species detection in the ecological area. Automatic sound classifiers have been used to perform species detection. Many methods have been presented in the literature to classify bird sounds with high accuracy. Nowadays, deep learning models have been used to classify data with high classification accuracy. However, these networks have high computational complexity. To obtain a highly accurate and lightweight classification model, a new multileveled and handcrafted features based machine learning model is presented. The presented automated bird sound classification model uses the multileveled ternary pattern (TP) feature generation, feature selection, and classification phases. The multileveled feature generation network can reach high classification accuracies since they generate high-level, low-level, and mid-level features. To construct levels, discrete wavelet transform (DWT) is employed to use the effectiveness of the DWT in bird sound classification. An improved version of the ReliefF, which is iterative ReliefF (IRF), is considered as feature selector. IRF selects the most informative features automatically, and these features are operated on linear discriminant (LD), k nearest neighbor (kNN), bagged tree (BT), and support vector machine (SVM) classifiers to calculate results of variable classifiers. The proposed multilevel TP and IRF based bird sound classification method reached 96.67% accuracy by using SVM on the 18 classes bird sound dataset.
Based investigate of beehive sound to detect air pollutants by machine learning
2021, Ecological Informatics
Citation Excerpt :
In order to effectively identify different beehives' sound in MFCC data processed by PCA, support vector machine (SVM) (Cortes and Vapnik, 1995) was used for classification. Researchers(Byun and Der Haar, 2018; Ventura et al., 2015; Lee et al. Robertson and Wanner, 2006; Anwar et al., 2019) have proved that SVM can successfully train and test on the classification task of multidimensional MFCCs.
As honey bees are extremely sensitive to a variety of chemicals and emit typical sound when exposed to environmental chemical, the sound of beehive may be explored as signal to monitor atmospheric pollutants. In the study, the beehives were exposed to the common air pollution chemicals of acetone, Trichloromethane, Glutaric dialdehyde and ethyl ether, and collect beehive sound data using a beehive sound acquisition device developed by ourselves. We found that the sound of honey bees has a certain relationship with the surrounding chemistry and pollution. Based on the features of Mel frequency cepstral coefficients (MFCCs) extracted from beehive sound data, we have builted the support vector machine (SVM) model provided the accuracy rate of 93.7% for classifying beehive sounds associated with different compounds. Our method outperformed other classification algorithms in terms of accuracy when applied to preprocessed teat data (93.7% average accuracy compared to the 83.8% achieved by KNN and the 83.6% achieved by RF). The results indicated that the beehive sound analysis can provide qualitative chemical information about the air surrounding the beehives. The study suggestd that monitoring the beehive sounds would become a promising way to monitor the air quality
A two-stage approach to automatically detect and classify woodpecker (Fam. Picidae) sounds
2020, Applied Acoustics
Inventorying and monitoring which bird species inhabit a specific area give rich and reliable information regarding its conservation status and other meaningful biological parameters. Typically, this surveying process is carried out manually by ornithologists and birdwatchers who spend long periods of time in the areas of interest trying to identify which species occur. Such methodology is based on the experts’ own knowledge, experience, visualization and hearing skills, which results in an expensive, subjective and error prone process. The purpose of this paper is to present a computing friendly system able to automatically detect and classify woodpecker acoustic signals from a real-world environment. More specifically, the proposed architecture features a two-stage Learning Classifier System that uses (1) Mel Frequency Cepstral Coefficients and Zero Crossing Rate to detect bird sounds over environmental noise, and (2) Linear Predictive Cepstral Coefficients, Perceptual Linear Predictive Coefficients and Mel Frequency Cepstral Coefficients to identify the bird species and sound type (i.e., vocal sounds such as advertising calls, excitement calls, call notes and drumming events) associated to that bird sound. Conducted experiments over a data set of the known woodpeckers species belonging to the Picidae family that live in the Iberian peninsula have resulted in an overall accuracy of 94,02%, which endorses the feasibility of this proposal and encourage practitioners to work toward this direction.
The use of WSN (wireless sensor network) in the surveillance of endangered bird species
2020, Advances in Ubiquitous Computing: Cyber-Physical Systems, Smart Cities and Ecological Monitoring
Wetlands are home to an impressive number of fauna and flora, some of which are threatened with extinction. There is a large number of wetlands around the world, among which the Pantanal is surely the major one. It covers a total area of 150,000 km² across three countries of South America (Brazil, Bolivia, and Paraguay). It is home to very rich and numerous fauna and flora, including more than 650 bird species. There are other important wetlands in the world, such as Camargue (France), the Everglades (United States), or Okavango (Botswana).
Algeria has more than 250 wetlands, of which 50 are classified internationally for their importance and their ecological role. They are also privileged places for tens of thousands of waterbirds of different species to overwinter or make a temporary halt. Some of these species are threatened with extinction according to the latest classification of the International Union for Conservation of Nature (IUCN) such as the White-headed Duck (Oxyura leucocephala) and the Ferruginous Duck (Aythya nyroca). This gives Algeria's wetlands a great importance and requires special attention.
In this chapter, we propose a WSN-based monitoring of these birds in their natural habitat. The idea is to identify and recognize different bird species instantly from the detection and processing of their vocalizations (call and/or song) by wireless sensor nodes. Sensor networks are proving to be ideal platforms for recording and processing such data because of their characteristics compliance with the requirements of the project such as energy independence, the average financial cost, wide geographic coverage, and the preservation of the environment. However, the proposed monitoring method, based on the WSN, must meet two key challenges. First, immunity to environmental noises that are inevitably present in all recorded bird sounds, and second, a reduced computational complexity leading to an energy consumption saving of the wireless sensor nodes and thus increasing their lifetime.

View all citing articles on Scopus

View full text

Audio parameterization with robust frame selection for improved bird identification

Highlights

Abstract

Introduction

Section snippets

Method

Experimental setup

Results

Processing time

Conclusion

Acknowledgments

Ecological Informatics

Pattern Recognition Letters

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Neurocomputing

Pattern Recognition Letters

Applied Acoustics

Applied Acoustics

Journal of Signal and Information Processing

Real-time bioacoustics monitoring and automated species identification

PeerJ

Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus

Similarity search in animal sound databases

IEEE Transactions on Multimedia

Bird Census Techniques

A tutorial on text-independent speaker verification

EURASIP Journal on Advances in Signal Processing

Handbook of image and video processing

Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach

The Journal of the Acoustical Society of America

Morphological processing of spectrograms for speech enhancement

Advances in Nonlinear Speech Processing

Bird species recognition by comparing the HMMs of syllabes

Bird species recognition by wavelet transformation of a section of birdsong