Automatic recognition of animal vocalizations using averaged MFCC and linear discriminant analysis

doi:10.1016/j.patrec.2005.07.004

Pattern Recognition Letters

Volume 27, Issue 2, 15 January 2006, Pages 93-101

https://doi.org/10.1016/j.patrec.2005.07.004 Get rights and content

Abstract

In this paper we propose a method that uses the averaged Mel-frequency cepstral coefficients (MFCCs) and linear discriminant analysis (LDA) to automatically identify animals from their sounds. First, each syllable corresponding to a piece of vocalization is segmented. The averaged MFCCs over all frames in a syllable are calculated as the vocalization features. Linear discriminant analysis (LDA), which finds out a transformation matrix that minimizes the within-class distance and maximizes the between-class distance, is utilized to increase the classification accuracy while to reduce the dimensionality of the feature vectors. In our experiment, the average classification accuracy is 96.8% and 98.1% for 30 kinds of frog calls and 19 kinds of cricket calls, respectively.

Introduction

Many animals generate sounds either for communication or as a by-product of their living activities such as eating, moving, or flying. Automatic recognition of bioacoustic sounds is valuable for applications such as biological research and environmental monitoring; this is particularly true for detecting and locating animals. In our daily life, we often hear the animal vocalizations rather than see the animals. In general, the animals generate sounds to communicate with members of the same species and thus the animal vocalizations have evolved to be species-specific. Therefore, identifying animal species from their vocalizations is valuable to ecological censusing.

In general, the acoustic signal representing animal vocalizations can be regarded as a sequence of syllables. Thus, a better way to identify animals from their vocalizations is to use a syllable as the acoustic component. It is necessary to segment the syllables of animal vocalizations before the recognition process. Segmentation of speech or audio signals is often based on energy (Lamel et al., 1981, Li et al., 2001, Lu, 2001, Wold et al., 1996, Zhang and Kuo, 2001) and/or zero-crossing rate (Li et al., 2001, Lu, 2001, Tian et al., 2002, Wold et al., 1996, Zhang and Kuo, 2001). A disadvantage of using these segmentation methods to extract syllables from animal vocalizations is that the full syllable cannot be extracted exactly. To overcome this problem, we exploit the frequency information to segment the syllables of animal vocalizations (Harma, 2003).

Once the syllables have been properly segmented, a set of features will be calculated to represent each syllable. The most well-known features for speech/speaker recognition are linear predictive coefficients (LPCs) (Rabiner and Juang, 1993) or Mel-frequency cepstral coefficients (MFCCs) (Picone, 1993, Rabiner and Juang, 1993, Vergin et al., 1999). In this paper, we use the averaged MFCCs in a syllable to identify animals from their sounds due to the fact that MFCCs can represent the spectrum of animal sounds in a compact form. In the next section, we will describe the proposed recognition method for animal vocalizations.

Section snippets

The proposed recognition method for animal vocalizations

The recognition system consists of two parts: the training part and the recognition part. The training part is composed of three main modules: syllable segmentation, averaged MFCCs extraction, and linear discriminant analysis (LDA). The recognition part consists of four modules: syllable segmentation, averaged MFCCs extraction, LDA transformation, and classification. A detailed description of each module will be described below.

Experimental results

Two audio databases of 30 frog calls and 19 cricket calls derived from compact disk are used for the experiments (see Table 3, Table 4). The sampling frequency is 44,100 Hz and each sample is digitized in 16 bits. Most of the calls are field recordings with additional sounds in the background. Some of the calls are generated by multiple individuals vocalizing simultaneously. Each acoustic signal is first segmented into a set of syllables, in which half is used for training and half for testing.

Conclusions

In this paper we propose a method capable of identifying frogs/crickets automatically from the sounds they generate. Each syllable corresponding to a piece of vocalization is first segmented. The averaged MFCCs (AMFCC) over all frames within a syllable are used as vocalization features such that the effect of background noise can be attenuated. Linear discriminant analysis (LDA) is used to reduce the feature dimension and increase the classification accuracy. Experimental results have shown

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments that improved the representation and quality of this paper. This research was supported in part by Chung Hua University under contract CHU-94-TR-02 and the National Science Council of ROC under contract NSC-92-2213-E-216-020.

References (14)

D. Li et al.
Classification of general audio data for content-based retrieval
Pattern Recognition Letters
(2001)
M.C. Baker
The chorus song of cooperatively breeding laughing kookaburras: characterization and comparison among groups
Ethology
(2004)
R. Duda et al.
Pattern Classification
(2000)
A. Harma
Automatic identification of bird species based on sinusoidal modeling of syllables
Internat. Conf. on Acoust. Speech Signal Process.
(2003)
J.A. Kogan et al.
Automated recognition of bird song elements from continuous recordings using DTW and HMMs
Journal of the Acoustical Society of America
(1998)
L.F. Lamel et al.
An improved endpoint detector for isolated word recognition
IEEE Transactions on Acoustics, Speech, and Signal Processing
(1981)
G.J. Lu
Indexing and retrieval of audio: A survey
Multimedia Tools and Applications
(2001)

There are more references available in the full text version of this article.

Cited by (98)

Japanese Black cattle call patterns classification using multiple acoustic features and machine learning models
2023, Computers and Electronics in Agriculture
Cattle calls have been used as indicators of the health and welfare of cattle, but the analysis of calls is laborious and time-consuming. In this study, we developed six machine learning classifiers to classify five cattle call patterns using the acoustic features of the calls. The classifiers were trained using calls collected from 31 Japanese Black cattle using a microphone and a webcam. The call patterns of interest included normal, estrus, feeding, normal calls with and without noise. We extracted 193 dimensions of acoustic features from each acoustic sample as the input for the call pattern classifiers. Two datasets were utilized to assess how background noise affected classification. Dataset2 combined two types of normal calls, whereas Dataset1 distinguished between normal calls with and without noise. The outcomes demonstrated that dataset-2 completed classification more accurately than dataset-1. K-nearest neighbor (95.62%), random forest (92.04%), decision tree (92.83%), and logistic regression (92.83%) all had high classification accuracy with dataset-2. When trained on dataset 2, Naïve Bayes and support vector machine models' classification accuracy increased to 86.31% and 88.93%, respectively, from their initial values of 79.64% and 79.81% when using dataset-1. The similarity in call amplitudes and lengths resulted in incorrect classifications. In the future, a microphone with identity information could be attached near the cow's mouth to collect individual calls. In addition to calls associated with different physiological states, other types of information (e.g., sex, age, species, and individual acoustic characteristics), will be considered. This study offers a potential effective tool for precision livestock feeding and reproduction management based on cattle call patterns automatic recognition.
Multi-view features fusion for birdsong classification
2022, Ecological Informatics
Citation Excerpt :
Although the method of using bird images for recognition has made some achievements, it has the limitation of narrow recognition range (Anusha and ManiSai, 2022). However, the audio-based birdsong classification has no such limitation in the original data collection (Lee et al., 2006). The birdsong recognition is favored by many researchers because of its high efficiency, no damage, and wide range (Ma, 2016; Xia et al., 2011).
As important members of the ecosystem, birds are good monitors of the ecological environment. Bird recognition, especially birdsong recognition, has attracted more and more attention in the field of artificial intelligence. At present, traditional machine learning and deep learning are widely used in birdsong recognition. Deep learning can not only classify and recognize the spectrums of birdsong, but also be used as a feature extractor. Machine learning is often used to classify and recognize the extracted birdsong handcrafted feature parameters. As the data samples of the classifier, the feature of birdsong directly determines the performance of the classifier. Multi-view features from different methods of feature extraction can obtain more perfect information of birdsong. Therefore, aiming at enriching the representational capacity of single feature and getting a better way to combine features, this paper proposes a birdsong classification model based multi-view features, which combines the deep features extracted by convolutional neural network (CNN) and handcrafted features. Firstly, four kinds of handcrafted features are extracted. Those are wavelet transform (WT) spectrum, Hilbert-Huang transform (HHT) spectrum, short-time Fourier transform (STFT) spectrum and Mel-frequency cepstral coefficients (MFCC). Then CNN is used to extract the deep features from WT, HHT and STFT spectrum, and the minimal-redundancy-maximal-relevance (mRMR) to select optimal features. Finally, three classification models (random forest, support vector machine and multi-layer perceptron) are built with the deep features and handcrafted features, and the probability of classification results of the two types of features are fused as the new features to recognize birdsong. Taking sixteen species of birds as research objects, the experimental results show that the three classifiers obtain the accuracy of 95.49%, 96.25% and 96.16% respectively for the features of the proposed method, which are better than the seven single features and three fused features involved in the experiment. This proposed method effectively combines the deep features and handcrafted features from the perspectives of signal. The fused features can more comprehensively express the information of the bird audio itself, and have higher classification accuracy and lower dimension, which can effectively improve the performance of bird audio classification.
Deep Neural Network for Automatic Classification of Pathological Voice Signals
2022, Journal of Voice
Computer-aided pathological voice detection is efficient for initial screening of pathological voice, and has received high academic and clinical attention. This paper proposes an automatic diagnosis method of pathological voice based on deep neural network (DNN). Other two classification models (support vector machines and random forests) were used to verify the effectiveness of DNN.
In this paper, we extracted 12 Mel frequency cepstral coefficients of each voice sample as row features. The constructed DNN consists a two-layer stacked sparse autoencoders network and a softmax layer. The stacked sparse autoencoders layer can learn high-level features from raw Mel frequency cepstral coefficients features. Then, the softmax layer can diagnose pathological voice according to high-level features. The DNN and the other two comparison models used the same train set and test set for the experiment.
Experimental results reveal that the value of sensitivity, specificity, precision, accuracy, and F1 score of the DNN can reach 97.8%, 99.4%, 99.4%, 98.6%, and 98.4%, respectively. The five indexes of DNN classification results are at least 6.2%, 5%, 5.6%, 5.7%, and 6.2% higher than the comparison models (support vector machine and random forest).
The proposed DNN can learn advanced features from raw acoustic features, and distinguish pathological voice from healthy voice. To the extent of this preliminary study, future studies can further explore the application of DNN in other experiments and clinical practice
Bias correction for linear discriminant analysis
2021, Pattern Recognition Letters
Linear discriminant analysis (LDA) is perhaps one of the most fundamental statistical pattern recognition techniques. In this work, we explicitly present, for the first time, an asymptotically exact estimator of the LDA optimal intercept in terms of achieving the lowest overall risk in the classification of two multivariate Gaussian distributions with a common covariance matrix and arbitrary misclassification costs. The proposed estimator of the optimal bias term is developed based on the theory of random matrices of increasing dimension in which the observation dimension and the sample size tend to infinity while keeping their magnitudes comparable. The simple form of this estimator provides us with some analytical insights into the working mechanism of the bias correction in LDA. We then complement these analytical insights with numerical experiments. In particular, empirical results using real data show that insofar as the overall risk is concerned, the proposed bias-corrected form of LDA can outperform the conventional LDA classifier in a wide range of misclassification costs. At the same time, the superiority of the proposed form over LDA tends to be more evident as dimensionality or the ratio between class-specific costs increase.
Based investigate of beehive sound to detect air pollutants by machine learning
2021, Ecological Informatics
As honey bees are extremely sensitive to a variety of chemicals and emit typical sound when exposed to environmental chemical, the sound of beehive may be explored as signal to monitor atmospheric pollutants. In the study, the beehives were exposed to the common air pollution chemicals of acetone, Trichloromethane, Glutaric dialdehyde and ethyl ether, and collect beehive sound data using a beehive sound acquisition device developed by ourselves. We found that the sound of honey bees has a certain relationship with the surrounding chemistry and pollution. Based on the features of Mel frequency cepstral coefficients (MFCCs) extracted from beehive sound data, we have builted the support vector machine (SVM) model provided the accuracy rate of 93.7% for classifying beehive sounds associated with different compounds. Our method outperformed other classification algorithms in terms of accuracy when applied to preprocessed teat data (93.7% average accuracy compared to the 83.8% achieved by KNN and the 83.6% achieved by RF). The results indicated that the beehive sound analysis can provide qualitative chemical information about the air surrounding the beehives. The study suggestd that monitoring the beehive sounds would become a promising way to monitor the air quality
Bioacoustic signal classification in continuous recordings: Syllable-segmentation vs sliding-window
2020, Expert Systems with Applications
Frog population has been experiencing rapid decreases worldwide, which is regarded as one of the most critical threats to the global biodiversity. Therefore, large volumes of frog recordings have been collected for assessing this decline. Building an automatic frog species classification system is becoming ever more important. The traditional system for classifying frog species consists of four steps: (1) bioacoustic signal preprocessing, (2) segmentation, (3) feature extraction, (4) classification. Each prior step has a direct impact on the subsequent step. Consequently, the final classification performance is highly affected by the initial three steps. However, the performance of bioacoustic signal segmentation is highly dependent on the background noise of those environmental recordings. In this study, we propose an end-to-end approach for acoustic classification of frog species in continuous recordings. First, a sliding window is used to segment the audio signal into frames. Then, 1D-Convolution Neural Network and long short-term memory (CNN-LSTM) network is used to learn a representation from the raw audio signal, where three Convolutional layers and one LSTM layer are used to capture the signal’s pattern. Experimental results in classifying 23 Australian frog species demonstrate the effectiveness of our proposed CNN-LSTM based method. Compared to the syllable-segmentation based frog species classification system, our proposed CNN-LSTM based approach is more robust in frog species classification under various noisy conditions.

View all citing articles on Scopus

View full text

Automatic recognition of animal vocalizations using averaged MFCC and linear discriminant analysis

Abstract

Introduction

Section snippets

The proposed recognition method for animal vocalizations

Experimental results

Conclusions

Acknowledgments

Pattern Recognition Letters

The chorus song of cooperatively breeding laughing kookaburras: characterization and comparison among groups

Ethology

Pattern Classification

Automatic identification of bird species based on sinusoidal modeling of syllables

Internat. Conf. on Acoust. Speech Signal Process.

Automated recognition of bird song elements from continuous recordings using DTW and HMMs

Journal of the Acoustical Society of America

An improved endpoint detector for isolated word recognition

IEEE Transactions on Acoustics, Speech, and Signal Processing

Indexing and retrieval of audio: A survey

Multimedia Tools and Applications