Elsevier

Pattern Recognition Letters

Volume 27, Issue 2, 15 January 2006, Pages 93-101
Pattern Recognition Letters

Automatic recognition of animal vocalizations using averaged MFCC and linear discriminant analysis

https://doi.org/10.1016/j.patrec.2005.07.004Get rights and content

Abstract

In this paper we propose a method that uses the averaged Mel-frequency cepstral coefficients (MFCCs) and linear discriminant analysis (LDA) to automatically identify animals from their sounds. First, each syllable corresponding to a piece of vocalization is segmented. The averaged MFCCs over all frames in a syllable are calculated as the vocalization features. Linear discriminant analysis (LDA), which finds out a transformation matrix that minimizes the within-class distance and maximizes the between-class distance, is utilized to increase the classification accuracy while to reduce the dimensionality of the feature vectors. In our experiment, the average classification accuracy is 96.8% and 98.1% for 30 kinds of frog calls and 19 kinds of cricket calls, respectively.

Introduction

Many animals generate sounds either for communication or as a by-product of their living activities such as eating, moving, or flying. Automatic recognition of bioacoustic sounds is valuable for applications such as biological research and environmental monitoring; this is particularly true for detecting and locating animals. In our daily life, we often hear the animal vocalizations rather than see the animals. In general, the animals generate sounds to communicate with members of the same species and thus the animal vocalizations have evolved to be species-specific. Therefore, identifying animal species from their vocalizations is valuable to ecological censusing.

In general, the acoustic signal representing animal vocalizations can be regarded as a sequence of syllables. Thus, a better way to identify animals from their vocalizations is to use a syllable as the acoustic component. It is necessary to segment the syllables of animal vocalizations before the recognition process. Segmentation of speech or audio signals is often based on energy (Lamel et al., 1981, Li et al., 2001, Lu, 2001, Wold et al., 1996, Zhang and Kuo, 2001) and/or zero-crossing rate (Li et al., 2001, Lu, 2001, Tian et al., 2002, Wold et al., 1996, Zhang and Kuo, 2001). A disadvantage of using these segmentation methods to extract syllables from animal vocalizations is that the full syllable cannot be extracted exactly. To overcome this problem, we exploit the frequency information to segment the syllables of animal vocalizations (Harma, 2003).

Once the syllables have been properly segmented, a set of features will be calculated to represent each syllable. The most well-known features for speech/speaker recognition are linear predictive coefficients (LPCs) (Rabiner and Juang, 1993) or Mel-frequency cepstral coefficients (MFCCs) (Picone, 1993, Rabiner and Juang, 1993, Vergin et al., 1999). In this paper, we use the averaged MFCCs in a syllable to identify animals from their sounds due to the fact that MFCCs can represent the spectrum of animal sounds in a compact form. In the next section, we will describe the proposed recognition method for animal vocalizations.

Section snippets

The proposed recognition method for animal vocalizations

The recognition system consists of two parts: the training part and the recognition part. The training part is composed of three main modules: syllable segmentation, averaged MFCCs extraction, and linear discriminant analysis (LDA). The recognition part consists of four modules: syllable segmentation, averaged MFCCs extraction, LDA transformation, and classification. A detailed description of each module will be described below.

Experimental results

Two audio databases of 30 frog calls and 19 cricket calls derived from compact disk are used for the experiments (see Table 3, Table 4). The sampling frequency is 44,100 Hz and each sample is digitized in 16 bits. Most of the calls are field recordings with additional sounds in the background. Some of the calls are generated by multiple individuals vocalizing simultaneously. Each acoustic signal is first segmented into a set of syllables, in which half is used for training and half for testing.

Conclusions

In this paper we propose a method capable of identifying frogs/crickets automatically from the sounds they generate. Each syllable corresponding to a piece of vocalization is first segmented. The averaged MFCCs (AMFCC) over all frames within a syllable are used as vocalization features such that the effect of background noise can be attenuated. Linear discriminant analysis (LDA) is used to reduce the feature dimension and increase the classification accuracy. Experimental results have shown

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments that improved the representation and quality of this paper. This research was supported in part by Chung Hua University under contract CHU-94-TR-02 and the National Science Council of ROC under contract NSC-92-2213-E-216-020.

References (14)

  • D. Li et al.

    Classification of general audio data for content-based retrieval

    Pattern Recognition Letters

    (2001)
  • M.C. Baker

    The chorus song of cooperatively breeding laughing kookaburras: characterization and comparison among groups

    Ethology

    (2004)
  • R. Duda et al.

    Pattern Classification

    (2000)
  • A. Harma

    Automatic identification of bird species based on sinusoidal modeling of syllables

    Internat. Conf. on Acoust. Speech Signal Process.

    (2003)
  • J.A. Kogan et al.

    Automated recognition of bird song elements from continuous recordings using DTW and HMMs

    Journal of the Acoustical Society of America

    (1998)
  • L.F. Lamel et al.

    An improved endpoint detector for isolated word recognition

    IEEE Transactions on Acoustics, Speech, and Signal Processing

    (1981)
  • G.J. Lu

    Indexing and retrieval of audio: A survey

    Multimedia Tools and Applications

    (2001)
There are more references available in the full text version of this article.

Cited by (98)

  • Multi-view features fusion for birdsong classification

    2022, Ecological Informatics
    Citation Excerpt :

    Although the method of using bird images for recognition has made some achievements, it has the limitation of narrow recognition range (Anusha and ManiSai, 2022). However, the audio-based birdsong classification has no such limitation in the original data collection (Lee et al., 2006). The birdsong recognition is favored by many researchers because of its high efficiency, no damage, and wide range (Ma, 2016; Xia et al., 2011).

  • Bias correction for linear discriminant analysis

    2021, Pattern Recognition Letters
View all citing articles on Scopus
View full text