Multiple-Instance Multiple-Label Learning for the Classification of Frog Calls with Acoustic Event Detection

Xie, Jie; Towsey, Michael; Zhang, Liang; Yasumiba, Kiyomi; Schwarzkopf, Lin; Zhang, Jinglan; Roe, Paul

doi:10.1007/978-3-319-33618-3_23

Jie Xie¹⁹,
Michael Towsey¹⁹,
Liang Zhang¹⁹,
Kiyomi Yasumiba¹⁹,
Lin Schwarzkopf¹⁹,
Jinglan Zhang¹⁹ &
…
Paul Roe¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9680))

Included in the following conference series:

International Conference on Image and Signal Processing

1760 Accesses

Abstract

Frog call classification has received increasing attention due to its importance for ecosystem. Traditionally, the classification of frog calls is solved by means of the single-instance single-label classification classifier. However, since different frog species tend to call simultaneously, classifying frog calls becomes a multiple-instance multiple-label learning problem. In this paper, we propose a novel method for the classification of frog species using multiple-instance multiple-label (MIML) classifiers. To be specific, continuous recordings are first segmented into audio clips (10 s). For each audio clip, acoustic event detection is used to segment frog syllables. Then, three feature sets are extracted from each syllable: mask descriptor, profile statistics, and the combination of mask descriptor and profile statistics. Next, a bag generator is applied to those extracted features. Finally, three MIML classifiers, MIML-SVM, MIML-RBF, and MIML-kNN, are employed for tagging each audio clip with different frog species. Experimental results show that our proposed method can achieve high accuracy (81.8 % true positive/negatives) for frog call classification.

You have full access to this open access chapter, Download conference paper PDF

Frog call classification: a survey

Article 17 December 2016

Detection and Multi-label Classification of Bats

Investigation of Acoustic and Visual Features for Frog Call Classification

Article 26 February 2019

Keywords

1 Introduction

Recently, human activity and climate change put a negative effect on frog biodiversity, which makes frog monitoring become ever more important. Compared with the traditional monitoring method such as field observation, acoustic sensors have greatly extended acoustic monitoring into larger spatio-temporal scales [1]. Correspondingly, large volumes of acoustic data are generated, which makes it essential to develop automatic methods.

Several papers have already described automated methods for the classification of frog calls. Han et al. combined spectral centroid, Shannon entropy, Renyi entropy for frog call recognition with a k-nearest neighbour classifier [2]. Gingras et al. proposed a method based on mean value for dominant frequency, coefficient of variation of root-mean square energy, and spectral flux for anuran classification [3]. Bedoya et al. used Mel-frequency cepstral coefficients (MFCCs) for the recognition of anuran species with a fuzzy classifier [4]. Xie et al. proposed a method based on track duration, dominant frequency, oscillation rate, frequency modulation and energy modulation to do frog call [5]. All those previous methods achieve a high accuracy rate in recognition and classification, but recordings used in those papers are assumed that there is only a single frog species present in each recording.

Unfortunately, all the recordings used in this study are low signal to noise ratio and contain many overlapping animal vocal activities including frogs, birds, crickets and so on. To solve this problem, the multiple-instance multiple-label classifier for supervised classification is formulated [6]. In the previous study, Briggs et al. has already introduced the MIML classifiers for acoustic classification of multiple simultaneous bird species [7]. In their method, a supervise learning classifier was employed for segmenting acoustic events, which required lots of annotations.

In this study, we introduced the MIML algorithm for frog call classification. Rather than using a supervised learning method for syllable segmentation, acoustic event detection is first employed to separate frog syllables. Then, three feature sets, mask descriptor, profile statistics, and the combination of mask descriptor and profile statistics, are calculated from each syllable. After applying a bag generator to those extracted feature sets, three classifiers, MIML-SVM [6], MIML-RBF [8], and MIML-kNN [9], are lastly used for the recognition of multiple simultaneous frog species. Experimental results show that our proposed method can achieve high classification accuracy.

2 Materials and Methods

2.1 Materials

Digital recordings in this study were obtained with a battery-powered, weatherproof Song Meter (SM2) box. Recordings were two-channel, sampled at 22.05 kHz and saved in WAC4 format. Here, a representative sample of 342 10-s recordings was selected to train and evaluate our proposed algorithm for predicting which frog species are present in a recording. All those examples were collected between 02/2014 to 03/2014, because it is the frog breeding season with high calling activity. All the species that are present in each 10-s recording were manually labelled by an ecologist who studies frog calls. There are totally eight frog species in the recordings: Canetoad (CAD) ($F_{0}=560$ Hz), Cyclorana novaehollandiae (CNE) ($F_{0}=610$ Hz), Limnodynastes terraereginae (LTE) ($F_{0}=610$ Hz), Litoria fallax (LFX) ($F_{0}=4000$ Hz), Litoria nasuta (LNA) ($F_{0}=2800$ Hz), Litoria rothii (LRI) ($F_{0}=1800$ Hz), Litoria rubella (LRA) ($F_{0}=2300$ Hz), and Uperolela mimula (UMA) ($F_{0}=2400$ Hz). Here, $F_{0}$ is the mean dominant frequency for each frog species. Each recording contains between 1 and 5 species. Following the prior work [7], we assume that recordings without any frog calls can be detected by acosutic event detection.

2.2 Signal Processing

All the recordings were re-sampled at 16 kHz and mixed to mono. A spectrogram was then generated by applying short-time Fourier transform to each recording. Specifically, each recording was divided into frames of 512 samples with 50 % frame overlap. A fast Fourier transform was then performed on each frame with a Hamming window, which yielded amplitude values for 256 frequency bins, each spanning 31.25 Hz. The final decibel values (S) were generated using $S_{tf} = 20*log_{10}(A_{tf})$, where A is the amplitude value, $t=0,...,T-1$ and $f=0,...,F-1$ represent frequency and time index, T and F are 256 frequency bins and 625 frames, respectively.

2.3 Acoustic Event Detection for Syllable Segmentation

Acoustic event detection (AED) aims to detect specified acoustic event in an audio stream. In this study, we use AED to segment frog syllables. Since all the recordings are collected from the field, there are much overlapping vocal activities. Traditional methods for audio segmentation are based on time domain information [10, 11], which cannot address those recordings. Here, we modified the AED method developed by Towsey et al. [12] to segment recordings with overlapping activities. The detail of our AED method is described as follows:

Step 1: Wiener filter

To de-noise and smooth the spectrogram, a 2-D Wiener filter is applied to the spectrogram image over a $5 \times 5$ time-frequency grid, where the filter size is selected after considering the trade-off between removing the background graininess and blurring the acoustic events.

$$\begin{aligned} \hat{S_{tf}} = \mu + \frac{(\sigma ^{2}-\nu ^{2})}{\sigma ^{2}}(S_{tf}-\nu ) \end{aligned}$$

(1)

where $\mu $ and $\sigma ^{2}$ are local mean and variance, respectively. $\nu ^{2}$ is the noise variance estimated by averaging all local variances.

Step 2: Spectral subtraction

After Wiener filter, the graininess has been removed. However, some noises such as wind, insect, motor engine that cover the whole recording cannot be removed. Here, a modified spectral subtraction is used for dealing with those noise [13].

Step 3: Adaptive thresholding

After noise reduction, the next step is to convert the noise reduced spectrogram $\hat{S^{'}_{tf}}$ into the binary spectrogram $S^{b}_{tf}$ for events detection. Different from the hard threshold in Towseys work, an adaptive thresholding method named Otsu thresholding is used to convert the smoothed spectrogram into binary spectrogram. Otsus method assumes that the spectrogram is composed of two classes: acoustic events and background noise. An optimal threshold value is used for the decision. After thresholding, each group of contiguous positive pixels will be regarded as a candidate event (Fig. 1).

Step 4: Events filtering using dominant frequency and event area

After aforementioned process, not all detected events are correspond to frog vocalizations. To further remove those events that are from the listed frog species in Sect. 2.1, dominant frequency ($F_{0}$) and area within the event boundary (Ar) are used for filtering.

Step 5: Region growing

Region growing algorithm is utilized to obtain the contour of the particular acoustic event [14]. To get the accuracy boundary of each acoustic event and improve the discrimination of extracted features, a 2-D region growing algorithm is applied for obtaining the accuracy event shape within each segmented event. First, a maximal intensity value within each segmented event is selected as the seed point. Then, if the difference between the neighbourhood pixels and the seed(s) is smaller than the threshold, the neighbourhood pixels will be located and assigned to the output image. Next, the new added pixels are used as seeds for further processing until all the pixels that satisfy the criteria are added to the output image. The final results after region growing are shown in Fig. 2. Here, the threshold value is empirically set as 5 dB.

2.4 Feature Extraction

Based on acoustic event detection results, two feature sets are first calculated to describe each event (syllable): mask descriptor and profile statistic [7]. Here, we exclude histogram of orientation from our feature sets, because the previous study has already demonstrated its poor classification performance [7]. For mask descriptor, it is used to describe the syllable shape including minimum frequency, maximum frequency, bandwidth, duration, area, perimeter, non-compactness, rectangularity. For profile statistics, there are time-Gini, frequency-Gini, frequency-mean, frequency-variance, frequency-skewness, frequency-kurtosis, frequency-max, time-max, mask-mean, and mask standard deviation. The third feature set consists of all features.

Table 1. Accuracy measure for MIML classifiers with different feature sets. Here, $\downarrow $ indicates the smaller the better, while $\uparrow $ indicates the bigger the better.

Full size table

3 Multiple-Instance Multiple-Label Classifiers

After feature extraction, three MIML algorithms are evaluated for the classification of multiple simultaneous frog calls: MIML-SVM, MIML-RBF, and MIML-kNN. With some form of event-level distance measure, the MIML problem has been reduced to a single-instance multiple-label problem by associating each event with a event-level feature [7]. Here, the maximal and average Hausdorff distances between two syllables are used by MIML-SVM and MIML-RBF, separately. For MIML-kNN, the nearest neighbour is used to assign syllable-level features.

4 Experiment Results

4.1 Parameter Tuning

There are three modules whose parameters need to be discussed: signal processing, acoustic event detection, and classification. For signal processing, the window size and overlap are 512 samples and 50 %, respectively. During the process of acoustic event detection, four thresholds for event filtering need to be determined, which are small and large area threshold, and frequency boundary for events filtering. All those thresholds were determined empirically by applying various combinations of thresholds to a small number of randomly selected 10 s clips. For MIML-SVM classifiers, the parameters used are ($C,\gamma ,r$) and set as (0.1, 0.6, 0.2) experimentally. For MIML-RBF, the parameters are ($ r, \mu $) and set as (0.1,0.6). For MIML-kNN, the number of references (k) and citers ($k^{'}$) are 10 and 20, respectively.

4.2 Classification

In this study, all the algorithms were programmed in Matlab 2014b. Each MIML algorithm is evaluated with five-fold cross-validation on the collection of 342 species-labelled recordings. Five measures including Hamming loss, rank loss, one-error, coverage, and micro-AUC are used to characterize the accuracy of each algorithm [15, 16]. The definition of each measure can be found in [7], the positive/negatives is defined as $1-$Hamming loss and it is 0.818 for MIML-RBF with MD. Mask descriptor (MD) and profile statistical (PS), and all features (AF) are put into the three classifiers, respectively. The performance of each MIML classifier is shown in Table 1. Here, the best classification accuracy is achieved by MIML-RBF using MD. For each classifier, the classification accuracy of MD is higher than PS and AF, which shows that the event shape have higher discrimination power than the event content. To give a concrete view of predictions, the results of 5 randomly selected recordings using MIML-RBF are shown in Table 2. Recordings of No. 1 and No. 3 are accurately predicted.

Table 2. Example predictions with MIML-RBF.

Full size table

5 Conclusion

In this study, we propose a novel method for the classification of multiple simultaneous frog species in environmental recordings. To the best of our knowledge, this is the first study that applies the MIML algorithm to frog calls. Since frogs tend to call simultaneously, the MIML algorithm is more suitable for dealing with those recordings than single-instance single-label classification. After applying acoustic event detection algorithm to each 10 s recording, each frog syllable is segmented. Then, three feature sets are calculated based on those segmented syllables. Finally, three MIML classifiers are used for the classification of frog calls with the best accuracy (81.8 % true positive/negatives). Future work will focus on the study of novel features and MIML classifiers for further improving the classification performance.

References

Wimmer, J., Towsey, M., Planitz, B., Williamson, I., Roe, P.: Analysing environmental acoustic data through collaboration and automation. Future Gener. Comput. Syst. 29(2), 560–568 (2013)
Article Google Scholar
Han, N.C., Muniandy, S.V., Dayou, J.: Acoustic classification of Australian Anurans based on hybrid spectral-entropy approach. Appl. Acoust. 72(9), 639–645 (2011)
Article Google Scholar
Gingras, B., Fitch, W.T.: A three-parameter model for classifying Anurans into four genera based on advertisement calls. J. Acoust. Soc. Am. 133(1), 547–559 (2013)
Article Google Scholar
Bedoya, C., Isaza, C., Daza, J.M., López, J.D.: Automatic recognition of Anuran species based on syllable identification. Ecol. Inf. 24, 200–209 (2014)
Article Google Scholar
Xie, J., Towsey, M., Truskinger, A., Eichinski, P., Zhang, J., Roe, P.: Acoustic classification of Australian Anurans using syllable features. In: 2015 IEEE Tenth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (IEEE ISSNIP 2015), Singapore, April 2015
Google Scholar
Zhou, Z.-H.Z.M.-L.: Multi-instance multi-label learning with application to scene classification. In: Advances in Neural Information Processing Systems, pp. 1609–1616 (2007)
Google Scholar
Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X.Z., Raich, R., Hadley, S.J., Hadley, A.S., Betts, M.G.: Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. J. Acoust. Soc. Am. 131(6), 4640–4650 (2012)
Article Google Scholar
Zhang, M.-L., Wang, Z.-J.: MIMLRBF: RBF neural networks for multi-instance multi-label learning. Neurocomputing 72(16), 3951–3956 (2009)
Article Google Scholar
Zhang, M.-L.: A k-nearest neighbor based multi-instance multi-label learning algorithm. In: 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), vol. 2, pp. 207–212. IEEE (2010)
Google Scholar
Somervuo, P., et al.: Classification of the harmonic structure in bird vocalization. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 5, pp. V–701. IEEE (2004)
Google Scholar
Huang, C.-J., Yang, Y.-J., Yang, D.-X., Chen, Y.-J.: Frog classification using machine learning techniques. Expert Syst. Appl. 36(2), 3737–3743 (2009)
Article Google Scholar
Towsey, M., Planitz, B., Nantes, A., Wimmer, J., Roe, P.: A toolbox for animal call recognition. Bioacoustics 21(2), 107–125 (2012)
Article Google Scholar
Xie, J., Towsey, M., Zhang, J., Roe, P.: Image processing and classification procedure for the analysis of Australian frog vocalisations. In: Proceedings of the 2nd International Workshop on Environmental Multimedia Retrieval, ser. EMR 2015, New York, NY, USA, pp. 15–20. ACM (2015)
Google Scholar
Mallawaarachchi, A., Ong, S., Chitre, M., Taylor, E.: Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles. J. Acoust. Soc. Am. 124(2), 1159–1170 (2008)
Article Google Scholar
Zhou, Z.-H., Zhang, M.-L., Huang, S.-J., Li, Y.-F.: MIML: a framework for learning with ambiguous objects. CORR abs/0808.3231 (2008)
Google Scholar
Dimou, A., Tsoumakas, G., Mezaris, V., Kompatsiaris, I., Vlahavas, I.: An empirical study of multi-label learning methods for video annotation. In: Seventh International Workshop on Content-Based Multimedia Indexing, CBMI 2009, pp. 19–24. IEEE (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Electrical Engineering and Computer Science School, Queensland University of Technology, Brisbane, Australia
Jie Xie, Michael Towsey, Liang Zhang, Kiyomi Yasumiba, Lin Schwarzkopf, Jinglan Zhang & Paul Roe

Authors

Jie Xie
View author publications
You can also search for this author in PubMed Google Scholar
Michael Towsey
View author publications
You can also search for this author in PubMed Google Scholar
Liang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kiyomi Yasumiba
View author publications
You can also search for this author in PubMed Google Scholar
Lin Schwarzkopf
View author publications
You can also search for this author in PubMed Google Scholar
Jinglan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Paul Roe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Xie .

Editor information

Editors and Affiliations

Université de Bourgogne, Auxerre, France
Alamin Mansouri
University of Quebec, Trois-Rivieres, Québec, Canada
Fathallah Nouboud
Université du Québec à Trois-Rivières, Trois-Rivières, Québec, Canada
Alain Chalifour
Université Ibn Zohr, Agadir, Morocco
Driss Mammass
Université de Montréal, Montreal, Québec, Canada
Jean Meunier
Universitè de Caen Basse Normandie, Caen, France
Abderrahim Elmoataz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, J. et al. (2016). Multiple-Instance Multiple-Label Learning for the Classification of Frog Calls with Acoustic Event Detection. In: Mansouri, A., Nouboud, F., Chalifour, A., Mammass, D., Meunier, J., Elmoataz, A. (eds) Image and Signal Processing. ICISP 2016. Lecture Notes in Computer Science(), vol 9680. Springer, Cham. https://doi.org/10.1007/978-3-319-33618-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-33618-3_23
Published: 07 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33617-6
Online ISBN: 978-3-319-33618-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)