Keywords

1 Introduction

Detection and recognition of animal species are among the most fundamental challenges in biodiversity research and conservation monitoring, in particular for the assessment of bioindicator species that provide valuable information on the health of the respective ecosystem. The highly diverse bats (Order Chiroptera) are such bio-indicators and therefore the determination of bat activity and bat species abundance is an essential challenge in conservation and ecology projects [7]. An acoustic identification of bat species relies on the observation and recognition of species-specific call patterns from sonograms of ultrasonic recordings [6]. Nevertheless this “manual” identification is very time-consuming and requires a high degree of training and skills. Current bat monitoring necessities, e.g. particularly those connected to the essential monitoring of the animals around wind energy turbines, produce huge datasets that can vastly profit from computer-assisted identification methods. For identifying the bat call components in the ultrasonic frequency range, most of the current techniques work with a time-frequency representation of the audio signal. However frequent background noises (insects, footsteps, other electronic devices and power lines) and signal distortions (over-gained microphones) may hamper signal detection in recordings from the field. Moreover, other signals such as insect vocalizations or rain drops may have a frequency distribution in the ultrasonic range. For this reason, methods based on the calculation of the instantaneous dominant frequency such as zero-crossing are not suitable to apply for detection. Some studies have shown that detecting bat calls based on spectrograms is a more effective method, e.g., a spectral peak detector [12]. Nevertheless, in the case of low signal-to-noise ratio, some studies have shown that model-based detection produces fewer false positive detections than spectral peak detection [11]. Recently was proposed by Aodha et al. [8] a detection method based on deep neural networks in which a representation learning is automatically performed. The results shown that deep learning methods can perform better than other existing detection methods. Nevertheless, we consider that the metric applied is sensitive to imbalanced ratio of false and true positive detections.

In this study, we propose a method for the detection of bat echolocation calls that uses a statistical model-based Voice Activity Detector combined with an ensemble of tree classifiers using the Random Forests algorithm. We use global evaluation curves to visualize the detector performance over the full range of possible distributions of positive and false detections. Our approach aims to accurately detect a more extensive and diverse set of bat species within complex and noisy soundscapes.

2 Methods

The objective of the following techniques is to detect and locate ultrasound signals of bat echolocations calls within broad band recordings from natural environments. In the following we describe the methods that will be later compared on a performance test.

2.1 Energy-Based Detector

This detection method has been used in many studies as the baseline detector [11, 12] and consists of a short-term broadband energy signal. Usually, this signal is calculated from a Time-Frequency (TF) representation X of the audio signal \(x\left( t\right) \). Using the Short Time Fourier Transform (STFT), we estimate a TF representation \(X\left( t_m,f_k\right) \) with discrete variables \( t_m=m\varDelta t\) and \(f_k=k \varDelta f\) with \( m=1,2,\ldots ,M\) and \(k=1,2,\ldots ,K\). Thereafter, the mean energy signal is defined as

$$\begin{aligned} EBD_{mean}\left( t_m\right) = \frac{1}{K}\sum _{f_k}|X \left( t_m,f_k\right) |^2 \end{aligned}$$
(1)

Another alternative is to calculate the peak energy as follows

$$\begin{aligned} EBD_{max}\left( t_m\right) ={\mathop {\hbox {max}}\limits _{f_k}}{|X \left( t_m,f_k\right) |^2} \end{aligned}$$
(2)

To avoid non-relevant noises in the sonic range, the TF representation is filtered using a band-pass with cut-off frequencies of 12.5 kHz and 250 kHz. Thereafter, a time frame \(t_m\) is said to contain a bat call if the energy signal \(EBD(t_m)\) is above some predefined threshold \(\theta \).

2.2 Voice Activity Detector

Here we propose a detection method based on a Voice Activity Detector (VAD) technique [10, 13]. This model-based approach is often used in speech processing for detecting the presence or absence of speech. The VAD uses a Time-Frequency (TF) representation X of the audio signal \(x\left( t\right) \) as explained in Sect. 2.1. The following definitions are evaluated at each time frame \(t_m\). First, we defined a coefficient vector \(\mathbf {X}=\left( X_1,\dots ,X_k,\dots ,X_K\right) \) where \(X_k=X\left( t_m,f_k\right) \) is the kth component that corresponds to the frequency bin \(f_k\). Assuming that the target signal s(t) is degraded by uncorrelated additive noise n(t), for each frame \(t_m\) are considered two hypothesis \(H_0\) and \(H_1\) for absent signal and present signal respectively, i.e.

$$\begin{aligned}&H_0 :\ \mathbf {X}=\mathbf {N}\\&H_1 :\ \mathbf {X}=\mathbf {N}+\mathbf {S} \end{aligned}$$

where \(\mathbf {X}\), \(\mathbf {N}\) and \(\mathbf {S}\) are the coefficient vectors of x(t), n(t), and s(t) with their respectively components \(X_k\),\(\ N_k\), and \(S_k\). A gaussian statistical model is adopted such that the coefficients vectors of each process are asymptotically independent gaussian random variables [5, 13]. The probability density functions conditioned on \(H_0\) and \(H_1\) are given by \(p\left( \mathbf {X}|H_0\right) =\prod ^K_{k=1}{p\left( X_k|H_0\right) }\) and \(p\left( \mathbf {X}|H_1\right) =\prod ^K_{k=1}{p\left( X_k|H_1\right) }\) respectively. For each frequency bin \(f_k\) a likelihood ratio \(\varLambda _k\) is defined as the ratio of the probability density functions \(p\left( X_k|H_0\right) \) and \(p\left( X_k|H_1\right) \). According to [5] the likelihood ratio \(\varLambda _k\) has the form

$$\begin{aligned} \varLambda _k\triangleq \frac{p\left( X_k|H_1\right) }{p\left( X_k|H_0\right) }=\frac{1}{1+{\xi }_k}\exp \left( \frac{{\gamma }_k{\xi }_k}{1+{\xi }_k}\right) \end{aligned}$$
(3)

where \({\xi }_k={\lambda }_S\left( k\right) /{\lambda }_N\left( k\right) \) and \(\gamma _k={\left| X_k\right| }^2/{\lambda }_N\left( k\right) \) are called the a priori and a posteriori signal-to-noise ratios respectively. These ratios are defined by \(\lambda _N\left( k\right) \) and \(\lambda _S\left( k\right) \), which denote the variances of \(N_k\) and \(S_k\), respectively. To estimate the parameters \(\xi _k\) and \(\gamma _k\) it is important to estimate the noise variance \(\lambda _N\left( k\right) \). According to [13], we applied a noise statistic estimation procedure based on a decision-directed method.

The decision rule to presume that a signal is absent or present is established from the likelihood ratio \(LR(t_m)\) which is the geometric mean of \(\varLambda _k(t_m)\) among all frequencies\(f_k\) at time \(t_m\), this means

$$\begin{aligned} LR(t_m)\equiv \log {\varLambda (t_m)}=\frac{1}{K}\sum ^K_{k=1}{\log {\varLambda _k(t_m)}}\underset{H_1}{\overset{H_0}{\gtrless }}\theta _d \end{aligned}$$
(4)

where the parameter \(\theta _d \) is a given threshold value that delimitates the occurrence of hypothesis \(H_0\) and \(H_1\). The next process is to locate all the detections of the target signal (i.e. bat call signal). This means to estimate a time \(t_d\) and a frequency \(f_d\) for each detection event \(d=1,2,\dots , \) such that hypothesis \(H_1\) is true in Eq. 4. A detection event is delimited by the compact interval \(\varDelta T_d=[T_{onset},T_{offset}]\) such that \(LR\left( t_m\right) >\theta \) for all \(t_m\in \varDelta T_d\). The detection point \(\mathbf x _d=\ \left( t_d,f_d\right) \) is obtained by

$$\begin{aligned} t_d&={\mathop {\mathrm {argmax}}_{t_m} \ LR\left( t_m\right) } \end{aligned}$$
(5a)
$$\begin{aligned} f_d&={\mathop {\mathrm {argmax}}_{f_k} p\left( X(t_d,f_k)|H_1\right) } \end{aligned}$$
(5b)

For this study, we set the STFT with a Hamming window of 2.5 ms length and an overlap of 75%. The STFT was filtered using a band-pass with cut-off frequencies of 12.5 kHz and 250 kHz. We based the VAD detector on [3], and the parameters were adapted for ultrasonic recordings and adjusted to target bat call signals. The adjustment was based on general values for call duration and pulse interval.

Table 1. Call features used for classification with Random Forests algorithm

2.3 Voice Activity Detector with Random Forests Classification

This method (VAD+RF) combines the VAD detector described in Sect. 2.2 with an ensemble of tree classifiers based on the Random Forests [2] algorithm. We implemented a feature extraction task for every detection point \(\mathbf {x}_d=\ \left( t_d,f_d\right) \) (see Eq. 5) based on a heuristic image segmentation technique applied on a previous study [10]. This technique separates background components from the call detection and keeps only the spectral components connected to the detection point \(\mathbf {x}_d\). For the call segmentation we used a cut-off threshold \({\theta }_S=12\,\mathrm{dB}\) and a call area limit from \(2\ \mathrm{sHz}\) to \(200\ \mathrm{sHz}\). Thereafter, we extracted a set of features of different types (see Table 1), using the filtered detection sonogram \(S_{\theta }\). Using this features a binary Random Forests classifier predicts whether the detection is a positive or a negative call detection. As an output of the RF classifier we obtain a posterior probability \(P\left( +|S_{\theta }\left( \mathbf {x}_d\right) \right) \) of a positive call detection (\(+\)) given the detection sample \(S_{\theta }\) located at \(\mathbf {x}_d\). Contrary to the detectors described in the previous sections, this method is a supervised learning algorithm and requires a training set with positive and negative detection examples.

2.4 Evaluation Metrics

The receiver operating characteristic curve is commonly used to evaluate the performance of a detection method. This curve is created by plotting the true positive rate (\(tpr\)) against the false positive rate (\(fpr\)) at various threshold settings. In this case, these standard metrics are not suitable due the fact that in noisy soundscapes there is a significantly greater proportion of negative detections (non-bat calls) than positive examples (bat calls). Instead, we applied global evaluation curves such as the Cost Curves [4] and the \(F_1\)-measure Curves [14]. Aodha et al. [8] plotted their the results using a PR-curve of precision P against recall R, nevertheless this PR-curve is sensitive to class imbalance, given its dependence on precision [14]. Different operating conditions (skew levels) lead to different PR-curves, which makes classifier comparison difficult. Cost Curves [4] are an alternative metric to visualize the expected cost (EC) of classification over a range of misclassification costs and/or skew levels. Denoting \(C\left( -|+\right) \) and \(C\left( +|-\right) \) as the misclassification costs of positive and negative samples (usually the cost of correct classifications is zero), then the expected cost is \(EC = fnr \cdot p\left( +\right) \cdot C\left( -|+\right) + fpr \cdot p\left( -\right) \cdot C\left( +|-\right) \), where fnr is the false negative rate, and \(p\left( +\right) \) and \(p\left( -\right) =1-p\left( +\right) \) are the prior probability of the positive and negative class. The Cost Curves depicts the Normalized Expected Cost (\(NEC \in [0,1]\)) as a function of \(pc\left( +\right) \in [0,1]\) as follows

$$\begin{aligned} NEC = \left( 1-tpr-fpr\right) pc\left( +\right) + fpr \end{aligned}$$
(6)

The term \(pc\left( +\right) \) is the normalized product of \(p\left( +\right) \cdot C\left( -|+\right) \) and is defined as

$$\begin{aligned} pc\left( +\right) = \frac{\left( 1/m-1\right) \cdot p\left( +\right) }{\left( 1/m-2\right) \cdot p\left( +\right) +1} \end{aligned}$$
(7)

where \(m=C\left( +|-\right) /\left( C\left( +|-\right) + C\left( -|+\right) \right) \). For this metric a lower curve indicates a better classification performance. Soleymani et al. [14] recently proposed an analogous metric to Cost Curves based on the \(F_{\alpha }\)-measure. In this metric, a classifier is represented as a curve that shows its performance over all of its decision thresholds and a range of imbalance levels. In this case we give equal weights to recall and precision, this means \(\alpha =1\). The \(F_{1}\)-measure Curve is defined as

$$\begin{aligned} F_{1} =\frac{tpr}{\left( 1/ {p\left( +\right) }-1\right) fpr + tpr} \end{aligned}$$
(8)

For this metric a higher curve implies a better classification performance.

The detection methods described in the previous sections produces as output, some detection signal D(t) which value indicates the likelihood of detection at a time t. To obtain the Cost and \(F_1\)-measure Curves it is necessary to swept a determined detection threshold \(\theta \) and get the \(fpr\) and the \(tpr\) as functions of \(\theta \). This task can be optimized by measuring the prominence of each peak d on D(t). We applied the function findpeaks in MATLAB [9] to estimate the prominence h(d) of each peak d. Next, we calculated the peak width at prominence height \(\beta \cdot h(d)\) with \(\beta =7/8\) and we used it to determine the time interval location \(\varDelta t_d\) of the detection d. We considered a true positive detection tp(d) if any true detection \(t_{+}\) were found within the interval \(\varDelta t_d\), i.e. \(tp(d)=1 \) if any \(t_{+}\in \varDelta t_d\), otherwise \(tp(d)=0\).

3 Experiments

We used an open-access library labeled by citizen scientists [8] that comprises ultrasonic audio collected along road-transects across Europe. The audio library includes three datasets: iBats (E. Europe), iBats (UK) and NBP (Norfolk). These datasets were chosen to represent three different realistic use cases commonly used for bat surveys and monitoring programmes. These datasets are divided in a train and a test set. We added a fourth dataset comprising all datasets together. We trained and tested the performance of our method and compared it to other existing detection methods. We included the performance test results of the BatDetective tool [8] based on deep neural networks. We included also the performance test data provided in [8] for three existing closed-source commercial detection systems. The tested detection methods are: VAD+RF, VAD, \(EBD_{mean}\), EBD\(_{max}\), BatDetective (Aodha et al. [8]), SonoBat (version 3.1.7p [15]), SCAN’R (version 1.7.7 [16]) and Kaleidoscope (version 4.2.0 alpha4 [1]).

The Cost Curves and \(F_1\)-measure Curves from the performance tests are depicted in Figs. 1 and 2, and their respective area under curve are shown in Table 2. Using Cost Curves, a lower value indicates a superior detection performance and vice-versa for \(F_1\)-measure Curves. We observe that BatDetective and VAD+RF outperforms the other detection methods on all datasets. Except iBats (E.Europe), the performance of VAD+RF is comparable to the BatDetective method. Specially for iBats (UK) and using all datasets, the detection performance of VAD+RF is the most superior for a large range of \(p\left( +\right) \) and \(pc\left( +\right) \), which implies a powerful detection under a large range of detection operating conditions (from quiet to noisy soundscapes). In other words, VAD+RF can perform accurately under a larger range of possible distributions of positive and negative detections. Contrary to deep neural networks, our approach method uses a model-based signal processing technique and a feature engineering to detect bat calls. One of the advantages of deep learning is that some hidden representations can be learned from the data. Based on the obtained results on this study, we suggest that a combined approach of VAD signal processing with deep neural networks may enhance the detection performance.

Fig. 1.
figure 1

Comparison of the detection methods using Cost Curves

Fig. 2.
figure 2

Comparison of the detection methods using \(F_1\)-measure Curves

Table 2. Area under curve of the Cost Curves (\(A_C\)) and area under curve of the \({F_1}\)-measure Curves (\(A_{F_1}\))

4 Conclusions

In this study, we propose a method for the detection of bat echolocation calls that uses a statistical model-based Voice Activity Detector combined with an ensemble of tree classifiers based on the Random Forests algorithm. We use global evaluation curves to visualize the detector performance over the full range of detection operating conditions. Results show that the detecting power of VAD+RF is comparable to methods based on deep learning. Based on the results, we give recommendations to improve the future designs of bat call detectors.