Abstract
Detecting bird calls in audio is an important task for automatic wildlife monitoring, as well as in citizen science and audio library management. This paper presents front-end acoustic enhancement techniques to handle the acoustic domain mismatch problem in bird detection. A time-domain cross-condition data augmentation (TCDA) method is first proposed to enhance the domain coverage of a fixed training dataset. Then, to eliminate the distortion of stationary noise and enhance the transient events, we investigate a per-channel energy normalization (PCEN) to automatic control the gain of every subband in the mel-frequency spectrogram. Furthermore, a harmonic percussive source separation is investigated to extract robust percussive representations of bird call to alleviate the acoustic mismatch. Our experiments are performed on the Bird Audio Detection Task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events 2018. Extensive results show that the proposed TCDA leads to a relative 5.02% AUC improvements on mismatch conditions. And also on the cross-domain test set, the proposed percussive features (RPFs), and these RPFs with PCEN significantly improve the baseline with conventional log mel-spectrogram features from 81.79% AUC to 84.46% and 88.68%, respectively. Moreover, we find that combing different front-end features can further improve the system performances.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Adavanne, S., Drossos, K., Çakir, E., & Virtanen, T. (2017). Stacked convolutional and recurrent neural networks for bird audio detection. In Proc. EUSIPCO (pp. 1729–1733).
Bai, J. S., Wu, R., Wang, M., et al. (2018). CIAIC-BAD sysytem for DCASE2018 challenge task3. In DCASE challenge.
Battenberg, E., Child, R., Coates, A., et al. (2017). Reducing bias in production speech models. CoRR, 1705, 04400.
Becker, L., Nelus, A., Gauer, J., Rudolph, L., & Martin, R. (2020). Audio feature extraction for vehicle engine noise classification. In Proc. ICASSP (pp. 711–715).
Berger, F., Freillinger, W., Primus, P., & Reisinger, W. (2018). Bird Audio Detection - DCASE 2018. In DCASE challenge
Duan, S., Towsey, M., Zhang, J., Truskinger, A., Wimmer, J., & Roe, P. (2011). Acoustic component detection for automatic species recognition in environmental monitoring. In Proc. ISSNIP (pp. 514–519).
FitzGerald, D. (2010). Harmonic/percussive separation using median filtering. In Proc. DAFx (pp. DAFX1-DAFX-4).
Franceschi, J.-Y., Fawzi, A., & Fawzi, O. (2018). Robustness of classifiers to uniform \(\ell _p\) and gaussian noise. In Proc. AISTATS (pp. 1–25).
Grill, T., Schlüter, J. (2017). Two convolutional neural networks for bird detection in audio signals. In Proc. EUSIPCO (pp. 1764–1768)
Himawan, I., Towsey, M., & Roe, P. (2018). 3D convolutional recurrent neural networks for bird sound detection. In Proc. DCASE workshop pp.108–112.
IEEE AASP challenge on detection and classification of acoustic scenes and events. DCASE2018 Challenge. http://dcase.community/challenge2018/task-bird-audio-detection
Jamali, S., Ahmadpanah, J., & Alipoor, G. (2018). Bird audio detection using supervised weighted NMF. In DCASE challenge
Kong, Q., Iqbal, T., Xu, Y., et al. (2018). DCASE 2018 challenge SURREY cross-task convolutional neural network baseline. In Proc. DCASE Workshop (pp. 217–221).
Krstulovic, S. (2018). Audio event recognition in the smart home. Computational analysis of sound scenes and events (pp. 335–371). Springer.
Lasseck, M. (2018). Acoustic bird detection with deep convolutional neural networks. In Proc. DCASE Workshop (pp. 143–147)
Liaqat, S., Bozorg, N., Jose, N., Conrey, P., Tamasi, A., & Johnson, M. T. (2018). Domain tuning methods for bird audio detection. In Proc. DCASE Workshop (pp. 163–167)
Lostanlen, V., et al. (2019). Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1), 39–43.
Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S., & Bello, J. P. (2018). Birdvox-full-night: A dataset and benchmark for avian flight call detection. In Proc. ICASSP (pp. 266–270).
Mukherjee, R., Banerjee, D., Dey, K., & Ganguly, N. (2018). Convolutional recurrent neural network based bird audio detection. In DCASE challenge.
Müller, D. (2014). Disch. Extending harmonic-percussive separation of audio. In Pro. ISMIR (pp. 611–616).
Ono, N., Miyamoto, K., Kameoka, H., & Sagayama, S. (2008a). A real-time equalizer of harmonic and percussive components in music signals. In Proc. ISMIR (pp. 139–144).
Ono, N., Miyamoto, K., Roux, J. L., Kameoka, H., & Sagayama, S. (2008b). Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram. In Proc. EUSIPCO (pp. 240–244).
Park, D. S., Chan, W., Zhang, Y., et al. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech (pp. 2613–2617).
Porter, J., & Boll, S. (1984). Optimal estimators for spectral restoration of noisy speech. In Proc. ICASSP (pp. 53–56).
Schluter, J., & Lehner, B. (2018). Zero-mean convolutions for level-invariant singing voice detection. In Proc. ISMIR (pp. 1–6).
Shen, J., Qu, Y., Zhang, W., & Yu, Y. (2018). Wasserstein distance guided representation learning for domain adaptation, AAAI (pp. 4058–4065).
Song, J., & Li, S. (2018). Bird audio detection using convolutional neural networks and binary neural networks. In DCASE challenge.
Sun, B., Feng, J., & Saenko, K. (2016). Return of frustratingly easy domain adaptation, in Proc. AAAI (pp. 2058–2065).
Vesperini, F., Gabrielli, L., Principi, E., & Squartini, S. (2018). A capsule neural networks based approach for bird audio detection. In DCASE Challenge.
Vincent, L., Salamon, J., Farnsworth, A., et al. (2019). Robust sound event detection in bioacoustic sensor networks. PLoS ONE, 14(10).
Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F., & Saurous, R. A. (2017). Trainable frontend for robust and far-field keyword spotting. In Proc. ICASSP (pp. 5670–5674).
Xie, J., Hu, K., Zhu, M., Yu, J., & Zhu, Q. (2019). Investigation of different CNN-based models for improved bird sound classification. IEEE Access, 7, 175353–175361.
Yu, C. C, Hao, Y., Yang, W. B., & Fu, B. (2018). Author guidelines for DCASE2018 challenge technical report. In DCASE challenge
Zinemanas, P., Cancela, P., & Rocamora, M. (2019). End-to-end convolutional neural networks for sound event detection in urban environments. In Proc. FRUCT (pp. 533–539).
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62071302).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tang, T., Long, Y., Li, Y. et al. Acoustic domain mismatch compensation in bird audio detection. Int J Speech Technol 25, 251–260 (2022). https://doi.org/10.1007/s10772-022-09957-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-09957-w