A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection

Xia, Xianjun; Togneri, Roberto; Sohel, Ferdous; Zhao, Yuanjun; Huang, Defeng

doi:10.1007/s00034-019-01094-1

A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection

Published: 21 March 2019

Volume 38, pages 3433–3453, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Xianjun Xia ORCID: orcid.org/0000-0001-5277-6634¹,
Roberto Togneri¹,
Ferdous Sohel²,
Yuanjun Zhao¹ &
…
Defeng Huang¹

2239 Accesses
21 Citations
3 Altmetric
Explore all metrics

Abstract

Recently, neural network-based deep learning methods have been popularly applied to computer vision, speech signal processing and other pattern recognition areas. Remarkable success has been demonstrated by using the deep learning approaches. The purpose of this article is to provide a comprehensive survey for the neural network-based deep learning approaches on acoustic event detection. Different deep learning-based acoustic event detection approaches are investigated with an emphasis on both strongly labeled and weakly labeled acoustic event detection systems. This paper also discusses how deep learning methods benefit the acoustic event detection task and the potential issues that need to be addressed for prospective real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Article Open access 17 June 2019

Diego de Benito-Gorron, Alicia Lozano-Diez, … Joaquin Gonzalez-Rodriguez

A review of deep learning techniques in audio event recognition (AER) applications

Article 14 June 2023

Arjun Prashanth, S. L. Jayalakshmi & R. Vedhapriyavadhana

Deep Learning for Image and Sound Data: An Overview

Notes

http://www.cs.tut.fi/~heittolt/datasets.

References

S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, T. Virtanen, Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017)
S. Adavanne, P. Pertila, T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network. arXiv preprint arXiv:1706.02291 (2017)
S. Adavanne, T. Virtanen, Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)
S. Adavanne, T. Virtanen, A report on sound event detection with different binaural features, in Workshop on DCASE Challenge, Tech. Rep. (2017)
S. Adavanne, T. Virtanen, Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)
A. Antoniou, A. Storkey, H. Edwards, Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017)
J. Beltran, E. Chavez, J. Favela, Scalable identification of mixed environmental sounds, recorded from heterogeneous sources. Pattern Recognit. Lett. 68, 153–160 (2015)
Article Google Scholar
E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, T. Virtanen, Convolutional recurrent neural networks for bird audio detection, in IEEE Signal Processing Conference (EUSIPCO) (2017), pp. 1744–1748
E. Cakir, T. Heittola, H. Huttunen, T. Virtanen, Polyphonic sound event detection using multi label deep neural networks, in International Joint Conference on Neural Networks (IJCNN) (2015), pp. 1–7
E. Cakir, T. Virtanen, End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input. arXiv preprint arXiv:1805.03647 (2018)
S.Y. Chou, S.R. Jang, Y.H. Yang, FrameCNN: a weakly-supervised learning framework for frame-wise acoustic event detection and classification. Recall 14, 55–64 (2017)
Google Scholar
C. Clavel, T. Ehrette, G. Richard, Events detection for an audio-based surveillance system, in IEEE International Conference on Multimedia and Expo (ICME) (2005), pp. 1306–1309
C.V. Cotton, D.P. Ellis, Spectral vs. spectro-temporal features for acoustic event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2011), pp. 69–72
J.L. Dai Wei, P. Pham, S. Das, S. Qu, F. Metze, Sound event detection for real life audio DCASE challenge, in Proceedings of the Workshop Detection and Classification of Acoustic Scenes and Events (2016)
A. Dang, T.H. Vu, J.C. Wang, A survey of deep learning for polyphonic sound event detection, in IEEE International Conference on Orange Technologies (ICOT) (2017), pp. 75–78
A. Dang, T.H. Vu, J.C. Wang, Deep learning for DCASE2017 challenge. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)
P.T. De Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the cross-entropy method. Ann. Oper. Res. 134(1), 19–67 (2005)
Article MathSciNet MATH Google Scholar
E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a laplacian pyramid of adversarial networks, in Advances in neural information processing systems (NIPS) (2015), pp. 1486–1494
A. Dessein, A. Cont, G. Lemaitre, Real-time detection of overlapping sound events with non-negative matrix factorization, in Matrix Information Geometry (2017), pp. 341–371
T.G. Dietterich, R.H. Lathrop, T. Lozano Perez, Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)
Article MATH Google Scholar
A. Diment, T. Heittola, T. Virtanen, Sound event detection for office live and office synthetic AASP challenge, in Proceedings of the IEEE AASP Challenge on Detection Classif. Acoust. Scenes Events (WASPAA) (2013)
B. Elizalde, K. Anurag, S. Ankit, B. Rohan, V. Emmanuel, R. Bhiksha, L. Ian, Experimentation on the DCASE challenge 2016: Task 1 Acoustic scene classification and task 3 Sound event detection in real life audio. DCASE Challenge, Tech. Rep.(2016)
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)
MathSciNet MATH Google Scholar
J.F. Gemmeke, D.P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R.C. Moore, M. Plakal, M. Ritter, Audio set: An ontology and human labeled dataset for audio events, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 776–780
J.F. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste, An exemplar-based NMF approach to audio event detection, in IEEE Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013), pp. 1–4
D. Giannoulis, D. Stowell, E. Benetos, M. Rossignol, M. Lagrange, M.D. Plumbley, A database and challenge for acoustic scene classification and event detection, in IEEE Signal Processing Conference (EUSIPCO) (2013), pp. 1–5
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems (2014), pp. 2672–2680
A. Gorin, N. Makhazhanov, N. Shmyrev, DCASE sound event detection system based on convolutional neural network. Workshop on DCASE Challenge, Tech. Rep. (2016)
R. Grzeszick, A. Plinge, G.A. Fink, Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1242–1252 (2017)
Article Google Scholar
T. Heittola, A. Mesaros, T. Virtanen, M. Gabbouj, Supervised model training for overlapping sound events based on unsupervised source separation, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 8677–8681
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, S. Hochreiter, GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 (2017)
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Y. Hou, S. Li, Sound event detection in real life audio using multimodel system. DCASE Challenge, Tech. Rep. (2017)
I.Y. Jeong, S. Lee, Y. Han, K. Lee, Audio event detection using multiple-input convolutional neural network, in Workshop on DCASE Challenge, Tech. Rep. (2017)
F. Jin, F. Sattar, S. Krishnan, Log-frequency spectrogram for respiratory sound monitoring, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 597–600
D.P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, M. Welling, Improving variational inference with inverse autoregressive flow, in arXiv preprint arXiv:1606.04934 (2016)
D.P. Kingma, M. Welling, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
H.G. Kim, J.Y. Kim, Acoustic event detection in multichannel audio using gated recurrent neural networks with high resolution spectral features. ETRI J. 39(6), 832–840 (2017)
Article Google Scholar
T. Komatsu, Y. Senda, R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, in IEEE Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 2259–2263
Q. Kong, I. Sobieraj, W. Wang, M. Plumbley, Deep neural network baseline for DCASE challenge (2016)
A. Kumar, B. Raj, Audio event detection using weakly labeled data, in ACM Proceedings on Multimedia Conference (2016), pp. 1038–1047
Y.H. Lai, C.H. Wang, S.Y. Hou, B.Y. Chen, Y. Tsao, Y.W. Liu, DCASE report for task 3 Sound event detection in real life audio, in Workshop on DCASE Challenge, Tech. Rep. (2016)
P. Laffitte, D. Sodoyer, C. Tatkeu, L. Girin, Deep neural networks for automatic detection of screams and shouted speech in subway trains, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 6460–6464
D. Lee, S. Lee, Y. Han, K. Lee, Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. IEEE Proceedings 86(11), 2278–2324 (1998)
Article Google Scholar
X. Lin, J. Liu, X. Kang, Audio recapture detection with convolutional neural networks. IEEE Trans. Multimed. 18(8), 1480–1487 (2016)
Article Google Scholar
R. Lu, Z. Duan, Bidirectional GRU for sound event detection. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)
A. Makhzani, B. Frey, K-sparse autoencoders. arXiv preprint arXiv:1312.5663 (2013)
M. Meyer, L. Cavigelli, L. Thiele, Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017)
A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, M.D. Plumbley, Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(2), 379–393 (2018)
Article Google Scholar
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, T. Virtanen, DCASE 2017 challenge setup: tasks, datasets and baseline system, in DCASE2017 Challenge, Tech. Rep. (2017)
A. Mesaros, T. Heittola, T. Virtanen, Metrics for polyphonic sound event detection. Appl. Sci. 6(6), 162 (2016)
Article Google Scholar
S. Mun, S. Park, D.K. Han, H. Ko, Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane, in Proc. DCASE (2017), pp. 93–97
M.E. Niessen, T.L. Van Kasteren, A. Merentitis, Hierarchical sound event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2013)
W. Nogueira, G. Roma, P. Herrera, Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier, in IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (2013), pp. 1–2
A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proceedings of the 34th International Conference on Machine Learning. 70, 2642–2651 (2017)
G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)
Article Google Scholar
G. Parascandolo, H. Huttunen, T. Virtanen, Recurrent neural networks for polyphonic sound event detection in real life recordings, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 6440–6444
S. Passler, W.J. Fischer, Food intake monitoring: Automated chew event detection in chewing sounds. IEEE J. Biomed. Health Informat. 18(1), 278–289 (2014)
Article Google Scholar
H. Phan, M. Maass, R. Mazur, A. Mertins, Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2015)
Article Google Scholar
K.J. Piczak, Environmental sound classification with convolutional neural networks, in IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2015), pp. 1–6
C. Poultney, S. Chopra, Y.L. Cun, Efficient learning of sparse representations with an energy-based model, in Advances in neural information processing systems (NIPS) (2007), pp. 1137–1144
A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)
Article Google Scholar
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep. (1985)
J. Salamon, J.P. Bello, Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
Article Google Scholar
J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research, in Proceedings of the ACM international conference on Multimedia (2014), pp 1041–1044
J. Salamon, D. MacConnell, M. Cartwright, P. Li, J.P. Bello, Scaper: A library for soundscape synthesis and augmentation, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2017), pp. 344–348
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in Advances in Neural Information Processing Systems (2016), pp. 2234–2242
J. Schroder, B. Cauchi, R., M. Schadler, N. Moritz, K. Adiloglu, J. Anemuller, S. Doclo, B. Kollmeier, S. Goetze, Acoustic event detection using signal enhancement and spectro-temporal feature extraction. in IEEE Workshop on Applicat. Signal Process. Audio Acoust. (WASPAA) (2013)
J. Schroder, S. Goetze, V. Grutzmacher, J. Anemuller, Automatic acoustic siren detection in traffic noise by part-based models, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 493–497
R. Serizel, N. Turpault, H. Eghbal-Zadeh, A.P. Shah, Large-scale weakly labeled semi-supervised sound event detection in domestic environments. arXiv preprint arXiv:1807.10501 (2018)
R. Stiefelhagen, K. Bernardin, R. Bowers, R.T. Rose, M. Michel, J. Garofolo, The CLEAR 2007 evaluation, in Multimodal Technologies for Perception of Humans (2017), pp. 3–34
T.W. Su, J.Y. Liu, Y.H. Yang, Weakly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 791–795
A. Temko, D. Macho, C. Nadeu, Fuzzy integral based information fusion for classification of highly confusable non-speech sounds. Pattern Recognit. 41(5), 1814–1823 (2008)
Article MATH Google Scholar
A. Temko, C. Nadeu, Acoustic event detection in meeting-room environments. Pattern Recognit. Lett. 30(14), 1281–1288 (2009)
Article Google Scholar
A. Temko, C. Nadeu, Classification of acoustic events using SVM-based clustering schemes. Pattern Recognit. 39(4), 682–694 (2006)
Article MATH Google Scholar
G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, A. Sarti, Scream and gunshot detection and localization for audio-surveillance systems, in IEEE Advanced Video and Signal Based Surveillance (AVSS) (2007), pp. 21–26
P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in ACM Proceedings of the 25th international conference on Machine learning (2008), pp. 1096–1103
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
MathSciNet MATH Google Scholar
T. Virtanen, A. Mesaros, T. Heittola, M. Plumbley, P. Foster, E. Benetos, M. Lagrange, in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) (2016)
L. Vuegen, B.V.D. Broeck, P. Karsmakers, J.F. Gemmeke, B. Vanrumste, H.V. Hamme, An MFCC-GMM approach for event detection and classification, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013), pp. 1–3
X. Xia, R. Togneri, F. Sohel, D. Huang, Random forest classification based acoustic event detection utilizing contextual-information and bottleneck features. Pattern Recognit. 81, 1–13 (2018)
Article Google Scholar
X. Xia, R. Togneri, F. Sohel, D. Huang, Frame wise dynamic threshold based polyphonic acoustic event detection, in Proc. Interspeech (2017), pp. 474–478
X. Xia, R. Togneri, F. Sohel, D. Huang, Class wise distance based acoustic event detection. Tech. Rep., DCASE Challenge (2017)
X. Xia, R. Togneri, F. Sohel, D. Huang, Confidence based acoustic event detection, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 306–310
Y. Xu, Q. Kong, W. Wang, M.D. Plumbley, Large-scale weakly supervised audio classification using gated convolutional neural network, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 121–125
Y. Yang, J. Jiang, Bi-weighted ensemble via HMM-based approaches for temporal data clustering. Pattern Recognit. 76, 391–403 (2018)
Article Google Scholar
J. Yu, C. Chaomurilige, M.S. Yang, On convergence and parameter selection of the EM and DA-EM algorithms for Gaussian mixtures. Pattern Recognit. 77, 188–203 (2018)
Article Google Scholar
X. Zhu, Y. Liu, Z. Qin, J. Li, Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648 (2017)
X. Zhuang, J. Huang, G. Potamianos, M. Hasegawa-Johnson, Acoustic fall detection using gaussian mixture models and GMM supervectors (2019), pp. 69–72
X. Zhuang, X. Zhou, M.A. Hasegawa-Johnson, T.S. Huang, Real-world acoustic event detection. Pattern Recognit. Lett. 31(12), 1543–1551 (2010)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the International Postgraduate Research Scholarship (IPRS) from the University of Western Australia.

Author information

Authors and Affiliations

School of Electrical, Electronic and Computer Engineering, University of Western Australia, 35 Stirling Hwy, Perth, WA, 6009, Australia
Xianjun Xia, Roberto Togneri, Yuanjun Zhao & Defeng Huang
School of Engineering and Information Technology, Murdoch University, 90 South St, Murdoch, WA, 6150, Australia
Ferdous Sohel

Authors

Xianjun Xia
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Togneri
View author publications
You can also search for this author in PubMed Google Scholar
Ferdous Sohel
View author publications
You can also search for this author in PubMed Google Scholar
Yuanjun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Defeng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianjun Xia.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xia, X., Togneri, R., Sohel, F. et al. A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection. Circuits Syst Signal Process 38, 3433–3453 (2019). https://doi.org/10.1007/s00034-019-01094-1

Download citation

Received: 03 September 2018
Revised: 12 March 2019
Accepted: 15 March 2019
Published: 21 March 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s00034-019-01094-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection

Abstract

Access this article

Similar content being viewed by others

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

A review of deep learning techniques in audio event recognition (AER) applications

Deep Learning for Image and Sound Data: An Overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection

Abstract

Access this article

Similar content being viewed by others

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

A review of deep learning techniques in audio event recognition (AER) applications

Deep Learning for Image and Sound Data: An Overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation