Abstract
Recently, neural network-based deep learning methods have been popularly applied to computer vision, speech signal processing and other pattern recognition areas. Remarkable success has been demonstrated by using the deep learning approaches. The purpose of this article is to provide a comprehensive survey for the neural network-based deep learning approaches on acoustic event detection. Different deep learning-based acoustic event detection approaches are investigated with an emphasis on both strongly labeled and weakly labeled acoustic event detection systems. This paper also discusses how deep learning methods benefit the acoustic event detection task and the potential issues that need to be addressed for prospective real-world scenarios.
Similar content being viewed by others
References
S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, T. Virtanen, Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017)
S. Adavanne, P. Pertila, T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network. arXiv preprint arXiv:1706.02291 (2017)
S. Adavanne, T. Virtanen, Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)
S. Adavanne, T. Virtanen, A report on sound event detection with different binaural features, in Workshop on DCASE Challenge, Tech. Rep. (2017)
S. Adavanne, T. Virtanen, Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)
A. Antoniou, A. Storkey, H. Edwards, Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017)
J. Beltran, E. Chavez, J. Favela, Scalable identification of mixed environmental sounds, recorded from heterogeneous sources. Pattern Recognit. Lett. 68, 153–160 (2015)
E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, T. Virtanen, Convolutional recurrent neural networks for bird audio detection, in IEEE Signal Processing Conference (EUSIPCO) (2017), pp. 1744–1748
E. Cakir, T. Heittola, H. Huttunen, T. Virtanen, Polyphonic sound event detection using multi label deep neural networks, in International Joint Conference on Neural Networks (IJCNN) (2015), pp. 1–7
E. Cakir, T. Virtanen, End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input. arXiv preprint arXiv:1805.03647 (2018)
S.Y. Chou, S.R. Jang, Y.H. Yang, FrameCNN: a weakly-supervised learning framework for frame-wise acoustic event detection and classification. Recall 14, 55–64 (2017)
C. Clavel, T. Ehrette, G. Richard, Events detection for an audio-based surveillance system, in IEEE International Conference on Multimedia and Expo (ICME) (2005), pp. 1306–1309
C.V. Cotton, D.P. Ellis, Spectral vs. spectro-temporal features for acoustic event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2011), pp. 69–72
J.L. Dai Wei, P. Pham, S. Das, S. Qu, F. Metze, Sound event detection for real life audio DCASE challenge, in Proceedings of the Workshop Detection and Classification of Acoustic Scenes and Events (2016)
A. Dang, T.H. Vu, J.C. Wang, A survey of deep learning for polyphonic sound event detection, in IEEE International Conference on Orange Technologies (ICOT) (2017), pp. 75–78
A. Dang, T.H. Vu, J.C. Wang, Deep learning for DCASE2017 challenge. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)
P.T. De Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the cross-entropy method. Ann. Oper. Res. 134(1), 19–67 (2005)
E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a laplacian pyramid of adversarial networks, in Advances in neural information processing systems (NIPS) (2015), pp. 1486–1494
A. Dessein, A. Cont, G. Lemaitre, Real-time detection of overlapping sound events with non-negative matrix factorization, in Matrix Information Geometry (2017), pp. 341–371
T.G. Dietterich, R.H. Lathrop, T. Lozano Perez, Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)
A. Diment, T. Heittola, T. Virtanen, Sound event detection for office live and office synthetic AASP challenge, in Proceedings of the IEEE AASP Challenge on Detection Classif. Acoust. Scenes Events (WASPAA) (2013)
B. Elizalde, K. Anurag, S. Ankit, B. Rohan, V. Emmanuel, R. Bhiksha, L. Ian, Experimentation on the DCASE challenge 2016: Task 1 Acoustic scene classification and task 3 Sound event detection in real life audio. DCASE Challenge, Tech. Rep.(2016)
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)
J.F. Gemmeke, D.P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R.C. Moore, M. Plakal, M. Ritter, Audio set: An ontology and human labeled dataset for audio events, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 776–780
J.F. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste, An exemplar-based NMF approach to audio event detection, in IEEE Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013), pp. 1–4
D. Giannoulis, D. Stowell, E. Benetos, M. Rossignol, M. Lagrange, M.D. Plumbley, A database and challenge for acoustic scene classification and event detection, in IEEE Signal Processing Conference (EUSIPCO) (2013), pp. 1–5
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems (2014), pp. 2672–2680
A. Gorin, N. Makhazhanov, N. Shmyrev, DCASE sound event detection system based on convolutional neural network. Workshop on DCASE Challenge, Tech. Rep. (2016)
R. Grzeszick, A. Plinge, G.A. Fink, Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1242–1252 (2017)
T. Heittola, A. Mesaros, T. Virtanen, M. Gabbouj, Supervised model training for overlapping sound events based on unsupervised source separation, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 8677–8681
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, S. Hochreiter, GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 (2017)
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Y. Hou, S. Li, Sound event detection in real life audio using multimodel system. DCASE Challenge, Tech. Rep. (2017)
I.Y. Jeong, S. Lee, Y. Han, K. Lee, Audio event detection using multiple-input convolutional neural network, in Workshop on DCASE Challenge, Tech. Rep. (2017)
F. Jin, F. Sattar, S. Krishnan, Log-frequency spectrogram for respiratory sound monitoring, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 597–600
D.P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, M. Welling, Improving variational inference with inverse autoregressive flow, in arXiv preprint arXiv:1606.04934 (2016)
D.P. Kingma, M. Welling, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
H.G. Kim, J.Y. Kim, Acoustic event detection in multichannel audio using gated recurrent neural networks with high resolution spectral features. ETRI J. 39(6), 832–840 (2017)
T. Komatsu, Y. Senda, R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, in IEEE Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 2259–2263
Q. Kong, I. Sobieraj, W. Wang, M. Plumbley, Deep neural network baseline for DCASE challenge (2016)
A. Kumar, B. Raj, Audio event detection using weakly labeled data, in ACM Proceedings on Multimedia Conference (2016), pp. 1038–1047
Y.H. Lai, C.H. Wang, S.Y. Hou, B.Y. Chen, Y. Tsao, Y.W. Liu, DCASE report for task 3 Sound event detection in real life audio, in Workshop on DCASE Challenge, Tech. Rep. (2016)
P. Laffitte, D. Sodoyer, C. Tatkeu, L. Girin, Deep neural networks for automatic detection of screams and shouted speech in subway trains, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 6460–6464
D. Lee, S. Lee, Y. Han, K. Lee, Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015)
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. IEEE Proceedings 86(11), 2278–2324 (1998)
X. Lin, J. Liu, X. Kang, Audio recapture detection with convolutional neural networks. IEEE Trans. Multimed. 18(8), 1480–1487 (2016)
R. Lu, Z. Duan, Bidirectional GRU for sound event detection. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)
A. Makhzani, B. Frey, K-sparse autoencoders. arXiv preprint arXiv:1312.5663 (2013)
M. Meyer, L. Cavigelli, L. Thiele, Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017)
A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, M.D. Plumbley, Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(2), 379–393 (2018)
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, T. Virtanen, DCASE 2017 challenge setup: tasks, datasets and baseline system, in DCASE2017 Challenge, Tech. Rep. (2017)
A. Mesaros, T. Heittola, T. Virtanen, Metrics for polyphonic sound event detection. Appl. Sci. 6(6), 162 (2016)
S. Mun, S. Park, D.K. Han, H. Ko, Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane, in Proc. DCASE (2017), pp. 93–97
M.E. Niessen, T.L. Van Kasteren, A. Merentitis, Hierarchical sound event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2013)
W. Nogueira, G. Roma, P. Herrera, Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier, in IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (2013), pp. 1–2
A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proceedings of the 34th International Conference on Machine Learning. 70, 2642–2651 (2017)
G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)
G. Parascandolo, H. Huttunen, T. Virtanen, Recurrent neural networks for polyphonic sound event detection in real life recordings, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 6440–6444
S. Passler, W.J. Fischer, Food intake monitoring: Automated chew event detection in chewing sounds. IEEE J. Biomed. Health Informat. 18(1), 278–289 (2014)
H. Phan, M. Maass, R. Mazur, A. Mertins, Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2015)
K.J. Piczak, Environmental sound classification with convolutional neural networks, in IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2015), pp. 1–6
C. Poultney, S. Chopra, Y.L. Cun, Efficient learning of sparse representations with an energy-based model, in Advances in neural information processing systems (NIPS) (2007), pp. 1137–1144
A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep. (1985)
J. Salamon, J.P. Bello, Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research, in Proceedings of the ACM international conference on Multimedia (2014), pp 1041–1044
J. Salamon, D. MacConnell, M. Cartwright, P. Li, J.P. Bello, Scaper: A library for soundscape synthesis and augmentation, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2017), pp. 344–348
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in Advances in Neural Information Processing Systems (2016), pp. 2234–2242
J. Schroder, B. Cauchi, R., M. Schadler, N. Moritz, K. Adiloglu, J. Anemuller, S. Doclo, B. Kollmeier, S. Goetze, Acoustic event detection using signal enhancement and spectro-temporal feature extraction. in IEEE Workshop on Applicat. Signal Process. Audio Acoust. (WASPAA) (2013)
J. Schroder, S. Goetze, V. Grutzmacher, J. Anemuller, Automatic acoustic siren detection in traffic noise by part-based models, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 493–497
R. Serizel, N. Turpault, H. Eghbal-Zadeh, A.P. Shah, Large-scale weakly labeled semi-supervised sound event detection in domestic environments. arXiv preprint arXiv:1807.10501 (2018)
R. Stiefelhagen, K. Bernardin, R. Bowers, R.T. Rose, M. Michel, J. Garofolo, The CLEAR 2007 evaluation, in Multimodal Technologies for Perception of Humans (2017), pp. 3–34
T.W. Su, J.Y. Liu, Y.H. Yang, Weakly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 791–795
A. Temko, D. Macho, C. Nadeu, Fuzzy integral based information fusion for classification of highly confusable non-speech sounds. Pattern Recognit. 41(5), 1814–1823 (2008)
A. Temko, C. Nadeu, Acoustic event detection in meeting-room environments. Pattern Recognit. Lett. 30(14), 1281–1288 (2009)
A. Temko, C. Nadeu, Classification of acoustic events using SVM-based clustering schemes. Pattern Recognit. 39(4), 682–694 (2006)
G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, A. Sarti, Scream and gunshot detection and localization for audio-surveillance systems, in IEEE Advanced Video and Signal Based Surveillance (AVSS) (2007), pp. 21–26
P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in ACM Proceedings of the 25th international conference on Machine learning (2008), pp. 1096–1103
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
T. Virtanen, A. Mesaros, T. Heittola, M. Plumbley, P. Foster, E. Benetos, M. Lagrange, in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) (2016)
L. Vuegen, B.V.D. Broeck, P. Karsmakers, J.F. Gemmeke, B. Vanrumste, H.V. Hamme, An MFCC-GMM approach for event detection and classification, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013), pp. 1–3
X. Xia, R. Togneri, F. Sohel, D. Huang, Random forest classification based acoustic event detection utilizing contextual-information and bottleneck features. Pattern Recognit. 81, 1–13 (2018)
X. Xia, R. Togneri, F. Sohel, D. Huang, Frame wise dynamic threshold based polyphonic acoustic event detection, in Proc. Interspeech (2017), pp. 474–478
X. Xia, R. Togneri, F. Sohel, D. Huang, Class wise distance based acoustic event detection. Tech. Rep., DCASE Challenge (2017)
X. Xia, R. Togneri, F. Sohel, D. Huang, Confidence based acoustic event detection, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 306–310
Y. Xu, Q. Kong, W. Wang, M.D. Plumbley, Large-scale weakly supervised audio classification using gated convolutional neural network, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 121–125
Y. Yang, J. Jiang, Bi-weighted ensemble via HMM-based approaches for temporal data clustering. Pattern Recognit. 76, 391–403 (2018)
J. Yu, C. Chaomurilige, M.S. Yang, On convergence and parameter selection of the EM and DA-EM algorithms for Gaussian mixtures. Pattern Recognit. 77, 188–203 (2018)
X. Zhu, Y. Liu, Z. Qin, J. Li, Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648 (2017)
X. Zhuang, J. Huang, G. Potamianos, M. Hasegawa-Johnson, Acoustic fall detection using gaussian mixture models and GMM supervectors (2019), pp. 69–72
X. Zhuang, X. Zhou, M.A. Hasegawa-Johnson, T.S. Huang, Real-world acoustic event detection. Pattern Recognit. Lett. 31(12), 1543–1551 (2010)
Acknowledgements
This work was supported by the International Postgraduate Research Scholarship (IPRS) from the University of Western Australia.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xia, X., Togneri, R., Sohel, F. et al. A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection. Circuits Syst Signal Process 38, 3433–3453 (2019). https://doi.org/10.1007/s00034-019-01094-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-019-01094-1