Skip to main content
Log in

A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Recently, neural network-based deep learning methods have been popularly applied to computer vision, speech signal processing and other pattern recognition areas. Remarkable success has been demonstrated by using the deep learning approaches. The purpose of this article is to provide a comprehensive survey for the neural network-based deep learning approaches on acoustic event detection. Different deep learning-based acoustic event detection approaches are investigated with an emphasis on both strongly labeled and weakly labeled acoustic event detection systems. This paper also discusses how deep learning methods benefit the acoustic event detection task and the potential issues that need to be addressed for prospective real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.cs.tut.fi/~heittolt/datasets.

References

  1. S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, T. Virtanen, Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017)

  2. S. Adavanne, P. Pertila, T. Virtanen, Sound event detection using spatial features and convolutional recurrent neural network. arXiv preprint arXiv:1706.02291 (2017)

  3. S. Adavanne, T. Virtanen, Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)

  4. S. Adavanne, T. Virtanen, A report on sound event detection with different binaural features, in Workshop on DCASE Challenge, Tech. Rep. (2017)

  5. S. Adavanne, T. Virtanen, Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv preprint arXiv:1710.02998 (2017)

  6. A. Antoniou, A. Storkey, H. Edwards, Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 (2017)

  7. J. Beltran, E. Chavez, J. Favela, Scalable identification of mixed environmental sounds, recorded from heterogeneous sources. Pattern Recognit. Lett. 68, 153–160 (2015)

    Article  Google Scholar 

  8. E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, T. Virtanen, Convolutional recurrent neural networks for bird audio detection, in IEEE Signal Processing Conference (EUSIPCO) (2017), pp. 1744–1748

  9. E. Cakir, T. Heittola, H. Huttunen, T. Virtanen, Polyphonic sound event detection using multi label deep neural networks, in International Joint Conference on Neural Networks (IJCNN) (2015), pp. 1–7

  10. E. Cakir, T. Virtanen, End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input. arXiv preprint arXiv:1805.03647 (2018)

  11. S.Y. Chou, S.R. Jang, Y.H. Yang, FrameCNN: a weakly-supervised learning framework for frame-wise acoustic event detection and classification. Recall 14, 55–64 (2017)

    Google Scholar 

  12. C. Clavel, T. Ehrette, G. Richard, Events detection for an audio-based surveillance system, in IEEE International Conference on Multimedia and Expo (ICME) (2005), pp. 1306–1309

  13. C.V. Cotton, D.P. Ellis, Spectral vs. spectro-temporal features for acoustic event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2011), pp. 69–72

  14. J.L. Dai Wei, P. Pham, S. Das, S. Qu, F. Metze, Sound event detection for real life audio DCASE challenge, in Proceedings of the Workshop Detection and Classification of Acoustic Scenes and Events (2016)

  15. A. Dang, T.H. Vu, J.C. Wang, A survey of deep learning for polyphonic sound event detection, in IEEE International Conference on Orange Technologies (ICOT) (2017), pp. 75–78

  16. A. Dang, T.H. Vu, J.C. Wang, Deep learning for DCASE2017 challenge. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)

  17. P.T. De Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the cross-entropy method. Ann. Oper. Res. 134(1), 19–67 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  18. E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a laplacian pyramid of adversarial networks, in Advances in neural information processing systems (NIPS) (2015), pp. 1486–1494

  19. A. Dessein, A. Cont, G. Lemaitre, Real-time detection of overlapping sound events with non-negative matrix factorization, in Matrix Information Geometry (2017), pp. 341–371

  20. T.G. Dietterich, R.H. Lathrop, T. Lozano Perez, Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)

    Article  MATH  Google Scholar 

  21. A. Diment, T. Heittola, T. Virtanen, Sound event detection for office live and office synthetic AASP challenge, in Proceedings of the IEEE AASP Challenge on Detection Classif. Acoust. Scenes Events (WASPAA) (2013)

  22. B. Elizalde, K. Anurag, S. Ankit, B. Rohan, V. Emmanuel, R. Bhiksha, L. Ian, Experimentation on the DCASE challenge 2016: Task 1 Acoustic scene classification and task 3 Sound event detection in real life audio. DCASE Challenge, Tech. Rep.(2016)

  23. D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)

    MathSciNet  MATH  Google Scholar 

  24. J.F. Gemmeke, D.P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R.C. Moore, M. Plakal, M. Ritter, Audio set: An ontology and human labeled dataset for audio events, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 776–780

  25. J.F. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste, An exemplar-based NMF approach to audio event detection, in IEEE Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013), pp. 1–4

  26. D. Giannoulis, D. Stowell, E. Benetos, M. Rossignol, M. Lagrange, M.D. Plumbley, A database and challenge for acoustic scene classification and event detection, in IEEE Signal Processing Conference (EUSIPCO) (2013), pp. 1–5

  27. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems (2014), pp. 2672–2680

  28. A. Gorin, N. Makhazhanov, N. Shmyrev, DCASE sound event detection system based on convolutional neural network. Workshop on DCASE Challenge, Tech. Rep. (2016)

  29. R. Grzeszick, A. Plinge, G.A. Fink, Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1242–1252 (2017)

    Article  Google Scholar 

  30. T. Heittola, A. Mesaros, T. Virtanen, M. Gabbouj, Supervised model training for overlapping sound events based on unsupervised source separation, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 8677–8681

  31. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, S. Hochreiter, GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 (2017)

  32. G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  33. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  34. Y. Hou, S. Li, Sound event detection in real life audio using multimodel system. DCASE Challenge, Tech. Rep. (2017)

  35. I.Y. Jeong, S. Lee, Y. Han, K. Lee, Audio event detection using multiple-input convolutional neural network, in Workshop on DCASE Challenge, Tech. Rep. (2017)

  36. F. Jin, F. Sattar, S. Krishnan, Log-frequency spectrogram for respiratory sound monitoring, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 597–600

  37. D.P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, M. Welling, Improving variational inference with inverse autoregressive flow, in arXiv preprint arXiv:1606.04934 (2016)

  38. D.P. Kingma, M. Welling, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  39. H.G. Kim, J.Y. Kim, Acoustic event detection in multichannel audio using gated recurrent neural networks with high resolution spectral features. ETRI J. 39(6), 832–840 (2017)

    Article  Google Scholar 

  40. T. Komatsu, Y. Senda, R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, in IEEE Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 2259–2263

  41. Q. Kong, I. Sobieraj, W. Wang, M. Plumbley, Deep neural network baseline for DCASE challenge (2016)

  42. A. Kumar, B. Raj, Audio event detection using weakly labeled data, in ACM Proceedings on Multimedia Conference (2016), pp. 1038–1047

  43. Y.H. Lai, C.H. Wang, S.Y. Hou, B.Y. Chen, Y. Tsao, Y.W. Liu, DCASE report for task 3 Sound event detection in real life audio, in Workshop on DCASE Challenge, Tech. Rep. (2016)

  44. P. Laffitte, D. Sodoyer, C. Tatkeu, L. Girin, Deep neural networks for automatic detection of screams and shouted speech in subway trains, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 6460–6464

  45. D. Lee, S. Lee, Y. Han, K. Lee, Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)

  46. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  47. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. IEEE Proceedings 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  48. X. Lin, J. Liu, X. Kang, Audio recapture detection with convolutional neural networks. IEEE Trans. Multimed. 18(8), 1480–1487 (2016)

    Article  Google Scholar 

  49. R. Lu, Z. Duan, Bidirectional GRU for sound event detection. Workshop on DCASE2017 Challenge, Tech. Rep. (2017)

  50. A. Makhzani, B. Frey, K-sparse autoencoders. arXiv preprint arXiv:1312.5663 (2013)

  51. M. Meyer, L. Cavigelli, L. Thiele, Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017)

  52. A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, M.D. Plumbley, Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(2), 379–393 (2018)

    Article  Google Scholar 

  53. A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, T. Virtanen, DCASE 2017 challenge setup: tasks, datasets and baseline system, in DCASE2017 Challenge, Tech. Rep. (2017)

  54. A. Mesaros, T. Heittola, T. Virtanen, Metrics for polyphonic sound event detection. Appl. Sci. 6(6), 162 (2016)

    Article  Google Scholar 

  55. S. Mun, S. Park, D.K. Han, H. Ko, Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane, in Proc. DCASE (2017), pp. 93–97

  56. M.E. Niessen, T.L. Van Kasteren, A. Merentitis, Hierarchical sound event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2013)

  57. W. Nogueira, G. Roma, P. Herrera, Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier, in IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (2013), pp. 1–2

  58. A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proceedings of the 34th International Conference on Machine Learning. 70, 2642–2651 (2017)

  59. G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)

    Article  Google Scholar 

  60. G. Parascandolo, H. Huttunen, T. Virtanen, Recurrent neural networks for polyphonic sound event detection in real life recordings, in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 6440–6444

  61. S. Passler, W.J. Fischer, Food intake monitoring: Automated chew event detection in chewing sounds. IEEE J. Biomed. Health Informat. 18(1), 278–289 (2014)

    Article  Google Scholar 

  62. H. Phan, M. Maass, R. Mazur, A. Mertins, Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2015)

    Article  Google Scholar 

  63. K.J. Piczak, Environmental sound classification with convolutional neural networks, in IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2015), pp. 1–6

  64. C. Poultney, S. Chopra, Y.L. Cun, Efficient learning of sparse representations with an energy-based model, in Advances in neural information processing systems (NIPS) (2007), pp. 1137–1144

  65. A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

  66. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)

    Article  Google Scholar 

  67. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep. (1985)

  68. J. Salamon, J.P. Bello, Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)

    Article  Google Scholar 

  69. J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research, in Proceedings of the ACM international conference on Multimedia (2014), pp 1041–1044

  70. J. Salamon, D. MacConnell, M. Cartwright, P. Li, J.P. Bello, Scaper: A library for soundscape synthesis and augmentation, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2017), pp. 344–348

  71. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in Advances in Neural Information Processing Systems (2016), pp. 2234–2242

  72. J. Schroder, B. Cauchi, R., M. Schadler, N. Moritz, K. Adiloglu, J. Anemuller, S. Doclo, B. Kollmeier, S. Goetze, Acoustic event detection using signal enhancement and spectro-temporal feature extraction. in IEEE Workshop on Applicat. Signal Process. Audio Acoust. (WASPAA) (2013)

  73. J. Schroder, S. Goetze, V. Grutzmacher, J. Anemuller, Automatic acoustic siren detection in traffic noise by part-based models, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 493–497

  74. R. Serizel, N. Turpault, H. Eghbal-Zadeh, A.P. Shah, Large-scale weakly labeled semi-supervised sound event detection in domestic environments. arXiv preprint arXiv:1807.10501 (2018)

  75. R. Stiefelhagen, K. Bernardin, R. Bowers, R.T. Rose, M. Michel, J. Garofolo, The CLEAR 2007 evaluation, in Multimodal Technologies for Perception of Humans (2017), pp. 3–34

  76. T.W. Su, J.Y. Liu, Y.H. Yang, Weakly-supervised audio event detection using event-specific gaussian filters and fully convolutional networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 791–795

  77. A. Temko, D. Macho, C. Nadeu, Fuzzy integral based information fusion for classification of highly confusable non-speech sounds. Pattern Recognit. 41(5), 1814–1823 (2008)

    Article  MATH  Google Scholar 

  78. A. Temko, C. Nadeu, Acoustic event detection in meeting-room environments. Pattern Recognit. Lett. 30(14), 1281–1288 (2009)

    Article  Google Scholar 

  79. A. Temko, C. Nadeu, Classification of acoustic events using SVM-based clustering schemes. Pattern Recognit. 39(4), 682–694 (2006)

    Article  MATH  Google Scholar 

  80. G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, A. Sarti, Scream and gunshot detection and localization for audio-surveillance systems, in IEEE Advanced Video and Signal Based Surveillance (AVSS) (2007), pp. 21–26

  81. P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in ACM Proceedings of the 25th international conference on Machine learning (2008), pp. 1096–1103

  82. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)

    MathSciNet  MATH  Google Scholar 

  83. T. Virtanen, A. Mesaros, T. Heittola, M. Plumbley, P. Foster, E. Benetos, M. Lagrange, in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE) (2016)

  84. L. Vuegen, B.V.D. Broeck, P. Karsmakers, J.F. Gemmeke, B. Vanrumste, H.V. Hamme, An MFCC-GMM approach for event detection and classification, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013), pp. 1–3

  85. X. Xia, R. Togneri, F. Sohel, D. Huang, Random forest classification based acoustic event detection utilizing contextual-information and bottleneck features. Pattern Recognit. 81, 1–13 (2018)

    Article  Google Scholar 

  86. X. Xia, R. Togneri, F. Sohel, D. Huang, Frame wise dynamic threshold based polyphonic acoustic event detection, in Proc. Interspeech (2017), pp. 474–478

  87. X. Xia, R. Togneri, F. Sohel, D. Huang, Class wise distance based acoustic event detection. Tech. Rep., DCASE Challenge (2017)

  88. X. Xia, R. Togneri, F. Sohel, D. Huang, Confidence based acoustic event detection, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 306–310

  89. Y. Xu, Q. Kong, W. Wang, M.D. Plumbley, Large-scale weakly supervised audio classification using gated convolutional neural network, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 121–125

  90. Y. Yang, J. Jiang, Bi-weighted ensemble via HMM-based approaches for temporal data clustering. Pattern Recognit. 76, 391–403 (2018)

    Article  Google Scholar 

  91. J. Yu, C. Chaomurilige, M.S. Yang, On convergence and parameter selection of the EM and DA-EM algorithms for Gaussian mixtures. Pattern Recognit. 77, 188–203 (2018)

    Article  Google Scholar 

  92. X. Zhu, Y. Liu, Z. Qin, J. Li, Data augmentation in emotion classification using generative adversarial networks. arXiv preprint arXiv:1711.00648 (2017)

  93. X. Zhuang, J. Huang, G. Potamianos, M. Hasegawa-Johnson, Acoustic fall detection using gaussian mixture models and GMM supervectors (2019), pp. 69–72

  94. X. Zhuang, X. Zhou, M.A. Hasegawa-Johnson, T.S. Huang, Real-world acoustic event detection. Pattern Recognit. Lett. 31(12), 1543–1551 (2010)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the International Postgraduate Research Scholarship (IPRS) from the University of Western Australia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xianjun Xia.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, X., Togneri, R., Sohel, F. et al. A Survey: Neural Network-Based Deep Learning for Acoustic Event Detection. Circuits Syst Signal Process 38, 3433–3453 (2019). https://doi.org/10.1007/s00034-019-01094-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01094-1

Keywords

Navigation