Skip to main content

Advertisement

Log in

Improved Convolutional Neural Networks for Acoustic Event Classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

To further exploit the potential performance of convolutional neural networks in acoustic event classification, an improved convolutional neural network called AecNet (Acoustic event classification net) is proposed. For traditional convolutional neural network lacks the representation of low-level features, the proposed model includes more feature layers to reserve the information of low-level and high-level features of the input. In order to extract the features of different level effectively, 1 × 1 convolutions are adopted to compress the feature maps of all convolutional layers except the top convolutional layer. Then the condensed features are concatenated into one layer, which contains all features in different levels. So, the feature learning is enhanced and multi-scale convolutional neural network is constructed. In order to extract the dynamic features of the sound clip better, multi-channels spectrogram features comprised of mel-spectrogram, its first order delta along frequency and time, second order delta along frequency and time are adopted. In experiment, point of FFT, number of mel-bands and type of mel-spectrogram deltas are detailedly discussed and reasonable choice are suggested in practice. Experiments results on datasets ESC-10, ESC-50 and DCASE show that the proposed method yields improvements of recognition accuracy in various degrees compared with some state-of-art results on standard benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Aytar Y, Vondrick C, Torralba A (2016) SoundNet: Learning Sound Representations from Unlabeled Video. arXiv preprint arXiv:1610.09001

  2. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305

    MathSciNet  MATH  Google Scholar 

  3. Chu S, Narayanan S, Kuo CCJ (2009) Environmental Sound Recognition With Time–Frequency Audio Features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158

    Article  Google Scholar 

  4. Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio Set: An ontology and human-labeled dataset for audio events. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, March 5, 2017 - March 9, 2017. New Orleans, LA, United states: Institute of Electrical and Electronics Engineers Inc.

  5. Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. in 22nd European Signal Processing Conference, EUSIPCO 2014, September 1, 2014 - September 5, 2014. Lisbon, Portugal: European Signal Processing Conference, EUSIPCO

  6. Han Y, Lee K (2016) Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383

  7. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. in 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016. Las Vegas, NV, United states: IEEE Computer Society

  8. Hertel L, Barth E, Kaster T, Martinetz T (2015) Deep convolutional neural networks as generic feature extractors. in International Joint Conference on Neural Networks, IJCNN 2015, July 12, 2015 - July 17, 2015. Killarney, Ireland: Institute of Electrical and Electronics Engineers Inc.

  9. Jarrett K, Kavukcuoglu K, Ranzato M A (2009) Lecun Y. What is the best multi-stage architecture for object recognition? in 12th International Conference on Computer Vision, ICCV 2009, September 29, 2009 - October 2, 2009. Kyoto, Japan: Institute of Electrical and Electronics Engineers Inc.

  10. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. in 2014 ACM Conference on Multimedia, MM 2014, November 3, 2014 - November 7, 2014. Orlando, FL, United states: Association for Computing Machinery, Inc.

  11. Kim HG, Jin YK (2017) Acoustic Event Detection in Multichannel Audio Using Gated Recurrent Neural Networks with High-Resolution Spectral Features. ETRI J 39(6):832–840

    Article  Google Scholar 

  12. Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980

  13. Kumar A, Raj B (2016) Audio event detection using weakly labeled data. in 24th ACM Multimedia Conference, MM 2016, October 15, 2016 - October 19, 2016. Amsterdam, United kingdom: Association for Computing Machinery, Inc.

  14. Lin M, Chen Q, Yan S (2013) Network In Network. arXiv preprint arXiv:1312.4400

  15. Marques GA (2016) Langlois T. tut acoustic scene classification submission. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016

  16. Mcloughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE-ACM T Audio Spe 23(3):540–552

    Google Scholar 

  17. Mesaros A, Heittola T, Benetos E, Foster P, Lagrange M, Virtanen T, Plumbley MD (2017) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE-ACM T Audio Spe 26(2):379–393

    Google Scholar 

  18. Mikolov T, Joulin A, Chopra S, Mathieu M, Ranzato M A (2014) Learning Longer Memory in Recurrent Neural Networks. arXiv preprint arXiv:1412.7753

  19. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. in 30th International Conference on Machine Learning, ICML 2013, June 16, 2013 - June 21, 2013. Atlanta, GA, United states: International Machine Learning Society (IMLS)

  20. Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE-ACM T Audio Spe 23(1):20–31

    Google Scholar 

  21. Piczak KJ (2015) Environmental sound classification with convolutional neural networks. in 25th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2015, September 17, 2015 - September 20, 2015. Boston, MA, United states: IEEE Computer Society

  22. Piczak KJ (2015) ESC: Dataset for environmental sound classification. in 23rd ACM International Conference on Multimedia, MM 2015, October 26, 2015 - October 30, 2015. Brisbane, QLD, Australia: Association for Computing Machinery, Inc.

  23. Povey D, Zhang X, Khudanpur S (2014) Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging. arXiv preprint arXiv:1410.7455v3

  24. Radford A, Metz L, Chintala S (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434

  25. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  26. Sermanet P, Lecun Y (2011) Traffic sign recognition with multi-scale convolutional networks. in 2011 International Joint Conference on Neural Network, IJCNN 2011, July 31, 2011 - August 5, 2011. San Jose, CA, United states: Institute of Electrical and Electronics Engineers Inc.

  27. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  28. Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer Society

  29. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V (2015) Rabinovich A. Going deeper with convolutions. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer Society

  30. Takahashi N, Gygli M, Pfister B, Van Gool L (2016) Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection. arXiv preprint arXiv:1604.07160

  31. Valenti M, Diment A, Parascandolo G, Squartini S, Virtanen T (2016) DCASE 2016 acoustic scene classification using convolutional neural networks, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016. 95–99

  32. Vu TH, Wang JC (2016) Acoustic scene and event recognition using recurrent neural networks. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016

  33. Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson PJB, Plumbley MD (2017) Unsupervised feature learning based on deep models for environmental audio tagging. IEEE-ACM T Audio Spe 25(6):1230–1241

    Google Scholar 

  34. Yun S, Kim S, Moon S, Cho J, Kim T (2016) Discriminative training of GMM parameters for audio scene classification and audio tagging. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016

  35. Zhang H, Mcloughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015, April 19, 2014 - April 24, 2014. Brisbane, QLD, Australia: Institute of Electrical and Electronics Engineers Inc.

  36. Zieger C, Omologo M (2008) Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm. in INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association, September 22, 2008 - September 26, 2008. Brisbane, QLD, Australia: International Speech Communication Association

Download references

Acknowledgments

The work was supported by the National Natural Science Foundation of China under Grant No. 61871213, Six Talent Peaks Project in Jiangsu Province under Grant No. 2016-DZXX-023, China Postdoctoral Science Foundation funded project under Grant No. 2016 M601696, Qing Lan Project of Jiangsu Province, Jiangsu Planned Projects for Postdoctoral Research Funds under Grant No. 1601011B.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruiyu Liang.

Ethics declarations

Conflicts of Interest

The authors declare no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, G., Liang, R., Xie, Y. et al. Improved Convolutional Neural Networks for Acoustic Event Classification. Multimed Tools Appl 78, 15801–15816 (2019). https://doi.org/10.1007/s11042-018-6991-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6991-4

Keywords