Improved Convolutional Neural Networks for Acoustic Event Classification

Tang, Guichen; Liang, Ruiyu; Xie, Yue; Bao, Yongqiang; Wang, Shijia

doi:10.1007/s11042-018-6991-4

Improved Convolutional Neural Networks for Acoustic Event Classification

Published: 08 December 2018

Volume 78, pages 15801–15816, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Guichen Tang¹,
Ruiyu Liang ORCID: orcid.org/0000-0002-6813-4203^1,2,
Yue Xie²,
Yongqiang Bao¹ &
…
Shijia Wang²

657 Accesses
11 Altmetric
1 Mention
Explore all metrics

Abstract

To further exploit the potential performance of convolutional neural networks in acoustic event classification, an improved convolutional neural network called AecNet (Acoustic event classification net) is proposed. For traditional convolutional neural network lacks the representation of low-level features, the proposed model includes more feature layers to reserve the information of low-level and high-level features of the input. In order to extract the features of different level effectively, 1 × 1 convolutions are adopted to compress the feature maps of all convolutional layers except the top convolutional layer. Then the condensed features are concatenated into one layer, which contains all features in different levels. So, the feature learning is enhanced and multi-scale convolutional neural network is constructed. In order to extract the dynamic features of the sound clip better, multi-channels spectrogram features comprised of mel-spectrogram, its first order delta along frequency and time, second order delta along frequency and time are adopted. In experiment, point of FFT, number of mel-bands and type of mel-spectrogram deltas are detailedly discussed and reasonable choice are suggested in practice. Experiments results on datasets ESC-10, ESC-50 and DCASE show that the proposed method yields improvements of recognition accuracy in various degrees compared with some state-of-art results on standard benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Sound Event Classification with Local Time-Frequency Information and Convolutional Neural Networks

Diffusion-Based Convolutional Recurrent Neural Network for Improving Sound Event Detection

AtResNet: Residual Atrous CNN with Multi-scale Feature Representation for Low Complexity Acoustic Scene Classification

Article 24 July 2022

References

Aytar Y, Vondrick C, Torralba A (2016) SoundNet: Learning Sound Representations from Unlabeled Video. arXiv preprint arXiv:1610.09001
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
MathSciNet MATH Google Scholar
Chu S, Narayanan S, Kuo CCJ (2009) Environmental Sound Recognition With Time–Frequency Audio Features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158
Article Google Scholar
Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio Set: An ontology and human-labeled dataset for audio events. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, March 5, 2017 - March 9, 2017. New Orleans, LA, United states: Institute of Electrical and Electronics Engineers Inc.
Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. in 22nd European Signal Processing Conference, EUSIPCO 2014, September 1, 2014 - September 5, 2014. Lisbon, Portugal: European Signal Processing Conference, EUSIPCO
Han Y, Lee K (2016) Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. in 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016. Las Vegas, NV, United states: IEEE Computer Society
Hertel L, Barth E, Kaster T, Martinetz T (2015) Deep convolutional neural networks as generic feature extractors. in International Joint Conference on Neural Networks, IJCNN 2015, July 12, 2015 - July 17, 2015. Killarney, Ireland: Institute of Electrical and Electronics Engineers Inc.
Jarrett K, Kavukcuoglu K, Ranzato M A (2009) Lecun Y. What is the best multi-stage architecture for object recognition? in 12th International Conference on Computer Vision, ICCV 2009, September 29, 2009 - October 2, 2009. Kyoto, Japan: Institute of Electrical and Electronics Engineers Inc.
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. in 2014 ACM Conference on Multimedia, MM 2014, November 3, 2014 - November 7, 2014. Orlando, FL, United states: Association for Computing Machinery, Inc.
Kim HG, Jin YK (2017) Acoustic Event Detection in Multichannel Audio Using Gated Recurrent Neural Networks with High-Resolution Spectral Features. ETRI J 39(6):832–840
Article Google Scholar
Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980
Kumar A, Raj B (2016) Audio event detection using weakly labeled data. in 24th ACM Multimedia Conference, MM 2016, October 15, 2016 - October 19, 2016. Amsterdam, United kingdom: Association for Computing Machinery, Inc.
Lin M, Chen Q, Yan S (2013) Network In Network. arXiv preprint arXiv:1312.4400
Marques GA (2016) Langlois T. tut acoustic scene classification submission. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Mcloughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE-ACM T Audio Spe 23(3):540–552
Google Scholar
Mesaros A, Heittola T, Benetos E, Foster P, Lagrange M, Virtanen T, Plumbley MD (2017) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE-ACM T Audio Spe 26(2):379–393
Google Scholar
Mikolov T, Joulin A, Chopra S, Mathieu M, Ranzato M A (2014) Learning Longer Memory in Recurrent Neural Networks. arXiv preprint arXiv:1412.7753
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. in 30th International Conference on Machine Learning, ICML 2013, June 16, 2013 - June 21, 2013. Atlanta, GA, United states: International Machine Learning Society (IMLS)
Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE-ACM T Audio Spe 23(1):20–31
Google Scholar
Piczak KJ (2015) Environmental sound classification with convolutional neural networks. in 25th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2015, September 17, 2015 - September 20, 2015. Boston, MA, United states: IEEE Computer Society
Piczak KJ (2015) ESC: Dataset for environmental sound classification. in 23rd ACM International Conference on Multimedia, MM 2015, October 26, 2015 - October 30, 2015. Brisbane, QLD, Australia: Association for Computing Machinery, Inc.
Povey D, Zhang X, Khudanpur S (2014) Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging. arXiv preprint arXiv:1410.7455v3
Radford A, Metz L, Chintala S (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Sermanet P, Lecun Y (2011) Traffic sign recognition with multi-scale convolutional networks. in 2011 International Joint Conference on Neural Network, IJCNN 2011, July 31, 2011 - August 5, 2011. San Jose, CA, United states: Institute of Electrical and Electronics Engineers Inc.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer Society
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V (2015) Rabinovich A. Going deeper with convolutions. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer Society
Takahashi N, Gygli M, Pfister B, Van Gool L (2016) Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection. arXiv preprint arXiv:1604.07160
Valenti M, Diment A, Parascandolo G, Squartini S, Virtanen T (2016) DCASE 2016 acoustic scene classification using convolutional neural networks, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016. 95–99
Vu TH, Wang JC (2016) Acoustic scene and event recognition using recurrent neural networks. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson PJB, Plumbley MD (2017) Unsupervised feature learning based on deep models for environmental audio tagging. IEEE-ACM T Audio Spe 25(6):1230–1241
Google Scholar
Yun S, Kim S, Moon S, Cho J, Kim T (2016) Discriminative training of GMM parameters for audio scene classification and audio tagging. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Zhang H, Mcloughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015, April 19, 2014 - April 24, 2014. Brisbane, QLD, Australia: Institute of Electrical and Electronics Engineers Inc.
Zieger C, Omologo M (2008) Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm. in INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association, September 22, 2008 - September 26, 2008. Brisbane, QLD, Australia: International Speech Communication Association

Download references

Acknowledgments

The work was supported by the National Natural Science Foundation of China under Grant No. 61871213, Six Talent Peaks Project in Jiangsu Province under Grant No. 2016-DZXX-023, China Postdoctoral Science Foundation funded project under Grant No. 2016 M601696, Qing Lan Project of Jiangsu Province, Jiangsu Planned Projects for Postdoctoral Research Funds under Grant No. 1601011B.

Author information

Authors and Affiliations

School of Communication Engineering, Nanjing Institute of Technology, Nanjing, 211167, China
Guichen Tang, Ruiyu Liang & Yongqiang Bao
School of Information Science and Engineering, Southeast University, Nanjing, 210096, China
Ruiyu Liang, Yue Xie & Shijia Wang

Authors

Guichen Tang
View author publications
You can also search for this author inPubMed Google Scholar
Ruiyu Liang
View author publications
You can also search for this author inPubMed Google Scholar
Yue Xie
View author publications
You can also search for this author inPubMed Google Scholar
Yongqiang Bao
View author publications
You can also search for this author inPubMed Google Scholar
Shijia Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ruiyu Liang.

Ethics declarations

Conflicts of Interest

The authors declare no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, G., Liang, R., Xie, Y. et al. Improved Convolutional Neural Networks for Acoustic Event Classification. Multimed Tools Appl 78, 15801–15816 (2019). https://doi.org/10.1007/s11042-018-6991-4

Download citation

Received: 13 April 2018
Revised: 16 October 2018
Accepted: 28 November 2018
Published: 08 December 2018
Issue Date: 30 June 2019
DOI: https://doi.org/10.1007/s11042-018-6991-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Convolutional Neural Networks for Acoustic Event Classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Sound Event Classification with Local Time-Frequency Information and Convolutional Neural Networks

Diffusion-Based Convolutional Recurrent Neural Network for Improving Sound Event Detection

AtResNet: Residual Atrous CNN with Multi-scale Feature Representation for Low Complexity Acoustic Scene Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now