Abstract
To further exploit the potential performance of convolutional neural networks in acoustic event classification, an improved convolutional neural network called AecNet (Acoustic event classification net) is proposed. For traditional convolutional neural network lacks the representation of low-level features, the proposed model includes more feature layers to reserve the information of low-level and high-level features of the input. In order to extract the features of different level effectively, 1 × 1 convolutions are adopted to compress the feature maps of all convolutional layers except the top convolutional layer. Then the condensed features are concatenated into one layer, which contains all features in different levels. So, the feature learning is enhanced and multi-scale convolutional neural network is constructed. In order to extract the dynamic features of the sound clip better, multi-channels spectrogram features comprised of mel-spectrogram, its first order delta along frequency and time, second order delta along frequency and time are adopted. In experiment, point of FFT, number of mel-bands and type of mel-spectrogram deltas are detailedly discussed and reasonable choice are suggested in practice. Experiments results on datasets ESC-10, ESC-50 and DCASE show that the proposed method yields improvements of recognition accuracy in various degrees compared with some state-of-art results on standard benchmark.




Similar content being viewed by others
References
Aytar Y, Vondrick C, Torralba A (2016) SoundNet: Learning Sound Representations from Unlabeled Video. arXiv preprint arXiv:1610.09001
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
Chu S, Narayanan S, Kuo CCJ (2009) Environmental Sound Recognition With Time–Frequency Audio Features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158
Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio Set: An ontology and human-labeled dataset for audio events. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, March 5, 2017 - March 9, 2017. New Orleans, LA, United states: Institute of Electrical and Electronics Engineers Inc.
Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. in 22nd European Signal Processing Conference, EUSIPCO 2014, September 1, 2014 - September 5, 2014. Lisbon, Portugal: European Signal Processing Conference, EUSIPCO
Han Y, Lee K (2016) Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. in 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016. Las Vegas, NV, United states: IEEE Computer Society
Hertel L, Barth E, Kaster T, Martinetz T (2015) Deep convolutional neural networks as generic feature extractors. in International Joint Conference on Neural Networks, IJCNN 2015, July 12, 2015 - July 17, 2015. Killarney, Ireland: Institute of Electrical and Electronics Engineers Inc.
Jarrett K, Kavukcuoglu K, Ranzato M A (2009) Lecun Y. What is the best multi-stage architecture for object recognition? in 12th International Conference on Computer Vision, ICCV 2009, September 29, 2009 - October 2, 2009. Kyoto, Japan: Institute of Electrical and Electronics Engineers Inc.
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. in 2014 ACM Conference on Multimedia, MM 2014, November 3, 2014 - November 7, 2014. Orlando, FL, United states: Association for Computing Machinery, Inc.
Kim HG, Jin YK (2017) Acoustic Event Detection in Multichannel Audio Using Gated Recurrent Neural Networks with High-Resolution Spectral Features. ETRI J 39(6):832–840
Kingma DP, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980
Kumar A, Raj B (2016) Audio event detection using weakly labeled data. in 24th ACM Multimedia Conference, MM 2016, October 15, 2016 - October 19, 2016. Amsterdam, United kingdom: Association for Computing Machinery, Inc.
Lin M, Chen Q, Yan S (2013) Network In Network. arXiv preprint arXiv:1312.4400
Marques GA (2016) Langlois T. tut acoustic scene classification submission. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Mcloughlin I, Zhang H, Xie Z, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. IEEE-ACM T Audio Spe 23(3):540–552
Mesaros A, Heittola T, Benetos E, Foster P, Lagrange M, Virtanen T, Plumbley MD (2017) Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE-ACM T Audio Spe 26(2):379–393
Mikolov T, Joulin A, Chopra S, Mathieu M, Ranzato M A (2014) Learning Longer Memory in Recurrent Neural Networks. arXiv preprint arXiv:1412.7753
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. in 30th International Conference on Machine Learning, ICML 2013, June 16, 2013 - June 21, 2013. Atlanta, GA, United states: International Machine Learning Society (IMLS)
Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. IEEE-ACM T Audio Spe 23(1):20–31
Piczak KJ (2015) Environmental sound classification with convolutional neural networks. in 25th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2015, September 17, 2015 - September 20, 2015. Boston, MA, United states: IEEE Computer Society
Piczak KJ (2015) ESC: Dataset for environmental sound classification. in 23rd ACM International Conference on Multimedia, MM 2015, October 26, 2015 - October 30, 2015. Brisbane, QLD, Australia: Association for Computing Machinery, Inc.
Povey D, Zhang X, Khudanpur S (2014) Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging. arXiv preprint arXiv:1410.7455v3
Radford A, Metz L, Chintala S (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3):211–252
Sermanet P, Lecun Y (2011) Traffic sign recognition with multi-scale convolutional networks. in 2011 International Joint Conference on Neural Network, IJCNN 2011, July 31, 2011 - August 5, 2011. San Jose, CA, United states: Institute of Electrical and Electronics Engineers Inc.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer Society
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V (2015) Rabinovich A. Going deeper with convolutions. in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, June 7, 2015 - June 12, 2015. Boston, MA, United states: IEEE Computer Society
Takahashi N, Gygli M, Pfister B, Van Gool L (2016) Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection. arXiv preprint arXiv:1604.07160
Valenti M, Diment A, Parascandolo G, Squartini S, Virtanen T (2016) DCASE 2016 acoustic scene classification using convolutional neural networks, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016. 95–99
Vu TH, Wang JC (2016) Acoustic scene and event recognition using recurrent neural networks. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson PJB, Plumbley MD (2017) Unsupervised feature learning based on deep models for environmental audio tagging. IEEE-ACM T Audio Spe 25(6):1230–1241
Yun S, Kim S, Moon S, Cho J, Kim T (2016) Discriminative training of GMM parameters for audio scene classification and audio tagging. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Zhang H, Mcloughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015, April 19, 2014 - April 24, 2014. Brisbane, QLD, Australia: Institute of Electrical and Electronics Engineers Inc.
Zieger C, Omologo M (2008) Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm. in INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association, September 22, 2008 - September 26, 2008. Brisbane, QLD, Australia: International Speech Communication Association
Acknowledgments
The work was supported by the National Natural Science Foundation of China under Grant No. 61871213, Six Talent Peaks Project in Jiangsu Province under Grant No. 2016-DZXX-023, China Postdoctoral Science Foundation funded project under Grant No. 2016 M601696, Qing Lan Project of Jiangsu Province, Jiangsu Planned Projects for Postdoctoral Research Funds under Grant No. 1601011B.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tang, G., Liang, R., Xie, Y. et al. Improved Convolutional Neural Networks for Acoustic Event Classification. Multimed Tools Appl 78, 15801–15816 (2019). https://doi.org/10.1007/s11042-018-6991-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6991-4