Multi-channel Convolutional Neural Networks with Multi-level Feature Fusion for Environmental Sound Classification

Chong, Dading; Zou, Yuexian; Wang, Wenwu

doi:10.1007/978-3-030-05716-9_13

Multi-channel Convolutional Neural Networks with Multi-level Feature Fusion for Environmental Sound Classification

Dading Chong¹⁹,
Yuexian Zou^19,20 &
Wenwu Wang²¹

Conference paper
First Online: 11 December 2018

2350 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11296))

Abstract

Learning acoustic models directly from the raw waveform is an effective method for Environmental Sound Classification (ESC) where sound events often exhibit vast diversity in temporal scales. Convolutional neural networks (CNNs) based ESC methods have achieved the state-of-the-art results. However, their performance is affected significantly by the number of convolutional layers used and the choice of the kernel size in the first convolutional layer. In addition, most existing studies have ignored the ability of CNNs to learn hierarchical features from environmental sounds. Motivated by these findings, in this paper, parallel convolutional filters with different sizes in the first convolutional layer are designed to extract multi-time resolution features aiming at enhancing feature representation. Inspired by VGG networks, we build our deep CNNs by stacking 1-D convolutional layers using very small filters except for the first layer. Finally, we extend the model using multi-level feature aggregation technique to boost the classification performance. The experimental results on Urbansound 8k, ESC-50, and ESC-10 show that our proposed method outperforms the state-of-the-art end-to-end methods for environmental sound classification in terms of the classification accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Virtanen, T., Plumbley, M.D., Ellis, D.: Computational Analysis of Sound Scenes and Events. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-63450-0
Book Google Scholar
Boddapati, V., Petef, A., Rasmusson, J., Lundberg, L.: Classifying environmental sounds using image recognition networks. Proc. Comput. Sci. 112, 2048–2056 (2017)
Article Google Scholar
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2015)
Google Scholar
Vacher, M., Serignat, J.-F., Chaillol, S.: Sound classification in a smart room environment: an approach using GMM and HMM methods. In: The 4th IEEE Conference on Speech Technology and Human-Computer Dialogue (SpeD 2007), Publishing House of the Romanian Academy (Bucharest), pp. 135–146 (2007)
Google Scholar
Łopatka, K., Zwan, P., Czyżewski, A.: Dangerous sound event recognition using support vector machine classifiers. In: Nguyen, N.T., Zgrzywa, A., Czyżewski, A. (eds.) Advances in Multimedia and Network Information System Technologies, pp. 49–57. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14989-4_5
Chapter Google Scholar
Su, F., Yang, L., Lu, T., Wang, G.: Environmental sound classification for scene recognition using local discriminant bases and HMM. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1389–1392. ACM (2011)
Google Scholar
Saki, F., Kehtarnavaz, N.: Background noise classification using random forest tree classifier for cochlear implant applications. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3591–3595. IEEE (2014)
Google Scholar
Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)
Google Scholar
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 1533–1545 (2014)
Article Google Scholar
Kong, Q., Sobieraj, I., Wang, W., Plumbley, M.: Deep neural network baseline for DCASE challenge 2016. In: Proceedings of DCASE 2016 (2016)
Google Scholar
Cotton, C.V., Ellis, D.P.: Spectral vs. spectro-temporal features for acoustic event detection. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 69–72. IEEE (2011)
Google Scholar
Zhang, H., McLoughlin, I., Song, Y.: Robust sound event recognition using convolutional neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 559–563. IEEE (2015)
Google Scholar
Zhang, X., Zou, Y., Shi, W.: Dilated convolution neural network with LeakyReLU for environmental sound classification. In: 2017 22nd International Conference on Digital Signal Processing (DSP), pp. 1–5. IEEE (2017)
Google Scholar
Medhat, F., Chesmore, D., Robinson, J.: Masked conditional neural networks for audio classification. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10614, pp. 349–358. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68612-7_40
Chapter Google Scholar
Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end convolutional neural network. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2721–2725. IEEE (2017)
Google Scholar
Tokozume, Y., Ushiku, Y., Harada, T.: Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282 (2017)
Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)
Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)
Google Scholar
Zhao, B., Lu, H., Chen, S., Liu, J., Wu, D.: Convolutional neural networks for time series classification. J. Syst. Eng. Electron. 28, 162–169 (2017)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24, 1208–1212 (2017)
Article Google Scholar
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)
Google Scholar
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)
Google Scholar
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, p. 3 (2013)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24, 279–283 (2017)
Article Google Scholar

Download references

Acknowledgment

This project was partially supported by Shenzhen Science & Technology Fundamental Research Programs (No: JCYJ20170817160058246 and JCYJ20170306165153653) & Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467). Special acknowledgements are given to Aoto-PKUSZ Joint Research Center of Artificial Intelligence on Scene Cognition & Technology Innovation for its support.

Author information

Authors and Affiliations

ADSPLAB, School of ECE, Peking University, Shenzhen, China
Dading Chong & Yuexian Zou
Peng Cheng Laboratory, Shenzhen, China
Yuexian Zou
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Wenwu Wang

Authors

Dading Chong
View author publications
You can also search for this author in PubMed Google Scholar
Yuexian Zou
View author publications
You can also search for this author in PubMed Google Scholar
Wenwu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuexian Zou .

Editor information

Editors and Affiliations

Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Ioannis Kompatsiaris
EURECOM, Sophia Antipolis, France
Benoit Huet
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Vasileios Mezaris
Dublin City University, Dublin, Ireland
Cathal Gurrin
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki, Greece
Stefanos Vrochidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chong, D., Zou, Y., Wang, W. (2019). Multi-channel Convolutional Neural Networks with Multi-level Feature Fusion for Environmental Sound Classification. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11296. Springer, Cham. https://doi.org/10.1007/978-3-030-05716-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-05716-9_13
Published: 11 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05715-2
Online ISBN: 978-3-030-05716-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics