Abstract
With the popular application of deep learning-based models in various classification problems, more and more researchers have applied these models to environmental sound classification (ESC) tasks in recent years. However, the performance of existing models that use acoustic features such as log-scaled mel spectrogram (Log mel) and mel frequency cepstral coefficient or raw waveform to train deep neural networks for ESC is unsatisfactory. In this paper, first of all, a fusion of multiple features consisting of Log mel, log-scaled cochleagram and log-scaled constant-Q transform are proposed, and these features are fused to form the feature set that is called LMCC. Then, a network called CNN-GRUNN which consists of convolutional neural network and gated recurrent unit neural network in parallel is presented to improve the performance of ESC with the proposed aggregated features. Experiments were conducted on ESC-10, ESC-50, and UrbanSound8K datasets. The experimental results indicate that the model with LMCC as input to CNN-GRUNN is appropriate for ESC problems. And our model is able to achieve good classification accuracy for the three datasets, i.e., ESC-10 (92.30%), ESC-50 (87.43%), and UrbanSound8K (96.10%).

Similar content being viewed by others
REFERENCES
Baum, E., Harper, M., Alicea, R., and Ordonez, C., Sound identification for fire-fighting mobile robots, 2018 Second IEEE Int. Conf. Rob. Comput. (IRC), 2018, pp. 79–86.
Liu, J.M., You, M., Li, G.Z., Wang, Z., Xu, X., Qiu, Z., Xie, W., An, C., and Chen, S., Cough signal recognition with Gammatone cepstral coefficients, IEEE China Summit Int. Conf. Signal Inf. Process., 2013, pp. 160–164.
Ali, H., Tran, S.N., Benetos, E., and d’Avila Garcez, A.S., Speaker recognition with hybrid features from a deep belief network, Neural Comput. Appl., 2018, vol. 29, no. 6, pp. 13–19.
Ghosal, D. and Kolekar, M.H., Music genre recognition using deep neural networks and transfer learning, Interspeech, 2018, pp. 2087–2091.
Sahidullah, M. and Saha, G., Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition, Speech Commun., 2012, vol. 54, no. 4, pp. 543–565.
Chu, S., Narayanan, S., and Kuo, C.C.J., Environmental sound recognition with time-frequency audio features, IEEE Trans. Audio Speech Lang. Process., 2009, vol. 17, no. 6, pp. 1142–1158.
Valero, X. and Alias, F., Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification, IEEE Trans. Multimedia, 2012, vol. 14, no. 6, pp. 1684–1689.
Geiger, J.T. and Helwani, K., Improving event detection for audio surveillance using Gabor filterbank features, 23rd Eur. Signal Process. Conf., 2015, pp. 714–718.
Jin, Z., Zhou, G., Gao, D., and Zhang, Y., EEG classification using sparse Bayesian extreme learning machine for brain–computer interface, Neural Comput. Appl., 2018, pp. 1–9.
Shao, Y. and Wang, D., Robust speaker identification using auditory features and computational auditory scene analysis, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1589–1592.
Wang, J.C., Wang, J.F., He, K.W., and Hsu, C.S., Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor, Proc. Int. Jt. Conf. Neural Netw., 2006, pp. 1731–1735.
Zhang, Y., Wang, Y., Zhou, G., Jin, J., Wang, B., Wang, X., and Cichocki, A., Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces, Expert Syst. Appl., 2018, vol. 96, pp. 302–310.
Zhang, H., McLoughlin, I., and Song, Y., Robust sound event recognition using convolutional neural networks, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 559–563.
LeCun, Y., Bengio, Y., and Hinton, G., Deep learning, Nature, 2015, vol. 521, no. 7553, pp. 436–444.
Krizhevsky, A., Sutskever, I., and Hinton, G.E., Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
Palaz, D., Magimai.-Doss, M., and Collobert, R., Analysis of CNN-based speech recognition system using raw speech as input, IDIAP, 2015.
Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y., On the properties of neural machine translation: Encoder-decoder approaches, arXiv:1409.1259.
Piczak, K.J., ESC: Dataset for environmental sound classification, 23rd ACM Int. Conf. Multimedia, 2015, pp. 1015–1018.
Salamon, J., Jacoby, C., and Bello, J.P., A dataset and taxonomy for urban sound research, 22nd ACM Int. Conf. Multimedia, 2014, pp. 1041–1044.
Piczak, K.J., Environmental sound classification with convolutional neural networks, IEEE Int. Workshop Mach. Learn. Signal Process., 2015, pp. 1–6.
Vu, T.H. and Wang, J.C., Acoustic scene and event recognition using recurrent neural networks, Detection and Classification of Acoustic Scenes and Events, 2016.
Bae, S.H., Choi, I., and Kim, N.S., Acoustic scene classification using parallel combination of LSTM and CNN, Detection and Classification of Acoustic Scenes and Events, 2016, pp. 11–15.
Aytar, Y., Vondrick, C., and Torralba, A., Soundnet: Learning sound representations from unlabeled video, Adv. Neural Inf. Process. Syst., 2016, pp. 892–900.
Dai, W., Dai, C., Qu, S., Li, J., and Das, S., Very deep convolutional neural networks for raw waveforms, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 421–425.
Tokozume, Y. and Harada, T., Learning environmental sounds with end-to-end convolutional neural network, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 2721–2725.
Tokozume, Y., Ushiku, Y., and Harada, T., Learning from between-class examples for deep sound recognition, arXiv:1711.10282
Sang, J., Park, S., and Lee, J., Convolutional recurrent neural networks for urban sound classification using raw waveforms, 26th Eur. Signal Process. Conf., 2018, pp. 2444–2448.
Zhang, Z., Xu, S., Zhang, S., Qiao, T., and Cao, S., Learning attentive representations for environmental sound classification, IEEE Access, 2019, vol. 7, pp. 130327–130339.
Zhang, Z., Xu, S., Qiao, T., Zhang, S., and Cao, S., Attention based convolutional recurrent neural network for environmental sound classification, Chinese Conf. Pattern Recognition Computer Vision, 2019, pp. 261–271.
Takahashi, N., Gygli, M., Pfister, B., and Van Gool, L., Deep convolutional neural networks and data augmentation for acoustic event detection, arXiv:1604.07160
Sailor, H.B., Agrawal, D.M., and Patil, H.A., Unsupervised filterbank learning using convolutional restricted Boltzmann machine for environmental sound classification, Interspeech, 2017, pp. 3107–3111.
Tak, R.N., Agrawal, D.M., and Patil, H.A., Novel phase encoded mel filterbank energies for environmental sound classification, Int. Conf. Pattern Recognit. Mach. Intell., 2017, pp. 317–325.
Agrawal, D.M., Sailor, H.B., Soni, M.H., and Patil, H.A., Novel TEO-based gammatone features for environmental sound classification, 25th Eur. Signal Process. Conf., 2017, pp. 1809–1813.
Boddapati, V., Petef, A., Rasmusson, J., and Lundberg, L., Classifying environmental sounds using image recognition networks, Procedia Comput. Sci., 2017, vol. 112, pp. 2048–2056.
Zhang, Z., Xu, S., Cao, S., and Zhang, S., Deep convolutional neural network with mixup for environmental sound classification, Chinese Conf. Pattern Recognition Computer Vision, 2018, pp. 356–367.
Su, Y., Zhang, K., Wang, J., and Madani, K., Environment sound classification using a two-stream CNN based on decision-level fusion, Sensors, 2019, vol. 19, no. 7, pp. 1733.
Sharma, J., Granmo, O.C., and Goodwin, M., Environment sound classification using multiple feature channels and deep convolutional neural networks, arXiv:1908.11219
Shao, Y. and Wang, D., Robust speaker identification using auditory features and computational auditory scene analysis, IEEE Int. Conf. Acoust. Speech Signal Process., 2008, pp. 1589–1592.
Xing, Z., Baik, E., Jiao, Y., Kulkarni, N., Li, C., Muralidhar, G., Parandehgheibi, M., Reed, E., Singhal, A., Xiao, F., and Pouliot, C., Modeling of the latent embedding of music using deep neural network, arXiv:1705.05229
Gao, B., Woo, W.L., and Khor, L.C., Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation, J. Acoust. Soc. Am., 2014, vol. 135, no. 3, pp. 1171–1185.
Sharan, R.V. and Moir, T.J., Pseudo-color cochleagram image feature and sequential feature selection for robust acoustic event recognition, Appl. Acoust., 2018, vol. 140, pp. 198–204.
Brown, J.C., Calculation of a constant Q spectral transform, J. Acoust. Soc. Am., 1991, vol. 89, no. 1, pp. 425–434.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare no conflict of interest.
About this article
Cite this article
Yu Zhang, Zeng, J., Li, Y. et al. Convolutional Neural Network-Gated Recurrent Unit Neural Network with Feature Fusion for Environmental Sound Classification. Aut. Control Comp. Sci. 55, 311–318 (2021). https://doi.org/10.3103/S0146411621040106
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0146411621040106