Abstract
In recent years, there has been an increasing interest in Environmental Sound Classification (ESC), and it is a challenging non-speech audio event classification problem because of the complexity of the environment. However, the classification accuracy of the conventional methods is significantly dependent on the robustness of representative features and the effectiveness of the constructed model, which causes the poor adaptability of current models. Considering this, a novel ESC scheme based on stacked Deep Neural Networks with multi-dimensional aggregated features is proposed. Firstly, we use the aggregated features composed of time-domain features and time–frequency (TF) domain features to capture a more comprehensive representation of sounds. Afterward, the feature reduction based on Principal Component Analysis (PCA) is employed to select the most discriminative representations. Finally, a novel Stacked Deep Neural Networks based on ensemble learning and data augmentation is presented to improve the ESC scheme's generalizing capability. The experimental results demonstrate that the proposed method is appropriate for ESC problems, which achieves 96.1% and 98.1% accuracy scores for ESC-10 and UrbanSound8K datasets, respectively, and outperforms most state-of-art methods in ESC tasks at the aspect of both accuracy and computational burden.
Similar content being viewed by others
References
Chachada, S., Jay, C. C. (2014). Environmental sound recognition: A survey. Apsipa Transactions on Signal & Information Processing, 3.
Baum, E., Harper, M., Alicea, R., & Ordonez, C. (2018). Sound Identification for Fire-Fighting Mobile Robots. In 2018 Second IEEE international conference on robotic computing (IRC), pp.79–86.
Mydlarz, C., Salamon, J., & Bello, J. P. (2017). The implementation of low-cost urban acoustic monitoring devices. Applied Acoustics, 117, 207–218.
Fan, X., Sun, T., Chen, W., Fan, Q. (2020). Deep neural network based environment sound classification and its implementation on hearing aid app. Measurement, 159(9), 107790.
Su, Y., Zhang, K., Wang, J., & Madani, K. (2019). Environment sound classification using a two-stream CNN based on decision-level fusion. Sensors (Switzerland), 19(7), 1–15.
Barchiesi, D., Giannoulis, D., Stowellm, D., & Plumbleym, M. D. (2014). Acoustic Scene Classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3), 16–34.
Cheveigné, D. A. (2008). Computational Auditory Scene Analysis. ISTE.
Mesaros, A., Heittola, T. (2017). DCASE 2017 Challenge setup: Tasks, datasets and baseline system. In Detection and Classification of Acoustic Scenes & Events 2017.
Barchiesi, D., Giannoulis, D. D., Stowell, D., & Plumbley, M. D. (2015). Acoustic Scene Classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3), 16–34.
Piczak, K. J. (2015). Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th international workshop on machine learning for signal processing (MLSP), pp. 1–6.
Su, Y., Zhang, K., Wang, J., Zhou, D., Madani, K. (2020). Performance analysis of multiple aggregated acoustic features for environment sound classification. Applied Acoustics, 158, 107050.
Dai, W. (2014). Acoustic scene recognition with deep learning. Carnegie Mellon University, Pittsburg, Pennsylvania, USA.
Bountourakis, V., Vrysis, L., Papanikolaou, G. (2015). Machine learning algorithms for environmental sound recognition: Towards soundscape semantics. In Proceedings of the Audio Mostly 2015 on Interaction With Sound (AM '15), pp.1–7.
Bregman, A. S. (1990). Auditory Scene Analysis. MIT Press.
Silva, B. D., Happi, A. W., Braeken, A., & Touhafi, A. (2019). Evaluation of classical Machine Learning techniques towards urban sound recognition on embedded systems. Applied Sciences, 9(18), 3885.
Rakotomamonjy, A. (2017). Supervised representation learning for audio scene classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6), 1253–1265.
Ahmad, S., Agrawal, S., Joshi, S., Taran, S. (2020). Environmental sound classification using optimum allocation sampling based empirical mode decomposition. Physica A: Statistical Mechanics and its Applications, 537, 122613.
Abdoli, S., Cardinal, P., & Lameiras, K. A. (2019). End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136, 52–263.
Chen, Y., Guo, Q., Liang, X., Wang, J., & Qian, Y. (2019). Environmental sound classification with dilated convolutions. Applied Acoustics, 148, 123–132.
Huang, Z., Liu, C., Fei, H. (2020). Urban sound classification based on 2-order dense convolutional network using dual features. Applied Acoustics, 164, 107243.
Zhang, Z., Xu, S., Zhang, S., Qiao, T., & Cao, S. (2021). Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing, 453, 896–903.
Parascandolo, G., Huttunen, H., Virtanen, T. (2020). Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444.
Demir, F., Abdullah, D. A., & Sengur, A. (2020). A New Deep CNN Model for Environmental Sound Classification. IEEE Access, 8, 66529–66537.
Salamon, J., & Bello, J. P. (2017). Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters, 24, 279–283.
Lu, R., Duan, Z., Zhang, C. (2017). Metric learning based data augmentation for environmental sound classification. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–5.
Mushtaq, Z., & Su, S. F. (2020). Environmental sound classification using a regularized deep convolutional neural network with data augmentation. Applied Acoustics, 167, 107389.
Mun, S., Park, S., Han, D. K., Ko, H. (2017). Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane. Work Detect Classified Acoustic Scenes Events, pp. 93–97.
Mun, S., Shon, S., Kim, W. et al. (2017). Deep Neural Network based learning and transferring mid-level audio features for acoustic scene classification. In 2017 IEEE International Conference Acoustic Speech, Signal Processing (ICASSP), pp. 796–800.
Li, S., Yao, Y., Hu, J., & Liu, G. (2018). An ensemble stacked convolutional neural network model for environmental event sound recognition. Applied Sciences, 87(7), 1152.
Li, X., Chebiyyam, V., Kirchhoff, K. (2019). Multi-stream network with temporal attention for environmental sound classification. In Proceeding Annual Conference International Speech Communication Association INTERSPEECH, pp. 3604–3608.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1015–1018.
McFee, B., Raffel, C. et al. (2015). librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th Python in Science Conference, vol. 8, pp.18–25.
Hinton, G. E., Salakhutdinov, R. R., & Code, M. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507.
Zhang, Z., Xu, S., Cao, S., & Zhang, S. (2018). Deep convolutional neural network with mixup for environmental sound classification. Lecture Notes in Computer Science, 11257, 356–367.
Mushtaq, Z., Su, S. F., Tran, Q. V. (2021). Spectral images based environmental sound classification using CNN with meaningful data augmentation. Appllied Acoustics, 172,107581.
Zhang, X., Zou, Y., Shi, W. (2017). Dilated convolution neural network with LeakyReLU for environmental sound classification. In 2017 22nd International Conference on Digital Signal Processing (DSP), London, UK, pp.1–5.
Zhang, K., Cai, Y., Ren, Y., Ye, R., & He, L. (2020). MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network For Sound Event Detection. IEEE Access, pp.99, 1–1.
Acknowledgements
This work reported herein was funded jointly by the National Natural Science Foundation of China for Young Scholar (Grant No. 61801471), Youth Innovation Promotion Association CAS (Grant No. 2021022), the development fund for Shanghai talents (Grant No. 2020011), and Jiading Youth Talents Program
Author information
Authors and Affiliations
Contributions
Conceptualization, C.L.; methodology, C.L.; validation, C.L, F.H.; investigation, C.L.; writing—original draft preparation, F.H., and C.L.; visualization, C.L., Y.C., and Y.Z; project administration, F.H. and H.F.; funding acquisition, F.H. and H.F. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, C., Hong, F., Feng, H. et al. Environmental Sound Classification Based on Stacked Concatenated DNN using Aggregated Features. J Sign Process Syst 93, 1287–1299 (2021). https://doi.org/10.1007/s11265-021-01702-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-021-01702-x