Abstract
The sound event recognition (SER) task is gaining lot of importance in emerging applications such as machine audition, audio surveillance, and environmental audio scene recognition. The recognition of sound events with noisy conditions in real-time surveillance applications is a difficult task. In this paper, we focus on learning patterns using multiple forms (views) of the given sound events. We propose two variants of the Multi-View Representation (MVR)-based approach for the SER task. The first variant combines the auditory image-based features and the cepstral features from sound signal. The second variant combines the statistical features extracted from the auditory images and the cepstral features of sound signal. In addition to these variants, Constant Q-transform and Variable Q-transform image-based features are also explored to study the other effective forms of multi-view representations. A discriminative model-based classifier is then used to recognize these representations as environmental sound events. The performance of the proposed MVR approaches is evaluated on three benchmark sound event datasets namely ESC-50, DCASE2016 Task 2, and DCASE2018 Task 2 for the SER task. The recognition accuracy of the proposed MVR approach is significantly better than the other approaches proposed in the recent literature.




Similar content being viewed by others
Availability of data and material
The datasets namely ESC-50, DCASE2016 Task 2 and DCASE2018 Task 2 used in our studies are publicly available.
Code availability
The code is available from the corresponding author upon request.
References
Yang, W., Krishnan, S.: Sound event detection in real-life audio using joint spectral and temporal features. Signal Image Video Process. 12(7), 1345–1352 (2018)
Kong, Q., Xu, Y., Sobieraj, I., Wang, W., Plumbley, D.M.: Sound event detection and time-frequency segmentation from weakly labelled data. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 27(4), 777–787 (2019)
Chandrakala, S., Jayalakshmi, S.L.: Generative model driven representation learning in a hybrid framework for environmental audio scene and sound event recognition. IEEE Trans. Multimed. 22(1), 3–14 (2020)
Shreyas, N., Venkatraman, M., Malini, S., Chandrakala, S.: Trends of sound event recognition in audio surveillance: a recent review and study. In: The Cognitive Approach in Cloud Computing and Internet of Things Technologies for Surveillance Tracking Systems, pp. 95–106. Elsevier, (2020)
Wang, C.-Y., Tai, T.-C., Wang, J.-C., Santoso, A., Mathu-laprangsan, S., Chiang, C.-C., Chung-Hsien, W.: Sound events recognition and retrieval using multi-convolutional-channel sparse coding convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1875–1887 (2020)
Jayalakshmi, S.L., Chandrakala, S., Nedunchelian, R.: Global statistical features-based approach for acoustic event detection. Appl. Acoust. 139, 113–118 (2018)
Atrey, P.K., Maddage, N.C., Kankanhalli, M.S.: Audio based event detection for multimedia surveillance. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5, p. V. IEEE, (2006)
Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process. Lett. 18(2), 130–133 (2010)
Do Ha, M., Sheng, W., Liu, M., Zhang, S.: Context-aware sound event recognition for home service robots. In: 2016 IEEE International Conference on Automation Science and Engineering (CASE), pp. 739–744. IEEE, (2016)
Singh, S., Payne, R.S., Jennings, A.P.: Toward a methodology for assessing electric vehicle exterior sounds. IEEE Trans. Intell. Transp. Syst. 15(4), 1790–1800 (2014)
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, (2013)
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM, (2015)
Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 142–153 (2014)
Cowling, M., Sitte, R.: Comparison of techniques for environmental sound recognition. Pattern Recognit. Lett. 24(15), 2895–2907 (2003)
Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE, (2015)
Jeong, I.-Y., Lee, S., Han, Y., Lee, K.: Audio event detection using multiple-input convolutional neural network. In: Detection and Classification of Acoustic Scenes and Events (DCASE) (2017)
Chen, Y., Zhang, Y., Duan, Z.: DCASE2017 sound event detection using convolutional neural network. In: Detection and Classification of Acoustic Scenes and Events (2017)
Adavanne, S., Parascandolo, G., Pertilä, P., Heittola, T., Virtanen, T.: Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprintarXiv:1706.02293, (2017)
Parascandolo, G., Huttunen, H., Virtanen, T.: Recurrent neural networks for polyphonic sound event detection in real life recordings. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444. IEEE, (2016)
Lu, R., Duan, Z.: Bidirectional GRU for sound event detection. In: Detection and Classification of Acoustic Scenes and Events (2017)
Zhou, J.: Sound event detection in multichannel audio LSTM network. In: Proceedings of Detection Classification Acoustic Scenes Events, (2017)
Myung Jong Kim and Hoirin Kim: Audio-based objectionable content detection using discriminative transforms of time–frequency dynamics. IEEE Trans. Multimed. 14(5), 1390–1400 (2012)
Hyungjun, L., Kim, M.J., Kim, H.-R.: Bag-of-audio-words feature representation using GMM clustering for sound event classification. In: ICEIC2015, pp. 170–175, (2015)
Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1 (2013)
Chu, S., Narayanan, S., Jay Kuo, C.-C.: Environmental sound recognition with time–frequency audio features. IEEE Trans. Audio Speech Lang. Process. 17(6), 1142–1158 (2009)
Ye, J., Kobayashi, T., Wang, X., Tsuda, H., Masahiro, M.: An automatic taxonomy approach. In: IEEE Transactions on Emerging Topics in Computing, Audio Data Mining for Anthropogenic Disaster Identification (2017)
Serizel, R., Bisot, V., Essid, S., Richard, G.: Acoustic features for environmental sound analysis. In: Computational Analysis of Sound Scenes and Events, pp. 71–101. Springer, (2018)
Grzeszick, R., Plinge, A., Fink, G.A.: Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1242–1252 (2017)
Li, Y., Li, X., Zhang, Y., Wang, W., Liu, M., Feng, X.: Acoustic scene classification using deep audio feature and BLSTM network. In: 2018 International Conference on Audio, Language and Image Processing (ICALIP), pp. 371–374. IEEE, (2018)
Vesperini, F., Gabrielli, L., Principi, E., Squartini, S.: Polyphonic sound event detection by using capsule neural networks. IEEE J. Sel. Top. Signal Process. 13(2), 310–322 (2019). https://doi.org/10.1109/JSTSP.2019.2902305
Yu, Y., Beuret, S., Zeng, D., Oyama, K.: Deep learning of human perception in audio event classification. In: 2018 IEEE International Symposium on Multimedia (ISM), pp. 188–189. IEEE, (2018)
Brown, J.C.: Calculation of a constant q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)
Hanyu, Z., Shengchen, L.: A system for DCASE challenge using 2018 CRNN with MEL features. Technical report, DCASE2018 Challenge (2018)
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 26(2), 379–393 (2018)
Fonseca, E., Plakal, M., Font, F., Ellis, D.P.W., Favory, X., Pons, J., Serra, X.: General-purpose tagging of freesound audio with audioset labels: task description, dataset, and baseline. arXiv preprintarXiv:1807.09902, (2018)
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE, (2015b)
Benetos, E., Lafay, G., Lagrange, M.: DCASE2016 task 2 baseline. Technical report, DCASE2016 Challenge (2016)
Komatsu, T., Toizumi, T., Kondo, R., Senda, Y.: Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 45–49, (2016)
Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time–frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(1), 142–153 (2015)
Acknowledgements
The authors would like to acknowledge the financial support vide No.DST/CSRI/2017/131(G) Project under the ‘Cognitive Science Research Initiative (CSRI)’ by the Department of Science and Technology, Government of India to carry out this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chandrakala, S., M, V., N, S. et al. Multi-view representation for sound event recognition. SIViP 15, 1211–1219 (2021). https://doi.org/10.1007/s11760-020-01851-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-020-01851-9