Abstract
Sound Event Localization and Detection (SELD) is the task of spatial and temporal localization of various sound events and their classification. Commonly, multitask models are used to perform SELD. In this work, a deep learning network model named channel-wise ‘FusionNet’ is designed to perform the SELD task. The novel fusion layer is introduced into the regular Deep Neural Network (DNN), where the input is fed channel-wise, and the outputs of all channels are fused to form a new feature representation. The key contribution of this work is the neural network model which helps to retain the channel-wise information from the multichannel input along with the spatial and temporal information. The proposed network utilizes separable convolution blocks in the convolution layers, therefore, the complexity of the model is low in terms of both time and space. The features used as input are Mel-band energies for Sound Event Detection (SED) and intensity vectors for the Direction-of-Arrival (DOA) estimation. The proposed network’s fusion layer provides a better representation of features for both SED and DOA estimation tasks. Experiments are performed on the recordings of the First-order Ambisonic (FOA) array format of the TAU-NIGENS Spatial Sound Events 2020 dataset. An improved performance is achieved in terms of Error Rate (ER), DOA error, and Frame Recall (FR) has been observed in comparison to the state-of-the-art SELD systems.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets discussed in the manuscript are publicly available for research purposes. (URL: https://dcase.community/workshop2020/proceedings.) [28]
References
Adavanne S, Parascandolo G, Pertilä P, et al (2016) Sound event detection in multichannel audio using spatial and harmonic features. In: Workshop on detection and classification of acoustic scenes and events, pp 6–10
Adavanne S, Politis A, Nikunen J et al (2018) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J Sel Topics Signal Process 13(1):34–48
Akbacak M, Hansen JH (2007) Environmental sniffing: noise knowledge estimation for robust speech systems. IEEE Transactions on Audio, Speech, and Language Processing 15(2):465–477
Aletta F, Kang J, Astolfi A et al (2016) Differences in soundscape appreciation of walking sounds from different footpath materials in urban parks. Sustain Cities Soc 27:367–376
Benesty J, Chen J, Huang Y (2004) Time-delay estimation via linear interpolation and cross correlation. IEEE Trans on Speech Audio Process 12(5):509–519
Cakir E, Heittola T, Huttunen H, et al (2015) Polyphonic sound event detection using multi label deep neural networks. In: The international joint conference on neural networks (IJCNN). IEEE, pp 1–7
Cakır E, Parascandolo G, Heittola T et al (2017) Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(6):1291–1303
Cao Y, Iqbal T, Kong Q, et al (2020) Event-independent network for polyphonic sound event localization and detection. Tech. rep., DCASE2020 Challenge
Carletti V, Foggia P, Percannella G, et al (2013) Audio surveillance using a bag of aural words classifier. In: the 10th International conference on advanced video and signal based surveillance. IEEE, pp 81–86
Chakrabarty S, Habets E (2017) Multi-speaker localization using convolutional neural network trained with noise. In: Workshop on machine learning for audio processing, pp 1–5
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp 1800–1807
Chu S, Narayanan S, Kuo CJ, et al (2006) Where am I? Scene recognition for mobile robots using audio features. In: The International conference on multimedia and expo. IEEE, pp 885–888
DiBiase JH (2000) A High-accuracy, Low-latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays. Brown University Providence, RI
DiBiase JH, Silverman HF, Brandstein MS (2001) Robust localization in reverberant rooms. In: Microphone arrays. Springer, p 157–180
Hayashi T, Watanabe S, Toda T et al (2017) Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11):2059–2070
Hirvonen T (2015) Classification of spatial audio location and content using convolutional neural networks
Huang Y, Benesty J, Elko GW et al (2001) Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans Speech Audio Process 9(8):943–956
Huang Z, Liu C, Fei H et al (2020) Urban sound classification based on 2-order dense convolutional network using dual features. Appl Acoust 164(107):243
Jayalakshmi S, Chandrakala S, Nedunchelian R (2018) Global statistical features-based approach for acoustic event detection. Appl Acoust 139:113–118
Jeong IY, Lee S, Han Y, et al (2017) Audio event detection using multiple-input convolutional neural network. Detection and Classification of Acoustic Scenes and Events (DCASE) pp 51–54
Kapka S, Lewandowski M (2019) Sound source detection, localization and classification using consecutive ensemble of CRNN models. Tech. rep., Detection Classification Acoustic Scenes Events Workshop
LiHong P, Xue Z, Ping C, et al (2019) Polyphonic sound event detection and localization using a two-stage strategy. Tech. rep., Detection Classification Acoustic Scenes Events Workshop
Lopatka K, Kotus J, Czyzewski A (2016) Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed Tools Appl 75(17):10,407–10,439
Mesaros A, Heittola T, Eronen A, et al (2010) Acoustic event detection in real life recordings. In: The 18th european signal processing conference. IEEE, pp 1267–1271
Mesaros A, Adavanne S, Politis A, et al (2019) Joint measurement of localization and detection of sound events. In: IEEE Workshop on applications of signal processing to audio and acoustics (WASPAA)
Phan H, Hertel L, Maass M, et al (2016) Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv
Phan H, Pham L, Koch P, et al (2020) Audio event detection and localization with multitask regression network. Tech. rep., DCASE2020 Challenge
Politis A, Adavanne S, Virtanen T (2020) A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In: Proceedings of the workshop on detection and classification of acoustic scenes and events
Politis A, Mesaros A, Adavanne S et al (2020) Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29:684–698
Spoorthy V, Koolagudi SG (2023) Polyphonic sound event detection using Mel-Pseudo constant Q-Transform and deep neural network. IETE Journal of Research pp 1–13
Spoorthy V, Koolagudi SG (2023) A transpose-SELDNet for polyphonic sound event localization and detection. In: 2023 IEEE 8th international conference for convergence in technology (I2CT). IEEE, pp 1–6
Wang Q, Wu H, Jing Z, et al (2020) The USTC-IFLYTEK system for sound event localization and detection of DCASE2020 challenge. Tech. rep., DCASE2020 Challenge
Weiping Z, Jiantao Y, Xiaotao X, et al (2017) Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion. Detection and Classification of Acoustic Scenes and Events (DCASE)
Zhang H, McLoughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 559–563
Zöhrer M, Pernkopf F (2017) Virtual adversarial training and data augmentation for acoustic event detection with gated recurrent neural networks. In: Interspeech, pp 493–497
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
Conceptualization: Spoorthy. V, Shashidhar G. Koolagudi; Methodology: Spoorthy. V; Formal analysis and investigation: Spoorthy. V; Writing - original draft preparation: Spoorthy. V; Writing - review and editing: Shashidhar G. Koolagudi; Supervision: Shashidhar G. Koolagudi
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
V., S., Kooolagudi, S.G. Polyphonic sound event localization and detection using channel-wise FusionNet. Appl Intell 54, 5015–5026 (2024). https://doi.org/10.1007/s10489-024-05438-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05438-6