Skip to main content
Log in

Polyphonic sound event localization and detection using channel-wise FusionNet

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Sound Event Localization and Detection (SELD) is the task of spatial and temporal localization of various sound events and their classification. Commonly, multitask models are used to perform SELD. In this work, a deep learning network model named channel-wise ‘FusionNet’ is designed to perform the SELD task. The novel fusion layer is introduced into the regular Deep Neural Network (DNN), where the input is fed channel-wise, and the outputs of all channels are fused to form a new feature representation. The key contribution of this work is the neural network model which helps to retain the channel-wise information from the multichannel input along with the spatial and temporal information. The proposed network utilizes separable convolution blocks in the convolution layers, therefore, the complexity of the model is low in terms of both time and space. The features used as input are Mel-band energies for Sound Event Detection (SED) and intensity vectors for the Direction-of-Arrival (DOA) estimation. The proposed network’s fusion layer provides a better representation of features for both SED and DOA estimation tasks. Experiments are performed on the recordings of the First-order Ambisonic (FOA) array format of the TAU-NIGENS Spatial Sound Events 2020 dataset. An improved performance is achieved in terms of Error Rate (ER), DOA error, and Frame Recall (FR) has been observed in comparison to the state-of-the-art SELD systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets discussed in the manuscript are publicly available for research purposes. (URL: https://dcase.community/workshop2020/proceedings.) [28]

References

  1. Adavanne S, Parascandolo G, Pertilä P, et al (2016) Sound event detection in multichannel audio using spatial and harmonic features. In: Workshop on detection and classification of acoustic scenes and events, pp 6–10

  2. Adavanne S, Politis A, Nikunen J et al (2018) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J Sel Topics Signal Process 13(1):34–48

    Article  Google Scholar 

  3. Akbacak M, Hansen JH (2007) Environmental sniffing: noise knowledge estimation for robust speech systems. IEEE Transactions on Audio, Speech, and Language Processing 15(2):465–477

    Article  Google Scholar 

  4. Aletta F, Kang J, Astolfi A et al (2016) Differences in soundscape appreciation of walking sounds from different footpath materials in urban parks. Sustain Cities Soc 27:367–376

    Article  Google Scholar 

  5. Benesty J, Chen J, Huang Y (2004) Time-delay estimation via linear interpolation and cross correlation. IEEE Trans on Speech Audio Process 12(5):509–519

    Article  Google Scholar 

  6. Cakir E, Heittola T, Huttunen H, et al (2015) Polyphonic sound event detection using multi label deep neural networks. In: The international joint conference on neural networks (IJCNN). IEEE, pp 1–7

  7. Cakır E, Parascandolo G, Heittola T et al (2017) Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(6):1291–1303

    Article  Google Scholar 

  8. Cao Y, Iqbal T, Kong Q, et al (2020) Event-independent network for polyphonic sound event localization and detection. Tech. rep., DCASE2020 Challenge

  9. Carletti V, Foggia P, Percannella G, et al (2013) Audio surveillance using a bag of aural words classifier. In: the 10th International conference on advanced video and signal based surveillance. IEEE, pp 81–86

  10. Chakrabarty S, Habets E (2017) Multi-speaker localization using convolutional neural network trained with noise. In: Workshop on machine learning for audio processing, pp 1–5

  11. Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp 1800–1807

  12. Chu S, Narayanan S, Kuo CJ, et al (2006) Where am I? Scene recognition for mobile robots using audio features. In: The International conference on multimedia and expo. IEEE, pp 885–888

  13. DiBiase JH (2000) A High-accuracy, Low-latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays. Brown University Providence, RI

  14. DiBiase JH, Silverman HF, Brandstein MS (2001) Robust localization in reverberant rooms. In: Microphone arrays. Springer, p 157–180

  15. Hayashi T, Watanabe S, Toda T et al (2017) Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11):2059–2070

    Article  Google Scholar 

  16. Hirvonen T (2015) Classification of spatial audio location and content using convolutional neural networks

  17. Huang Y, Benesty J, Elko GW et al (2001) Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans Speech Audio Process 9(8):943–956

    Article  Google Scholar 

  18. Huang Z, Liu C, Fei H et al (2020) Urban sound classification based on 2-order dense convolutional network using dual features. Appl Acoust 164(107):243

    Google Scholar 

  19. Jayalakshmi S, Chandrakala S, Nedunchelian R (2018) Global statistical features-based approach for acoustic event detection. Appl Acoust 139:113–118

    Article  Google Scholar 

  20. Jeong IY, Lee S, Han Y, et al (2017) Audio event detection using multiple-input convolutional neural network. Detection and Classification of Acoustic Scenes and Events (DCASE) pp 51–54

  21. Kapka S, Lewandowski M (2019) Sound source detection, localization and classification using consecutive ensemble of CRNN models. Tech. rep., Detection Classification Acoustic Scenes Events Workshop

  22. LiHong P, Xue Z, Ping C, et al (2019) Polyphonic sound event detection and localization using a two-stage strategy. Tech. rep., Detection Classification Acoustic Scenes Events Workshop

  23. Lopatka K, Kotus J, Czyzewski A (2016) Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed Tools Appl 75(17):10,407–10,439

  24. Mesaros A, Heittola T, Eronen A, et al (2010) Acoustic event detection in real life recordings. In: The 18th european signal processing conference. IEEE, pp 1267–1271

  25. Mesaros A, Adavanne S, Politis A, et al (2019) Joint measurement of localization and detection of sound events. In: IEEE Workshop on applications of signal processing to audio and acoustics (WASPAA)

  26. Phan H, Hertel L, Maass M, et al (2016) Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv

  27. Phan H, Pham L, Koch P, et al (2020) Audio event detection and localization with multitask regression network. Tech. rep., DCASE2020 Challenge

  28. Politis A, Adavanne S, Virtanen T (2020) A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In: Proceedings of the workshop on detection and classification of acoustic scenes and events

  29. Politis A, Mesaros A, Adavanne S et al (2020) Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29:684–698

    Article  Google Scholar 

  30. Spoorthy V, Koolagudi SG (2023) Polyphonic sound event detection using Mel-Pseudo constant Q-Transform and deep neural network. IETE Journal of Research pp 1–13

  31. Spoorthy V, Koolagudi SG (2023) A transpose-SELDNet for polyphonic sound event localization and detection. In: 2023 IEEE 8th international conference for convergence in technology (I2CT). IEEE, pp 1–6

  32. Wang Q, Wu H, Jing Z, et al (2020) The USTC-IFLYTEK system for sound event localization and detection of DCASE2020 challenge. Tech. rep., DCASE2020 Challenge

  33. Weiping Z, Jiantao Y, Xiaotao X, et al (2017) Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion. Detection and Classification of Acoustic Scenes and Events (DCASE)

  34. Zhang H, McLoughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 559–563

  35. Zöhrer M, Pernkopf F (2017) Virtual adversarial training and data augmentation for acoustic event detection with gated recurrent neural networks. In: Interspeech, pp 493–497

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Spoorthy. V, Shashidhar G. Koolagudi; Methodology: Spoorthy. V; Formal analysis and investigation: Spoorthy. V; Writing - original draft preparation: Spoorthy. V; Writing - review and editing: Shashidhar G. Koolagudi; Supervision: Shashidhar G. Koolagudi

Corresponding author

Correspondence to Spoorthy V..

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

V., S., Kooolagudi, S.G. Polyphonic sound event localization and detection using channel-wise FusionNet. Appl Intell 54, 5015–5026 (2024). https://doi.org/10.1007/s10489-024-05438-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05438-6

Keywords