Skip to main content
Log in

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

As a combination of sound event detection and direction of arrival, the joint task of sound event localization and detection (SELD) is an emerging audio signal processing task and is applied in many areas widely. A popular convolutional recurrent neural network (CRNN)-based method uses convolution neural network (CNN) to extract high-level space features from manually designed features and utilizes recurrent neural network to model sequence context information. Some studies have shown that the normal CNN could not be robust in challenging acoustic environments such as overlapping, moving and discontinuous sources. To improve the performance of SELD in more complex acoustic scenes, parallel multi-attention enhancement (PMAE) is proposed as a convolution enhancement method to boost the representation ability of CNN in this paper. PMAE consists of attention feature enhancement (AFE) and parallel multi-attention (PMA) block. PMA, embedded into AFE, extracts boosting global–local features by efficient attention modules along with different dimensions. AFE, as a feature fusion strategy, fuses multi-scale enhanced features to improve feature representation. AFE shows great performance for overlapping sources. PMA adequately extracts characteristic information of different sound events and shows better performance on moving and discontinuous sources when it is combined with AFE. Based on such a framework, the SELD system becomes robust, while the target sources are moving and overlapping with unknown interference classes. The simulations show that proposed PMAE improves the performance enormously for SELD without other technologies, such as data augment and post-processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. S. Adavanne, A. Politis, T. Virtanen, Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. in 26th European Signal Processing Conference (EUSIPCO), pp. 1462–1466 (2018).

  2. S. Adavanne, A. Politis, J. Nikunen, T. Virtanen, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Topics Signal Process. 13, 34–48 (2019). https://doi.org/10.1109/JSTSP.2018.2885636

    Article  Google Scholar 

  3. M.J. Bianco, S. Gannot, P. Gerstoft, Semi-supervised source lo-calization with deep generative modelling. in 30th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2020).

  4. M.S. Brandstein, H.F. Silverman, A high-accuracy low-latency technique for talker localization in reverberant environments using microphone arrays. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 80–82 (1997).

  5. C. Busso et al., Smart room: participant and speaker localization and identification. in 30th IEEE International Conference on Acoustics, Speech, and Signal Processing. (ICASSP), pp. 1117–1120 (2005).

  6. T. Butko, F.G. Pla, C. Segura, C. Nadeu, J. Hernando, Two-source acoustic event detection and localization: online implementation in a smart-room. in 19th European Signal Processing Conference (EUSIPCO), pp. 1317–1321 (2011).

  7. E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 1291–1303 (2017)

    Article  Google Scholar 

  8. Y. Cao et al., GCNet: non-local networks meet squeeze-excitation networks and beyond. in IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1971–1980 (2019). https://doi.org/10.1109/ICCVW.2019.00246.

  9. S. Chu, S. Narayanan, C.C.J. Kuo, Environmental sound recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 17, 1142–1158 (2009)

    Article  Google Scholar 

  10. L. Comanducci et al., Source localization using distributed microphones in reverberant environments based on deep learning and ray space transform. IEEE-ACM Trans. Audio Speech Lang. Process. 28, 2238–2251 (2020)

    Article  Google Scholar 

  11. M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: a systematic review. ACM Comput. Surv. 48, 1–46 (2016)

    Article  Google Scholar 

  12. Y. Dai et al., Attentional feature fusion. in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568 (2021).

  13. P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance of roads: a system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 17, 279–288 (2016)

    Article  Google Scholar 

  14. J. Fu et al., Dual attention network for scene segmentation. in 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3141–3149 (2019).

  15. P. Gerstoft, C.F. Mecklenbrauker, A. Xenaki, S. Nannuru, Multisnapshot sparse Bayesian learning for DOA. IEEE Signal Process. Lett. 23, 1469–1473 (2016). https://doi.org/10.1109/LSP.2016.2598550

    Article  Google Scholar 

  16. C.J. Grobler, C.P. Kruger, B.J. Silva, G.P. Hancke, Sound based localization and identification in industrial environments. in 43rd Annual Conference of the IEEE-Industrial-Electronics-Society (IECON), pp. 6119–6124 (2017).

  17. P.-A. Grumiaux et al., SALADnet: Self-attentive multisource localization in the Ambisonics domain. in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 336–340 (2021).

  18. P.-A. Grumiaux, S. Kitic, L. Girin, A. Guerin, Eurasip, Improved feature extraction for CRNN-based multiple sound source localization, in 29th European Signal Processing Conference (EUSIPCO) (2021), pp. 231–235.

  19. T. Hayashi et al., Duration-controlled LSTM for polyphonic sound event detection. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 2059–2070 (2017)

    Article  Google Scholar 

  20. W. He, P. Motlicek, J.-M. Odobez, Deep neural networks for multiple speaker detection and localization. in IEEE International Conference on Robotics and Automation (ICRA), pp. 74–79 (2018).

  21. G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012)

    Article  Google Scholar 

  22. Q. Hou, L. Zhang, M.-M. Cheng, J. Feng, Strip pooling: rethinking spatial pooling for scene parsing. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4002–4011 (2020).

  23. Y.T. Huang, J. Benesty, G.W. Elko, R.M. Mersereau, Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans. Speech Audio Process. 9, 943–956 (2001)

    Article  Google Scholar 

  24. Q. Huang, T. Chen, One-dimensional MUSIC-type algorithm for spherical microphone arrays. IEEE Access 8, 28178–28187 (2020). https://doi.org/10.1109/ACCESS.2020.2972069

    Article  Google Scholar 

  25. Y. Huang, X. Wu, T. Qu, A time-domain unsupervised learning based sound source localization method. in 3rd IEEE International Conference on Information Communication and Signal Processing (ICICSP), pp. 26–32 (2020).

  26. P. Huy, L. Hertel, M. Maass, A. Mertins, A, Robust audio event recognition with 1-max pooling convolutional neural networks. in 17th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH), pp. 3653–3657 (2016).

  27. B. Kim, S. Yang, J. Kim, S. Chang, QTI submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design. Arxiv, (2022).

  28. C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976). https://doi.org/10.1109/TASSP.1976.1162830

    Article  Google Scholar 

  29. T. Komatsu, Y. Senda, R. Kondo, IEEE, Acoustic event detection based on nonnegative matrix factorization with mixture of local dictionaries andactivation aggregation. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2259–2263 (2016).

  30. H.W. Kuhn, The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  31. A. Kumar, B. Raj, Audio event detection using weakly labeled data. in IEEE International Conference on Multimedia & Expo (ICME), pp. 1038–1047 (2016).

  32. G. Le Moing et al., Data-efficient framework for real-world multiple sound source 2D localization. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3425–3429 (2021).

  33. Q. Liu, Y. Gu, H.C. So, DOA estimation in impulsive noise via low-rank matrix approximation and weakly convex optimization. IEEE Trans. Aerosp. Electron. Syst. 55, 3603–3616 (2019). https://doi.org/10.1109/TAES.2019.2909728

    Article  Google Scholar 

  34. K. Lopatka, J. Kotus, A. Czyzewski, Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed. Tools Appl. 75, 10407–10439 (2016)

    Article  Google Scholar 

  35. J. Lu, Mean teacher convolution system for DCASE 2018 Task 4, Tech. Rep. DCASE Challenge 2018.

  36. T.A. Marques et al., Estimating animal population density using passive acoustics. Biol. Rev. 88, 287–309 (2013)

    Article  Google Scholar 

  37. A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detection in real-life recordings. in 18th European Signal Processing Conference (EUSIPCO), pp. 1267–1271 (2010).

  38. A. Mesaros et al., Joint measurement of localization and detection of sound events. in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 333–337 (2019). https://doi.org/10.1109/WASPAA.2019.8937220.

  39. J. Naranjo-Alcazar, S. Perez-Castanos, P. Zuccarello, et al. TASK3 DCASE2021 Challenge: sound event localization and detection using squeeze-excitation residual CNNs. arXiv:2107.14561 (2021).

  40. C. Pan, J. Chen, J. Benesty, Performance study of the MVDR beamformer as a function of the source incidence angle. IEEE-ACM Trans. Audio Speech Lang. Process. 22, 67–79 (2014)

    Article  Google Scholar 

  41. G.K. Papageorgiou, M. Sellathurai, Y.C. Eldar, Deep networks for direction-of-arrival estimation in low SNR. IEEE Trans. Signal Process. 69, 3714–3729 (2021). https://doi.org/10.1109/TSP.2021.3089927

    Article  MathSciNet  Google Scholar 

  42. G. Parascandolo, H. Huttunen, T. Virtanen, Ieee, Recurrent neural networks for polyphonic sound event detection in real life recordings. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444 (2016).

  43. D.S. Park et al., Specaugment: A simple data augmentation method for automatic speech recognition. in Interspeech Conference (INTERSPEECH), pp. 2613–2617 (2019).

  44. T. Pellegrini, L. Cances, Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection. in International Joint Conference on Neural Networks (IJCNN), pp. 2–8 (2019).

  45. A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE-ACM Trans. Audio Speech Lang. Process. 29, 684–698 (2021). https://doi.org/10.1109/TASLP.2020.3047233.T.N

    Article  Google Scholar 

  46. A. Politis, S. Adavanne, D. Krause, et al. A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. arXiv:2106.06999 (2021).

  47. R. Roy, T. Kailath, ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989)

    Article  Google Scholar 

  48. R.O. Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986)

    Article  MathSciNet  Google Scholar 

  49. K. Shimada et al., Accdoa: activity-coupled cartesian direction of arrival representation for sound event localization and detection. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 915–919 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413609.

  50. X. Sun, Y. Jiang, W. Li, Residual attention based network for automatic classification of phonation modes. in IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020).

  51. P. Swietojanski, A. Ghoshal, S. Renals, Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21, 1120–1124 (2014)

    Article  Google Scholar 

  52. R. Takeda, K. Komatani, IEEE, Sound source localization based on deep neural networks with directional activate function exploiting phase information. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 405–409 (2016).

  53. Z. Tang, J.D. Kanu, K. Hogan, D. Manocha, Regression and classification for direction-of-arrival estimation with convolutional re-current neural networks. in Interspeech Conference. (INTERSPEECH), pp. 654–658 (2019).

  54. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo and Q. Hu, ECA-Net: efficient channel attention for deep convolutional neural networks. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11531–11539 (2020).

  55. X. Wei, Y. Yuan, Q. Ling, DOA estimation using a greedy block coordinate descent algorithm. IEEE Trans. Signal Process. 60, 6382–6394 (2012). https://doi.org/10.1109/TSP.2012.2218812

    Article  MathSciNet  Google Scholar 

  56. P.W. Wessels, J.V. Sande, F.V. der Eerden, Detection and localization of impulsive sound events for environmental noise assessment. J. Acoust. Soc. Am. 141(5), 3886–3886 (2017)

    Article  Google Scholar 

  57. S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: convolutional block attention module. in 15th European Conference on Computer Vision (ECCV), pp. 3–19 (2018).

  58. X. Xiao et al., A learning-based approach to direction of arrival estimation in noisy and reverberant environments. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2814–2818 (2015).

  59. Y. Xu, Q. Kong, W. Wang, M.D. Plumbley, Large-scale weakly supervised audio classification using gated convolutional neural network. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125 (2018).

  60. H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 559–563 (2015).

  61. Z. Zhou, Y. Zhou, D. Wang, J. Mu, H. Zhou, Self-attention feature fusion network for semantic segmentation. Neurocomputing 453, 50–59 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable comments. This work was supported by National Natural Science Foundation of China (61571279).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinghua Huang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Huang, Q. Sound Event Localization and Detection Using Parallel Multi-attention Enhancement. Circuits Syst Signal Process 43, 545–567 (2024). https://doi.org/10.1007/s00034-023-02489-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02489-x

Keywords

Navigation