Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

Chen, Zhengyu; Huang, Qinghua

doi:10.1007/s00034-023-02489-x

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

Published: 05 September 2023

Volume 43, pages 545–567, (2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Zhengyu Chen¹ &
Qinghua Huang¹

364 Accesses
Explore all metrics

Abstract

As a combination of sound event detection and direction of arrival, the joint task of sound event localization and detection (SELD) is an emerging audio signal processing task and is applied in many areas widely. A popular convolutional recurrent neural network (CRNN)-based method uses convolution neural network (CNN) to extract high-level space features from manually designed features and utilizes recurrent neural network to model sequence context information. Some studies have shown that the normal CNN could not be robust in challenging acoustic environments such as overlapping, moving and discontinuous sources. To improve the performance of SELD in more complex acoustic scenes, parallel multi-attention enhancement (PMAE) is proposed as a convolution enhancement method to boost the representation ability of CNN in this paper. PMAE consists of attention feature enhancement (AFE) and parallel multi-attention (PMA) block. PMA, embedded into AFE, extracts boosting global–local features by efficient attention modules along with different dimensions. AFE, as a feature fusion strategy, fuses multi-scale enhanced features to improve feature representation. AFE shows great performance for overlapping sources. PMA adequately extracts characteristic information of different sound events and shows better performance on moving and discontinuous sources when it is combined with AFE. Based on such a framework, the SELD system becomes robust, while the target sources are moving and overlapping with unknown interference classes. The simulations show that proposed PMAE improves the performance enormously for SELD without other technologies, such as data augment and post-processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Article Open access 05 December 2022

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Article 13 December 2023

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Article Open access 30 June 2023

References

S. Adavanne, A. Politis, T. Virtanen, Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. in 26th European Signal Processing Conference (EUSIPCO), pp. 1462–1466 (2018).
S. Adavanne, A. Politis, J. Nikunen, T. Virtanen, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Topics Signal Process. 13, 34–48 (2019). https://doi.org/10.1109/JSTSP.2018.2885636
Article Google Scholar
M.J. Bianco, S. Gannot, P. Gerstoft, Semi-supervised source lo-calization with deep generative modelling. in 30th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2020).
M.S. Brandstein, H.F. Silverman, A high-accuracy low-latency technique for talker localization in reverberant environments using microphone arrays. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 80–82 (1997).
C. Busso et al., Smart room: participant and speaker localization and identification. in 30th IEEE International Conference on Acoustics, Speech, and Signal Processing. (ICASSP), pp. 1117–1120 (2005).
T. Butko, F.G. Pla, C. Segura, C. Nadeu, J. Hernando, Two-source acoustic event detection and localization: online implementation in a smart-room. in 19th European Signal Processing Conference (EUSIPCO), pp. 1317–1321 (2011).
E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 1291–1303 (2017)
Article Google Scholar
Y. Cao et al., GCNet: non-local networks meet squeeze-excitation networks and beyond. in IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1971–1980 (2019). https://doi.org/10.1109/ICCVW.2019.00246.
S. Chu, S. Narayanan, C.C.J. Kuo, Environmental sound recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 17, 1142–1158 (2009)
Article Google Scholar
L. Comanducci et al., Source localization using distributed microphones in reverberant environments based on deep learning and ray space transform. IEEE-ACM Trans. Audio Speech Lang. Process. 28, 2238–2251 (2020)
Article Google Scholar
M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: a systematic review. ACM Comput. Surv. 48, 1–46 (2016)
Article Google Scholar
Y. Dai et al., Attentional feature fusion. in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568 (2021).
P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance of roads: a system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 17, 279–288 (2016)
Article Google Scholar
J. Fu et al., Dual attention network for scene segmentation. in 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3141–3149 (2019).
P. Gerstoft, C.F. Mecklenbrauker, A. Xenaki, S. Nannuru, Multisnapshot sparse Bayesian learning for DOA. IEEE Signal Process. Lett. 23, 1469–1473 (2016). https://doi.org/10.1109/LSP.2016.2598550
Article Google Scholar
C.J. Grobler, C.P. Kruger, B.J. Silva, G.P. Hancke, Sound based localization and identification in industrial environments. in 43rd Annual Conference of the IEEE-Industrial-Electronics-Society (IECON), pp. 6119–6124 (2017).
P.-A. Grumiaux et al., SALADnet: Self-attentive multisource localization in the Ambisonics domain. in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 336–340 (2021).
P.-A. Grumiaux, S. Kitic, L. Girin, A. Guerin, Eurasip, Improved feature extraction for CRNN-based multiple sound source localization, in 29th European Signal Processing Conference (EUSIPCO) (2021), pp. 231–235.
T. Hayashi et al., Duration-controlled LSTM for polyphonic sound event detection. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 2059–2070 (2017)
Article Google Scholar
W. He, P. Motlicek, J.-M. Odobez, Deep neural networks for multiple speaker detection and localization. in IEEE International Conference on Robotics and Automation (ICRA), pp. 74–79 (2018).
G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012)
Article Google Scholar
Q. Hou, L. Zhang, M.-M. Cheng, J. Feng, Strip pooling: rethinking spatial pooling for scene parsing. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4002–4011 (2020).
Y.T. Huang, J. Benesty, G.W. Elko, R.M. Mersereau, Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans. Speech Audio Process. 9, 943–956 (2001)
Article Google Scholar
Q. Huang, T. Chen, One-dimensional MUSIC-type algorithm for spherical microphone arrays. IEEE Access 8, 28178–28187 (2020). https://doi.org/10.1109/ACCESS.2020.2972069
Article Google Scholar
Y. Huang, X. Wu, T. Qu, A time-domain unsupervised learning based sound source localization method. in 3rd IEEE International Conference on Information Communication and Signal Processing (ICICSP), pp. 26–32 (2020).
P. Huy, L. Hertel, M. Maass, A. Mertins, A, Robust audio event recognition with 1-max pooling convolutional neural networks. in 17th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH), pp. 3653–3657 (2016).
B. Kim, S. Yang, J. Kim, S. Chang, QTI submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design. Arxiv, (2022).
C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976). https://doi.org/10.1109/TASSP.1976.1162830
Article Google Scholar
T. Komatsu, Y. Senda, R. Kondo, IEEE, Acoustic event detection based on nonnegative matrix factorization with mixture of local dictionaries andactivation aggregation. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2259–2263 (2016).
H.W. Kuhn, The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)
Article MathSciNet Google Scholar
A. Kumar, B. Raj, Audio event detection using weakly labeled data. in IEEE International Conference on Multimedia & Expo (ICME), pp. 1038–1047 (2016).
G. Le Moing et al., Data-efficient framework for real-world multiple sound source 2D localization. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3425–3429 (2021).
Q. Liu, Y. Gu, H.C. So, DOA estimation in impulsive noise via low-rank matrix approximation and weakly convex optimization. IEEE Trans. Aerosp. Electron. Syst. 55, 3603–3616 (2019). https://doi.org/10.1109/TAES.2019.2909728
Article Google Scholar
K. Lopatka, J. Kotus, A. Czyzewski, Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed. Tools Appl. 75, 10407–10439 (2016)
Article Google Scholar
J. Lu, Mean teacher convolution system for DCASE 2018 Task 4, Tech. Rep. DCASE Challenge 2018.
T.A. Marques et al., Estimating animal population density using passive acoustics. Biol. Rev. 88, 287–309 (2013)
Article Google Scholar
A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detection in real-life recordings. in 18th European Signal Processing Conference (EUSIPCO), pp. 1267–1271 (2010).
A. Mesaros et al., Joint measurement of localization and detection of sound events. in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 333–337 (2019). https://doi.org/10.1109/WASPAA.2019.8937220.
J. Naranjo-Alcazar, S. Perez-Castanos, P. Zuccarello, et al. TASK3 DCASE2021 Challenge: sound event localization and detection using squeeze-excitation residual CNNs. arXiv:2107.14561 (2021).
C. Pan, J. Chen, J. Benesty, Performance study of the MVDR beamformer as a function of the source incidence angle. IEEE-ACM Trans. Audio Speech Lang. Process. 22, 67–79 (2014)
Article Google Scholar
G.K. Papageorgiou, M. Sellathurai, Y.C. Eldar, Deep networks for direction-of-arrival estimation in low SNR. IEEE Trans. Signal Process. 69, 3714–3729 (2021). https://doi.org/10.1109/TSP.2021.3089927
Article MathSciNet Google Scholar
G. Parascandolo, H. Huttunen, T. Virtanen, Ieee, Recurrent neural networks for polyphonic sound event detection in real life recordings. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444 (2016).
D.S. Park et al., Specaugment: A simple data augmentation method for automatic speech recognition. in Interspeech Conference (INTERSPEECH), pp. 2613–2617 (2019).
T. Pellegrini, L. Cances, Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection. in International Joint Conference on Neural Networks (IJCNN), pp. 2–8 (2019).
A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE-ACM Trans. Audio Speech Lang. Process. 29, 684–698 (2021). https://doi.org/10.1109/TASLP.2020.3047233.T.N
Article Google Scholar
A. Politis, S. Adavanne, D. Krause, et al. A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. arXiv:2106.06999 (2021).
R. Roy, T. Kailath, ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989)
Article Google Scholar
R.O. Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986)
Article MathSciNet Google Scholar
K. Shimada et al., Accdoa: activity-coupled cartesian direction of arrival representation for sound event localization and detection. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 915–919 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413609.
X. Sun, Y. Jiang, W. Li, Residual attention based network for automatic classification of phonation modes. in IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020).
P. Swietojanski, A. Ghoshal, S. Renals, Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21, 1120–1124 (2014)
Article Google Scholar
R. Takeda, K. Komatani, IEEE, Sound source localization based on deep neural networks with directional activate function exploiting phase information. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 405–409 (2016).
Z. Tang, J.D. Kanu, K. Hogan, D. Manocha, Regression and classification for direction-of-arrival estimation with convolutional re-current neural networks. in Interspeech Conference. (INTERSPEECH), pp. 654–658 (2019).
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo and Q. Hu, ECA-Net: efficient channel attention for deep convolutional neural networks. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11531–11539 (2020).
X. Wei, Y. Yuan, Q. Ling, DOA estimation using a greedy block coordinate descent algorithm. IEEE Trans. Signal Process. 60, 6382–6394 (2012). https://doi.org/10.1109/TSP.2012.2218812
Article MathSciNet Google Scholar
P.W. Wessels, J.V. Sande, F.V. der Eerden, Detection and localization of impulsive sound events for environmental noise assessment. J. Acoust. Soc. Am. 141(5), 3886–3886 (2017)
Article Google Scholar
S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: convolutional block attention module. in 15th European Conference on Computer Vision (ECCV), pp. 3–19 (2018).
X. Xiao et al., A learning-based approach to direction of arrival estimation in noisy and reverberant environments. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2814–2818 (2015).
Y. Xu, Q. Kong, W. Wang, M.D. Plumbley, Large-scale weakly supervised audio classification using gated convolutional neural network. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125 (2018).
H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 559–563 (2015).
Z. Zhou, Y. Zhou, D. Wang, J. Mu, H. Zhou, Self-attention feature fusion network for semantic segmentation. Neurocomputing 453, 50–59 (2021)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their valuable comments. This work was supported by National Natural Science Foundation of China (61571279).

Author information

Authors and Affiliations

School of Communication and Information Engineering, Shanghai University, Shanghai, China
Zhengyu Chen & Qinghua Huang

Authors

Zhengyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinghua Huang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Z., Huang, Q. Sound Event Localization and Detection Using Parallel Multi-attention Enhancement. Circuits Syst Signal Process 43, 545–567 (2024). https://doi.org/10.1007/s00034-023-02489-x

Download citation

Received: 15 February 2023
Revised: 08 August 2023
Accepted: 09 August 2023
Published: 05 September 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00034-023-02489-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

Abstract

Access this article

Similar content being viewed by others

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

Abstract

Access this article

Similar content being viewed by others

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation