Abstract
As a combination of sound event detection and direction of arrival, the joint task of sound event localization and detection (SELD) is an emerging audio signal processing task and is applied in many areas widely. A popular convolutional recurrent neural network (CRNN)-based method uses convolution neural network (CNN) to extract high-level space features from manually designed features and utilizes recurrent neural network to model sequence context information. Some studies have shown that the normal CNN could not be robust in challenging acoustic environments such as overlapping, moving and discontinuous sources. To improve the performance of SELD in more complex acoustic scenes, parallel multi-attention enhancement (PMAE) is proposed as a convolution enhancement method to boost the representation ability of CNN in this paper. PMAE consists of attention feature enhancement (AFE) and parallel multi-attention (PMA) block. PMA, embedded into AFE, extracts boosting global–local features by efficient attention modules along with different dimensions. AFE, as a feature fusion strategy, fuses multi-scale enhanced features to improve feature representation. AFE shows great performance for overlapping sources. PMA adequately extracts characteristic information of different sound events and shows better performance on moving and discontinuous sources when it is combined with AFE. Based on such a framework, the SELD system becomes robust, while the target sources are moving and overlapping with unknown interference classes. The simulations show that proposed PMAE improves the performance enormously for SELD without other technologies, such as data augment and post-processing.
Similar content being viewed by others
References
S. Adavanne, A. Politis, T. Virtanen, Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. in 26th European Signal Processing Conference (EUSIPCO), pp. 1462–1466 (2018).
S. Adavanne, A. Politis, J. Nikunen, T. Virtanen, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Topics Signal Process. 13, 34–48 (2019). https://doi.org/10.1109/JSTSP.2018.2885636
M.J. Bianco, S. Gannot, P. Gerstoft, Semi-supervised source lo-calization with deep generative modelling. in 30th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2020).
M.S. Brandstein, H.F. Silverman, A high-accuracy low-latency technique for talker localization in reverberant environments using microphone arrays. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 80–82 (1997).
C. Busso et al., Smart room: participant and speaker localization and identification. in 30th IEEE International Conference on Acoustics, Speech, and Signal Processing. (ICASSP), pp. 1117–1120 (2005).
T. Butko, F.G. Pla, C. Segura, C. Nadeu, J. Hernando, Two-source acoustic event detection and localization: online implementation in a smart-room. in 19th European Signal Processing Conference (EUSIPCO), pp. 1317–1321 (2011).
E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 1291–1303 (2017)
Y. Cao et al., GCNet: non-local networks meet squeeze-excitation networks and beyond. in IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1971–1980 (2019). https://doi.org/10.1109/ICCVW.2019.00246.
S. Chu, S. Narayanan, C.C.J. Kuo, Environmental sound recognition with time-frequency audio features. IEEE Trans. Audio Speech Lang. Process. 17, 1142–1158 (2009)
L. Comanducci et al., Source localization using distributed microphones in reverberant environments based on deep learning and ray space transform. IEEE-ACM Trans. Audio Speech Lang. Process. 28, 2238–2251 (2020)
M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: a systematic review. ACM Comput. Surv. 48, 1–46 (2016)
Y. Dai et al., Attentional feature fusion. in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568 (2021).
P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, M. Vento, Audio surveillance of roads: a system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 17, 279–288 (2016)
J. Fu et al., Dual attention network for scene segmentation. in 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3141–3149 (2019).
P. Gerstoft, C.F. Mecklenbrauker, A. Xenaki, S. Nannuru, Multisnapshot sparse Bayesian learning for DOA. IEEE Signal Process. Lett. 23, 1469–1473 (2016). https://doi.org/10.1109/LSP.2016.2598550
C.J. Grobler, C.P. Kruger, B.J. Silva, G.P. Hancke, Sound based localization and identification in industrial environments. in 43rd Annual Conference of the IEEE-Industrial-Electronics-Society (IECON), pp. 6119–6124 (2017).
P.-A. Grumiaux et al., SALADnet: Self-attentive multisource localization in the Ambisonics domain. in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 336–340 (2021).
P.-A. Grumiaux, S. Kitic, L. Girin, A. Guerin, Eurasip, Improved feature extraction for CRNN-based multiple sound source localization, in 29th European Signal Processing Conference (EUSIPCO) (2021), pp. 231–235.
T. Hayashi et al., Duration-controlled LSTM for polyphonic sound event detection. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 2059–2070 (2017)
W. He, P. Motlicek, J.-M. Odobez, Deep neural networks for multiple speaker detection and localization. in IEEE International Conference on Robotics and Automation (ICRA), pp. 74–79 (2018).
G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 82–97 (2012)
Q. Hou, L. Zhang, M.-M. Cheng, J. Feng, Strip pooling: rethinking spatial pooling for scene parsing. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4002–4011 (2020).
Y.T. Huang, J. Benesty, G.W. Elko, R.M. Mersereau, Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans. Speech Audio Process. 9, 943–956 (2001)
Q. Huang, T. Chen, One-dimensional MUSIC-type algorithm for spherical microphone arrays. IEEE Access 8, 28178–28187 (2020). https://doi.org/10.1109/ACCESS.2020.2972069
Y. Huang, X. Wu, T. Qu, A time-domain unsupervised learning based sound source localization method. in 3rd IEEE International Conference on Information Communication and Signal Processing (ICICSP), pp. 26–32 (2020).
P. Huy, L. Hertel, M. Maass, A. Mertins, A, Robust audio event recognition with 1-max pooling convolutional neural networks. in 17th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH), pp. 3653–3657 (2016).
B. Kim, S. Yang, J. Kim, S. Chang, QTI submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design. Arxiv, (2022).
C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976). https://doi.org/10.1109/TASSP.1976.1162830
T. Komatsu, Y. Senda, R. Kondo, IEEE, Acoustic event detection based on nonnegative matrix factorization with mixture of local dictionaries andactivation aggregation. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2259–2263 (2016).
H.W. Kuhn, The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)
A. Kumar, B. Raj, Audio event detection using weakly labeled data. in IEEE International Conference on Multimedia & Expo (ICME), pp. 1038–1047 (2016).
G. Le Moing et al., Data-efficient framework for real-world multiple sound source 2D localization. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3425–3429 (2021).
Q. Liu, Y. Gu, H.C. So, DOA estimation in impulsive noise via low-rank matrix approximation and weakly convex optimization. IEEE Trans. Aerosp. Electron. Syst. 55, 3603–3616 (2019). https://doi.org/10.1109/TAES.2019.2909728
K. Lopatka, J. Kotus, A. Czyzewski, Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed. Tools Appl. 75, 10407–10439 (2016)
J. Lu, Mean teacher convolution system for DCASE 2018 Task 4, Tech. Rep. DCASE Challenge 2018.
T.A. Marques et al., Estimating animal population density using passive acoustics. Biol. Rev. 88, 287–309 (2013)
A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detection in real-life recordings. in 18th European Signal Processing Conference (EUSIPCO), pp. 1267–1271 (2010).
A. Mesaros et al., Joint measurement of localization and detection of sound events. in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 333–337 (2019). https://doi.org/10.1109/WASPAA.2019.8937220.
J. Naranjo-Alcazar, S. Perez-Castanos, P. Zuccarello, et al. TASK3 DCASE2021 Challenge: sound event localization and detection using squeeze-excitation residual CNNs. arXiv:2107.14561 (2021).
C. Pan, J. Chen, J. Benesty, Performance study of the MVDR beamformer as a function of the source incidence angle. IEEE-ACM Trans. Audio Speech Lang. Process. 22, 67–79 (2014)
G.K. Papageorgiou, M. Sellathurai, Y.C. Eldar, Deep networks for direction-of-arrival estimation in low SNR. IEEE Trans. Signal Process. 69, 3714–3729 (2021). https://doi.org/10.1109/TSP.2021.3089927
G. Parascandolo, H. Huttunen, T. Virtanen, Ieee, Recurrent neural networks for polyphonic sound event detection in real life recordings. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444 (2016).
D.S. Park et al., Specaugment: A simple data augmentation method for automatic speech recognition. in Interspeech Conference (INTERSPEECH), pp. 2613–2617 (2019).
T. Pellegrini, L. Cances, Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection. in International Joint Conference on Neural Networks (IJCNN), pp. 2–8 (2019).
A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE-ACM Trans. Audio Speech Lang. Process. 29, 684–698 (2021). https://doi.org/10.1109/TASLP.2020.3047233.T.N
A. Politis, S. Adavanne, D. Krause, et al. A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. arXiv:2106.06999 (2021).
R. Roy, T. Kailath, ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989)
R.O. Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986)
K. Shimada et al., Accdoa: activity-coupled cartesian direction of arrival representation for sound event localization and detection. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 915–919 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413609.
X. Sun, Y. Jiang, W. Li, Residual attention based network for automatic classification of phonation modes. in IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020).
P. Swietojanski, A. Ghoshal, S. Renals, Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21, 1120–1124 (2014)
R. Takeda, K. Komatani, IEEE, Sound source localization based on deep neural networks with directional activate function exploiting phase information. in 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 405–409 (2016).
Z. Tang, J.D. Kanu, K. Hogan, D. Manocha, Regression and classification for direction-of-arrival estimation with convolutional re-current neural networks. in Interspeech Conference. (INTERSPEECH), pp. 654–658 (2019).
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo and Q. Hu, ECA-Net: efficient channel attention for deep convolutional neural networks. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11531–11539 (2020).
X. Wei, Y. Yuan, Q. Ling, DOA estimation using a greedy block coordinate descent algorithm. IEEE Trans. Signal Process. 60, 6382–6394 (2012). https://doi.org/10.1109/TSP.2012.2218812
P.W. Wessels, J.V. Sande, F.V. der Eerden, Detection and localization of impulsive sound events for environmental noise assessment. J. Acoust. Soc. Am. 141(5), 3886–3886 (2017)
S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: convolutional block attention module. in 15th European Conference on Computer Vision (ECCV), pp. 3–19 (2018).
X. Xiao et al., A learning-based approach to direction of arrival estimation in noisy and reverberant environments. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2814–2818 (2015).
Y. Xu, Q. Kong, W. Wang, M.D. Plumbley, Large-scale weakly supervised audio classification using gated convolutional neural network. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 121–125 (2018).
H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks. in 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 559–563 (2015).
Z. Zhou, Y. Zhou, D. Wang, J. Mu, H. Zhou, Self-attention feature fusion network for semantic segmentation. Neurocomputing 453, 50–59 (2021)
Acknowledgements
The authors would like to thank the editor and anonymous reviewers for their valuable comments. This work was supported by National Natural Science Foundation of China (61571279).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Z., Huang, Q. Sound Event Localization and Detection Using Parallel Multi-attention Enhancement. Circuits Syst Signal Process 43, 545–567 (2024). https://doi.org/10.1007/s00034-023-02489-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02489-x