Sound event localization and detection systems can provide intelligent sound processing and analysis functions for various application devices. However, existing deep learning-based networks mostly rely on simple concatenation of convolutional neural networks (CNN) and recurrent neural networks, which leads to the loss of key feature information in audio. As a result, accurate localization and detection become more difficult. In this paper, we propose a local–global adaptive fusion and temporal importance network model. Firstly, the CNN block and the multi-scale enhanced axial cross attention Transformer block are used to learn the local and global features respectively. Then, the local and global features are effectively fused through the adaptive fusion module. Finally, the positional attention temporal context module is used to explore the positional information in the sound temporal sequence, capturing the important features. Experimental results on the Sony-TAu Reality Spatial Soundscapes 2022 dataset and the synthetic dataset show that the \(ER_{20^{\circ }}\) and \(LE_{CD}\) of the proposed model are reduced to 0.65 and 22.3\(^{\circ }\), respectively, and the \(F_{20^{\circ }}\) and \(LR_{CD}\) are increased to 31.1% and 54.8%, respectively, and the comprehensive evaluation metric, \(SELD\ score\), is reduced to 0.48, which achieves better performance compared with other models.

Similar content being viewed by others

Data availability
The data used in this article are openly available in DCASE 2022 task3 at https://dcase.community/challenge2022, reference number [27].
Adavanne, S., Politis, A., Nikunen, J., Virtanen, T.: Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13(1), 34–48 (2018). https://doi.org/10.1109/JSTSP.2018.2885636
Dabran, I., Elmakias, O., Shmelkin, R., Zusman, Y.: An intelligent sound alam recognition system for smart cars and smart homes. In: NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pp. 1–4 (2018). https://doi.org/10.1109/NOMS.2018.8406181
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M.: Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters 65, 22–28 (2015) https://doi.org/10.1016/j.patrec.2015.06.026
Mao, Y., Zeng, Y., Liu, H., Zhu, W., Zhou, Y.: ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with ambisonics data augmentation for sound event localization and detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9191–9195 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746673
Humayun, A.I., Ghaffarzadegan, S., Feng, Z., Hasan, T.: Learning front-end filter-bank parameters using convolutional neural networks for abnormal heart sound detection. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1408–1411 (2018). https://doi.org/10.1109/EMBC.2018.8512578
Valenzise, G., Gerosa, L., Tagliasacchi, M., Antonacci, F., Sarti, A.: Scream and gunshot detection and localization for audio-surveillance systems. In: 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 21–26 (2007). https://doi.org/10.1109/AVSS.2007.4425280
Butko, T., Pla, F.G., Segura, C., Nadeu, C., Hernando, J.: Two-source acoustic event detection and localization: Online implementation in a smart-room. In: 2011 19th European Signal Processing Conference, pp. 1317–1321 (2011)
Lopatka, K., Kotus, J., Czyzewski, A.: Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimedia Tools and Applications 75, 10407–10439 (2016) https://doi.org/10.1007/s11042-015-3105-4
Huang, Y., Benesty, J., Elko, G.W., Mersereati, R.M.: Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Transactions on Speech and Audio Processing 9(8), 943–956 (2001). https://doi.org/10.1109/89.966097
Roy, R., Kailath, T.: Esprit-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989). https://doi.org/10.1109/29.32276
Schmidt, R.: Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986). https://doi.org/10.1109/TAP.1986.1143830
Zinemanas, P., Cancela, P., Rocamora, M.: End-to-end convolutional neural networks for sound event detection in urban environments. In: 2019 24th Conference of Open Innovations Association (FRUCT), pp. 533–539 (2019). https://doi.org/10.23919/FRUCT.2019.8711906
Wang, Y., Zhao, G., Xiong, K., Shi, G., Zhang, Y.: Multi-scale and single-scale fully convolutional networks for sound event detection. Neurocomputing 421, 51–65 (2021) https://doi.org/10.1016/j.neucom.2020.09.038
Hayashi, T., Watanabe, S., Toda, T., Hori, T., Le Roux, J., Takeda, K.: Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11), 2059–2070 (2017). https://doi.org/10.1109/TASLP.2017.2740002
Zöhrer, M., Pernkopf, F.: Virtual adversarial training and data augmentation for acoustic event detection with gated recurrent neural networks. In: Interspeech, pp. 493–497 (2017). https://doi.org/10.21437/Interspeech.2017-1238
Guirguis, K., Schorn, C., Guntoro, A., Abdulatif, S., Yang, B.: SELD-TCN: Sound event localization & detection via temporal convolutional networks. In: 2020 28th European Signal Processing Conference (EUSIPCO), pp. 16–20 (2021). https://doi.org/10.23919/Eusipco47968.2020.9287716
Shimada, K., Koyama, Y., Takahashi, S., Takahashi, N., Tsunoo, E., Mitsufuji, Y.: Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 316–320 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746384
Kim, J.S., Park, H.J., Shin, W., Han, S.W.: A robust framework for sound event localization and detection on real recordings. Technical report, DCASE2022 Challenge (2022)
Wang, Q., Chai, L., Wu, H., Nian, Z., Niu, S., Zheng, S., Wang, Y., Sun, L., Fang, Y., Pan, J., et al.: The NERC-SLIP system for sound event localization and detection of dcase2022 challenge. Technical report, DCASE2022 Challenge (2022)
Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., Pan, D.Z.: Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103 (2022). https://doi.org/10.1109/CVPR52688.2022.01178
Qi, X., Wang, J., Chen, Y., Shi, Y., Zhang, L.: Lipsformer: Introducing lipschitz continuity to vision transformers. Preprint at https://arxiv.org/abs/2304.09856 (2023)
Wan, Q., Huang, Z., Lu, J., Gang, Y., Zhang, L.: Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In: The Eleventh International Conference on Learning Representations (2023)
Chen, X., Li, H., Li, M., Pan, J.: Learning a sparse transformer network for effective image deraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5896–5905 (2023). https://doi.org/10.1109/CVPR52729.2023.00571
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019). https://doi.org/10.1109/CVPR.2019.00060
Psomas, B., Kakogeorgiou, I., Karantzalos, K., Avrithis, Y.: Keep it SimPool: Who said supervised transformers suffer from attention deficit? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5350–5360 (2023). https://doi.org/10.1109/ICCV51070.2023.00493
Li, Y., Si, S., Li, G., Hsieh, C.-J., Bengio, S.: Learnable fourier features for multi-dimensional spatial positional encoding. Adv. Neural. Inf. Process. Syst. 34, 15816–15829 (2021)
Politis, A., Shimada, K., Sudarsanam, P., Adavanne, S., Krause, D., Koyama, Y., Takahashi, N., Takahashi, S., Mitsufuji, Y., Virtanen, T.: STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. Preprint at https://arxiv.org/abs/2206.01948 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. Preprint at arXiv:1901.02860 (2019)
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., Ochiai, T.: Espnet: End-to-end speech processing toolkit. Preprint at arXiv preprint arXiv:1804.00015 (2018)
Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713 (2019). https://doi.org/10.1609/aaai.v33i01.33016706
Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. In: The Eleventh International Conference on Learning Representations (2023)
Wang, W., Chen, W., Qiu, Q., Chen, L., Wu, B., Lin, B., He, X., Liu, W.: Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3123–3136 (2023). https://doi.org/10.1109/TPAMI.2023.3341806
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024) https://doi.org/10.1016/j.neucom.2023.127063
Chen, Z., Huang, Q.: GLFE: Global-Local fusion enhancement for sound event localization and detection. Technical report, DCASE2022 Challenge (2022)
Kapka, S., Tkaczuk, J.: CoLoC: Conditioned localizer and classifier for sound event localization and detection. Technical report, DCASE2022 Challenge (2022)
Wu, S., Huang, S., Liu, Z., Liu, J.: Mlp-mixer enhanced crnn for sound event localization and detection in dcase 2022 task 3. Technical report, DCASE2022 Challenge (2022)
Wang, Q., Du, J., Nian, Z., Niu, S., Chai, L., Wu, H., Pan, J., Lee, C.-H.: Loss function design for DNN-based sound event localization and detection on low-resource realistic data. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023.10095144
Kim, J.S., Park, H.J., Shin, W., Han, S.W.: AD-YOLO: You look only once in training multiple sound event localization and detection. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023.10096460
Dao, T., Guo, M., Ma, M.: Sound event localization and detection using a spatial omni-dimensional dynamic interactions network. SIViP 18(2), 1911–1917 (2024). https://doi.org/10.1007/s11760-023-02901-8
This work was supported by National Natural Science Foundation of China (Grant No. 62377031), the Fundamental Research Funds for the Central Universities (Grant No. GK202105006), the Key Research and Development Program in Shaanxi Province (Grant No. 2023-YBGY241).
Author information
Authors and Affiliations
Di Shi contributed to conceptualization, methodology, and writing of the main manuscript. Min Guo provided experimental guidance and participated in revising and reviewing the manuscript. Miao Ma participated in revising and reviewing the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shi, D., Guo, M. & Ma, M. A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network. Multimedia Systems 30, 367 (2024). https://doi.org/10.1007/s00530-024-01582-8
DOI: https://doi.org/10.1007/s00530-024-01582-8