Skip to main content
Log in

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Sound event localization and detection systems can provide intelligent sound processing and analysis functions for various application devices. However, existing deep learning-based networks mostly rely on simple concatenation of convolutional neural networks (CNN) and recurrent neural networks, which leads to the loss of key feature information in audio. As a result, accurate localization and detection become more difficult. In this paper, we propose a local–global adaptive fusion and temporal importance network model. Firstly, the CNN block and the multi-scale enhanced axial cross attention Transformer block are used to learn the local and global features respectively. Then, the local and global features are effectively fused through the adaptive fusion module. Finally, the positional attention temporal context module is used to explore the positional information in the sound temporal sequence, capturing the important features. Experimental results on the Sony-TAu Reality Spatial Soundscapes 2022 dataset and the synthetic dataset show that the \(ER_{20^{\circ }}\) and \(LE_{CD}\) of the proposed model are reduced to 0.65 and 22.3\(^{\circ }\), respectively, and the \(F_{20^{\circ }}\) and \(LR_{CD}\) are increased to 31.1% and 54.8%, respectively, and the comprehensive evaluation metric, \(SELD\ score\), is reduced to 0.48, which achieves better performance compared with other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data used in this article are openly available in DCASE 2022 task3 at https://dcase.community/challenge2022, reference number [27].

References

  1. Adavanne, S., Politis, A., Nikunen, J., Virtanen, T.: Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13(1), 34–48 (2018). https://doi.org/10.1109/JSTSP.2018.2885636

    Article  Google Scholar 

  2. Dabran, I., Elmakias, O., Shmelkin, R., Zusman, Y.: An intelligent sound alam recognition system for smart cars and smart homes. In: NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pp. 1–4 (2018). https://doi.org/10.1109/NOMS.2018.8406181

  3. Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M.: Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters 65, 22–28 (2015) https://doi.org/10.1016/j.patrec.2015.06.026

  4. Mao, Y., Zeng, Y., Liu, H., Zhu, W., Zhou, Y.: ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with ambisonics data augmentation for sound event localization and detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9191–9195 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746673

  5. Humayun, A.I., Ghaffarzadegan, S., Feng, Z., Hasan, T.: Learning front-end filter-bank parameters using convolutional neural networks for abnormal heart sound detection. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1408–1411 (2018). https://doi.org/10.1109/EMBC.2018.8512578

  6. Valenzise, G., Gerosa, L., Tagliasacchi, M., Antonacci, F., Sarti, A.: Scream and gunshot detection and localization for audio-surveillance systems. In: 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 21–26 (2007). https://doi.org/10.1109/AVSS.2007.4425280

  7. Butko, T., Pla, F.G., Segura, C., Nadeu, C., Hernando, J.: Two-source acoustic event detection and localization: Online implementation in a smart-room. In: 2011 19th European Signal Processing Conference, pp. 1317–1321 (2011)

  8. Lopatka, K., Kotus, J., Czyzewski, A.: Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimedia Tools and Applications 75, 10407–10439 (2016) https://doi.org/10.1007/s11042-015-3105-4

  9. Huang, Y., Benesty, J., Elko, G.W., Mersereati, R.M.: Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Transactions on Speech and Audio Processing 9(8), 943–956 (2001). https://doi.org/10.1109/89.966097

    Article  Google Scholar 

  10. Roy, R., Kailath, T.: Esprit-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989). https://doi.org/10.1109/29.32276

    Article  Google Scholar 

  11. Schmidt, R.: Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986). https://doi.org/10.1109/TAP.1986.1143830

    Article  MathSciNet  Google Scholar 

  12. Zinemanas, P., Cancela, P., Rocamora, M.: End-to-end convolutional neural networks for sound event detection in urban environments. In: 2019 24th Conference of Open Innovations Association (FRUCT), pp. 533–539 (2019). https://doi.org/10.23919/FRUCT.2019.8711906

  13. Wang, Y., Zhao, G., Xiong, K., Shi, G., Zhang, Y.: Multi-scale and single-scale fully convolutional networks for sound event detection. Neurocomputing 421, 51–65 (2021) https://doi.org/10.1016/j.neucom.2020.09.038

  14. Hayashi, T., Watanabe, S., Toda, T., Hori, T., Le Roux, J., Takeda, K.: Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11), 2059–2070 (2017). https://doi.org/10.1109/TASLP.2017.2740002

    Article  Google Scholar 

  15. Zöhrer, M., Pernkopf, F.: Virtual adversarial training and data augmentation for acoustic event detection with gated recurrent neural networks. In: Interspeech, pp. 493–497 (2017). https://doi.org/10.21437/Interspeech.2017-1238

  16. Guirguis, K., Schorn, C., Guntoro, A., Abdulatif, S., Yang, B.: SELD-TCN: Sound event localization & detection via temporal convolutional networks. In: 2020 28th European Signal Processing Conference (EUSIPCO), pp. 16–20 (2021). https://doi.org/10.23919/Eusipco47968.2020.9287716

  17. Shimada, K., Koyama, Y., Takahashi, S., Takahashi, N., Tsunoo, E., Mitsufuji, Y.: Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 316–320 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746384

  18. Kim, J.S., Park, H.J., Shin, W., Han, S.W.: A robust framework for sound event localization and detection on real recordings. Technical report, DCASE2022 Challenge (2022)

  19. Wang, Q., Chai, L., Wu, H., Nian, Z., Niu, S., Zheng, S., Wang, Y., Sun, L., Fang, Y., Pan, J., et al.: The NERC-SLIP system for sound event localization and detection of dcase2022 challenge. Technical report, DCASE2022 Challenge (2022)

  20. Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., Pan, D.Z.: Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103 (2022). https://doi.org/10.1109/CVPR52688.2022.01178

  21. Qi, X., Wang, J., Chen, Y., Shi, Y., Zhang, L.: Lipsformer: Introducing lipschitz continuity to vision transformers. Preprint at https://arxiv.org/abs/2304.09856 (2023)

  22. Wan, Q., Huang, Z., Lu, J., Gang, Y., Zhang, L.: Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In: The Eleventh International Conference on Learning Representations (2023)

  23. Chen, X., Li, H., Li, M., Pan, J.: Learning a sparse transformer network for effective image deraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5896–5905 (2023). https://doi.org/10.1109/CVPR52729.2023.00571

  24. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019). https://doi.org/10.1109/CVPR.2019.00060

  25. Psomas, B., Kakogeorgiou, I., Karantzalos, K., Avrithis, Y.: Keep it SimPool: Who said supervised transformers suffer from attention deficit? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5350–5360 (2023). https://doi.org/10.1109/ICCV51070.2023.00493

  26. Li, Y., Si, S., Li, G., Hsieh, C.-J., Bengio, S.: Learnable fourier features for multi-dimensional spatial positional encoding. Adv. Neural. Inf. Process. Syst. 34, 15816–15829 (2021)

    Google Scholar 

  27. Politis, A., Shimada, K., Sudarsanam, P., Adavanne, S., Krause, D., Koyama, Y., Takahashi, N., Takahashi, S., Mitsufuji, Y., Virtanen, T.: STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. Preprint at https://arxiv.org/abs/2206.01948 (2022)

  28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)

  29. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. Preprint at arXiv:1901.02860 (2019)

  30. Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., Ochiai, T.: Espnet: End-to-end speech processing toolkit. Preprint at arXiv preprint arXiv:1804.00015 (2018)

  31. Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713 (2019). https://doi.org/10.1609/aaai.v33i01.33016706

  32. Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. In: The Eleventh International Conference on Learning Representations (2023)

  33. Wang, W., Chen, W., Qiu, Q., Chen, L., Wu, B., Lin, B., He, X., Liu, W.: Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3123–3136 (2023). https://doi.org/10.1109/TPAMI.2023.3341806

    Article  Google Scholar 

  34. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024) https://doi.org/10.1016/j.neucom.2023.127063

  35. Chen, Z., Huang, Q.: GLFE: Global-Local fusion enhancement for sound event localization and detection. Technical report, DCASE2022 Challenge (2022)

  36. Kapka, S., Tkaczuk, J.: CoLoC: Conditioned localizer and classifier for sound event localization and detection. Technical report, DCASE2022 Challenge (2022)

  37. Wu, S., Huang, S., Liu, Z., Liu, J.: Mlp-mixer enhanced crnn for sound event localization and detection in dcase 2022 task 3. Technical report, DCASE2022 Challenge (2022)

  38. Wang, Q., Du, J., Nian, Z., Niu, S., Chai, L., Wu, H., Pan, J., Lee, C.-H.: Loss function design for DNN-based sound event localization and detection on low-resource realistic data. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023.10095144

  39. Kim, J.S., Park, H.J., Shin, W., Han, S.W.: AD-YOLO: You look only once in training multiple sound event localization and detection. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023.10096460

  40. Dao, T., Guo, M., Ma, M.: Sound event localization and detection using a spatial omni-dimensional dynamic interactions network. SIViP 18(2), 1911–1917 (2024). https://doi.org/10.1007/s11760-023-02901-8

    Article  Google Scholar 

Download references

Funding

This work was supported by National Natural Science Foundation of China (Grant No. 62377031), the Fundamental Research Funds for the Central Universities (Grant No. GK202105006), the Key Research and Development Program in Shaanxi Province (Grant No. 2023-YBGY241).

Author information

Authors and Affiliations

Authors

Contributions

Di Shi contributed to conceptualization, methodology, and writing of the main manuscript. Min Guo provided experimental guidance and participated in revising and reviewing the manuscript. Miao Ma participated in revising and reviewing the manuscript.

Corresponding author

Correspondence to Min Guo.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, D., Guo, M. & Ma, M. A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network. Multimedia Systems 30, 367 (2024). https://doi.org/10.1007/s00530-024-01582-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01582-8

Keywords

Navigation