A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

Shi, Di; Guo, Min; Ma, Miao

doi:10.1007/s00530-024-01582-8

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

Regular Paper
Published: 27 November 2024

Volume 30, article number 367, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Di Shi¹,
Min Guo¹ &
Miao Ma¹

145 Accesses
Explore all metrics

Abstract

Sound event localization and detection systems can provide intelligent sound processing and analysis functions for various application devices. However, existing deep learning-based networks mostly rely on simple concatenation of convolutional neural networks (CNN) and recurrent neural networks, which leads to the loss of key feature information in audio. As a result, accurate localization and detection become more difficult. In this paper, we propose a local–global adaptive fusion and temporal importance network model. Firstly, the CNN block and the multi-scale enhanced axial cross attention Transformer block are used to learn the local and global features respectively. Then, the local and global features are effectively fused through the adaptive fusion module. Finally, the positional attention temporal context module is used to explore the positional information in the sound temporal sequence, capturing the important features. Experimental results on the Sony-TAu Reality Spatial Soundscapes 2022 dataset and the synthetic dataset show that the $ER_{20^{\circ }}$ and $LE_{CD}$ of the proposed model are reduced to 0.65 and 22.3$^{\circ }$, respectively, and the $F_{20^{\circ }}$ and $LR_{CD}$ are increased to 31.1% and 54.8%, respectively, and the comprehensive evaluation metric, $SELD\ score$, is reduced to 0.48, which achieves better performance compared with other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Fig. 6

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Article Open access 05 December 2022

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Article 13 December 2023

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Article Open access 30 June 2023

Data availability

The data used in this article are openly available in DCASE 2022 task3 at https://dcase.community/challenge2022, reference number [27].

References

Adavanne, S., Politis, A., Nikunen, J., Virtanen, T.: Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13(1), 34–48 (2018). https://doi.org/10.1109/JSTSP.2018.2885636
Article Google Scholar
Dabran, I., Elmakias, O., Shmelkin, R., Zusman, Y.: An intelligent sound alam recognition system for smart cars and smart homes. In: NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pp. 1–4 (2018). https://doi.org/10.1109/NOMS.2018.8406181
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M.: Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters 65, 22–28 (2015) https://doi.org/10.1016/j.patrec.2015.06.026
Mao, Y., Zeng, Y., Liu, H., Zhu, W., Zhou, Y.: ICASSP 2022 L3DAS22 Challenge: Ensemble of Resnet-Conformers with ambisonics data augmentation for sound event localization and detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9191–9195 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746673
Humayun, A.I., Ghaffarzadegan, S., Feng, Z., Hasan, T.: Learning front-end filter-bank parameters using convolutional neural networks for abnormal heart sound detection. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1408–1411 (2018). https://doi.org/10.1109/EMBC.2018.8512578
Valenzise, G., Gerosa, L., Tagliasacchi, M., Antonacci, F., Sarti, A.: Scream and gunshot detection and localization for audio-surveillance systems. In: 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 21–26 (2007). https://doi.org/10.1109/AVSS.2007.4425280
Butko, T., Pla, F.G., Segura, C., Nadeu, C., Hernando, J.: Two-source acoustic event detection and localization: Online implementation in a smart-room. In: 2011 19th European Signal Processing Conference, pp. 1317–1321 (2011)
Lopatka, K., Kotus, J., Czyzewski, A.: Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimedia Tools and Applications 75, 10407–10439 (2016) https://doi.org/10.1007/s11042-015-3105-4
Huang, Y., Benesty, J., Elko, G.W., Mersereati, R.M.: Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Transactions on Speech and Audio Processing 9(8), 943–956 (2001). https://doi.org/10.1109/89.966097
Article Google Scholar
Roy, R., Kailath, T.: Esprit-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 37(7), 984–995 (1989). https://doi.org/10.1109/29.32276
Article Google Scholar
Schmidt, R.: Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 34(3), 276–280 (1986). https://doi.org/10.1109/TAP.1986.1143830
Article MathSciNet Google Scholar
Zinemanas, P., Cancela, P., Rocamora, M.: End-to-end convolutional neural networks for sound event detection in urban environments. In: 2019 24th Conference of Open Innovations Association (FRUCT), pp. 533–539 (2019). https://doi.org/10.23919/FRUCT.2019.8711906
Wang, Y., Zhao, G., Xiong, K., Shi, G., Zhang, Y.: Multi-scale and single-scale fully convolutional networks for sound event detection. Neurocomputing 421, 51–65 (2021) https://doi.org/10.1016/j.neucom.2020.09.038
Hayashi, T., Watanabe, S., Toda, T., Hori, T., Le Roux, J., Takeda, K.: Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11), 2059–2070 (2017). https://doi.org/10.1109/TASLP.2017.2740002
Article Google Scholar
Zöhrer, M., Pernkopf, F.: Virtual adversarial training and data augmentation for acoustic event detection with gated recurrent neural networks. In: Interspeech, pp. 493–497 (2017). https://doi.org/10.21437/Interspeech.2017-1238
Guirguis, K., Schorn, C., Guntoro, A., Abdulatif, S., Yang, B.: SELD-TCN: Sound event localization & detection via temporal convolutional networks. In: 2020 28th European Signal Processing Conference (EUSIPCO), pp. 16–20 (2021). https://doi.org/10.23919/Eusipco47968.2020.9287716
Shimada, K., Koyama, Y., Takahashi, S., Takahashi, N., Tsunoo, E., Mitsufuji, Y.: Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 316–320 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746384
Kim, J.S., Park, H.J., Shin, W., Han, S.W.: A robust framework for sound event localization and detection on real recordings. Technical report, DCASE2022 Challenge (2022)
Wang, Q., Chai, L., Wu, H., Nian, Z., Niu, S., Zheng, S., Wang, Y., Sun, L., Fang, Y., Pan, J., et al.: The NERC-SLIP system for sound event localization and detection of dcase2022 challenge. Technical report, DCASE2022 Challenge (2022)
Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., Pan, D.Z.: Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103 (2022). https://doi.org/10.1109/CVPR52688.2022.01178
Qi, X., Wang, J., Chen, Y., Shi, Y., Zhang, L.: Lipsformer: Introducing lipschitz continuity to vision transformers. Preprint at https://arxiv.org/abs/2304.09856 (2023)
Wan, Q., Huang, Z., Lu, J., Gang, Y., Zhang, L.: Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In: The Eleventh International Conference on Learning Representations (2023)
Chen, X., Li, H., Li, M., Pan, J.: Learning a sparse transformer network for effective image deraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5896–5905 (2023). https://doi.org/10.1109/CVPR52729.2023.00571
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019). https://doi.org/10.1109/CVPR.2019.00060
Psomas, B., Kakogeorgiou, I., Karantzalos, K., Avrithis, Y.: Keep it SimPool: Who said supervised transformers suffer from attention deficit? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5350–5360 (2023). https://doi.org/10.1109/ICCV51070.2023.00493
Li, Y., Si, S., Li, G., Hsieh, C.-J., Bengio, S.: Learnable fourier features for multi-dimensional spatial positional encoding. Adv. Neural. Inf. Process. Syst. 34, 15816–15829 (2021)
Google Scholar
Politis, A., Shimada, K., Sudarsanam, P., Adavanne, S., Krause, D., Koyama, Y., Takahashi, N., Takahashi, S., Mitsufuji, Y., Virtanen, T.: STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. Preprint at https://arxiv.org/abs/2206.01948 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. Preprint at arXiv:1901.02860 (2019)
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., Chen, N., Renduchintala, A., Ochiai, T.: Espnet: End-to-end speech processing toolkit. Preprint at arXiv preprint arXiv:1804.00015 (2018)
Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713 (2019). https://doi.org/10.1609/aaai.v33i01.33016706
Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. In: The Eleventh International Conference on Learning Representations (2023)
Wang, W., Chen, W., Qiu, Q., Chen, L., Wu, B., Lin, B., He, X., Liu, W.: Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 46(5), 3123–3136 (2023). https://doi.org/10.1109/TPAMI.2023.3341806
Article Google Scholar
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024) https://doi.org/10.1016/j.neucom.2023.127063
Chen, Z., Huang, Q.: GLFE: Global-Local fusion enhancement for sound event localization and detection. Technical report, DCASE2022 Challenge (2022)
Kapka, S., Tkaczuk, J.: CoLoC: Conditioned localizer and classifier for sound event localization and detection. Technical report, DCASE2022 Challenge (2022)
Wu, S., Huang, S., Liu, Z., Liu, J.: Mlp-mixer enhanced crnn for sound event localization and detection in dcase 2022 task 3. Technical report, DCASE2022 Challenge (2022)
Wang, Q., Du, J., Nian, Z., Niu, S., Chai, L., Wu, H., Pan, J., Lee, C.-H.: Loss function design for DNN-based sound event localization and detection on low-resource realistic data. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023.10095144
Kim, J.S., Park, H.J., Shin, W., Han, S.W.: AD-YOLO: You look only once in training multiple sound event localization and detection. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023.10096460
Dao, T., Guo, M., Ma, M.: Sound event localization and detection using a spatial omni-dimensional dynamic interactions network. SIViP 18(2), 1911–1917 (2024). https://doi.org/10.1007/s11760-023-02901-8
Article Google Scholar

Download references

Funding

This work was supported by National Natural Science Foundation of China (Grant No. 62377031), the Fundamental Research Funds for the Central Universities (Grant No. GK202105006), the Key Research and Development Program in Shaanxi Province (Grant No. 2023-YBGY241).

Author information

Authors and Affiliations

Key Laboratory of Modern Teaching Technology, Ministry of Education, School of Computer Science, Shaanxi Normal University, Xi’an, 710119, China
Di Shi, Min Guo & Miao Ma

Authors

Di Shi
View author publications
You can also search for this author in PubMed Google Scholar
Min Guo
View author publications
You can also search for this author in PubMed Google Scholar
Miao Ma
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Di Shi contributed to conceptualization, methodology, and writing of the main manuscript. Min Guo provided experimental guidance and participated in revising and reviewing the manuscript. Miao Ma participated in revising and reviewing the manuscript.

Corresponding author

Correspondence to Min Guo.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, D., Guo, M. & Ma, M. A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network. Multimedia Systems 30, 367 (2024). https://doi.org/10.1007/s00530-024-01582-8

Download citation

Received: 25 January 2024
Accepted: 12 November 2024
Published: 27 November 2024
DOI: https://doi.org/10.1007/s00530-024-01582-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Attention mechanism combined with residual recurrent neural network for sound event detection and localization

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation