A hybrid offline-online method for sound event localization and detection

Zhang, Wenjie; Yu, Peng; Wang, Zhan; Wang, Zhenhe; Xu, Mingliang

doi:10.1007/s10489-024-05702-9

A hybrid offline-online method for sound event localization and detection

Published: 20 August 2024

Volume 54, pages 11357–11372, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Wenjie Zhang ORCID: orcid.org/0000-0002-8527-9826^1,2,3^na1,
Peng Yu¹^na1,
Zhan Wang¹,
Zhenhe Wang¹ &
…
Mingliang Xu^1,2,3

315 Accesses
Explore all metrics

Abstract

The objective of sound event localization and detection (SELD) is to accurately identify the temporal occurrence and spatial coordinates of a specific sound category. The existing mainstream offline methods may unintentionally introduce unfavorable future feature information during the training process, thereby potentially hindering the system’s performance. The utilization of online methods can lead to improved localization accuracy to a certain extent. Nevertheless, it may result in a diminished ability with the detection capability for sound events. In this paper, a hybrid offline-online method (HOOM) is proposed that involves extracting comprehensive audio information using offline network layers, while simultaneously filtering out irrelevant future information using online network layers. Based on this method, we designed two simple sub-network architectures. The first, convolution and causal convolution alternating network (CCAN), employs regular convolution along with causal convolutions to achieve the offline and online convolution features, respectively. The second, bidirectional and unidirectional alternating network (BUAN), combines bidirectional recurrent neural networks with unidirectional recurrent neural networks, capturing the offline and online contextual sequence information, respectively. Our proposed method demonstrates a 6% improvement in localization recall on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset. Furthermore, compared to offline or online methods, there is a 4% overall performance improvement. On the detection and classification of acoustic scenes and events 2022 (DCASE2022) synthetic dataset, the overall performance improvement is 5%. These results indicate a significant advantage and provide a novel and robust solution for the SELD task.

Graphical abstract

Propose a hybrid offline-online method for SELD. The extracted hybrid features or temporal sequences enable the acquisition of a more comprehensive range of audio information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Article Open access 30 June 2023

Full-Frequency Dynamic Convolution: A Physical Frequency-Dependent Convolution for Sound Event Detection

A Progressive Learning Approach for Sound Event Detection with Temporal and Spectral Features Fusion

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets in our experiments are available in https://zenodo.org/records/7880637 and https://zenodo.org/records/64 06873.

Notes

References

Imoto K, Mishima S, Arai Y, Kondo R (2022) Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance. Appl Acoust 196:108882. https://doi.org/10.1016/j.apacoust.2022.108882
Article Google Scholar
Chan TK, Chin CS (2020) A comprehensive review of polyphonic sound event detection. IEEE Access 8:103339–103373. https://doi.org/10.1109/ACCESS.2020.2999388
Article Google Scholar
Park S, Han DK, Elhilali M (2023) Cross-referencing self-training network for sound event detection in audio mixtures. IEEE Trans Multimed 25:4573–4585. https://doi.org/10.1109/TMM.2022.3178591
Article Google Scholar
Grumiaux P-A, Kitić S, Girin L, Guérin A (2022) A survey of sound source localization with deep learning methods. The J Acoust Soc Am 152(1):107–151. https://doi.org/10.1121/10.0011809
Article Google Scholar
Chen J, Liang H, Wang R, Zeng J, Lu P (2023) Dynamic interactive learning network for audio-visual event localization. Appl Intell, pp 1–12. https://doi.org/10.1007/s10489-023-05146-7
Desai D, Mehendale N (2022) A review on sound source localization systems. Arch Comput Methods Eng 29(7):4631–4642. https://doi.org/10.1007/s11831-022-09747-2
Article Google Scholar
Li H, Lau S-K (2020) A review of audio-visual interaction on soundscape assessment in urban built environments. Appl Acoust 166:107372. https://doi.org/10.1016/j.apacoust.2020.107372
Article Google Scholar
Li Z, Ba M, Kang J (2021) Physiological indicators and subjective restorativeness with audio-visual interactions in urban soundscapes. Sustain Cities Soc 75:103360. https://doi.org/10.1016/j.scs.2021.103360
Article Google Scholar
Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M (2016) Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans Intell Trans Sys 17(1):279–288. https://doi.org/10.1109/TITS.2015.2470216
Article Google Scholar
Elharrouss O, Almaadeed N, Al-Maadeed SA (2021) A review of video surveillance systems. J Vis Commun Image Represent 77:103116. https://doi.org/10.1016/j.jvcir.2021.103116
Article Google Scholar
Adavanne S, Politis A, Nikunen J, Virtanen T (2019) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J Sel Top Signal Process 13(1):34–48. https://doi.org/10.1109/JSTSP.2018.2885636
Article Google Scholar
Nagatomo K, Yasuda M, Yatabe K, Saito S, Oikawa Y (2022) On-line sound event localization and detection for real-time recognition of surrounding environment. Appl Acoust 199:108961. https://doi.org/10.1016/j.apacoust.2022.108961
Article Google Scholar
Politis A, Mesaros A, Adavanne S, Heittola T, Virtanen T (2021) Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Trans Audio, Speech Lang Process 29:684–698. https://doi.org/10.1109/TASLP.2020.3047233
Article Google Scholar
Politis A, Shimada K, Sudarsanam P, Adavanne S, Krause D, Koyama Y, Takahashi N, Takahashi S, Mitsufuji Y, Virtanen T (2022) STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In: Proceedings of the 8th detection and classification of acoustic scenes and events 2022 workshop (DCASE2022), Nancy, France, pp 125–129. https://dcase.community/workshop2022/proceedings
Guizzo E, Marinoni C, Pennese M, Ren X, Zheng X, Zhang C, Masiero B, Uncini A, Comminiello D (2022) L3das22 challenge: Learning 3d audio sources in a real office environment. In: International conference on acoustics, speech and signal processing (ICASSP), pp 9186–9190. https://doi.org/10.1109/ICASSP43922.2022.9746872
Shimada K, Politis A, Sudarsanam P, Krause D.A, Uchida K, Adavanne S, Hakala A, Koyama Y, Takahashi N, Takahashi S, Virtanen T, Mitsufuji Y (2023) Starss23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In: Advances in neural information processing systems, vol 36, pp 72931–72957. https://proceedings.neurips.cc/paper_files/paper/2023/file/e6c9671ed3b3106b71cafda3ba225c1a-Paper-Datasets_and_Benchmarks.pdf
He Y, Zhao J (2019) Temporal convolutional networks for anomaly detection in time series. J Phys Conf Ser 1213(4):042050. https://doi.org/10.1088/1742-6596/1213/4/042050
Article Google Scholar
Mohimont L, Chemchem A, Alin F, Krajecki M, Steffenel LA (2021) Convolutional neural networks and temporal cnns for covid-19 forecasting in france. Appl Intell, pp 1–26. https://doi.org/10.1007/s10489-021-02359-6
Zhu H, Yan J (2022) A deep learning based sound event location and detection algorithm using convolutional recurrent neural network. In: International conference on computer, information and telecommunication systems (CITS), pp 1–6. https://doi.org/10.1109/CITS55221.2022.9832991
Cao Y, Kong Q, Iqbal T, An F, Wang W, Plumbley MD (2019) Polyphonic sound event detection and localization using a two-stage strategy. In: Proceedings of detection and classification of acoustic scenes and events workshop, pp 30–34. https://doi.org/10.33682/4jhy-bj81
Sudo Y, Itoyama K, Nishida K, Nakadai K (2021) Multichannel environmental sound segmentation: with separately trained spectral and spatial features. Appl Intell 51(11):8245–8259. https://doi.org/10.1007/s10489-021-02314-5
Article Google Scholar
Kooolagudi SG et al (2024) Polyphonic sound event localization and detection using channel-wise fusionnet. Appl Intell 54(6):5015–5026. https://doi.org/10.1007/s10489-024-05438-6
Article Google Scholar
Lee S-H, Hwang J-W, Song M-H, Park H-M (2022) A method based on dual cross-modal attention and parameter sharing for polyphonic sound event localization and detection. Appl Sci 12(10). https://doi.org/10.3390/app12105075
Hu J, Cao Y, Wu M, Kong Q, Yang F, Plumbley MD, Yang J (2022) A track-wise ensemble event independent network for polyphonic sound event localization and detection. In: International conference on acoustics, speech and signal processing (ICASSP), pp 9196–9200. https://doi.org/10.1109/ICASSP43922.2022.9747283
Mao Y, Zeng Y, Liu H, Zhu W, Zhou Y (2022) Icassp 2022 l3das22 challenge: Ensemble of resnet-conformers with ambisonics data augmentation for sound event localization and detection. In: International conference on acoustics, speech and signal processing (ICASSP), pp 9191–9195. https://doi.org/10.1109/ICASSP43922.2022.9746673
Shimada K, Koyama Y, Takahashi N, Takahashi S, Mitsufuji Y (2021) Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In: International conference on acoustics, speech and signal processing (ICASSP), pp 915–919. https://doi.org/10.1109/ICASSP39728.2021.9413609
Huang Y, Benesty J, Elko GW, Mersereati RM (2001) Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans Speech Audio Process 9(8):943–956. https://doi.org/10.1109/89.966097
Article Google Scholar
Dang X, Zhu H (2024) An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network. J Acoust Soc Am 155(2):1182–1197. https://doi.org/10.1121/10.0024764
Article Google Scholar
Yin S, Yang Y, Chu Z, Shen L (2022) Resolution enhanced newtonized orthogonal matching pursuit solver for compressive beamforming. Appl Acoust 196:108884. https://doi.org/10.1016/j.apacoust.2022.108884
Article Google Scholar
Cho BJ, Park H-M (2021) Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition. IEEE/ACM Tran Audio, Speech Lang Process 29:1352–1367. https://doi.org/10.1109/TASLP.2021.3067202
Article Google Scholar
Schober P, Estiri SN, Aygun S, Jalilvand AH, Najafi MH, TaheriNejad N (2023) Stochastic computing design and implementation of a sound source localization system. IEEE J Emerg Sel Top Circ Syst 13(1):295–311. https://doi.org/10.1109/JETCAS.2023.3243604
Article Google Scholar
Diaz-Guerra D, Miguel A, Beltran JR (2021) Robust sound source tracking using srp-phat and 3d convolutional neural networks. IEEE/ACM Trans Audio, Speech Lang Process 29:300–311. https://doi.org/10.1109/TASLP.2020.3040031
Article Google Scholar
Diaz-Guerra D, Miguel A, Beltran JR (2023) Direction of arrival estimation of sound sources using icosahedral cnns. IEEE/ACM Trans Audio, Speech, Lang Process 31:313–321. https://doi.org/10.1109/TASLP.2022.3224282
Article Google Scholar
Yang B, Liu H, Li X (2022) Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization. In: International conference on acoustics, speech and signal processing (ICASSP), pp 721–725. https://doi.org/10.1109/ICASSP43922.2022.9746624
Yang S-T, Jhou F-C, Wang J-C, Chang P-C (2021) Sound event localization and detection based on time-frequency separable convolutional compression network. In: 2021 IEEE 10th global conference on consumer electronics (GCCE), pp 432–433. https://doi.org/10.1109/GCCE53005.2021.9622019
Sherstinsky A (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Phys D: Nonlinear Phenom 404:132306. https://doi.org/10.1016/j.physd.2019.132306
Shimada K, Koyama Y, Takahashi S, Takahashi N, Tsunoo E, Mitsufuji Y (2022) Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: International conference on acoustics, speech and signal processing (ICASSP), pp 316–320. https://doi.org/10.1109/ICASSP43922.2022.9746384
Scheibler R, Komatsu T, Fujita Y, Hentschel M (2022) On sorting and padding multiple targets for sound event localization and detection with permutation invariant and location-based training. In: Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 1–6. https://doi.org/10.23919/APSIPAASC55919.2022.9979815
Fonseca E, Favory X, Pons J, Font F, Serra X (2022) Fsd50k: An open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30:829–852. https://doi.org/10.1109/TASLP.2021.3133208
Article Google Scholar
Kumar P, Kumar A, Choudhary S, Prakash J, Kumar S (2023) A framework for seld using conformer and multi-accdoa strategies. Technical report, DCASE2023 Challenge. https://dcase.community/documents/challenge2023/technical_reports/DCASE2023_Kumar_85_t3a.pdf

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62003308.

Author information

Wenjie Zhang and Peng Yu contributed equally to this work.

Authors and Affiliations

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
Wenjie Zhang, Peng Yu, Zhan Wang, Zhenhe Wang & Mingliang Xu
Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, 450001, China
Wenjie Zhang & Mingliang Xu
National Supercomputing Center in Zhengzhou, Zhengzhou, 450001, China
Wenjie Zhang & Mingliang Xu

Authors

Wenjie Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Peng Yu
View author publications
You can also search for this author inPubMed Google Scholar
Zhan Wang
View author publications
You can also search for this author inPubMed Google Scholar
Zhenhe Wang
View author publications
You can also search for this author inPubMed Google Scholar
Mingliang Xu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: Wenjie Zhang, Peng Yu, Mingliang Xu;Methodology: Wenjie Zhang, Peng Yu; Formal analysis and investigation: Wenjie Zhang, Peng Yu; Writing - original draft preparation: Peng Yu; Article review and editing: Wenjie Zhang, Peng Yu, Zhan Wang, Zhenhe Wang; Funding acquisition: Wenjie Zhang; Supervision: Wenjie Zhang,Mingliang Xu; All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Wenjie Zhang or Mingliang Xu.

Ethics declarations

Conflicts of interest

The authors declare that they have no conficts.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, W., Yu, P., Wang, Z. et al. A hybrid offline-online method for sound event localization and detection. Appl Intell 54, 11357–11372 (2024). https://doi.org/10.1007/s10489-024-05702-9

Download citation

Accepted: 23 July 2024
Published: 20 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05702-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid offline-online method for sound event localization and detection

Abstract

Graphical abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Full-Frequency Dynamic Convolution: A Physical Frequency-Dependent Convolution for Sound Event Detection

A Progressive Learning Approach for Sound Event Detection with Temporal and Spectral Features Fusion

Explore related subjects

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now