Abstract
The objective of sound event localization and detection (SELD) is to accurately identify the temporal occurrence and spatial coordinates of a specific sound category. The existing mainstream offline methods may unintentionally introduce unfavorable future feature information during the training process, thereby potentially hindering the system’s performance. The utilization of online methods can lead to improved localization accuracy to a certain extent. Nevertheless, it may result in a diminished ability with the detection capability for sound events. In this paper, a hybrid offline-online method (HOOM) is proposed that involves extracting comprehensive audio information using offline network layers, while simultaneously filtering out irrelevant future information using online network layers. Based on this method, we designed two simple sub-network architectures. The first, convolution and causal convolution alternating network (CCAN), employs regular convolution along with causal convolutions to achieve the offline and online convolution features, respectively. The second, bidirectional and unidirectional alternating network (BUAN), combines bidirectional recurrent neural networks with unidirectional recurrent neural networks, capturing the offline and online contextual sequence information, respectively. Our proposed method demonstrates a 6% improvement in localization recall on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset. Furthermore, compared to offline or online methods, there is a 4% overall performance improvement. On the detection and classification of acoustic scenes and events 2022 (DCASE2022) synthetic dataset, the overall performance improvement is 5%. These results indicate a significant advantage and provide a novel and robust solution for the SELD task.
Graphical abstract
Propose a hybrid offline-online method for SELD. The extracted hybrid features or temporal sequences enable the acquisition of a more comprehensive range of audio information.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets in our experiments are available in https://zenodo.org/records/7880637 and https://zenodo.org/records/6406873.
References
Imoto K, Mishima S, Arai Y, Kondo R (2022) Impact of data imbalance caused by inactive frames and difference in sound duration on sound event detection performance. Appl Acoust 196:108882. https://doi.org/10.1016/j.apacoust.2022.108882
Chan TK, Chin CS (2020) A comprehensive review of polyphonic sound event detection. IEEE Access 8:103339–103373. https://doi.org/10.1109/ACCESS.2020.2999388
Park S, Han DK, Elhilali M (2023) Cross-referencing self-training network for sound event detection in audio mixtures. IEEE Trans Multimed 25:4573–4585. https://doi.org/10.1109/TMM.2022.3178591
Grumiaux P-A, Kitić S, Girin L, Guérin A (2022) A survey of sound source localization with deep learning methods. The J Acoust Soc Am 152(1):107–151. https://doi.org/10.1121/10.0011809
Chen J, Liang H, Wang R, Zeng J, Lu P (2023) Dynamic interactive learning network for audio-visual event localization. Appl Intell, pp 1–12. https://doi.org/10.1007/s10489-023-05146-7
Desai D, Mehendale N (2022) A review on sound source localization systems. Arch Comput Methods Eng 29(7):4631–4642. https://doi.org/10.1007/s11831-022-09747-2
Li H, Lau S-K (2020) A review of audio-visual interaction on soundscape assessment in urban built environments. Appl Acoust 166:107372. https://doi.org/10.1016/j.apacoust.2020.107372
Li Z, Ba M, Kang J (2021) Physiological indicators and subjective restorativeness with audio-visual interactions in urban soundscapes. Sustain Cities Soc 75:103360. https://doi.org/10.1016/j.scs.2021.103360
Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M (2016) Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans Intell Trans Sys 17(1):279–288. https://doi.org/10.1109/TITS.2015.2470216
Elharrouss O, Almaadeed N, Al-Maadeed SA (2021) A review of video surveillance systems. J Vis Commun Image Represent 77:103116. https://doi.org/10.1016/j.jvcir.2021.103116
Adavanne S, Politis A, Nikunen J, Virtanen T (2019) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J Sel Top Signal Process 13(1):34–48. https://doi.org/10.1109/JSTSP.2018.2885636
Nagatomo K, Yasuda M, Yatabe K, Saito S, Oikawa Y (2022) On-line sound event localization and detection for real-time recognition of surrounding environment. Appl Acoust 199:108961. https://doi.org/10.1016/j.apacoust.2022.108961
Politis A, Mesaros A, Adavanne S, Heittola T, Virtanen T (2021) Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Trans Audio, Speech Lang Process 29:684–698. https://doi.org/10.1109/TASLP.2020.3047233
Politis A, Shimada K, Sudarsanam P, Adavanne S, Krause D, Koyama Y, Takahashi N, Takahashi S, Mitsufuji Y, Virtanen T (2022) STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In: Proceedings of the 8th detection and classification of acoustic scenes and events 2022 workshop (DCASE2022), Nancy, France, pp 125–129. https://dcase.community/workshop2022/proceedings
Guizzo E, Marinoni C, Pennese M, Ren X, Zheng X, Zhang C, Masiero B, Uncini A, Comminiello D (2022) L3das22 challenge: Learning 3d audio sources in a real office environment. In: International conference on acoustics, speech and signal processing (ICASSP), pp 9186–9190. https://doi.org/10.1109/ICASSP43922.2022.9746872
Shimada K, Politis A, Sudarsanam P, Krause D.A, Uchida K, Adavanne S, Hakala A, Koyama Y, Takahashi N, Takahashi S, Virtanen T, Mitsufuji Y (2023) Starss23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In: Advances in neural information processing systems, vol 36, pp 72931–72957. https://proceedings.neurips.cc/paper_files/paper/2023/file/e6c9671ed3b3106b71cafda3ba225c1a-Paper-Datasets_and_Benchmarks.pdf
He Y, Zhao J (2019) Temporal convolutional networks for anomaly detection in time series. J Phys Conf Ser 1213(4):042050. https://doi.org/10.1088/1742-6596/1213/4/042050
Mohimont L, Chemchem A, Alin F, Krajecki M, Steffenel LA (2021) Convolutional neural networks and temporal cnns for covid-19 forecasting in france. Appl Intell, pp 1–26. https://doi.org/10.1007/s10489-021-02359-6
Zhu H, Yan J (2022) A deep learning based sound event location and detection algorithm using convolutional recurrent neural network. In: International conference on computer, information and telecommunication systems (CITS), pp 1–6. https://doi.org/10.1109/CITS55221.2022.9832991
Cao Y, Kong Q, Iqbal T, An F, Wang W, Plumbley MD (2019) Polyphonic sound event detection and localization using a two-stage strategy. In: Proceedings of detection and classification of acoustic scenes and events workshop, pp 30–34. https://doi.org/10.33682/4jhy-bj81
Sudo Y, Itoyama K, Nishida K, Nakadai K (2021) Multichannel environmental sound segmentation: with separately trained spectral and spatial features. Appl Intell 51(11):8245–8259. https://doi.org/10.1007/s10489-021-02314-5
Kooolagudi SG et al (2024) Polyphonic sound event localization and detection using channel-wise fusionnet. Appl Intell 54(6):5015–5026. https://doi.org/10.1007/s10489-024-05438-6
Lee S-H, Hwang J-W, Song M-H, Park H-M (2022) A method based on dual cross-modal attention and parameter sharing for polyphonic sound event localization and detection. Appl Sci 12(10). https://doi.org/10.3390/app12105075
Hu J, Cao Y, Wu M, Kong Q, Yang F, Plumbley MD, Yang J (2022) A track-wise ensemble event independent network for polyphonic sound event localization and detection. In: International conference on acoustics, speech and signal processing (ICASSP), pp 9196–9200. https://doi.org/10.1109/ICASSP43922.2022.9747283
Mao Y, Zeng Y, Liu H, Zhu W, Zhou Y (2022) Icassp 2022 l3das22 challenge: Ensemble of resnet-conformers with ambisonics data augmentation for sound event localization and detection. In: International conference on acoustics, speech and signal processing (ICASSP), pp 9191–9195. https://doi.org/10.1109/ICASSP43922.2022.9746673
Shimada K, Koyama Y, Takahashi N, Takahashi S, Mitsufuji Y (2021) Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In: International conference on acoustics, speech and signal processing (ICASSP), pp 915–919. https://doi.org/10.1109/ICASSP39728.2021.9413609
Huang Y, Benesty J, Elko GW, Mersereati RM (2001) Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans Speech Audio Process 9(8):943–956. https://doi.org/10.1109/89.966097
Dang X, Zhu H (2024) An iteratively reweighted steered response power approach to multisource localization using a distributed microphone network. J Acoust Soc Am 155(2):1182–1197. https://doi.org/10.1121/10.0024764
Yin S, Yang Y, Chu Z, Shen L (2022) Resolution enhanced newtonized orthogonal matching pursuit solver for compressive beamforming. Appl Acoust 196:108884. https://doi.org/10.1016/j.apacoust.2022.108884
Cho BJ, Park H-M (2021) Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition. IEEE/ACM Tran Audio, Speech Lang Process 29:1352–1367. https://doi.org/10.1109/TASLP.2021.3067202
Schober P, Estiri SN, Aygun S, Jalilvand AH, Najafi MH, TaheriNejad N (2023) Stochastic computing design and implementation of a sound source localization system. IEEE J Emerg Sel Top Circ Syst 13(1):295–311. https://doi.org/10.1109/JETCAS.2023.3243604
Diaz-Guerra D, Miguel A, Beltran JR (2021) Robust sound source tracking using srp-phat and 3d convolutional neural networks. IEEE/ACM Trans Audio, Speech Lang Process 29:300–311. https://doi.org/10.1109/TASLP.2020.3040031
Diaz-Guerra D, Miguel A, Beltran JR (2023) Direction of arrival estimation of sound sources using icosahedral cnns. IEEE/ACM Trans Audio, Speech, Lang Process 31:313–321. https://doi.org/10.1109/TASLP.2022.3224282
Yang B, Liu H, Li X (2022) Srp-dnn: Learning direct-path phase difference for multiple moving sound source localization. In: International conference on acoustics, speech and signal processing (ICASSP), pp 721–725. https://doi.org/10.1109/ICASSP43922.2022.9746624
Yang S-T, Jhou F-C, Wang J-C, Chang P-C (2021) Sound event localization and detection based on time-frequency separable convolutional compression network. In: 2021 IEEE 10th global conference on consumer electronics (GCCE), pp 432–433. https://doi.org/10.1109/GCCE53005.2021.9622019
Sherstinsky A (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Phys D: Nonlinear Phenom 404:132306. https://doi.org/10.1016/j.physd.2019.132306
Shimada K, Koyama Y, Takahashi S, Takahashi N, Tsunoo E, Mitsufuji Y (2022) Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In: International conference on acoustics, speech and signal processing (ICASSP), pp 316–320. https://doi.org/10.1109/ICASSP43922.2022.9746384
Scheibler R, Komatsu T, Fujita Y, Hentschel M (2022) On sorting and padding multiple targets for sound event localization and detection with permutation invariant and location-based training. In: Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 1–6. https://doi.org/10.23919/APSIPAASC55919.2022.9979815
Fonseca E, Favory X, Pons J, Font F, Serra X (2022) Fsd50k: An open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30:829–852. https://doi.org/10.1109/TASLP.2021.3133208
Kumar P, Kumar A, Choudhary S, Prakash J, Kumar S (2023) A framework for seld using conformer and multi-accdoa strategies. Technical report, DCASE2023 Challenge. https://dcase.community/documents/challenge2023/technical_reports/DCASE2023_Kumar_85_t3a.pdf
Funding
This work was supported by the National Natural Science Foundation of China under Grant 62003308.
Author information
Authors and Affiliations
Contributions
Conceptualization: Wenjie Zhang, Peng Yu, Mingliang Xu;Methodology: Wenjie Zhang, Peng Yu; Formal analysis and investigation: Wenjie Zhang, Peng Yu; Writing - original draft preparation: Peng Yu; Article review and editing: Wenjie Zhang, Peng Yu, Zhan Wang, Zhenhe Wang; Funding acquisition: Wenjie Zhang; Supervision: Wenjie Zhang,Mingliang Xu; All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Conflicts of interest
The authors declare that they have no conficts.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, W., Yu, P., Wang, Z. et al. A hybrid offline-online method for sound event localization and detection. Appl Intell 54, 11357–11372 (2024). https://doi.org/10.1007/s10489-024-05702-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05702-9