Skip to main content
Log in

A new speaker-diarization technology with denoising spectral-LSTM for online automatic multi-dialogue recording

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In AI pandemic applications, the online automatic AI recording apparatus for official councils such as court trials, business conferences and commercial meetings will become imperative because it could let the opinion identification and consensus of participants be synchronically available to implicitly diminish social costs such as follow-up disputes and controversies. Hence, in this study, an automatic on-line multi-dialogue recording system is completely constructed, where the unbounded interleaved-state recurrent neural networks (UIS-RNN) with proposed crux improvements is exploited to achieve confident speaker-diarization. For keeping the systematic robustness, a denoising spectral-LSTM, which is precisely modified from the dual-signal transformation LSTM (DTLN), can strengthen its subsequent crux-improved UIS-RNN and automatic speech recognition (ASR). Finally, the MacBERT model is set to rectify the possible wrong words in conversed sentences according to the learned rational context. For making our system being a practical software apparatus in the use of unmarked multi-person councils, we have also completed the convenient interfaces for the operations of ASR and speaker-diarization, which can exhibit on-line denoising efficacy and speaker-diarization results as well as offer real-time hand-crafted rectifications to common users. In extensive experiments, the proposed recording system can promise high accuracy rates of online speaker diarization and speech-separated ASR. Our proposed system had been examined by the cooperated law court staffs, who offered the noise-embedded speeches of practical court field to test our system. Since the tight recording burden had been indeed noticeably alleviated in their legal-action councils, the court staffs had endorsed that the proposed entire system could be a friendly labor-saving AI apparatus for on-line automatic multi-dialogue recording.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

All the datasets and materials for training, validation and testing can be available from resources supported in [31, 32] and [33] as well as the datasets collected by ourselves temporarily partially opened and available at https://drive.google.com/drive/folders/17tGk99_iywuSDT0H_dMBqc_xUcG4Zkui?usp=drive_link.

References

  1. Dehak N et al. (2011) Front-end factor analysis for speaker verification. In: Proc. of IEEE Transactions on Audio, Speech, and Language Processing. https://ieeexplore.ieee.org/document/5545402

  2. Zhu W, Pelecanos J (2016) Online speaker diarization using adapted i-vector transforms. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472638

  3. Variani E et al. (2014) Deep neural networks for small footprint text-dependent speaker verification. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/6854363

  4. Snyder D et al. (2018) X-vectors: Robust DNN embeddings for speaker recognition. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/8461375

  5. Zajíc Z et al. (2017) Speaker diarization using convolutional neural network for statistics accumulation refinement. In: INTERSPEECH. https://www.kky.zcu.cz/cs/publications/1/ZajicZbynek_2017_SpeakerDiarization.pdf

  6. Wang Q et al. (2018) Speaker diarization with LSTM. In: Proc. of IEEE International conference on acoustics, speech and signal processing (ICASSP). https://ieeexplore.ieee.org/document/8462628

  7. Garcia-Romero D et al.(2017) Speaker diarization using deep neural network embeddings. In: Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/7953094

  8. Westhausen N-L et al.(2020) Dual-signal transformation LSTM network for real-time noise suppression. In: Proc. of INTERSPEECH. https://arxiv.org/abs/2005.07551

  9. Zhang A et al. (2019) Fully supervised speaker diarization. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/8683892

  10. Higuchi Y et al. (2020) Speaker embeddings incorporating acoustic conditions for diarization. In: Proc. ofIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi:https://doi.org/10.1109/ICASSP40776.2020.9054273

  11. Li Z et al. (2021) Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/9413752

  12. Gan Z et al. (2022) End-to-end speaker diarization of Tibetan based on BLSTM. In: Global Conference on Robotics, Artificial Intelligence and Information Technology (GCRAIT). https://ieeexplore.ieee.org/document/9898377

  13. Kanda N et al. (2022) Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed ASR. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/9746225

  14. Ayasi A et al. (2022) Speaker diarization using BiLSTM and BiGRU with self-attention. In: Proc. ofSecond International Conference on Next Generation Intelligent Systems (ICNGIS). https://doi.org/10.1109/ICNGIS54955.2022.10079831

  15. Cheng S-W et al. (2022) An attention-based neural network on multiple speaker diarization. In: Proc. of IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). https://doi.org/10.1109/AICAS54282.2022.9870007

  16. Pang B et al. (2022) TSUP speaker diarization system for conversational short-phrase speaker diarization challenge. In: Proc. of13th International Symposium on Chinese Spoken Language Processing (ISCSLP). https://doi.org/10.1109/ISCSLP57327.2022.10037846

  17. Ravanelli M et al (2018) Light gated recurrent units for speech recognition. IEEE Trans Emerg Top Comput Intell 2(2):92102. https://doi.org/10.1109/TETCI.2017.2762739

    Article  Google Scholar 

  18. Li J et al. (2019) Improving RNN transducer modeling for end-to-end speech recognition. In: Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). https://doi.org/10.1109/ASRU46091.2019.9003906

  19. Hou J, Zhao S (2021) A real-time speech enhancement algorithm based on convolutional recurrent network and Wiener filter. In: Proc. ofIEEE 6th International Conference on Computer and Communication Systems (ICCCS). https://doi.org/10.1109/ICCCS52626.2021.9449307

  20. Hu Y et al. (2020) Deep complex convolution recurrent network for phase-aware speech enhancement. In: INTERSPEECH. https://arxiv.org/abs/2008.00264

  21. Hung J-W et al. (2021) Exploiting the non-uniform frequency-resolution spectrograms to improve the deep denoising auto-encoder for speech enhancement. In: Proc. of7th International Conference on Applied System Innovation (ICASI). https://doi.org/10.1109/ICASI52993.2021.9568478

  22. Jannu C, Vanambathina S-D (2023) An attention based densely connected U-NET with convolutional GRU for speech enhancement. In: Proc. of 3rd International conference on Artificial Intelligence and Signal Processing (AISP). https://doi.org/10.1109/AISP57993.2023.10134933

  23. Rethage D et al. (2018) A wavenet for speech denoising. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://ieeexplore.ieee.org/document/8462417

  24. Hao X et al. (2021) Fullsubnet: A fullband and sub-band fusion model for real-time single-channel speech enhancement. In: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://arxiv.org/abs/2010.15508

  25. Zhang C et al. (2021) Denoispeech: denoising text to speech with frame-level noise modeling. In: Proc. ofIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP39728.2021.9413934

  26. Yang G et al. (2022) Speech signal denoising algorithm and simulation based on wavelet threshold. In: Proc. of4th International Conference on Natural Language Processing (ICNLP). doi:https://doi.org/10.1109/ICNLP55136.2022.00055

  27. Kong Z et al. (2022) Speech denoising in the waveform domain with self-attention. In: Proc. ofIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP43922.2022.9746169

  28. Zhao S et al. (2021) Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses. In: Proc. ofIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP39728.2021.9414569

  29. Cui Y et al. (2020) Revisiting pre-trained models for Chinese natural language processing. Association for Computational Linguistics, vol. findings of the association for computational linguistics: EMNLP, arXiv preprint arXiv:2004.13922

  30. Rix A-W et al. (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proc. of IEEE international conference on acoustics, speech, and signal processing. https://ieeexplore.ieee.org/document/941023

  31. Dataset of Human Voice Speech Denoising Competition hosted by Industrial Technology Research Institute (ITRI), Taiwan, https://aidea-web.tw/topic/8d381596-ee9d-45d5-b779-188909ccb0c8

  32. Formosa Language Understanding public dataset provided by National Center University, Taiwan, https://scidm.nchc.org.tw/dataset/grandchallenge

  33. Audio-visual dataset VoxCeleb, https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

  34. Rombach R et al. (2022) High-resolution image synthesis with latent diffusion models, In: Proc. of IEEE international conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2112.10752

Download references

Acknowledgements

This work was supported by the funding of Ministry of Science and Technology of Taiwan under Grant MOST 111-2221-E006-177-MY2. Also, we would like to thank the AIGO competition for inspiring us to develop this system and then giving the high approval. Particularly, we are grateful to the Tainan District Court as the mainly cooperated institution for examining our proposed system and consecutively feeding key advices back via their practical applications.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Din Yuen Chan.

Ethics declarations

Conflict of interests

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chan, D.Y., Wang, JF. & Chin, HT. A new speaker-diarization technology with denoising spectral-LSTM for online automatic multi-dialogue recording. Multimed Tools Appl 83, 45407–45422 (2024). https://doi.org/10.1007/s11042-023-17283-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17283-9

Keywords

Navigation