A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction

Zhang, Yiru; Li, Zeke; Liu, Bijing; Fan, Haiwei; Yang, Yong; Yang, Qun

doi:10.1007/978-3-031-53311-2_32

Yiru Zhang¹⁴,
Zeke Li^15,16,
Bijing Liu^16,17,
Haiwei Fan¹⁵,
Yong Yang^16,17 &
…
Qun Yang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14556))

Included in the following conference series:

International Conference on Multimedia Modeling

376 Accesses

Abstract

Speaker extraction is a technique that separates the target speech from multi-talker mixtures using a priori information about the target speaker, such as pre-enrolled reference speech. However, in real-world scenarios, the mixture speech is partially overlapped continuous long speech and obtaining a priori information is often challenging. Hence, we propose a framework to estimate the reference speech of participating speakers from non-overlapping input regions and extract target speech. To accurately estimate the regions, we adopt the idea of region proposal to generate multiple speech segment proposals of non-overlapping regions from the input speech mixtures. And then, we cluster these proposed segments into clusters to obtain the best reference speech of each speaker. We conduct experiment on simulated meeting-style test set with different overlap ratio based on LibriSpeech. The experimental results show that the region proposal method can achieve the best performance in speech extraction compared to other reference speech estimation methods.

Y. Zhang and Z. Li—Contributed equally to this research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Our code are available online at https://github.com/Ease-3600/TSE-with-ref-selection.

References

Xu, C., Rao, W., Chng, E.S., Li, H.: Spex: multi-scale time domain speaker extraction network. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1370–1384 (2020). https://doi.org/10.1109/TASLP.2020.2987429
Article Google Scholar
Wang, Q., et al.: VoiceFilter: targeted voice separation by speaker-conditioned spectrogram masking. In: Proceedings of Interspeech 2019, pp. 2728–2732 (2019). https://doi.org/10.21437/Interspeech. 2019–1101
ŽmolíkovŽmolíková, K., et al.: Speakerbeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE J. Selected Topics Signal Process. 13(4), 800–814 (2019). https://doi.org/10.1109/JSTSP.2019.2922820
Article Google Scholar
Özgür Çetin, Shriberg, E.: Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In: Proceedings of Interspeech 2006, pp. paper 1915-Mon2A2O.6 (2006). https://doi.org/10.21437/Interspeech. 2006–91
Chen, Z., et al.: Continuous speech separation: Dataset and analysis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7284–7288 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053426
Borsdorf, M., Xu, C., Li, H., Schultz, T.: Universal speaker extraction in the presence and absence of target speakers for speech of one and two talkers. In: Proceedings of Interspeech 2021, pp. 1469–1473 (2021). https://doi.org/10.21437/Interspeech. 2021–1939
Delcroix, M., Zmolíková, K., Ochiai, T., Kinoshita, K., Nakatani, T.: Speaker activity driven neural speech extraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021, pp. 6099–6103. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9414998
Pan, Z., Ge, M., Li, H.: Usev: universal speaker extraction with visual cue. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 3032–3045 (2022). https://doi.org/10.1109/TASLP.2022.3205759
Article Google Scholar
Zeghidour, N., Grangier, D.: Wavesplit: end-to-end speech separation by speaker clustering. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2840–2849 (2021). https://doi.org/10.1109/TASLP.2021.3099291
Article Google Scholar
Han, H., Chung, S.W., Kang, H.G.: MIRNet: learning multiple identities representations in overlapped speech. In: Proceedings of Interspeech 2020, pp. 4303–4307 (2020). https://doi.org/10.21437/Interspeech. 2020–2076
Byun, J., Shin, J.W.: Monaural speech separation using speaker embedding from preliminary separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2753–2763 (2021). https://doi.org/10.1109/TASLP.2021.3101617
Article Google Scholar
Han, C., et al.: Continuous speech separation using speaker inventory for long recording. In: Proceedings of Interspeech 2021, pp. 3036–3040 (2021). https://doi.org/10.21437/Interspeech. 2021–338
Wang, P., et al.: Speech separation using speaker inventory. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 230–236 (2019). https://doi.org/10.1109/ASRU46091.2019.9003884
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245 (2017). https://doi.org/10.1109/ICASSP.2017.7952154
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Kao, C.C., Wang, W., Sun, M., Wang, C.: R-CRNN: region-based convolutional recurrent neural network for audio event detection. In: Proc. Interspeech 2018, pp. 1358–1362 (2018). https://doi.org/10.21437/Interspeech. 2018–2323
Huang, Z., Watanabe, S., Fujita, Y., García, P., Shao, Y., Povey, D., Khudanpur, S.: Speaker diarization with region proposal network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6514–6518 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053760
Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., Vincent, E.: Librimix: an open-source dataset for generalizable speech separation (2020)
Google Scholar
Ge, M., Xu, C., Wang, L., Chng, E.S., Dang, J., Li, H.: SpEx+: a Complete Time Domain Speaker Extraction Network. In: Proceedings of Interspeech 2020, pp. 1406–1410 (2020). https://doi.org/10.21437/Interspeech. 2020–1397
Bredin, H., et al.: pyannote.audio: neural building blocks for speaker diarization. In: ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing (2020)
Google Scholar
Bredin, H., Laurent, A.: End-to-end speaker segmentation for overlap-aware resegmentation. In: Proceedings of Interspeech 2021 (2021)
Google Scholar
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In: Proceedings of Interspeech 2020, pp. 3830–3834 (2020). https://doi.org/10.21437/Interspeech. 2020–2650
Kanda, N., Horiguchi, S., Fujita, Y., Xue, Y., Nagamatsu, K., Watanabe, S.: Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 31–38 (2019). https://doi.org/10.1109/ASRU46091.2019.9004009
Manohar, V., Chen, S.J., Wang, Z., Fujita, Y., Watanabe, S., Khudanpur, S.: Acoustic modeling for overlapping speech recognition: Jhu chime-5 challenge system. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6665–6669 (2019). https://doi.org/10.1109/ICASSP.2019.8682556
Chen, S., et al.: Wavlm: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Selected Topics Signal Process. 16(6), 1505–1518 (2022). https://doi.org/10.1109/JSTSP.2022.3188113
Article MathSciNet Google Scholar
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: Sdr - half-baked or well done? In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630 (2019). https://doi.org/10.1109/ICASSP.2019.8683855

Download references

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Yiru Zhang & Qun Yang
State Grid Fujian Electric Power Dispatching and Control Center, Fuzhou, China
Zeke Li & Haiwei Fan
NARI Technology Co., Ltd., Nanjing, China
Zeke Li, Bijing Liu & Yong Yang
Beijing Kedong Electric Power Control System Co., Ltd., Beijing, China
Bijing Liu & Yong Yang

Authors

Yiru Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zeke Li
View author publications
You can also search for this author in PubMed Google Scholar
Bijing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haiwei Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qun Yang .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Li, Z., Liu, B., Fan, H., Yang, Y., Yang, Q. (2024). A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-53311-2_32
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53310-5
Online ISBN: 978-3-031-53311-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction