Skip to main content

A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14556))

Included in the following conference series:

  • 376 Accesses

Abstract

Speaker extraction is a technique that separates the target speech from multi-talker mixtures using a priori information about the target speaker, such as pre-enrolled reference speech. However, in real-world scenarios, the mixture speech is partially overlapped continuous long speech and obtaining a priori information is often challenging. Hence, we propose a framework to estimate the reference speech of participating speakers from non-overlapping input regions and extract target speech. To accurately estimate the regions, we adopt the idea of region proposal to generate multiple speech segment proposals of non-overlapping regions from the input speech mixtures. And then, we cluster these proposed segments into clusters to obtain the best reference speech of each speaker. We conduct experiment on simulated meeting-style test set with different overlap ratio based on LibriSpeech. The experimental results show that the region proposal method can achieve the best performance in speech extraction compared to other reference speech estimation methods.

Y. Zhang and Z. Li—Contributed equally to this research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our code are available online at https://github.com/Ease-3600/TSE-with-ref-selection.

References

  1. Xu, C., Rao, W., Chng, E.S., Li, H.: Spex: multi-scale time domain speaker extraction network. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1370–1384 (2020). https://doi.org/10.1109/TASLP.2020.2987429

    Article  Google Scholar 

  2. Wang, Q., et al.: VoiceFilter: targeted voice separation by speaker-conditioned spectrogram masking. In: Proceedings of Interspeech 2019, pp. 2728–2732 (2019). https://doi.org/10.21437/Interspeech. 2019–1101

  3. ŽmolíkovŽmolíková, K., et al.: Speakerbeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE J. Selected Topics Signal Process. 13(4), 800–814 (2019). https://doi.org/10.1109/JSTSP.2019.2922820

    Article  Google Scholar 

  4. Özgür Çetin, Shriberg, E.: Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In: Proceedings of Interspeech 2006, pp. paper 1915-Mon2A2O.6 (2006). https://doi.org/10.21437/Interspeech. 2006–91

  5. Chen, Z., et al.: Continuous speech separation: Dataset and analysis. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7284–7288 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053426

  6. Borsdorf, M., Xu, C., Li, H., Schultz, T.: Universal speaker extraction in the presence and absence of target speakers for speech of one and two talkers. In: Proceedings of Interspeech 2021, pp. 1469–1473 (2021). https://doi.org/10.21437/Interspeech. 2021–1939

  7. Delcroix, M., Zmolíková, K., Ochiai, T., Kinoshita, K., Nakatani, T.: Speaker activity driven neural speech extraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021, pp. 6099–6103. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9414998

  8. Pan, Z., Ge, M., Li, H.: Usev: universal speaker extraction with visual cue. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 3032–3045 (2022). https://doi.org/10.1109/TASLP.2022.3205759

    Article  Google Scholar 

  9. Zeghidour, N., Grangier, D.: Wavesplit: end-to-end speech separation by speaker clustering. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2840–2849 (2021). https://doi.org/10.1109/TASLP.2021.3099291

    Article  Google Scholar 

  10. Han, H., Chung, S.W., Kang, H.G.: MIRNet: learning multiple identities representations in overlapped speech. In: Proceedings of Interspeech 2020, pp. 4303–4307 (2020). https://doi.org/10.21437/Interspeech. 2020–2076

  11. Byun, J., Shin, J.W.: Monaural speech separation using speaker embedding from preliminary separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2753–2763 (2021). https://doi.org/10.1109/TASLP.2021.3101617

    Article  Google Scholar 

  12. Han, C., et al.: Continuous speech separation using speaker inventory for long recording. In: Proceedings of Interspeech 2021, pp. 3036–3040 (2021). https://doi.org/10.21437/Interspeech. 2021–338

  13. Wang, P., et al.: Speech separation using speaker inventory. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 230–236 (2019). https://doi.org/10.1109/ASRU46091.2019.9003884

  14. Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245 (2017). https://doi.org/10.1109/ICASSP.2017.7952154

  15. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  16. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964

  17. Kao, C.C., Wang, W., Sun, M., Wang, C.: R-CRNN: region-based convolutional recurrent neural network for audio event detection. In: Proc. Interspeech 2018, pp. 1358–1362 (2018). https://doi.org/10.21437/Interspeech. 2018–2323

  18. Huang, Z., Watanabe, S., Fujita, Y., García, P., Shao, Y., Povey, D., Khudanpur, S.: Speaker diarization with region proposal network. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6514–6518 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053760

  19. Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., Vincent, E.: Librimix: an open-source dataset for generalizable speech separation (2020)

    Google Scholar 

  20. Ge, M., Xu, C., Wang, L., Chng, E.S., Dang, J., Li, H.: SpEx+: a Complete Time Domain Speaker Extraction Network. In: Proceedings of Interspeech 2020, pp. 1406–1410 (2020). https://doi.org/10.21437/Interspeech. 2020–1397

  21. Bredin, H., et al.: pyannote.audio: neural building blocks for speaker diarization. In: ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing (2020)

    Google Scholar 

  22. Bredin, H., Laurent, A.: End-to-end speaker segmentation for overlap-aware resegmentation. In: Proceedings of Interspeech 2021 (2021)

    Google Scholar 

  23. Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In: Proceedings of Interspeech 2020, pp. 3830–3834 (2020). https://doi.org/10.21437/Interspeech. 2020–2650

  24. Kanda, N., Horiguchi, S., Fujita, Y., Xue, Y., Nagamatsu, K., Watanabe, S.: Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 31–38 (2019). https://doi.org/10.1109/ASRU46091.2019.9004009

  25. Manohar, V., Chen, S.J., Wang, Z., Fujita, Y., Watanabe, S., Khudanpur, S.: Acoustic modeling for overlapping speech recognition: Jhu chime-5 challenge system. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6665–6669 (2019). https://doi.org/10.1109/ICASSP.2019.8682556

  26. Chen, S., et al.: Wavlm: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Selected Topics Signal Process. 16(6), 1505–1518 (2022). https://doi.org/10.1109/JSTSP.2022.3188113

    Article  MathSciNet  Google Scholar 

  27. Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: Sdr - half-baked or well done? In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630 (2019). https://doi.org/10.1109/ICASSP.2019.8683855

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qun Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Li, Z., Liu, B., Fan, H., Yang, Y., Yang, Q. (2024). A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53311-2_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53310-5

  • Online ISBN: 978-3-031-53311-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics