Skip to main content

Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

  • 2237 Accesses

Abstract

In this paper, we have been investigating an approach to a speaker representation for a diarization system that clusters short telephone conversation segments (produced by the same speaker). The proposed approach applies a neural-network-based descriptor that replaces a usual i-vector descriptor in the state-of-the-art diarization systems. The comparison of these two techniques was done on the English part of the CallHome corpus. The final results indicate the superiority of the i-vector’s approach although our proposed descriptor brings an additive information. Thus, the combined descriptor represents a speaker in a segment for diarization purpose with lower diarization error (almost 20% relative improvement compared with only i-vector application).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adami, A.G., Kajarekar, S.S., Hermansky, H.: A new speaker change detection method for two-speaker segmentation. In: ICASSP, vol. 4, pp. 3908–3911 (2002)

    Google Scholar 

  2. Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)

    Google Scholar 

  3. Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)

    Google Scholar 

  4. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  5. Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)

    Article  Google Scholar 

  6. Fredouille, C., Bozonnet, S., Evans, N.: The LIA-EURECOM RT 2009 Speaker Diarization System. In: NIST Rich Transcription Workshop (RT09), Melbourne, USA (2009)

    Google Scholar 

  7. Furui, S., Itoh, D.: Neural-network-based HMM adaptation for noisy speech. In: ICASSP, Salt Lake City, pp. 365–368 (2001)

    Google Scholar 

  8. Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-Vector length normalization in speaker recognition systems. In: Interspeech, Florence, pp. 249–252 (2011)

    Google Scholar 

  9. Garcia-Romero, D., McCree, A., Shum, S., Brummer, N., Vaquero, C.: Unsupervised domain adaptation for i-Vector speaker recognition. In: Odyssey - Speaker and Language Recognition Workshop, Joensuu, pp. 260–264 (2014)

    Google Scholar 

  10. Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., McCree, A.: Speaker diarization using deep neural network embedings. In: ICASSP, New Orleans, pp. 4930–4934 (2017)

    Google Scholar 

  11. Graff, D., Miller, D., Walker, K.: Switchboard-2 phase III audio. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1999)

    Google Scholar 

  12. Graff, D., Walker, K., Canavan, A.: Switchboard-2 phase II, LDC99S79. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2002)

    Google Scholar 

  13. Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, Brisbane, pp. 4420–4424 (2015)

    Google Scholar 

  14. Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP, Shanghai, pp. 31–35 (2016)

    Google Scholar 

  15. Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker Diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)

    Google Scholar 

  16. Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. Technical report, Centre de Recherche Informatique de Montreal (2006)

    Google Scholar 

  17. Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey - Speaker and Language Recognition Workshop, Toledo, pp. 219–226 (2004)

    Google Scholar 

  18. Machlica, L., Zajíc, Z.: Factor analysis and nuisance attribute projection revisited. In: Interspeech, Portland, pp. 1570–1573 (2012)

    Google Scholar 

  19. Martin, A., Przybocki, M.: 2004 NIST speaker recognition evaluation, LDC 2006 S44. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2011)

    Google Scholar 

  20. Milner, R., Hain, T.: DNN-based speaker clustering for speaker Diarisation. In: Interspeech, San Francisco, 08 September 2012, pp. 2185–2189 (2016)

    Google Scholar 

  21. NIST Multimodal Information Group: 2005 NIST Speaker Recognition Evaluation Training Data, LDC2011S01. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2011)

    Google Scholar 

  22. NIST Multimodal Information Group: 2006 NIST Speaker Recognition Evaluation Training Set, LDC2011S09. In: LDC Catalog (2011)

    Google Scholar 

  23. Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news Diarization. In: Interspeech, Lyon, p. 5 (2013)

    Google Scholar 

  24. Sell, G., Garcia-Romero, D.: Speaker Diarization with PLDA i-Vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)

    Google Scholar 

  25. Sell, G., Garcia-Romero, D., Mccree, A.: Speaker Diarization with i-Vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)

    Google Scholar 

  26. Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P.: A study of the Cosine distance-based mean shift for telephone speech diarization. Audio, Speech Lang. Process. 22(1), 217–227 (2014)

    Article  Google Scholar 

  27. Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)

    Google Scholar 

  28. Shum, S.H., Dehak, N., Dehak, R., Glass, J.R.: Unsupervised methods for speaker diarization: an integrated and iterative approach. Audio, Speech Lang. Process. 21(10), 2015–2028 (2013)

    Article  Google Scholar 

  29. Theano Development Team: Theano: A Python Framework for Fast Computation of Mathematical Expressions. arXiv e-prints abs/1605.0 (2016)

    Google Scholar 

  30. Wang, R., Gu, M., Li, L., Xu, M., Zheng, T.F.: Speaker segmentation using deep speaker vectors for fast speaker change scenarios. In: ICASSP, New Orleans, pp. 5420–5424 (2017)

    Google Scholar 

  31. Yells, S.H., Stolcke, A., Slaney, M.: Artificial neural network features for speaker diarization. In: Proceedings of IEEE Spoken Language Technology Workshop, pp. 402–406. IEEE (2014)

    Google Scholar 

  32. Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 411–418. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_49

    Chapter  Google Scholar 

  33. Zajíc, Z., Machlica, L., Müller, L.: Initialization of fMLLR with sufficient statistics from similar speakers. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 187–194. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23538-2_24

    Chapter  Google Scholar 

  34. Zelinka, J., Vaněk, J., Müller, L.: Neural-network-based spectrum processing for speech recognition and speaker verification. In: Statistical Language and Speech Processing, Budapest, vol. 9449, pp. 288–299 (2015)

    Google Scholar 

  35. Zhu, W., Pelecanos, J.: Online speaker Diarization using adapted i-Vector transforms. In: ICASSP, Shanghai, pp. 5045–5049 (2016)

    Google Scholar 

Download references

Acknowledgments

This research was supported by the Ministry of Culture Czech Republic, project No. DG16P02B048.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zbyněk Zajíc .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Zajíc, Z., Zelinka, J., Müller, L. (2017). Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics