Abstract
In this paper, we have been investigating an approach to a speaker representation for a diarization system that clusters short telephone conversation segments (produced by the same speaker). The proposed approach applies a neural-network-based descriptor that replaces a usual i-vector descriptor in the state-of-the-art diarization systems. The comparison of these two techniques was done on the English part of the CallHome corpus. The final results indicate the superiority of the i-vector’s approach although our proposed descriptor brings an additive information. Thus, the combined descriptor represents a speaker in a segment for diarization purpose with lower diarization error (almost 20% relative improvement compared with only i-vector application).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adami, A.G., Kajarekar, S.S., Hermansky, H.: A new speaker change detection method for two-speaker segmentation. In: ICASSP, vol. 4, pp. 3908–3911 (2002)
Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: ICASSP, New Orleans, pp. 5430–5434 (2017)
Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Fiscus, J.G., Radde, N., Garofolo, J.S., Le, A., Ajot, J., Laprun, C.: The rich transcription 2006 spring meeting recognition evaluation. Mach. Learn. Multimodal Interact. 4299, 309–322 (2006)
Fredouille, C., Bozonnet, S., Evans, N.: The LIA-EURECOM RT 2009 Speaker Diarization System. In: NIST Rich Transcription Workshop (RT09), Melbourne, USA (2009)
Furui, S., Itoh, D.: Neural-network-based HMM adaptation for noisy speech. In: ICASSP, Salt Lake City, pp. 365–368 (2001)
Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-Vector length normalization in speaker recognition systems. In: Interspeech, Florence, pp. 249–252 (2011)
Garcia-Romero, D., McCree, A., Shum, S., Brummer, N., Vaquero, C.: Unsupervised domain adaptation for i-Vector speaker recognition. In: Odyssey - Speaker and Language Recognition Workshop, Joensuu, pp. 260–264 (2014)
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., McCree, A.: Speaker diarization using deep neural network embedings. In: ICASSP, New Orleans, pp. 4930–4934 (2017)
Graff, D., Miller, D., Walker, K.: Switchboard-2 phase III audio. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1999)
Graff, D., Walker, K., Canavan, A.: Switchboard-2 phase II, LDC99S79. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2002)
Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, Brisbane, pp. 4420–4424 (2015)
Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP, Shanghai, pp. 31–35 (2016)
Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker Diarization system. In: ICASSP, New Orleans, pp. 4945–4949 (2017)
Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. Technical report, Centre de Recherche Informatique de Montreal (2006)
Kenny, P., Dumouchel, P.: Experiments in speaker verification using factor analysis likelihood ratios. In: Odyssey - Speaker and Language Recognition Workshop, Toledo, pp. 219–226 (2004)
Machlica, L., Zajíc, Z.: Factor analysis and nuisance attribute projection revisited. In: Interspeech, Portland, pp. 1570–1573 (2012)
Martin, A., Przybocki, M.: 2004 NIST speaker recognition evaluation, LDC 2006 S44. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2011)
Milner, R., Hain, T.: DNN-based speaker clustering for speaker Diarisation. In: Interspeech, San Francisco, 08 September 2012, pp. 2185–2189 (2016)
NIST Multimodal Information Group: 2005 NIST Speaker Recognition Evaluation Training Data, LDC2011S01. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (2011)
NIST Multimodal Information Group: 2006 NIST Speaker Recognition Evaluation Training Set, LDC2011S09. In: LDC Catalog (2011)
Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., Meignier, S.: An open-source state-of-the-art toolbox for broadcast news Diarization. In: Interspeech, Lyon, p. 5 (2013)
Sell, G., Garcia-Romero, D.: Speaker Diarization with PLDA i-Vector scoring and unsupervised calibration. In: IEEE Spoken Language Technology Workshop, South Lake Tahoe, pp. 413–417 (2014)
Sell, G., Garcia-Romero, D., Mccree, A.: Speaker Diarization with i-Vectors from DNN senone posteriors. In: Interspeech, Dresden, pp. 3096–3099 (2015)
Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P.: A study of the Cosine distance-based mean shift for telephone speech diarization. Audio, Speech Lang. Process. 22(1), 217–227 (2014)
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., Glass, J.: Exploiting intra-conversation variability for speaker diarization. In: Interspeech, Florence, pp. 945–948 (2011)
Shum, S.H., Dehak, N., Dehak, R., Glass, J.R.: Unsupervised methods for speaker diarization: an integrated and iterative approach. Audio, Speech Lang. Process. 21(10), 2015–2028 (2013)
Theano Development Team: Theano: A Python Framework for Fast Computation of Mathematical Expressions. arXiv e-prints abs/1605.0 (2016)
Wang, R., Gu, M., Li, L., Xu, M., Zheng, T.F.: Speaker segmentation using deep speaker vectors for fast speaker change scenarios. In: ICASSP, New Orleans, pp. 5420–5424 (2017)
Yells, S.H., Stolcke, A., Slaney, M.: Artificial neural network features for speaker diarization. In: Proceedings of IEEE Spoken Language Technology Workshop, pp. 402–406. IEEE (2014)
Zajíc, Z., Kunešová, M., Radová, V.: Investigation of segmentation in i-vector based speaker diarization of telephone speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 411–418. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_49
Zajíc, Z., Machlica, L., Müller, L.: Initialization of fMLLR with sufficient statistics from similar speakers. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 187–194. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23538-2_24
Zelinka, J., Vaněk, J., Müller, L.: Neural-network-based spectrum processing for speech recognition and speaker verification. In: Statistical Language and Speech Processing, Budapest, vol. 9449, pp. 288–299 (2015)
Zhu, W., Pelecanos, J.: Online speaker Diarization using adapted i-Vector transforms. In: ICASSP, Shanghai, pp. 5045–5049 (2016)
Acknowledgments
This research was supported by the Ministry of Culture Czech Republic, project No. DG16P02B048.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zajíc, Z., Zelinka, J., Müller, L. (2017). Neural Network Speaker Descriptor in Speaker Diarization of Telephone Speech. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_55
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)