Abstract
In this paper we analyze different labeling strategies and their impact on speaker change detection rates. We explore binary, linear fuzzy, quadratic and Gaussian labeling functions. We come to the conclusion that the labeling function is very important and the linear variant outperforms the rest. We also add phase information from the spectrum to the input of our convolutional neural network. Experiments show that even though the phase is informative its benefit is negligible and may be omitted. In the experiments we use a coverage-purity measure which is independent on tolerance parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, ICASSP (2017, in press)
Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, ICASSP (2017, in press)
Bredin, H., Gelly, G.: Improving speaker diarization of TV series using talking-face detection and clustering. In: Proceedings of the 2016 ACM on Multimedia Conference, Series, MM 2016, pp. 157–161. ACM, New York (2016). doi:10.1145/2964284.2967202
Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_22
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106–1114 (2012)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167 (2015)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Math. Doklady 27(2), 372–376 (1983)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML 2013, vol. 28, pp. 1139–1147 (2013)
Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English Speech LDC97S42. Linguistic Data Consortium, DVD, Philadelphia (1997)
Oo, Z., Kawakami, Y., Wang, L., Nakagawa, S., Xiao, X., Iwahashi, M.: DNN-based amplitude and phase feature enhancement for noise robust speaker identification. In: INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016, pp. 2204–2208 (2016)
Acknowledgment
This research was supported by the Grand Agency of the Czech Republic, project No. P103/12/G084. We would also like to thank the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hrúz, M., Salajka, P. (2017). Phase Analysis and Labeling Strategies in a CNN-Based Speaker Change Detection System. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_61
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)