Phase Analysis and Labeling Strategies in a CNN-Based Speaker Change Detection System

Hrúz, Marek; Salajka, Petr

doi:10.1007/978-3-319-66429-3_61

Marek Hrúz¹⁶ &
Petr Salajka¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2205 Accesses
1 Citations

Abstract

In this paper we analyze different labeling strategies and their impact on speaker change detection rates. We explore binary, linear fuzzy, quadratic and Gaussian labeling functions. We come to the conclusion that the labeling function is very important and the linear variant outperforms the rest. We also add phase information from the spectrum to the input of our convolutional neural network. Experiments show that even though the phase is informative its benefit is negligible and may be omitted. In the experiments we use a coverage-purity measure which is independent on tolerance parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hrúz, M., Zajíc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, ICASSP (2017, in press)
Google Scholar
Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, ICASSP (2017, in press)
Google Scholar
Bredin, H., Gelly, G.: Improving speaker diarization of TV series using talking-face detection and clustering. In: Proceedings of the 2016 ACM on Multimedia Conference, Series, MM 2016, pp. 157–161. ACM, New York (2016). doi:10.1145/2964284.2967202
Hrúz, M., Kunešová, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_22
Chapter Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106–1114 (2012)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167 (2015)
Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Math. Doklady 27(2), 372–376 (1983)
MATH Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML 2013, vol. 28, pp. 1139–1147 (2013)
Google Scholar
Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English Speech LDC97S42. Linguistic Data Consortium, DVD, Philadelphia (1997)
Google Scholar
Oo, Z., Kawakami, Y., Wang, L., Nakagawa, S., Xiao, X., Iwahashi, M.: DNN-based amplitude and phase feature enhancement for noise robust speaker identification. In: INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016, pp. 2204–2208 (2016)
Google Scholar

Download references

Acknowledgment

This research was supported by the Grand Agency of the Czech Republic, project No. P103/12/G084. We would also like to thank the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Faculty of Applied Sciences, NTIS - New Technologies for the Information Society, University of West Bohemia in Pilsen, Univerzitní 22, 306 14, Pilsen, Czech Republic
Marek Hrúz & Petr Salajka

Authors

Marek Hrúz
View author publications
You can also search for this author in PubMed Google Scholar
Petr Salajka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Hrúz .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hrúz, M., Salajka, P. (2017). Phase Analysis and Labeling Strategies in a CNN-Based Speaker Change Detection System. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_61

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_61
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics