Lightweight Embeddings for Speaker Verification

Tkachenko, Maxim; Yamshinin, Alexander; Kotov, Mikhail; Nastasenko, Marina

doi:10.1007/978-3-319-99579-3_70

Maxim Tkachenko¹⁶,
Alexander Yamshinin¹⁶,
Mikhail Kotov¹⁶ &
…
Marina Nastasenko¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1409 Accesses
1 Citations

Abstract

This paper presents speaker verification (SV) system using deep neural networks with hash representations (binarization) of embeddings. The training procedure is performed on NIST SRE train set, verification is performed on the same corpus with test set. The system architecture is based on deep recurrent layers with attention mechanism. Semi-hard triplets selection is used for the training procedure. The resulting layer of neural network is the tanh function and it makes the hash representation training as end-to-end possible. As a consequence, such a system decreases the embedding memory size in 32x times and increases the system evaluation performance. The equal error rate (EER) is achieved with regard to embeddings without binarization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China (2016)
Google Scholar
David, S., Pegah, G., Daniel, P., Daniel, G.R., Yishay, C., Sanjeev K.: Neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), San Diego, California (2016)
Google Scholar
Schroff, F., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 815–823 (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference for Learning Representations, San Diego (2015)
Google Scholar
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2010)
Article Google Scholar
Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: 11th International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, pp. 1–8 (2007)
Google Scholar
Cumani, S., Laface, P., Torino, P.: Probabilistic linear discriminant analysis of i-vector posterior distributions. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada (2013)
Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany (2015)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar (2014)
Google Scholar
Jozefowicz, R., Zaremba W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: International Conference on Machine Learning (ICML), Lille, France (2015)
Google Scholar
Yang, Z., Yang, D., Dyer Chr., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California (2016)
Google Scholar
Luong, M., Pham, H., Christopher, M.: Effective approaches to attention-based neural machine translation. In: Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal (2015)
Google Scholar
Li., Ch., et al.: Deep speaker: an end-to-end neural speaker embedding system. In: IEEE Spoken Language Technology Workshop (SLT), San Diego, California (2016)
Google Scholar
Cao, Z., Long, M., Wang, J., Yu, P.: HashNet: deep learning to hash by continuation. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy (2017)
Google Scholar
NIST SRE. https://www.nist.gov/itl/iad/mig/speaker-recognition
Testarium: Research tool. http://testarium.makseq.com
TfMicro: Tensorflow binding. http://github.com/makseq/tfmicro

Download references

Author information

Authors and Affiliations

ASM Solutions LLC, Moscow, Russia
Maxim Tkachenko, Alexander Yamshinin & Mikhail Kotov
Master Synthesis LLC, Moscow, Russia
Marina Nastasenko

Authors

Maxim Tkachenko
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Yamshinin
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Kotov
View author publications
You can also search for this author in PubMed Google Scholar
Marina Nastasenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maxim Tkachenko .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tkachenko, M., Yamshinin, A., Kotov, M., Nastasenko, M. (2018). Lightweight Embeddings for Speaker Verification. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_70

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_70
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics