Abstract
d-vector approach achieved impressive results in speaker verification. Representation is obtained at utterance level by calculating the mean of the frame level outputs of a hidden layer of the DNN. Although mean based speaker identity representation has achieved good performance, it ignores the variability of frames across the whole utterance, which consequently leads to information loss. This is particularly serious for text-dependent speaker verification, where within-utterance feature variability better reflects text variability than the mean. To address this issue, a new covariance based speaker representation is proposed in this paper. Here, covariance of the frame level outputs is calculated and incorporated into the speaker identity representation. The proposed approach is investigated within a joint multi-task learning framework for text-dependent speaker verification. Experiments on RSR2015 and RedDots showed that, covariance based deep feature can significantly improve the performance compared to the traditional mean based deep features.
This work has been supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002102 and the China NSFC projects (No. U1736202 and No. 61603252). Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In order to get a good estimate of the within-class covariance, the product of this parameter and between-class covariance is adding to the within-class covariance.
- 2.
Speaker errors happen when an impostor speaker utters the correct text, is accepted, while text errors happen when an enrolled speaker utters the wrong text is accepted.
References
Chen, K., Salman, A.: Learning speaker-specific characteristics with a deep neural architecture. IEEE Trans. Neural Netw. 22(11), 1744–1756 (2011)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. arXiv preprint arXiv:1509.08062 (2015)
Chen, Y.-H., Lopez-Moreno, I., Sainath, T.N., Visontai, M., Alvarez, R., Parada, C.: Locally-connected and convolutional neural networks for small footprint speaker recognition. In: INTERSPEECH (2015)
Lei, Y., Ferrer, L., McLaren, M., et al.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699. IEEE (2014)
Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K.: Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Computat. 18(7), 1527–1554 (2006)
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH, vol. 237, p. 240 (2011)
Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for lvcsr of meetings. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. IV–757. IEEE (2007)
Matejka, P., et al.: Neural network bottleneck features for language identification. In: Proceedings of IEEE Odyssey, pp. 299–304 (2014)
Fu, T., Qian, Y., Liu, Y., Yu, K.: Tandem deep features for text-dependent speaker verification. In: INTERSPEECH, pp. 1327–1331 (2014)
Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Sig. Process. Lett. 22(10), 1671–1675 (2015)
Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)
Chen, N., Qian, Y., Yu, K.: Multi-task learning for text-dependent speaker verification. In: INTERSPEECH (2015)
Larcher, A., Lee, K.A., Ma, B., Li, H.: Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)
Lee, K.A., et al.: The RedDots data collection for speaker recognition. In: INTERSPEECH (2015)
Hain, T., Johnson, S., Tuerk, A., Woodland, P., Young, S.: Segment generation and clustering in the HTK broadcast news transcription system. In: Proceedings of 1998 DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137 (1998)
De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of hmm-based synthetic speech. IEEE Trans. Audio Speech Lang. Process. 20(8), 2280–2290 (2012)
Chen, L.-W., Guo, W., Dai, L.-R.: Speaker verification against synthetic speech. In: 7th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 309–312. IEEE (2010)
Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on Riemannian manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)
Yao, J., Odobez, J.-M.: Fast human detection from videos using covariance features. Technical report, Idiap (2007)
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Kenny, P., Boulianne, G., Dumouchel, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005)
Kenny, P.: A small footprint i-vector extractor. In: Odyssey, pp. 1–6 (2012)
Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8. IEEE (2007)
Kenny, P., Stafylakis, T., Ouellet, P., Alam, M.J., Dumouchel, P.: PLDA for speaker verification with utterances of arbitrary duration. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7649–7653. IEEE (2013)
Matějka, P., et al.: Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4828–4831. IEEE (2011)
Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text-independent speaker verification systems. Digit. Sig. Process. 10(1), 42–54 (2000)
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digit. Sig. Process. 10(1), 19–41 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, S., Dinkel, H., Qian, Y., Yu, K. (2018). Covariance Based Deep Feature for Text-Dependent Speaker Verification. In: Peng, Y., Yu, K., Lu, J., Jiang, X. (eds) Intelligence Science and Big Data Engineering. IScIDE 2018. Lecture Notes in Computer Science(), vol 11266. Springer, Cham. https://doi.org/10.1007/978-3-030-02698-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-02698-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02697-4
Online ISBN: 978-3-030-02698-1
eBook Packages: Computer ScienceComputer Science (R0)