Skip to main content

Covariance Based Deep Feature for Text-Dependent Speaker Verification

  • Conference paper
  • First Online:
Intelligence Science and Big Data Engineering (IScIDE 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11266))

Abstract

d-vector approach achieved impressive results in speaker verification. Representation is obtained at utterance level by calculating the mean of the frame level outputs of a hidden layer of the DNN. Although mean based speaker identity representation has achieved good performance, it ignores the variability of frames across the whole utterance, which consequently leads to information loss. This is particularly serious for text-dependent speaker verification, where within-utterance feature variability better reflects text variability than the mean. To address this issue, a new covariance based speaker representation is proposed in this paper. Here, covariance of the frame level outputs is calculated and incorporated into the speaker identity representation. The proposed approach is investigated within a joint multi-task learning framework for text-dependent speaker verification. Experiments on RSR2015 and RedDots showed that, covariance based deep feature can significantly improve the performance compared to the traditional mean based deep features.

This work has been supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002102 and the China NSFC projects (No. U1736202 and No. 61603252). Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In order to get a good estimate of the within-class covariance, the product of this parameter and between-class covariance is adding to the within-class covariance.

  2. 2.

    Speaker errors happen when an impostor speaker utters the correct text, is accepted, while text errors happen when an enrolled speaker utters the wrong text is accepted.

References

  1. Chen, K., Salman, A.: Learning speaker-specific characteristics with a deep neural architecture. IEEE Trans. Neural Netw. 22(11), 1744–1756 (2011)

    Article  Google Scholar 

  2. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. arXiv preprint arXiv:1509.08062 (2015)

  3. Chen, Y.-H., Lopez-Moreno, I., Sainath, T.N., Visontai, M., Alvarez, R., Parada, C.: Locally-connected and convolutional neural networks for small footprint speaker recognition. In: INTERSPEECH (2015)

    Google Scholar 

  4. Lei, Y., Ferrer, L., McLaren, M., et al.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699. IEEE (2014)

    Google Scholar 

  5. Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K.: Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)

    Article  Google Scholar 

  6. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  Google Scholar 

  7. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Computat. 18(7), 1527–1554 (2006)

    Article  MathSciNet  Google Scholar 

  8. Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH, vol. 237, p. 240 (2011)

    Google Scholar 

  9. Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for lvcsr of meetings. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. IV–757. IEEE (2007)

    Google Scholar 

  10. Matejka, P., et al.: Neural network bottleneck features for language identification. In: Proceedings of IEEE Odyssey, pp. 299–304 (2014)

    Google Scholar 

  11. Fu, T., Qian, Y., Liu, Y., Yu, K.: Tandem deep features for text-dependent speaker verification. In: INTERSPEECH, pp. 1327–1331 (2014)

    Google Scholar 

  12. Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Sig. Process. Lett. 22(10), 1671–1675 (2015)

    Article  Google Scholar 

  13. Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)

    Google Scholar 

  14. Chen, N., Qian, Y., Yu, K.: Multi-task learning for text-dependent speaker verification. In: INTERSPEECH (2015)

    Google Scholar 

  15. Larcher, A., Lee, K.A., Ma, B., Li, H.: Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)

    Article  Google Scholar 

  16. Lee, K.A., et al.: The RedDots data collection for speaker recognition. In: INTERSPEECH (2015)

    Google Scholar 

  17. Hain, T., Johnson, S., Tuerk, A., Woodland, P., Young, S.: Segment generation and clustering in the HTK broadcast news transcription system. In: Proceedings of 1998 DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137 (1998)

    Google Scholar 

  18. De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of hmm-based synthetic speech. IEEE Trans. Audio Speech Lang. Process. 20(8), 2280–2290 (2012)

    Article  Google Scholar 

  19. Chen, L.-W., Guo, W., Dai, L.-R.: Speaker verification against synthetic speech. In: 7th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 309–312. IEEE (2010)

    Google Scholar 

  20. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on Riemannian manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)

    Google Scholar 

  21. Yao, J., Odobez, J.-M.: Fast human detection from videos using covariance features. Technical report, Idiap (2007)

    Google Scholar 

  22. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  23. Kenny, P., Boulianne, G., Dumouchel, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005)

    Article  Google Scholar 

  24. Kenny, P.: A small footprint i-vector extractor. In: Odyssey, pp. 1–6 (2012)

    Google Scholar 

  25. Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, pp. 1–8. IEEE (2007)

    Google Scholar 

  26. Kenny, P., Stafylakis, T., Ouellet, P., Alam, M.J., Dumouchel, P.: PLDA for speaker verification with utterances of arbitrary duration. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7649–7653. IEEE (2013)

    Google Scholar 

  27. Matějka, P., et al.: Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4828–4831. IEEE (2011)

    Google Scholar 

  28. Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text-independent speaker verification systems. Digit. Sig. Process. 10(1), 42–54 (2000)

    Article  Google Scholar 

  29. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digit. Sig. Process. 10(1), 19–41 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, S., Dinkel, H., Qian, Y., Yu, K. (2018). Covariance Based Deep Feature for Text-Dependent Speaker Verification. In: Peng, Y., Yu, K., Lu, J., Jiang, X. (eds) Intelligence Science and Big Data Engineering. IScIDE 2018. Lecture Notes in Computer Science(), vol 11266. Springer, Cham. https://doi.org/10.1007/978-3-030-02698-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02698-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02697-4

  • Online ISBN: 978-3-030-02698-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics