Skip to main content
Log in

Rapid Speaker Adaptation Based on Combination of KPCA and Latent Variable Model

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speaker adaptation is implemented in order to shift the speaker-independent model closer to the new speaker speech characteristics to improve the speech recognition performance. The kernel eigenspace-based speaker adaptation methods provide satisfactory performance using only a small amount of adaptation data. In such adaptation methods, kernel principal component analysis (KPCA) is applied to the training speaker space in order to create kernel eigenspace. Then, the adapted acoustic model to the new user is calculated in that space. One limitation of KPCA is its inability to define a precise pre-image of the model adapted in the kernel eigenspace, back to the speaker space. Therefore, a huge amount of computations is required to perform adaptation. The previously developed solutions for calculation of an approximate pre-image of the adapted model do not necessarily lead to the optimal conditions. Therefore, in this paper, we propose an efficient solution for this problem to construct more reliable pre-image of the adapted model in the speaker space. For this purpose, we benefit from the latent variable model to define a probabilistic model for description of the applied mapping between the kernel eigenspace and the speaker space. The experiments were conducted on two speech databases: FARSDAT, a Persian, and TIMIT, an English speech database. Implementing a typical HMM-based automatic speech recognition system, it was verified that the proposed method, utilizing about three seconds of adaptation data, achieves up to 4.4% and 7.6% relative phoneme recognition accuracy rate over the speaker-independent model on FARSDAT and TIMIT, respectively. Moreover, the proposed approach demonstrated superior performance compared to the other kernel eigenspace-based adaptation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

The datasets analyzed during the current study are not freely available. Monetary payment is required to access the dataset from the following Web links: FARSDAT: http://catalog.elra.info/en-us/repository/browse/ELRA-S0112/. TIMIT: https://www.ldc.upenn.edu/language-resources/data/obtaining.

Code Availability

The code will be made available on reasonable request.

References

  1. S.M. Ahadi, P.C. Woodland, Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 11, 187–206 (1997)

    Article  Google Scholar 

  2. Z. Ansari, F. Almasganj, Implementing PCA-based speaker adaptation methods in a Persian ASR system, in Proceeding of 5th International Symposium on Telecommunications (IST 2010) (2010)

  3. Z. Ansari, F. Almasganj, Implementing KPCA-based speaker adaptation methods with different optimization algorithms in a Persian ASR system. Procedia Soc. Behav. Sci. 32, 117–127 (2012)

    Article  Google Scholar 

  4. M.S. Bazarra, H.D. Sherali, C.M. Shelty, Nonlinear Programming, Theory and Algorithms (Wiley, New York, 2006).

    Book  Google Scholar 

  5. M. Bijankhan, M.J. Sheikhzadegan, FARSDAT-the Farsi spoken language database, in Proceedings of International Conference on Speech Sciences and Technology, Vol. 2, (1994), pp. 826–829

  6. C.M. Bishop, Latent variable models, in Learning in Graphical Models (Springer, Dordrecht, 1998), pp. 371–403

  7. M.A. Carreira-Perpinàn, Continuous Latent Variable Models for Dimensionality Reduction and Sequential Data Reconstruction. PhD Thesis (Department of Computer Science, University of Sheffield, UK, 2001)

  8. M.A. Carreira-Perpinán, Z. Lu, The Laplacian eigenmaps latent variable model, in Proceeding of Artificial Intelligence and Statistics (2007), pp. 59–66

  9. K.T. Chen, W.W. Liau, H.M. Wang, L.S. Lee, Fast speaker adaptation using eigenspace-based maximum likelihood linear regression, in Proceedings of International Conference on Spoken Language Processing (2000), pp. 742–745

  10. D.J. Choi, J.S. Park, Y.H. Oh, Unsupervised rapid speaker adaptation based on selective eigenvoice merging for user-specific voice interaction. Eng. Appl. Artif. Intell. 40, 95–102 (2015)

    Article  Google Scholar 

  11. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  12. P. Díez, S. Zlotnik, A. García-González, A. Huerta, Algebraic PGD for tensor separation and compression: an algorithmic approach. Comptes Rendus Mécanique 346(7), 501–514 (2018)

    Article  Google Scholar 

  13. M.J.F. Gales, P.C. Woodland, Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10, 249–264 (1996)

    Article  Google Scholar 

  14. M.J.F. Gales, Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8, 417–428 (2000)

    Article  Google Scholar 

  15. A. García-González, A. Huerta, S. Zlotnik, P. Díez, A kernel Principal Component Analysis (kPCA) digest with a new backward mapping (pre-image reconstruction) strategy (2020). arXiv preprint arXiv:2001.01958

  16. J.S. Garofolo, L.F.Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, in NIST Speech Disc 1–1.1. STIN, 93 (1993), p. 27403

  17. J.-L. Gauvain, C.-H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)

    Article  Google Scholar 

  18. D. González, J.V. Aguado, E. Cueto, E. Abisset-Chavanne, F. Chinesta, kPCA-based parametric solutions within the PGD framework. Arch. Comput. Methods Eng. 25(1), 69–86 (2018)

    Article  MathSciNet  Google Scholar 

  19. L. Grassi, E. Schileo, C. Boichon, M. Viceconti, F. Taddei, Comprehensive evaluation of PCA-based finite element modelling of the human femur. Med. Eng. Phys. 36(10), 1246–1252 (2014)

    Article  Google Scholar 

  20. S. Hahm, Y. Ohkawa, M. Ito, M. Suzuki, A. Ito, S. Makino, Aspect-model based referenced speaker weighting, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP2010) (2010), pp. 4302–4305

  21. T. Hazen, J. Glass, A Comparison of novel techniques for instantaneous speaker adaptation, in Proceedings of Eurospeech, Greece (1997)

  22. R.W.H. Hsiao, Kernel Eigenspace Based MLLR Adaptation. Master Thesis, Department of Computer Science, Hong Kong University (2004)

  23. X. Hung, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, NJ, 2001).

    Google Scholar 

  24. A. Jafari, F. Almasganj, Using Laplacian eigenmaps latent variable model and manifold learning to improve speech recognition accuracy. J. Speech Commun. 52, 725–735 (2010)

    Article  Google Scholar 

  25. Y. Jeong, Speaker adaptation based on the multilinear decompositions of training speaker models, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP2010), Dallas, TX (2010), pp. 4870–4873

  26. Y. Jeong, Adaptation of hidden Markov models using model-as-matrix representation. IEEE Trans. Audio Speech Lang. Process. 20(8), 2352–2364 (2012)

    Article  Google Scholar 

  27. I.T. Jolliffe, Principal Component Analysis (Springer, New York, 2002).

    MATH  Google Scholar 

  28. S. Keyhanian, B. Nasersharif, Laplacian eigenmaps latent variable model modification for pattern recognition, in Proceeding of the 23rd of Iranian Conference on Electrical Engineering (2015), pp. 668–673

  29. N.S. Kim, J.S. Sung, D.H. Hong, Factored MLLR adaptation. IEEE Signal Process. Lett. 18(2), 99–102 (2011)

    Article  Google Scholar 

  30. R. Kuhn, J.-C. Junqua, P. Nguyen, N. Niedzeiki, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 14, 695–707 (2000)

    Article  Google Scholar 

  31. J.T. Kwok, I.W. Tsang, The pre-image problem in kernel methods, in Proceedings of the International Conference on Machine Learning (ICML 2003), Washington, DC (2003)

  32. K.F. Lee, H.W. Hon, Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 66, 371641–1648 (1989)

    Google Scholar 

  33. E. Lopez, D. Gonzalez, J.V. Aguado, E. Abisset-Chavanne, E. Cueto, C. Binetruy, F. Chinesta, A manifold learning approach for integrated computational materials engineering. Arch. Comput. Methods Eng. 25(1), 59–68 (2018)

    Article  MathSciNet  Google Scholar 

  34. Z. Lu, C. Sminchisescu, M.Á. Carreira-Perpiñán, People tracking with the laplacian eigenmaps latent variable model, in Proceeding of Advances in Neural Information Processing Systems (2008), pp. 1705–1712

  35. B. Mak, J.T. Kwok, S. Ho, Kernel eigenvoice speaker adaptation. IEEE Trans. Audio Speech Lang. Process. 13, 984–992 (2005)

    Article  Google Scholar 

  36. B. Mak, R.W.-H. Hsiao, S.K.-L. Ho, J.T. Kwok, Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting. IEEE Trans. Audio Speech Lang. Process. 14, 1267–1280 (2006)

    Article  Google Scholar 

  37. B. Mak, R.W.-H. Hsiao, Kernel eigen-space-based MLLR adaptation. IEEE Trans. Audio Speech Lang. Process. 15, 784–795 (2007)

    Article  Google Scholar 

  38. B.K.-W. Mak, T.-C. Lai, I.W. Tsang, J.T.-Y. Kwok, Maximum penalized likelihood kernel regression for fast adaptation. IEEE Trans Audio Speech Lang. Process. 17(7), 1372–1381 (2009)

    Article  Google Scholar 

  39. G.J. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1996).

    MATH  Google Scholar 

  40. Y. Miao, F. Metze, A. Waibel, Learning discriminative basis coefficients for eigenspace MLLR unsupervised adaptation, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 7927–7931

  41. Y. Pang, L. Wang, Y. Yuan, Generalized KPCA by adaptive rules in feature space. Intern. J. Comput. Math. 87(5), 956–968 (2010)

    Article  MathSciNet  Google Scholar 

  42. Z. Roupakia, M. Gales, Kernel eigenvoices (revisited) for large-vocabulary speech recognition. IEEE Signal Process. Lett. 18(12), 709–712 (2011)

    Article  Google Scholar 

  43. Z. Roupakia, A. Ragni, M. Gales, Rapid nonlinear speaker adaptation for large-vocabulary continuous speech recognition, in Proceedings of INTERSPEECH 2012, Portland, Oregon (2012)

  44. O. Saz, T. Hain, Using contextual information in joint factor eigenspace MLLR for speech recognition in diverse scenarios, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 6314–6318

  45. B. Schölkopf, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem. J. Neural Comput. 10(5), 1299–1319 (1998)

    Article  Google Scholar 

  46. Y. Shiokawa, Y. Date, J. Kikuchi, Application of kernel principal component analysis and computational machine learning to exploration of metabolites strongly associated with diet. Sci. Rep. 8(1), 1–8 (2018)

    Article  Google Scholar 

  47. J.C. Snyder, S. Mika, K. Burke, K.R. Müller, Kernels, pre-images and optimization, in Empirical Inference (Springer, Berlin, 2013), pp. 245–259

  48. Y. Tang, R. Rose, Rapid speaker adaptation using clustered maximum-likelihood linear basis with sparse training data. J. IEEE Trans. Audio Speech Lang. Process. 16(3), 607–616 (2008)

    Article  Google Scholar 

  49. Q. Wang, Kernel principal component analysis and its applications in face recognition and active shape models (2012). arXiv preprint arXiv:1207.3538

  50. D. Widjaja, C. Varon, A. Dorado, J.A. Suykens, S. Van Huffel, Application of kernel principal component analysis for single-lead-ECG-derived respiration. IEEE. Trans. Biomed. Eng. 59(4), 1169–1176 (2012)

    Article  Google Scholar 

  51. Young et al., HTK Book (2009). http://htk.eng.cam.ac.uk

  52. W.S. Zheng, J. Lai, P.C. Yuen, Penalized preimage learning in kernel principal component analysis. IEEE Trans. Neural Netw. 21(4), 551–570 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zohreh Ansari.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ansari, Z., Almasganj, F. & Kabudian, S.J. Rapid Speaker Adaptation Based on Combination of KPCA and Latent Variable Model. Circuits Syst Signal Process 40, 3996–4017 (2021). https://doi.org/10.1007/s00034-021-01660-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-021-01660-6

Keywords

Navigation