Abstract
Speaker adaptation is implemented in order to shift the speaker-independent model closer to the new speaker speech characteristics to improve the speech recognition performance. The kernel eigenspace-based speaker adaptation methods provide satisfactory performance using only a small amount of adaptation data. In such adaptation methods, kernel principal component analysis (KPCA) is applied to the training speaker space in order to create kernel eigenspace. Then, the adapted acoustic model to the new user is calculated in that space. One limitation of KPCA is its inability to define a precise pre-image of the model adapted in the kernel eigenspace, back to the speaker space. Therefore, a huge amount of computations is required to perform adaptation. The previously developed solutions for calculation of an approximate pre-image of the adapted model do not necessarily lead to the optimal conditions. Therefore, in this paper, we propose an efficient solution for this problem to construct more reliable pre-image of the adapted model in the speaker space. For this purpose, we benefit from the latent variable model to define a probabilistic model for description of the applied mapping between the kernel eigenspace and the speaker space. The experiments were conducted on two speech databases: FARSDAT, a Persian, and TIMIT, an English speech database. Implementing a typical HMM-based automatic speech recognition system, it was verified that the proposed method, utilizing about three seconds of adaptation data, achieves up to 4.4% and 7.6% relative phoneme recognition accuracy rate over the speaker-independent model on FARSDAT and TIMIT, respectively. Moreover, the proposed approach demonstrated superior performance compared to the other kernel eigenspace-based adaptation methods.
Similar content being viewed by others
Data Availability
The datasets analyzed during the current study are not freely available. Monetary payment is required to access the dataset from the following Web links: FARSDAT: http://catalog.elra.info/en-us/repository/browse/ELRA-S0112/. TIMIT: https://www.ldc.upenn.edu/language-resources/data/obtaining.
Code Availability
The code will be made available on reasonable request.
References
S.M. Ahadi, P.C. Woodland, Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 11, 187–206 (1997)
Z. Ansari, F. Almasganj, Implementing PCA-based speaker adaptation methods in a Persian ASR system, in Proceeding of 5th International Symposium on Telecommunications (IST 2010) (2010)
Z. Ansari, F. Almasganj, Implementing KPCA-based speaker adaptation methods with different optimization algorithms in a Persian ASR system. Procedia Soc. Behav. Sci. 32, 117–127 (2012)
M.S. Bazarra, H.D. Sherali, C.M. Shelty, Nonlinear Programming, Theory and Algorithms (Wiley, New York, 2006).
M. Bijankhan, M.J. Sheikhzadegan, FARSDAT-the Farsi spoken language database, in Proceedings of International Conference on Speech Sciences and Technology, Vol. 2, (1994), pp. 826–829
C.M. Bishop, Latent variable models, in Learning in Graphical Models (Springer, Dordrecht, 1998), pp. 371–403
M.A. Carreira-Perpinàn, Continuous Latent Variable Models for Dimensionality Reduction and Sequential Data Reconstruction. PhD Thesis (Department of Computer Science, University of Sheffield, UK, 2001)
M.A. Carreira-Perpinán, Z. Lu, The Laplacian eigenmaps latent variable model, in Proceeding of Artificial Intelligence and Statistics (2007), pp. 59–66
K.T. Chen, W.W. Liau, H.M. Wang, L.S. Lee, Fast speaker adaptation using eigenspace-based maximum likelihood linear regression, in Proceedings of International Conference on Spoken Language Processing (2000), pp. 742–745
D.J. Choi, J.S. Park, Y.H. Oh, Unsupervised rapid speaker adaptation based on selective eigenvoice merging for user-specific voice interaction. Eng. Appl. Artif. Intell. 40, 95–102 (2015)
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)
P. Díez, S. Zlotnik, A. García-González, A. Huerta, Algebraic PGD for tensor separation and compression: an algorithmic approach. Comptes Rendus Mécanique 346(7), 501–514 (2018)
M.J.F. Gales, P.C. Woodland, Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10, 249–264 (1996)
M.J.F. Gales, Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8, 417–428 (2000)
A. García-González, A. Huerta, S. Zlotnik, P. Díez, A kernel Principal Component Analysis (kPCA) digest with a new backward mapping (pre-image reconstruction) strategy (2020). arXiv preprint arXiv:2001.01958
J.S. Garofolo, L.F.Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, in NIST Speech Disc 1–1.1. STIN, 93 (1993), p. 27403
J.-L. Gauvain, C.-H. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)
D. González, J.V. Aguado, E. Cueto, E. Abisset-Chavanne, F. Chinesta, kPCA-based parametric solutions within the PGD framework. Arch. Comput. Methods Eng. 25(1), 69–86 (2018)
L. Grassi, E. Schileo, C. Boichon, M. Viceconti, F. Taddei, Comprehensive evaluation of PCA-based finite element modelling of the human femur. Med. Eng. Phys. 36(10), 1246–1252 (2014)
S. Hahm, Y. Ohkawa, M. Ito, M. Suzuki, A. Ito, S. Makino, Aspect-model based referenced speaker weighting, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP2010) (2010), pp. 4302–4305
T. Hazen, J. Glass, A Comparison of novel techniques for instantaneous speaker adaptation, in Proceedings of Eurospeech, Greece (1997)
R.W.H. Hsiao, Kernel Eigenspace Based MLLR Adaptation. Master Thesis, Department of Computer Science, Hong Kong University (2004)
X. Hung, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Prentice Hall, NJ, 2001).
A. Jafari, F. Almasganj, Using Laplacian eigenmaps latent variable model and manifold learning to improve speech recognition accuracy. J. Speech Commun. 52, 725–735 (2010)
Y. Jeong, Speaker adaptation based on the multilinear decompositions of training speaker models, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP2010), Dallas, TX (2010), pp. 4870–4873
Y. Jeong, Adaptation of hidden Markov models using model-as-matrix representation. IEEE Trans. Audio Speech Lang. Process. 20(8), 2352–2364 (2012)
I.T. Jolliffe, Principal Component Analysis (Springer, New York, 2002).
S. Keyhanian, B. Nasersharif, Laplacian eigenmaps latent variable model modification for pattern recognition, in Proceeding of the 23rd of Iranian Conference on Electrical Engineering (2015), pp. 668–673
N.S. Kim, J.S. Sung, D.H. Hong, Factored MLLR adaptation. IEEE Signal Process. Lett. 18(2), 99–102 (2011)
R. Kuhn, J.-C. Junqua, P. Nguyen, N. Niedzeiki, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 14, 695–707 (2000)
J.T. Kwok, I.W. Tsang, The pre-image problem in kernel methods, in Proceedings of the International Conference on Machine Learning (ICML 2003), Washington, DC (2003)
K.F. Lee, H.W. Hon, Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 66, 371641–1648 (1989)
E. Lopez, D. Gonzalez, J.V. Aguado, E. Abisset-Chavanne, E. Cueto, C. Binetruy, F. Chinesta, A manifold learning approach for integrated computational materials engineering. Arch. Comput. Methods Eng. 25(1), 59–68 (2018)
Z. Lu, C. Sminchisescu, M.Á. Carreira-Perpiñán, People tracking with the laplacian eigenmaps latent variable model, in Proceeding of Advances in Neural Information Processing Systems (2008), pp. 1705–1712
B. Mak, J.T. Kwok, S. Ho, Kernel eigenvoice speaker adaptation. IEEE Trans. Audio Speech Lang. Process. 13, 984–992 (2005)
B. Mak, R.W.-H. Hsiao, S.K.-L. Ho, J.T. Kwok, Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting. IEEE Trans. Audio Speech Lang. Process. 14, 1267–1280 (2006)
B. Mak, R.W.-H. Hsiao, Kernel eigen-space-based MLLR adaptation. IEEE Trans. Audio Speech Lang. Process. 15, 784–795 (2007)
B.K.-W. Mak, T.-C. Lai, I.W. Tsang, J.T.-Y. Kwok, Maximum penalized likelihood kernel regression for fast adaptation. IEEE Trans Audio Speech Lang. Process. 17(7), 1372–1381 (2009)
G.J. McLachlan, T. Krishnan, The EM Algorithm and Extensions (Wiley, New York, 1996).
Y. Miao, F. Metze, A. Waibel, Learning discriminative basis coefficients for eigenspace MLLR unsupervised adaptation, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2013), pp. 7927–7931
Y. Pang, L. Wang, Y. Yuan, Generalized KPCA by adaptive rules in feature space. Intern. J. Comput. Math. 87(5), 956–968 (2010)
Z. Roupakia, M. Gales, Kernel eigenvoices (revisited) for large-vocabulary speech recognition. IEEE Signal Process. Lett. 18(12), 709–712 (2011)
Z. Roupakia, A. Ragni, M. Gales, Rapid nonlinear speaker adaptation for large-vocabulary continuous speech recognition, in Proceedings of INTERSPEECH 2012, Portland, Oregon (2012)
O. Saz, T. Hain, Using contextual information in joint factor eigenspace MLLR for speech recognition in diverse scenarios, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 6314–6318
B. Schölkopf, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem. J. Neural Comput. 10(5), 1299–1319 (1998)
Y. Shiokawa, Y. Date, J. Kikuchi, Application of kernel principal component analysis and computational machine learning to exploration of metabolites strongly associated with diet. Sci. Rep. 8(1), 1–8 (2018)
J.C. Snyder, S. Mika, K. Burke, K.R. Müller, Kernels, pre-images and optimization, in Empirical Inference (Springer, Berlin, 2013), pp. 245–259
Y. Tang, R. Rose, Rapid speaker adaptation using clustered maximum-likelihood linear basis with sparse training data. J. IEEE Trans. Audio Speech Lang. Process. 16(3), 607–616 (2008)
Q. Wang, Kernel principal component analysis and its applications in face recognition and active shape models (2012). arXiv preprint arXiv:1207.3538
D. Widjaja, C. Varon, A. Dorado, J.A. Suykens, S. Van Huffel, Application of kernel principal component analysis for single-lead-ECG-derived respiration. IEEE. Trans. Biomed. Eng. 59(4), 1169–1176 (2012)
Young et al., HTK Book (2009). http://htk.eng.cam.ac.uk
W.S. Zheng, J. Lai, P.C. Yuen, Penalized preimage learning in kernel principal component analysis. IEEE Trans. Neural Netw. 21(4), 551–570 (2010)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ansari, Z., Almasganj, F. & Kabudian, S.J. Rapid Speaker Adaptation Based on Combination of KPCA and Latent Variable Model. Circuits Syst Signal Process 40, 3996–4017 (2021). https://doi.org/10.1007/s00034-021-01660-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01660-6