Abstract
This study explores issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained using adults’ speech. For acoustic modeling in ASR, the employed front-end features capture the characteristics of the vocal filter while smoothing out those of the source (excitation). Adults’ and children’s speech differ significantly due to large deviation in the acoustic correlates such as pitch, formants, speaking rate, etc. In the context of children’s speech recognition on mismatched acoustic models, the recognition rates remain highly degraded despite use of the vocal tract length normalization (VTLN) for addressing formant mismatch. For commonly used mel-filterbank-based cepstral features, earlier studies have shown that the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the test features as well as that of the mean and the covariance parameters of the acoustic models was explored in an earlier work. In this paper, a low-latency adaptation scheme is presented for children’s mismatched ASR. The presented fast adaptation approach exploits the earlier reported low-rank projection technique in order to reduce the computational cost. In the proposed approach, developmental data from the children’s domain is partitioned into separate groups on the basis of their estimated VTLN warp factors. A set of adapted acoustic models is then created by combining the low-rank projection with the model space adaptation technique for each of the warp factors. Given the children’s test utterance, first an appropriate pre-adapted model mean supervector is chosen based on its estimated warp factor. The chosen supervector is then optimally scaled. Consequently, only two parameters are required to be estimated, i.e., a warp factor and a model mean scaling factor. Even with such stringent constraints, the proposed adaptation technique results in a relative improvement of about \(44\%\) over the VTLN included baseline.
Similar content being viewed by others
Notes
An initial version of this work was presented at Interspeech 2015 [34].
References
O. Abdel-Hamid, H. Jiang: Rapid and Effective Speaker Adaptation of Convolutional Neural Network Based Models for Speech Recognition. In: Proceedings INTERSPEECH, pp. 1248–1252 (2013)
A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong: The PF_STAR Children’s Speech Corpus. In: Proceedings of INTERSPEECH, pp. 2761–2764 (2005)
L. Bell, J. Gustafson, Children’s Convergence in Referring Expressions to Graphical Objects in a Speech-Enabled Computer Game. In: Proceedings of INTERSPEECH, pp. 2209–2212 (2007)
D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task. Proc. ICSLP 2, 1145–1148 (1996)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustic, Speech. Signal Process. 28(4), 357–366 (1980). doi:10.1109/TASSP.1980.1163420
V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)
S. Evans, N. Neave, D. Wakelin, Relationships between vocal characteristics and body size and shape in human males: an evolutionary explanation for a deep male voice. Biol. Psychol. 72(2), 160–163 (2006)
J. Fainberg, Improving Children’s Speech Recognition Through Out of Domain Data Augmentation. Master’s thesis, School of Informatics University of Edinburgh (2015)
W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA Speech Recognition Research Database: Specifications and Status. In: Proceedings of DARPA Workshop on Speech Recognition, pp. 93–99 (1986)
M.J.F. Gales, Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (1999)
J.L. Gauvain, C.H. Lee, Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)
S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)
S. Ghai, R. Sinha, Exploring the Role of Spectral Smoothing in Context of Children’s Speech Recognition. In: Proceedings of INTERSPEECH, pp. 1607–1610 (2009)
S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults’ and children’s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 2010, 7:1–7:15 (2010)
J. González, Formant frequencies and body size of speaker: a weak relationship in adult humans. J. Phon. 32(2), 277–287 (2004)
S.S. Gray, D. Willett, J. Pinto, J. Lu, P. Maergner, N. Bodenstab, Child Automatic Speech Recognition for US English: Child Interaction with Living-Room-Electronic-Devices. In: Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction (2014)
J. Gustafson, K. Sjolander, Voice Transformations for Improving Children’s Speech Recognition in a Publicly Available Dialogue System. In: Proceedings of ICSLP, pp. 297–300 (2002)
A. Hagen, B. Pellom, R. Cole, Children’s Speech Recognition with Application to Interactive Books and Tutors. In: Proceedings of ASRU Workshop, pp. 186–191 (2003)
A. Hagen, B. Pellom, R. Cole, Highly accurate children’s speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)
T.J. Hazen, J.R. Glass, A Comparison of Novel Techniques for Instantaneous Speaker Adaptation. In: Proceedings of European Conference on Speech Communication and Technology, pp. 2047–2050 (1997)
I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 1986)
H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA Based Transformation for Reducing Acoustic Mismatch in Context of Children Speech Recognition. In: Proceedings of International Conference on Signal Processing and Communications (SPCOM), pp. 1–5 (2014)
R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)
N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
C.J. Leggetter, P.C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 171–185 (1995)
H. Liao, Speaker Adaptation of Context Dependent Deep Neural Networks. In: Proceedings of ICASSP, pp. 7947–7951 (2013)
R. Nisimura, A. Lee, H. Saruwatari, K. Shikano, Public speech-oriented guidance system with adult and child discrimination capability. Proc. ICASSP 1, 433–436 (2004)
A. Potamianos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech and Audio Process. 11(6), 603–616 (2003)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition. In: Proceedings of ICASSP, pp. 81–85 (1995)
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study. Adv. Speech Recogn. Mobile Environ. Call Centers Clin. chap. 4, 61–90 (2010)
S. Shahnawazuddin, H. Kathania, R. Sinha, Enhancing the Recognition of Children’s Speech on Acoustically Mismatched ASR System. In: Proceedings IEEE TENCON, 2015
S. Shahnawazuddin, R. Sinha, Improved bases selection in acoustic model interpolation for fast on-line adaptation. IEEE Signal Process. Lett. 21(4), 493–497 (2014)
S. Shahnawazuddin, R. Sinha, Low-Memory Fast On-Line Adaptation for Acoustically Mismatched Children’s Speech Recognition. In: Proceedings of INTERSPEECH, pp. 1630–1634 (2015)
Shao, X., Milner, B.: Pitch Prediction from MFCC Vectors for Speech Reconstruction. In: Proceedings of ICASSP, pp. 97–100 (2004)
Singer, H., Sagayama, S.: Pitch Dependent Phone Modelling for HMM Based Speech Recognition. In: Proceedings of ICASSP, pp. 273–276 (1992)
R. Sinha, S. Ghai, On the Use of Pitch Normalization for Improving Children’s Speech Recognition. In: Proceedings of INTERSPEECH, pp. 568–571 (2009)
T. Tan, Y. Qian, K. Yu, Cluster adaptive training for deep neural network based acoustic model. IEEE/ACM Trans. Audio Speech Lang. Process 24(3), 459–468 (2016)
P.C. Woodland, Speaker Adaptation for Continuos Density HMMs: A Review. In: Proceedings of ISCA Tutorial and Research Workshop on Adaptatation Methods for Speech Recognition, pp. 11–19 (2001)
S. Xue, O. Abdel-Hamid, H. Jiang, L. Dai, Q. Liu, Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process 22(12), 1713–1725 (2014)
K. Yao, D. Yu, F. Seide, H. Su, L. Deng, Y. Gong, Adaptation of Context-Dependent Deep Neural Networks for Automatic Speech Recognition. In: Proceedings of SLT, pp. 366–369 (2012)
Young, S., Evermann, G., Gales, M.J.F., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book version 3.4. CUED, Cambridge, U.K. (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shahnawazuddin, S., Sinha, R. A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models. Circuits Syst Signal Process 37, 1098–1115 (2018). https://doi.org/10.1007/s00034-017-0586-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-017-0586-6