Abstract
Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.
Similar content being viewed by others
Notes
It is to note that neither CMVN nor fMLLR is applied to the MFCC features used for the analyses reported here even though both CMVN and fMLLR will be used prior to training model parameters.
References
T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training, in International Conference on Spoken Language Processing, vol. 2. (1996), pp. 1137–1140
A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus, in Proceedings of INTERSPEECH (2005), pp. 2761–2764
G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion, in Proceedings of The 8th International Conference on Digital Audio Effects (2005), pp. 116–118
L. Bell, J. Gustafson, Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game, in Proceeding of INTERSPEECH (2007), pp. 2209–2212
D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task, in Proceedings of ICSLP 2 (1996), pp. 1145–1148
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). doi:10.1109/TASSP.1980.1163420
J.R. Deller Jr., J.H.L. Hansen, J.G. Proakis, Discrete-Time Processing of Speech Signals, 2nd edn. (IEEE Press, New York, 2000)
V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3, 357–366 (1995)
J. Driedger, M. Müller, A review of time-scale modification of music signals. Appl. Sci. 6(2), 57 (2016)
J. Driedger, M. Müller, Tsm toolbox: Matlab implementations of time-scale modification algorithms, in Proceeding of the International Conference on Digital Audio Effects (DAFx), Erlangen, Germany (2014), pp. 249–256
J. Driedger, M. Müller, S. Ewert, Improving time-scale modification of music signals using harmonic-percussive separation. IEEE Signal Process. Lett. 21(1), 105–109 (2014)
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edn. (Wiley-Interscience, Hoboken, 2000)
W.M. Fisher, G.R. Doddington, K.M. Goudie-Marshall, The DARPA speech recognition research database: specifications and status, in Proceedings of the DARPA Workshop on Speech Recognition (1986). pp. 93–99
M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
M. Gerosa, D. Giuliani, F. Brugnara, Acoustic variability and automatic recognition of children’s speech. Speech Communun. 49(10–11), 847–860 (2007)
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech, in Proceeding of the Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8
S. Ghai, Addressing Pitch Mismatch for Children’s Automatic Speech Recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)
S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition, in Proceeding of INTERSPEECH (2009), pp. 1607–1610
S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for children’s speech recognition, in Proceedings of the Signal Processing and Communications (SPCOM) (2010)
S. Ghai, R. Sinha, Exploring the effect of differences in the acoustic correlates of adults’ and children’s speech in the context of automatic speech recognition. EURASIP J. Audio Speech Music Process. 2010, 7:1–7:15 (2010)
S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC features for children’s ASR in comparison to MFCC, in Proceedings of INTERSPEECH (2011), pp. 2589–2592
S. Ghai, R. Sinha, Pitch adaptive MFCC features for improving children’s mismatch ASR. Int. J. Spech Technol. 18(3), 489–503 (2015)
S.S. Gray, D. Willett, J. Pinto, J. Lu, P. Maergner, N. Bodenstab, Child automatic speech recognition for US English: child interaction with living-room-electronic-devices, in Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction (2014)
A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors, in Proceedings of ASRU (2003), pp. 186–191
A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 57(4), 1738–52 (1990)
G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)
I.T. Jolliffe, Principal Component Analysis (Springer, Berlin, 1986)
H.K. Kathania, S. Shahnawazuddin, R. Sinha, Exploring HLDA based transformation for reducing acoustic mismatch in context of children speech recognition, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2014), pp. 1–5
N. Kumar, A.G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)
J. Laroche, Time and Pitch Scale Modification of Audio Signals (Springer, Boston, 2002), pp. 279–309
J. Laroche, M. Dolson, Improved phase vocoder time-scale modification of audio. IEEE Trans. Speech Audio Process 7(3), 323–332 (1999)
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
R. Leonard, A database for speaker-independent digit recognition, in Proceedings of ICASSP (1984), pp. 42.11.1–42.11.4
H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children, in Proceedings of INTERSPEECH (2015), pp. 1611–1615
S. Matsoukas, R. Schwartz, H. Jin, L. Nguyen, Practical implementations of speaker-adaptive training, in Proceedings of DARPA Speech Recognition Workshop (1997)
P. McLeod, Fast, Accurate Pitch Detection Tools for Music Analysis. Ph.D. thesis, University of Otago, Dunedin, New Zealand (2008)
A. Metallinou, J. Cheng, Using deep neural networks to improve proficiency assessment for children English language learners, in Proceedings of INTERSPEECH (2014), pp. 1468–1472
S. Narayanan, A. Potamianos, Creating conversational interfaces for children. IEEE Trans. Speech Audio Process. 10(2), 65–78 (2002)
R. Nisimura, A. Lee, H. Saruwatari, K. Shikano, Public speech-oriented guidance system with adult and child discrimination capability, in Proceedings of ICASSP, vol. 1 (2004), pp. 433–436
A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Proceedings of ASRU (2011)
L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, 1993)
S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep neural networks, in Proceedings of INTERSPEECH (2013)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition, in Proceedings of ICASSP, vol 1 (1995), pp. 81–84
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech, in Proceedings of Speech and Language Technologies in Education (SLaTE) (2007)
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study, in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4, ed. By A. Neustein (Springer, 2010), pp. 61–90. doi:10.1007/978-1-4419-5951-5_4
R. Serizel, D. Giuliani, Vocal tract length normalisation approaches to dnn-based children’s and adults’ speech recognition, in Proceedings of the Spoken Language Technology Workshop (SLT) (2014), pp. 135–140
S. Shahnawazuddin, Improving childrens mismatched ASR through adaptive pitch compensation. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2016)
S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR, in Proceedings of ICASSP (2017), pp. 5225–5229
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In: Proceedings of INTERSPEECH (2016)
S. Shahnawazuddin, H.K. Kathania, R. Sinha, Enhancing the recognition of children’s speech on acoustically mismatched ASR system. In: Proceedings of TENCON (2015)
S. Shahnawazuddin, R. Sinha, Low-memory fast on-line adaptation for acoustically mismatched children’s speech recognition, in Proceedings of INTERSPEECH (2015)
S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)
S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)
X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction, in Proceedings of ICASSP (2004), pp. 97–100
P.G. Shivakumar, A. Potamianos, S. Lee, S. Narayanan, Improving speech recognition for children using acoustic adaptation and pronunciation modeling, in Proceedings of the Workshop on Child Computer Interaction (2014)
H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition, in Proceedings of ICASSP (1992), pp. 273–276
R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition, in Proceedings of INTERSPEECH (2009), pp. 568–571
R. Sinha, S. Shahnawazuddin, P.S. Karthik, Exploring the role of pitch-adaptive cepstral features in context of children’s mismatched ASR, in Proceedings of the International Conference on Signal Processing and Communications (SPCOM) (2016), pp. 1–5
X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)
Acknowledgements
The authors would like to express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions which greatly helped in improving the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kathania, H.K., Ahmad, W., Shahnawazuddin, S. et al. Explicit Pitch Mapping for Improved Children’s Speech Recognition. Circuits Syst Signal Process 37, 2021–2044 (2018). https://doi.org/10.1007/s00034-017-0652-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-017-0652-0