Abstract
It is well known that the recognition performance of an automatic speech recognition (ASR) system is affected by intra-speaker as well inter-speaker variability. The differences in the geometry of vocal organs, pitch and speaking-rate among the speakers are some such inter-speaker variabilities affecting the recognition performance. A mismatch between the training and test data with respect to any of those aforementioned factors leads to increased error rates. An example of acoustically mismatched ASR is the task of transcribing children’s speech on adult data-trained system. A large number of studies have been reported earlier that present a myriad of techniques for addressing acoustic mismatch arising from differences in pitch and dimensions of vocal organs. At the same time, only a few works on speaking-rate adaptation employing timescale modification have been reported. Furthermore, those studies were performed on ASR systems developed using Gaussian mixture models. Motivated by these facts, speaking-rate adaptation is explored in this work in the context of children’s ASR system employing deep neural network-based acoustic modeling. Speaking-rate adaptation is performed by changing the frame-length and overlap during front-end feature extraction process. Significant reductions in errors are noted by speaking-rate adaptation. In addition to that, we have also studied the effect of combining speaking-rate adaptation with vocal-tract length normalization and explicit pitch modification. In both the cases, additive improvements are obtained. To summarize, relative improvements in 15–20% over the baselines are obtained by varying the frame-length and frame-overlap.
Similar content being viewed by others
References
A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus. In Proceedings INTERSPEECH (2005), pp. 2761–2764
G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion. In Procedings of the 8th International Conference on Digital Audio Effects (2005), pp. 116–118
D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task. In Proceedings ICSLP, vol. 2 (1996), pp. 1145–1148
J.P. Cabral, L.C. Oliveira, Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proceedings INTERSPEECH (2005), pp. 1137–1140
S.M. Chu, D. Povey, Speaking rate adaptation using continuous frame rate normalization. In Proceedings ICASSP (2010), pp. 4306–4309
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. In Proceedings of Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8
S. Ghai, Addressing pitch mismatch for children’s automatic speech recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)
A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors. In Proceedings ASRU (2003), pp. 186–191
A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)
R. Kent, L. Forner, Speech segment durations in sentence recitations by children and adults. J. Phonet. 8, 157–168 (1980)
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
S. Lee, A. Potamianos, S.S. Narayanan, Analysis of children’s speech: duration, pitch and formants. In Proceedings INTERSPEECH, vol. 1 (1997), p. 473–476
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of childrens speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children. In Proceedings INTERSPEECH (2015), pp. 1611–1615
J.L. Miller, Effects of speaking rate on segmental distinctions. In Perspectives on the study of speech (1981), pp. 39–71
J.L. Miller, L.E. Volaitis, Effect of speaking rate on the perceptual structure of a phonetic category. Percept. Psychophys. 46(6), 505–512 (1989)
N. Mirghafori, E. Fosler, N. Morgan, Towards robustness to fast speech in ASR. In Proceedings ICASSP, vol. 1 (1996), pp. 335–338
N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation of speaking rate. In Proceedings EUROSPEECH (1997), pp. 2079–2082
S.H. ParthasarathiK., B. Hoffmeister, S. Matsoukas, A. Mandal, N. Strom, S. Garimella, fMLLR based feature-space speaker adaptation of DNN acoustic models. In INTERSPEECH (2015)
A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit. In Proceedings ASRU (2011)
L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, NJ, 1993)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In Proceedings ICASSP, vol. 1 (1995), pp. 81–84
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. In Proceedings Speech and Language Technologies in Education (SLaTE) (2007)
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study. In Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4 (2010), pp. 61–90
R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Nat. Lang. Eng. 23, 325–350 (2016)
S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR. In Proceedings ICASSP (2017), pp. 5225–5229
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In Proceedings INTERSPEECH (2016), pp. 3459–3463
S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)
S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)
X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction. In Proceedings ICASSP (2004), pp. 97–100
M.A. Siegler, R.M. Stern, On the effects of speech rate in large vocabulary speech recognition systems. In Proceedings ICASSP, vol. 1 (1995), pp. 612–615
H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition. In Proceedings ICASSP (1992), pp. 273–276
G. Stemmer, C. Hacker, S. Steidl, E. Nöth, Acoustic normalization of childrens speech. In Proceedings INTERSPEECH (2003), pp. 1313–1316
Q. Summerfield, Articulatory rate and perceptual constancy in phonetic perception. J. Exp. Psychol. Hum. Perform. Percept. 7, 208–215 (1981)
Z.H. Tan, B. Lindberg, Low-complexity variable frame rate analysis for speech recognition and voice activity detection. IEEE J. Sel. Top. Signal Process. 4(5), 798–807 (2010)
D.L. Valente, H.M. Plevinsky, J.M. Franco, E.C. Heinrichs-Graham, D. Lewis, Experimental investigation of the effects of the acoustical conditions in a simulated classroom on speech recognition and learning in children. J. Acoust. Soc. Am. 131(1), 232–246 (2012)
S. Whiteside, C. Hodgson, Speech patterns of children and adults elicited via a picture-naming task: an acoustic study. Speech Commun. 32(4), 267–285 (2000)
J. Wilpon, C. Jacobsen, A study of speech recognition for children and the elderly. In Proceedings ICASSP, vol. 1 (1996), pp. 349–352
P.C. Woodland, Speaker adaptation for continuos density HMMs: a review. In Proceedings ISCA ITRW on Adaptation Methods for Speech Recognition (2001), pp. 11–19
H. You, Q. Zhu, A. Alwan, Entropy-based variable frame rate analysis of speech signals and its application to ASR. In Proceedings ICASSP, vol. 1 (2004), pp. 549–522
X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks. In Proceedings ICASSP (2014), pp. 215–219
X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)
Acknowledgements
The authors express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shahnawazuddin, S., Singh, C., Kathania, H.K. et al. An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition. Circuits Syst Signal Process 37, 5540–5553 (2018). https://doi.org/10.1007/s00034-018-0828-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-018-0828-2