An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

Shahnawazuddin, S.; Singh, Chaman; Kathania, Hemant Kumar; Ahmad, Waquar; Pradhan, Gayadhar

doi:10.1007/s00034-018-0828-2

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

Published: 05 May 2018

Volume 37, pages 5540–5553, (2018)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

S. Shahnawazuddin ORCID: orcid.org/0000-0002-3916-9693¹,
Chaman Singh¹,
Hemant Kumar Kathania²,
Waquar Ahmad² &
…
Gayadhar Pradhan¹

317 Accesses
4 Citations
Explore all metrics

Abstract

It is well known that the recognition performance of an automatic speech recognition (ASR) system is affected by intra-speaker as well inter-speaker variability. The differences in the geometry of vocal organs, pitch and speaking-rate among the speakers are some such inter-speaker variabilities affecting the recognition performance. A mismatch between the training and test data with respect to any of those aforementioned factors leads to increased error rates. An example of acoustically mismatched ASR is the task of transcribing children’s speech on adult data-trained system. A large number of studies have been reported earlier that present a myriad of techniques for addressing acoustic mismatch arising from differences in pitch and dimensions of vocal organs. At the same time, only a few works on speaking-rate adaptation employing timescale modification have been reported. Furthermore, those studies were performed on ASR systems developed using Gaussian mixture models. Motivated by these facts, speaking-rate adaptation is explored in this work in the context of children’s ASR system employing deep neural network-based acoustic modeling. Speaking-rate adaptation is performed by changing the frame-length and overlap during front-end feature extraction process. Significant reductions in errors are noted by speaking-rate adaptation. In addition to that, we have also studied the effect of combining speaking-rate adaptation with vocal-tract length normalization and explicit pitch modification. In both the cases, additive improvements are obtained. To summarize, relative improvements in 15–20% over the baselines are obtained by varying the frame-length and frame-overlap.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Explicit Pitch Mapping for Improved Children’s Speech Recognition

Article 11 September 2017

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

Article 12 June 2017

Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation

Article 05 November 2021

References

A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus. In Proceedings INTERSPEECH (2005), pp. 2761–2764
G.T. Beauregard, X. Zhu, L. Wyse, An efficient algorithm for real-time spectrogram inversion. In Procedings of the 8th International Conference on Digital Audio Effects (2005), pp. 116–118
D. Burnett, M. Fanty, Rapid unsupervised adaptation to children’s speech on a connected-digit task. In Proceedings ICSLP, vol. 2 (1996), pp. 1145–1148
J.P. Cabral, L.C. Oliveira, Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proceedings INTERSPEECH (2005), pp. 1137–1140
S.M. Chu, D. Povey, Speaking rate adaptation using continuous frame rate normalization. In Proceedings ICASSP (2010), pp. 4306–4309
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
Article Google Scholar
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. In Proceedings of Workshop on Child, Computer and Interaction (2009), pp. 7:1–7:8
S. Ghai, Addressing pitch mismatch for children’s automatic speech recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India (2011)
A. Hagen, B. Pellom, R. Cole, Children’s speech recognition with application to interactive books and tutors. In Proceedings ASRU (2003), pp. 186–191
A. Hagen, B. Pellom, R. Cole, Highly accurate childrens speech recognition for interactive reading tutors using subword units. Speech Commun. 49(12), 861–873 (2007)
Article Google Scholar
R. Kent, L. Forner, Speech segment durations in sentence recitations by children and adults. J. Phonet. 8, 157–168 (1980)
Google Scholar
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Article Google Scholar
S. Lee, A. Potamianos, S.S. Narayanan, Analysis of children’s speech: duration, pitch and formants. In Proceedings INTERSPEECH, vol. 1 (1997), p. 473–476
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of childrens speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Article Google Scholar
H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children. In Proceedings INTERSPEECH (2015), pp. 1611–1615
J.L. Miller, Effects of speaking rate on segmental distinctions. In Perspectives on the study of speech (1981), pp. 39–71
J.L. Miller, L.E. Volaitis, Effect of speaking rate on the perceptual structure of a phonetic category. Percept. Psychophys. 46(6), 505–512 (1989)
Article Google Scholar
N. Mirghafori, E. Fosler, N. Morgan, Towards robustness to fast speech in ASR. In Proceedings ICASSP, vol. 1 (1996), pp. 335–338
N. Morgan, E. Fosler, N. Mirghafori, Speech recognition using on-line estimation of speaking rate. In Proceedings EUROSPEECH (1997), pp. 2079–2082
S.H. ParthasarathiK., B. Hoffmeister, S. Matsoukas, A. Mandal, N. Strom, S. Garimella, fMLLR based feature-space speaker adaptation of DNN acoustic models. In INTERSPEECH (2015)
A. Potaminaos, S. Narayanan, Robust recognition of children speech. IEEE Trans. Speech Audio Process. 11(6), 603–616 (2003)
Article Google Scholar
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)
Article Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit. In Proceedings ASRU (2011)
L. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Upper Saddle River, NJ, 1993)
Google Scholar
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In Proceedings ICASSP, vol. 1 (1995), pp. 81–84
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. In Proceedings Speech and Language Technologies in Education (SLaTE) (2007)
J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, B. Strope, Your word is my command: Google search by voice: a case study. In Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, chap. 4 (2010), pp. 61–90
R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Nat. Lang. Eng. 23, 325–350 (2016)
Article Google Scholar
S. Shahnawazuddin, K.T. Deepak, G. Pradhan, R. Sinha, Enhancing noise and pitch robustness of children’s ASR. In Proceedings ICASSP (2017), pp. 5225–5229
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. In Proceedings INTERSPEECH (2016), pp. 3459–3463
S. Shahnawazuddin, R. Sinha, Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput. Speech Lang. 43, 1–17 (2017)
Article Google Scholar
S. Shahnawazuddin, R. Sinha, G. Pradhan, Pitch-normalized acoustic features for robust children’s speech recognition. IEEE Signal Process. Lett. 24(8), 1128–1132 (2017)
Article Google Scholar
X. Shao, B. Milner, Pitch prediction from MFCC vectors for speech reconstruction. In Proceedings ICASSP (2004), pp. 97–100
M.A. Siegler, R.M. Stern, On the effects of speech rate in large vocabulary speech recognition systems. In Proceedings ICASSP, vol. 1 (1995), pp. 612–615
H. Singer, S. Sagayama, Pitch dependent phone modelling for HMM based speech recognition. In Proceedings ICASSP (1992), pp. 273–276
G. Stemmer, C. Hacker, S. Steidl, E. Nöth, Acoustic normalization of childrens speech. In Proceedings INTERSPEECH (2003), pp. 1313–1316
Q. Summerfield, Articulatory rate and perceptual constancy in phonetic perception. J. Exp. Psychol. Hum. Perform. Percept. 7, 208–215 (1981)
Article Google Scholar
Z.H. Tan, B. Lindberg, Low-complexity variable frame rate analysis for speech recognition and voice activity detection. IEEE J. Sel. Top. Signal Process. 4(5), 798–807 (2010)
Article Google Scholar
D.L. Valente, H.M. Plevinsky, J.M. Franco, E.C. Heinrichs-Graham, D. Lewis, Experimental investigation of the effects of the acoustical conditions in a simulated classroom on speech recognition and learning in children. J. Acoust. Soc. Am. 131(1), 232–246 (2012)
Article Google Scholar
S. Whiteside, C. Hodgson, Speech patterns of children and adults elicited via a picture-naming task: an acoustic study. Speech Commun. 32(4), 267–285 (2000)
Article Google Scholar
J. Wilpon, C. Jacobsen, A study of speech recognition for children and the elderly. In Proceedings ICASSP, vol. 1 (1996), pp. 349–352
P.C. Woodland, Speaker adaptation for continuos density HMMs: a review. In Proceedings ISCA ITRW on Adaptation Methods for Speech Recognition (2001), pp. 11–19
H. You, Q. Zhu, A. Alwan, Entropy-based variable frame rate analysis of speech signals and its application to ASR. In Proceedings ICASSP, vol. 1 (2004), pp. 549–522
X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks. In Proceedings ICASSP (2014), pp. 215–219
X. Zhu, G.T. Beauregard, L.L. Wyse, Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)
Article Google Scholar

Download references

Acknowledgements

The authors express sincere gratitude to the anonymous reviewers for their thoughtful comments and suggestions.

Author information

Authors and Affiliations

Department of ECE, NIT Patna, Patna, India
S. Shahnawazuddin, Chaman Singh & Gayadhar Pradhan
Department of ECE, NIT Sikkim, Ravangla, India
Hemant Kumar Kathania & Waquar Ahmad

Authors

S. Shahnawazuddin
View author publications
You can also search for this author in PubMed Google Scholar
Chaman Singh
View author publications
You can also search for this author in PubMed Google Scholar
Hemant Kumar Kathania
View author publications
You can also search for this author in PubMed Google Scholar
Waquar Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Gayadhar Pradhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Shahnawazuddin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahnawazuddin, S., Singh, C., Kathania, H.K. et al. An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition. Circuits Syst Signal Process 37, 5540–5553 (2018). https://doi.org/10.1007/s00034-018-0828-2

Download citation

Received: 21 July 2017
Revised: 24 April 2018
Accepted: 26 April 2018
Published: 05 May 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s00034-018-0828-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Explicit Pitch Mapping for Improved Children’s Speech Recognition

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children’s Speech Recognition

Abstract

Access this article

Similar content being viewed by others

Explicit Pitch Mapping for Improved Children’s Speech Recognition

A Fast Adaptation Approach for Enhanced Automatic Recognition of Children’s Speech with Mismatched Acoustic Models

Creating Robust Children’s ASR System in Zero-Resource Condition Through Out-of-Domain Data Augmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation