Abstract
Successful speech recognition for children requires large training data with sufficient speaker variability. The collection of such a training database of children’s voices is challenging and very expensive for zero/low resource language like Punjabi. In this paper, the data scarcity issue of the low resourced language Punjabi is addressed through two levels of augmentation. The original training corpus is first augmented by modifying the prosody parameters for pitch and speaking rate. Our results show that the augmentation improves the system performance over the baseline system. Then the augmented data combined with original data and used to train the TTS system to generate synthesis data and extended dataset is further used for augmented by generating children’s utterances using text-to-speech synthesis and sampling the language model with methods that increase the acoustic and lexical diversity. The final speech recognition performance indicates a relative improvement of 50.10% with acoustic and 57.40% with language diversity based augmentation in comparison to that of the baseline system respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
LDC-IL, Punjabi Raw Speech Corpus, https://data.ldcil.org/punjabi-raw-speech-corpus last accessed 2021/05/10.
References
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
Dalmia, S., Sanabria, R., Metze, F., Black, A.W.: Sequence-based multi-lingual low resource speech recognition. In: Proceedings of the ICASSP, pp. 4909–4913. IEEE (2018)
Du, C., Yu, K.: Speaker augmentation for low resource speech recognition. In: Proceedings of the ICASSP, pp. 7719–7723. IEEE (2020)
Evermann, G., et al.: Development of the 2003 CU-HTK conversational telephone speech transcription system. In: Proceedings of the ICASSP, vol. 1, pp. I-249. IEEE (2004)
Gu, J., Wang, Y., Chen, Y., Cho, K., Li, V.O.: Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437 (2018)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.M.: Two-stage data augmentation for low-resourced speech recognition. In: Proceedings of the INTERSPEECH, pp. 2378–2382 (2016)
Kadyan, V., Shanawazuddin, S., Singh, A.: Developing children’s speech recognition system for low resource Punjabi language. Appl. Acoust. 178, 108002 (2021)
Kanda, N., Takeda, R., Obuchi, Y.: Elastic spectral distortion for low resource speech recognition with deep neural networks. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 309–314. IEEE (2013)
Kathania, H., Singh, M., Grósz, T., Kurimo, M.: Data augmentation using prosody and false starts to recognize non-native children’s speech. In: Proceedings of the INTERSPEECH 2020 (2020). To appear
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Kathania, H.K., Kadiri, S.R., Alku, P., Kurimo, M.: Study of formant modification for children ASR. In: Proceedings of the ICASSP, pp. 7429–7433 (2020)
Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713 (2019)
Müller, M., Stüker, S., Waibel, A.: Language adaptive DNNs for improved low resource speech recognition. In: Proceedings of the INTERSPEECH, pp. 3878–3882 (2016)
Müller, M., Waibel, A.: Using language adaptive deep neural networks for improved multilingual speech recognition. In: Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT) (2015)
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Yegnanarayana, B. (ed.) Proceedings of the INTERSPEECH 2018, pp. 3743–3747. ISCA (2018)
Povey, D., et al.: The Kaldi Speech recognition toolkit. In: Proceedings of the ASRU, December 2011
Ragni, A., Knill, K., Rath, S.P., Gales, M.: Data augmentation for low resource languages (2014)
Rath, S.P., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: Proceedings of the INTERSPEECH (2013)
Ren, S., Chen, W., Liu, S., Li, M., Zhou, M., Ma, S.: Triangular architecture for rare language translation. arXiv preprint arXiv:1805.04813 (2018)
Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Advances in Neural Information Processing Systems, pp. 3171–3180 (2019)
Rosenberg, A., et al.: Speech recognition with augmented synthesized speech. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 996–1002. IEEE (2019)
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013, pp. 55–59. IEEE (2013)
Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: Proceedings of the ICASSP, pp. 7554–7558. IEEE (2020)
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of the ICASSP, pp. 4779–4783. IEEE (2018)
Sriram, A., Jun, H., Gaur, Y., Satheesh, S.: Robust speech recognition using generative adversarial networks. In: Proceedings of the ICASSP, pp. 5639–5643. IEEE (2018)
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Hansen, J.H.L., Pellom, B.L. (eds.) INTERSPEECH 2002. ISCA (2002)
Wang, K., Zhang, J., Sun, S., Wang, Y., Xiang, F., Xie, L.: Investigating generative adversarial networks based speech dereverberation for robust speech recognition. arXiv preprint arXiv:1803.10132 (2018)
Zhang, J.X., et al.: Voice conversion by cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. arXiv preprint arXiv:2009.01475 (2020)
Zhu, X., Beauregard, G.T., Wyse, L.L.: Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kadyan, V., Kathania, H., Govil, P., Kurimo, M. (2021). Synthesis Speech Based Data Augmentation for Low Resource Children ASR. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)