Skip to main content

Synthesis Speech Based Data Augmentation for Low Resource Children ASR

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

Abstract

Successful speech recognition for children requires large training data with sufficient speaker variability. The collection of such a training database of children’s voices is challenging and very expensive for zero/low resource language like Punjabi. In this paper, the data scarcity issue of the low resourced language Punjabi is addressed through two levels of augmentation. The original training corpus is first augmented by modifying the prosody parameters for pitch and speaking rate. Our results show that the augmentation improves the system performance over the baseline system. Then the augmented data combined with original data and used to train the TTS system to generate synthesis data and extended dataset is further used for augmented by generating children’s utterances using text-to-speech synthesis and sampling the language model with methods that increase the acoustic and lexical diversity. The final speech recognition performance indicates a relative improvement of 50.10% with acoustic and 57.40% with language diversity based augmentation in comparison to that of the baseline system respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    LDC-IL, Punjabi Raw Speech Corpus, https://data.ldcil.org/punjabi-raw-speech-corpus last accessed 2021/05/10.

References

  1. Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  2. Dalmia, S., Sanabria, R., Metze, F., Black, A.W.: Sequence-based multi-lingual low resource speech recognition. In: Proceedings of the ICASSP, pp. 4909–4913. IEEE (2018)

    Google Scholar 

  3. Du, C., Yu, K.: Speaker augmentation for low resource speech recognition. In: Proceedings of the ICASSP, pp. 7719–7723. IEEE (2020)

    Google Scholar 

  4. Evermann, G., et al.: Development of the 2003 CU-HTK conversational telephone speech transcription system. In: Proceedings of the ICASSP, vol. 1, pp. I-249. IEEE (2004)

    Google Scholar 

  5. Gu, J., Wang, Y., Chen, Y., Cho, K., Li, V.O.: Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437 (2018)

  6. Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)

  7. Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.M.: Two-stage data augmentation for low-resourced speech recognition. In: Proceedings of the INTERSPEECH, pp. 2378–2382 (2016)

    Google Scholar 

  8. Kadyan, V., Shanawazuddin, S., Singh, A.: Developing children’s speech recognition system for low resource Punjabi language. Appl. Acoust. 178, 108002 (2021)

    Article  Google Scholar 

  9. Kanda, N., Takeda, R., Obuchi, Y.: Elastic spectral distortion for low resource speech recognition with deep neural networks. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 309–314. IEEE (2013)

    Google Scholar 

  10. Kathania, H., Singh, M., Grósz, T., Kurimo, M.: Data augmentation using prosody and false starts to recognize non-native children’s speech. In: Proceedings of the INTERSPEECH 2020 (2020). To appear

    Google Scholar 

  11. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  12. Kathania, H.K., Kadiri, S.R., Alku, P., Kurimo, M.: Study of formant modification for children ASR. In: Proceedings of the ICASSP, pp. 7429–7433 (2020)

    Google Scholar 

  13. Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M.: Neural speech synthesis with transformer network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713 (2019)

    Google Scholar 

  14. Müller, M., Stüker, S., Waibel, A.: Language adaptive DNNs for improved low resource speech recognition. In: Proceedings of the INTERSPEECH, pp. 3878–3882 (2016)

    Google Scholar 

  15. Müller, M., Waibel, A.: Using language adaptive deep neural networks for improved multilingual speech recognition. In: Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT) (2015)

    Google Scholar 

  16. Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)

  17. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Yegnanarayana, B. (ed.) Proceedings of the INTERSPEECH 2018, pp. 3743–3747. ISCA (2018)

    Google Scholar 

  18. Povey, D., et al.: The Kaldi Speech recognition toolkit. In: Proceedings of the ASRU, December 2011

    Google Scholar 

  19. Ragni, A., Knill, K., Rath, S.P., Gales, M.: Data augmentation for low resource languages (2014)

    Google Scholar 

  20. Rath, S.P., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: Proceedings of the INTERSPEECH (2013)

    Google Scholar 

  21. Ren, S., Chen, W., Liu, S., Li, M., Zhou, M., Ma, S.: Triangular architecture for rare language translation. arXiv preprint arXiv:1805.04813 (2018)

  22. Ren, Y., et al.: FastSpeech: fast, robust and controllable text to speech. In: Advances in Neural Information Processing Systems, pp. 3171–3180 (2019)

    Google Scholar 

  23. Rosenberg, A., et al.: Speech recognition with augmented synthesized speech. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 996–1002. IEEE (2019)

    Google Scholar 

  24. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013, pp. 55–59. IEEE (2013)

    Google Scholar 

  25. Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: Proceedings of the ICASSP, pp. 7554–7558. IEEE (2020)

    Google Scholar 

  26. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of the ICASSP, pp. 4779–4783. IEEE (2018)

    Google Scholar 

  27. Sriram, A., Jun, H., Gaur, Y., Satheesh, S.: Robust speech recognition using generative adversarial networks. In: Proceedings of the ICASSP, pp. 5639–5643. IEEE (2018)

    Google Scholar 

  28. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Hansen, J.H.L., Pellom, B.L. (eds.) INTERSPEECH 2002. ISCA (2002)

    Google Scholar 

  29. Wang, K., Zhang, J., Sun, S., Wang, Y., Xiang, F., Xie, L.: Investigating generative adversarial networks based speech dereverberation for robust speech recognition. arXiv preprint arXiv:1803.10132 (2018)

  30. Zhang, J.X., et al.: Voice conversion by cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. arXiv preprint arXiv:2009.01475 (2020)

  31. Zhu, X., Beauregard, G.T., Wyse, L.L.: Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15(5), 1645–1653 (2007)

    Article  Google Scholar 

  32. Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hemant Kathania .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kadyan, V., Kathania, H., Govil, P., Kurimo, M. (2021). Synthesis Speech Based Data Augmentation for Low Resource Children ASR. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics