Abstract
Building a conventional automatic speech recognition (ASR) system based on hidden Markov model (HMM)/deep neural network (DNN) makes the system complex as it requires various modules such as acoustic, lexicon, linguistic resources, language models etc. particularly with the low resource languages. In contrast, End-to-End architecture has greatly simplifies the model building process by representing complex modules with a simple deep network and by replacing the use of linguistic resources with a data-driven learning techniques. In this paper, we present our prior work by exploring End-to-End (E2E) framework for Khasi speech recognition system and the novel extension towards the development of speech corpora for standard Khasi dialect. We implemented the proposed E2E model by using Nabu ASR toolkit. Additionally, three other models (monophone, triphone and hybrid DNN) were built. Comparing the results, significant improvement was achieved using the proposed method particularly with the connectionist temporal classification (CTC) with a character error rate (CER) of 5.04%.
Similar content being viewed by others
References
Amodei, D., et al. (2016). Deep speech 2: End-to-End speech recognition in English and Mandarin. In Proceedings of the 33rd international conference on machine learning (Vol. 48, pp. 173–182).
Bachate, R. P., & Sharma, A. (2019). Automatic speech recognition systems for regional languages in India. International Journal of Recent Technology and Engineering, 8, 585–592.
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472621.
Escur i Gelabert, J. (2017). Exploring automatic speech recognition with TensorFlow (pp. 1–36). Degree thesis.
Guglani, J., & Mishra, A. N. (2018). Continuous Punjabi speech recognition model based on Kaldi ASR toolkit. International Journal of Speech Technology, 21, 211–216.
Hannun, A., et al. (2014). Deep speech: Scaling up End-to-End speech recognition (pp. 1–12). arxiv.org/abs/1412.5567.
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in joint CTC-attention based End-to-End speech recognition with a deep CNN encoder and RNN-LM. Interspeech. https://doi.org/10.21437/Interspeech.2017-1296.
Kim, S., Hori, T., & Watanabe, S. (2017). Joint CTC-attention based End-to-End speech recognition using multi-task learning. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2017.7953075,4835-4839.
Kurata, G., & Audhkhasi, K. (2018). Improved knowledge distillation from bi-directional to uni-directional LSTM CTC for End-to-End speech recognition. IEEE Spoken Language Technology Workshop (SLT). https://doi.org/10.1109/SLT.2018.8639629.
Li, J., et al. (2019). Jasper: An End-to-End convolutional neural acoustic model. Interspeech. https://doi.org/10.21437/Interspeech.2019-1819.
Miao, Y., Gowayyed, M., & Metze, F. (2015). EESEN: End-to-End speech recognition using deep RNN models and WFST-based decoding. IEEE Workshop on Automatic Speech Recognition and Understanding. https://doi.org/10.1109/ASRU.2015.7404790.
Park, D. S., et al. (2019). SpecAugment: A simple data augmentation method for automatic speech recognition. Interspeech. https://doi.org/10.21437/Interspeech.2019-2680.
Renkens, V. Retrieved November 21, 2019, from https://www.github.com/vrenkens/nabu.
Shan, C., et al. (2019). Investigating End-to-End speech recognition for Mandarin-English code-switching. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2019.8682850.
Shan, C., Zhang, J., Wang, Y., & Xie, L. (2018). Attention-based End-to-End speech recognition on voice search. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2018.8462492.
Sumit, S. H., Al Muntasir, T., Zaman, M. A., Nandi, R. N., & Sourov, T. (2018). Noise Robust End-to-End speech recognition for Bangla language. International Conference on Bangla Speech and Language Processing (ICBSLP). https://doi.org/10.1109/ICBSLP.2018.8554871.
Watanabe, S. (2017). Hybrid CTC/attention architecture for End-to-End speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.
Zeyer, A., Irie, K., Schluter, R., & Ney, H. (2018). Improved training of End-to-End attention models for speech recognition. Interspeech. https://doi.org/10.21437/Interspeech.2018-1616.
Zhang, Y., et al. (2016). Towards End-to-End speech recognition with deep convolutional neural networks. International Conference on Intelligent Robotics and Applications. https://doi.org/10.21437/Interspeech.2016-1446.
Zhang, Y., Chan, W., & Jaitly, N. (2017). Very deep convolutional networks for End-to-End speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2017.7953077.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Syiem, B., Singh, L.J. Exploring end-to-end framework towards Khasi speech recognition system. Int J Speech Technol 24, 419–424 (2021). https://doi.org/10.1007/s10772-021-09811-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09811-5