skip to main content
10.1145/3502814.3502822acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssipConference Proceedingsconference-collections
research-article

End-to-end Speech Recognition Based on BGRU-CTC

Published: 11 April 2022 Publication History

Abstract

In recent years, the end-to-end speech recognition model has gradually become the development trend of large-scale continuous speech recognition because of its simplicity and easy training characteristics. In this paper, we use the good performance of bidirectional gated recurrent unit (BGRU), a variant of long short term memory (LSTM), in the field of speech recognition. At the same time, we use connectionist temporal classification (CTC) algorithm to train the model, build an end-to-end speech recognition system, and carry out speech recognition experiments on TIMIT. The results show that, compared with the traditional recognition model, the accuracy of the improved end-to-end model is improved by about 2.4%.

References

[1]
Billa, J. (2017). Improving LSTM-CTC based ASR performance in domains with limited training data. arXiv preprint arXiv:1707.00722.
[2]
Dey, R., & Salem, F. M. (2017). Gate-variants of gated recurrent unit (GRU) neural networks. Paper presented at the 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS).
[3]
Do, C.-T., Zhang, S., & Hain, T. (2021). Selective adaptation of end-to-end speech recognition using hybrid CTC/attention architecture for noise robustness. Paper presented at the 2020 28th European Signal Processing Conference (EUSIPCO).
[4]
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Paper presented at the Proceedings of the 23rd international conference on Machine learning.
[5]
Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Paper presented at the International conference on machine learning.
[6]
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., . . . Sainath, T. N. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97.
[7]
Kahn, J., Lee, A., & Hannun, A. (2020). Self-training for end-to-end speech recognition. Paper presented at the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[8]
Kim, S., Hori, T., & Watanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Paper presented at the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).
[9]
Mohamed, A.-r., Hinton, G., & Penn, G. (2012). Understanding how deep belief networks perform acoustic modelling. Paper presented at the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[10]
Rodríguez, E., Ruíz, B., García-Crespo, Á., & García, F. (1997). Speech/speaker recognition using a HMM/GMM hybrid model. Paper presented at the International Conference on Audio-and Video-Based Biometric Person Authentication.
[11]
Wang, D., Wang, X., & Lv, S. (2019). End-to-end mandarin speech recognition combining CNN and BLSTM. Symmetry, 11(5), 644.
[12]
Yu, D., & Deng, L. (2010). Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal processing magazine, 28(1), 145-154.
[13]
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017). A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. Paper presented at the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[14]
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSIP '21: Proceedings of the 2021 4th International Conference on Sensors, Signal and Image Processing
October 2021
81 pages
ISBN:9781450385725
DOI:10.1145/3502814
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. End-to-end model
  2. Link sequence classification
  3. Long and short term memory network
  4. Speech recognition

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSIP 2021

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 31
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media