research-article

End-to-end Speech Recognition Based on BGRU-CTC

Authors:

Yu Yan,

Xizhong ShenAuthors Info & Claims

SSIP '21: Proceedings of the 2021 4th International Conference on Sensors, Signal and Image Processing

Pages 48 - 52

https://doi.org/10.1145/3502814.3502822

Published: 11 April 2022 Publication History

Get Access

Abstract

In recent years, the end-to-end speech recognition model has gradually become the development trend of large-scale continuous speech recognition because of its simplicity and easy training characteristics. In this paper, we use the good performance of bidirectional gated recurrent unit (BGRU), a variant of long short term memory (LSTM), in the field of speech recognition. At the same time, we use connectionist temporal classification (CTC) algorithm to train the model, build an end-to-end speech recognition system, and carry out speech recognition experiments on TIMIT. The results show that, compared with the traditional recognition model, the accuracy of the improved end-to-end model is improved by about 2.4%.

References

[1]

Billa, J. (2017). Improving LSTM-CTC based ASR performance in domains with limited training data. arXiv preprint arXiv:1707.00722.

Google Scholar

[2]

Dey, R., & Salem, F. M. (2017). Gate-variants of gated recurrent unit (GRU) neural networks. Paper presented at the 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS).

Google Scholar

[3]

Do, C.-T., Zhang, S., & Hain, T. (2021). Selective adaptation of end-to-end speech recognition using hybrid CTC/attention architecture for noise robustness. Paper presented at the 2020 28th European Signal Processing Conference (EUSIPCO).

Google Scholar

[4]

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Paper presented at the Proceedings of the 23rd international conference on Machine learning.

Digital Library

Google Scholar

[5]

Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Paper presented at the International conference on machine learning.

Google Scholar

[6]

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., . . . Sainath, T. N. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97.

Google Scholar

[7]

Kahn, J., Lee, A., & Hannun, A. (2020). Self-training for end-to-end speech recognition. Paper presented at the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Google Scholar

[8]

Kim, S., Hori, T., & Watanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Paper presented at the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).

Google Scholar

[9]

Mohamed, A.-r., Hinton, G., & Penn, G. (2012). Understanding how deep belief networks perform acoustic modelling. Paper presented at the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Google Scholar

[10]

Rodríguez, E., Ruíz, B., García-Crespo, Á., & García, F. (1997). Speech/speaker recognition using a HMM/GMM hybrid model. Paper presented at the International Conference on Audio-and Video-Based Biometric Person Authentication.

Google Scholar

[11]

Wang, D., Wang, X., & Lv, S. (2019). End-to-end mandarin speech recognition combining CNN and BLSTM. Symmetry, 11(5), 644.

Crossref

Google Scholar

[12]

Yu, D., & Deng, L. (2010). Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal processing magazine, 28(1), 145-154.

Google Scholar

[13]

Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017). A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. Paper presented at the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Google Scholar

[14]

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.

Google Scholar

Recommendations

Robust AAM-based audio-visual speech recognition against face direction changes
MM '12: Proceedings of the 20th ACM international conference on Multimedia

As one of the techniques for robust speech recognition under noisy environments, audio-visual speech recognition (AVSR) using lip dynamic scene information together with audio information is attracting attention, and the research has advanced in recent ...
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coefficient (RMCC) features.A ...

Comments

Information & Contributors

Information

Published In

SSIP '21: Proceedings of the 2021 4th International Conference on Sensors, Signal and Image Processing

October 2021

81 pages

ISBN:9781450385725

DOI:10.1145/3502814

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSIP 2021

SSIP 2021: 2021 4th International Conference on Sensors, Signal and Image Processing

October 15 - 17, 2021

Nanjing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
31
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Recommendations

Robust AAM-based audio-visual speech recognition against face direction changes

Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations