Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

Xue, Jiabin; Zheng, Tieran; Han, Jiqing

doi:10.1007/978-3-030-36802-9_76

Jiabin Xue⁹,
Tieran Zheng⁹ &
Jiqing Han⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1143))

Included in the following conference series:

International Conference on Neural Information Processing

2236 Accesses
1 Citations

Abstract

The Grid Long Short-Term Memory (Grid-LSTM), which is consisted of three steps, i.e., two-dimensional grid splitting, local feature projection, and grid sequence modeling, has been widely used in Automatic Speech Recognition (ASR) tasks, since it has a strong time-frequency modeling ability. However, the network suffers from a serious problem that heavy computing time is always required. It can be found that the reason for this problem is in the last step, two cross-working LSTMs are employed to model time-frequency features in the grid via an analysis of its process. Thus, we try to speed up the Grid-LSTM by using a smaller grid and propose two enhanced Grid-LSTM models, i.e., Convolutional Grid-LSTM (ConvGrid-LSTM) and Multichannel ConvGrid-LSTM (MCConvGrid-LSTM) to reduce the grid size from the two dimensions of the Grid-LSTM respectively. In the frequency axis, we try to do this by using a large frequency stride and further to prevent performance loss by embedding a CNN in the Grid-LSTM. Moreover, in the time axis, we model several adjacent frames by the multichannel processing ability of CNN. Our method achieves \(54\%\) relative reduction of training time and \(19\%\) relative reduction of Word Error Rate (WER) for a character level End-to-End ASR task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4277–4280 (2012)
Google Scholar
Graves, A., Jaitly, N., Mohamed, A.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4 (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kalchbrenner, N., Danihelka, I., Graves, A.: Grid long short-term memory. In: International Conference of Learning Representation, ICLR, pp. 1–15. Open Publishing (2016)
Google Scholar
Li, B., Sainath, T.N.: Reducing the computational complexity of two-dimensional LSTMs. In: INTERSPEECH, pp. 964–968 (2017)
Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5206–5210 (2015)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4 (2011)
Google Scholar
Pundak, G., Sainath, T.N.: Lower frame rate neural network acoustic models. In: INTERSPEECH, pp. 22–26 (2016)
Google Scholar
Sainath, T.N., et al.: Improvements to deep convolutional neural networks for LVCSR. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 315–320 (2013)
Google Scholar
Sainath, T.N., Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: INTERSPEECH, pp. 813–817 (2016)
Google Scholar
Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 8614–8618 (2013)
Google Scholar
Sainath, T.N., Vinyals, O., Senior, A.W., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4580–4584 (2015)
Google Scholar
Stollenga, M.F., Byeon, W., Liwicki, M., Schmidhuber, J.: Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. In: Advances in Neural Information Processing Systems NIPS, pp. 2998–3006 (2015)
Google Scholar

Download references

Acknowledgements

This research was supported by the National Key Research and Development Program of China under Grant 2017YFB1002102 and National Natural Science Foundation of China under Grant U1736210.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Jiabin Xue, Tieran Zheng & Jiqing Han

Authors

Jiabin Xue
View author publications
You can also search for this author in PubMed Google Scholar
Tieran Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jiqing Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiqing Han .

Editor information

Editors and Affiliations

Australian National University, Canberra, ACT, Australia
Tom Gedeon
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xue, J., Zheng, T., Han, J. (2019). Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_76

Download citation

DOI: https://doi.org/10.1007/978-3-030-36802-9_76
Published: 05 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36801-2
Online ISBN: 978-3-030-36802-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics