Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks

Zhang, Wei; Zhai, Minghao; Huang, Zilong; Liu, Chen; Li, Wei; Cao, Yi

doi:10.1007/978-3-030-27529-7_29

Wei Zhang^14,16,
Minghao Zhai^14,16,
Zilong Huang^14,16,
Chen Liu^14,16,
Wei Li¹⁵ &
…
Yi Cao^14,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11745))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

2813 Accesses
16 Citations

Abstract

Approaches to deep learning have been used all over in connection to Automatic Speech Recognition (ASR), where they have achieved a high level of accuracy. This has mostly been seen in Convolutional Neural Network (CNN) which has recently been investigated in ASR. Due to the fact that CNN has an increased network’s depth on one branch, and may not be wide enough to work on capturing adequate features on signals of human speech. We focus on a proposal for an architecture that is deep and wide in CNN referred to as Multipath Convolutional Neural Network (MCNN). MCNN-CTC combines three additional paths with Connectionist Temporal Classification (CTC) objective function, and can be defined as an end-to-end system that has the ability to fully exploit spectral and temporal structures related to speech signals simultaneously. Results from the experiments show that the newly proposed MCNN-CTC structure enables a reduction in the error rate arising from the construction of end-to-end acoustic model. In the absence of a Language Model (LM), our proposed MCNN-CTC acoustic model has a relative reduction of 1.10%–12.08% comparing to the traditional HMM-based or DCNN-CTC-based models with strong generalization performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lecun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks. MIT Press, USA (1995)
Google Scholar
Abdel, H.O., Mohamed, A.R., Jiang, H.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Mohamed, A., Dahl, G.E., Hinton, G.E.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
Article Google Scholar
Hinton, G.E., Deng, L., Yu, D.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Abdel, H.O., Mohamed, A.R., Jiang, H.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 4277–4280. IEEE, Kyoto, May 2012
Google Scholar
Sainath, T.N., Mohamed, A.R., Kingsbury, B.: Deep convolutional neural networks for LVCSR. In: International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618. IEEE, Vancouver, May 2013
Google Scholar
Zhang, Y., Pezeshki, M., Brakel, P.: Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720, January 2017
Qian, Y.M., Woodland, P.C.: Very deep convolutional neural networks for robust speech recognition. In: Spoken Language Technology Workshop, pp. 481–488. IEEE, Berkeley, June 2017
Google Scholar
Bahdanau, D., Chorowski, J., Serdyuk, D.: End-to-End attention-based large vocabulary speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 4945–4949. IEEE, Shanghai, March 2016
Google Scholar
Miao, Y.J., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. arXiv preprint arXiv:1507.08240, October 2015
Zhang, H., Bao, F., Gao, G.: Mongolian speech recognition based on deep neural networks. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds.) CCL 2015. LNCS (LNAI), vol. 9427, pp. 180–188. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25816-4_15
Chapter Google Scholar
Tan, T., Qian, Y.M., Hu, H.: Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26(8), 1393–1405 (2018)
Article Google Scholar
Graves, A., Santiago, F., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning, pp. 369–376. IEEE, Pittsburgh, June 2006
Google Scholar
Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, Hong Kong, April 2003
Google Scholar
Hannun, A., Case, C., Casper, J.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Amodei, D., Anubhai, R., Battenberg, F.: Deep speech 2: end-to-end speech recognition in English and Mandarin. arXiv preprint arXiv:1512.02595 (2015)
Wang, Y., Deng, X., Pu, S.: Residual convolutional CTC networks for automatic speech recognition. arXiv preprint arXiv:1702.07793, February 2017
Li, J., Zhang, H., Cai, X.Y.: Towards end-to-end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks. In: Interspeech 2015, pp. 3615–3619. IEEE, Berlin, September 2015
Google Scholar
Zhou, S.Y., Dong, L.H., Xu, S., Xu, B.: Syllable-based sequence-to-sequence speech recognition with the transformer in Mandarin Chinese. arXiv preprint arXiv:1804.10752, June 2018
Zou, W., Jiang, D.W., Zhao, S.J., Li, X.G.: A comparable study of modeling units for end-to-end Mandarin speech recognition. arXiv preprint arXiv:1805.03832, May 2018
Dong, L.H., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 4437–4441. IEEE, Calgary, April 2018
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, July 2015
Srivastava, N., Hinton, G.E., Krizhevsky, A.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Zhou, Z.H.: Machine Learning. Tsinghua University Press, Beijing (2016)
Google Scholar
Simonyan, K., Andrew, Z.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)
Awni, Y.H., Andrew, L.M., Daniel, J.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873, December 2014
Wang, D., Zhang, X.: THCHS-30: a free chinese speech corpus. arXiv preprint arXiv:1512.01882, December 2015
Zhang, L.M., Wang, Y.Z., Zhang, B.Q.: Chinese Mandarin recognition and improvement based on CTC criterion. Comput. Eng. (2019)
Google Scholar

Download references

Acknowledgements

This work reported here was supported by the National Natural Science Foundation of China (Grant No. 51375209), 111 Project (Grant No. B18027), the Six Talent Peaks Project in Jiangsu Province (Grant No. ZBZZ-012), the Research and the Innovation Project for College Graduates of Jiangsu Province (Grant No. SJCX18-0630 and KYCX18-1846). Finally, the authors would like to thanks for the support of Thchs30 and ST-CMDS datasets.

Author information

Authors and Affiliations

School of Mechanical Engineering, Jiangnan University, Wuxi, 214122, Jiangsu, China
Wei Zhang, Minghao Zhai, Zilong Huang, Chen Liu & Yi Cao
Suzhou Vocational Institute of Industrial Technology, Suzhou, 215104, Jiangsu, China
Wei Li
Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment and Technology, Wuxi, 214122, Jiangsu, China
Wei Zhang, Minghao Zhai, Zilong Huang, Chen Liu & Yi Cao

Authors

Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Minghao Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Zilong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Cao .

Editor information

Editors and Affiliations

Shenyang Institute of Automation, Shenyang, China
Haibin Yu
Shenyang Institute of Automation, Shenyang, China
Jinguo Liu
Shenyang Institute of Automation, Shenyang, China
Lianqing Liu
University of Portsmouth, Portsmouth, UK
Zhaojie Ju
Shenyang Institute of Automation, Shenyang, China
Yuwang Liu
University of Portsmouth, Portsmouth, UK
Dalin Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, W., Zhai, M., Huang, Z., Liu, C., Li, W., Cao, Y. (2019). Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks. In: Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D. (eds) Intelligent Robotics and Applications. ICIRA 2019. Lecture Notes in Computer Science(), vol 11745. Springer, Cham. https://doi.org/10.1007/978-3-030-27529-7_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-27529-7_29
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27528-0
Online ISBN: 978-3-030-27529-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics