CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

Yi, Jiangyan; Wen, Zhengqi; Tao, Jianhua; Ni, Hao; Liu, Bin

doi:10.1007/s11265-017-1291-1

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

Published: 23 September 2017

Volume 90, pages 985–997, (2018)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Jiangyan Yi^1,2,
Zhengqi Wen¹,
Jianhua Tao^1,2,3,
Hao Ni^1,2 &
…
Bin Liu¹

803 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

This paper proposes a novel regularized adaptation method to improve the performance of multi-accent Mandarin speech recognition task. The acoustic model is based on long short term memory recurrent neural network trained with a connectionist temporal classification loss function (LSTM-RNN-CTC). In general, directly adjusting the network parameters with a small adaptation set may lead to over-fitting. In order to avoid this problem, a regularization term is added to the original training criterion. It forces the conditional probability distribution estimated from the adapted model to be close to the accent independent model. Meanwhile, only the accent-specific output layer needs to be fine-tuned using this adaptation method. Experiments are conducted on RASC863 and CASIA regional accented speech corpus. The results show that the proposed method obtains obvious improvement when compared with LSTM-RNN-CTC baseline model. It also outperforms other adaptation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Greg Van Houdt, Carlos Mosquera & Gonzalo Nápoles

Deep learning for time series classification: a review

Article 02 March 2019

Hassan Ismail Fawaz, Germain Forestier, … Pierre-Alain Muller

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

References

Huang, C., Chen, T., & Chang, E. (2004). Accent Issues in Large Vocabulary Continuous Speech Recognition. Int J Speech Technol, 7(2), 141–153.
Article Google Scholar
Wang, Z., Schultz, T., & Waibel, A. (2013). Comparison of Acoustic Model Adaptation Techniques on Non-native Speech. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Arslan, L. M., & Hansen, J. L. (1997). A study of the temporal features and frequency characteristics in American english foreign accent. Journal of the Acoustical Society of America, 102(1), 28–40.
Article Google Scholar
Liu, Y., & P. Fung (2006). Multi-accent Chinese Speech Recognition. In the Proceedings of Interspeech.
Fung, P., & Liu, Y. (Nov. 2005). Effects and Modeling of Phonetic and Acoustic Confusions in Accented Speech. J Acoust Soc Amer, 118(4), 3279–3293.
Article Google Scholar
Leading Group Office of Survey of Language Use in China (2006). In survey of language use in China. Beijing: Yu Wen Press (in Chinese).
Davis S. B., & Mermelstein, P. (2013) Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition. IEEE Trans Acoustics Speech Signal Process, 21 (10), 2073–2084.
Zheng, Y. L., Sproat, R., Gu, L., Shafran, I., Zhou, H., Su, Y., Jurafsky, D., Starr, R., & Yoon, S. (2005). Accent Detection and Speech Recognition for Shanghai-accented Mandarin. In the Proceedings of Interspeech.
Vergyri, D., Lamel, L., & Gauvain, L. (2010). Automatic Speech Recognition of Multiple Accented English Data. In the Proceedings of Interspeech.
Ding, G. H. (2008). Phonetic Confusion Analysis and Robust Phone Set Generation for Shanghai-Accented Mandarin Speech Recognition. In the Proceedings of Interspeech.
Fosler-Lussier, E., Amdal, I., & Kuo, H.-K. J. (2005). A Framework for Predicting Speech Recognition Errors. Speech Communication, 46(2), 153–170.
Article Google Scholar
Fosler-Lussier, E. (1999). Dynamic Pronunciation Models for Automatic Speech Recognition. Ph.D. dissertation, Int. Comput. Sci. Inst., Berkeley, CA, USA.
Hain, T., & Woodland, P. C. (1999). Dynamic HMM Selection for Continuous Speech Recognition. In Proc. Eurospeech, pp. 1327–1330.
V. Fisher et al. (1998). Speaker-Independent Upfront Dialect Adaptation in A Large Vocabulary Continuous Speech Recognition. In Proc. Int. Conf. Spoken Lang. Process.
Wang, Z., Schultz, T., & Waibel, A. (2003). Comparison of Acoustic Model Adaptation Techniques on Non-Native Speech. In ICASSP 2003. IEEE, pp. 540–543.
Mayfield Tomokiyo, L., & Waibel, A. (2001). Adaptation Methods for Non-Native Speech,” in Proceedings of Multilinguality in Spoken Language Processing, Aalborg.
Huang, C., Chang, E., Zhou, J., & Lee, K.-F. (2000). Accent Modeling Based on Pronunciation Dictionary Adaptation for Large Vocabulary Mandarin Speech Recognition. In ICSLP 2000, Beijing, pp. 818–821.
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Trans Audio Speech Lang Process, 1(1), 33–42.
Google Scholar
Seide, F., Li, G., & Yu, D. (2012). Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In the Proceedings of Interspeech.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., NSainath, T., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag, 29(6), 82–97.
Article Google Scholar
Yu, D., Seltzer, M., Li, J., Huang, J., & Seide, F. (2013). Feature learning in Deep Neural Networks - Studies on Speech Recognition Tasks. In the Proceedings of 2013 International Confernece on Learning Representation.
Goodfellow, I. J., Le, Q. V., Saxe, A. M., Lee, H., & Ng, A. Y. (2009). Measuring Invariances in Deep Networks. Advances in Neural Information Processing Systems (NIPS) 22.
Huang, Y., Yu, D., Liu, C. J., & Gong, Y. F. (2014). Multi-Accent Deep Neural Network Acoustic Model with Accent-Specific Top Layer Using the KLD-Regularized Model Adaptation. In the Proceedings of Interspeech.
Huang, J., Li, J., Yu, D., Deng, L., & Gong, Y. F. (2013). Cross-Language Knowledge Transfer Using Multilingual Deep Neural Network With Shared Hidden Layers. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Chen, M. M., Yang, Z. Y., Liang, J. Z., Li, Y. P., Liu, W. J. (2015). Improving Deep Neural Networks Based Multi-Accent Mandarin Speech Recognition Using I-Vectors and Accent-Specific Top layer. In the Proceedings of Interspeech.
Sak, H., Senior, A., & Beaufays, F. (2014). Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. In the Proceedings of Interspeech.
Liu, C., Wang, Y., Kumar, K., & Gong, Y. F. (2016). Investigations on Speaker Adaptation of LSTM RNN Models for Speech Recognition. In the Proceedings of the 2016 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker Adaptation of RNN-BLSTM for Speech Recognition Based on Speaker Code. In the Proceedings of the 2016 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Tan, T., Qian, Y., Yu, D., Kundu, S., & Lu, L. (2016). Speaker-Aware Training of LSTM-RNNs for Acoustic Modelling. In the Proceedings of the 2016 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Yi, J., Ni, H., Wen, Z. H., & Tao, J. (2016). Improving BLSTM RNN Based Mandarin Speech Recognition Using Accent Dependent Bottleneck Features. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.
Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pittsburgh, USA.
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech Recognition With Deep Recurrent Neural Networks. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6645–6649.
Graves, A., & Jaitly, N. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al. (2014). Deepspeech: Scaling up End-To-End Speech Recognition. arXiv preprint arXiv:1412.5567.
Hannun, A. Y., Maas, A. L., Jurafsky, D., & Ng, A. Y. (2014). First-Pass Large Vocabulary Continuous Speech Recognition Using Bi-Directional Recurrent DNNs. arXiv preprint arXiv:1408.2873.
Miao, Y. J., Gowayyed, M. & Metze, F. (2015). EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. In the Proceedings of ASRU.
Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-Divergence Regularized Deep Neural Network Adaptation for Improved Large Vocabulary Speech Recognition. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).
(2003). RASC863: 863 annotated 4 regional accent speech corpus. Chinese Academy of Social Sciences. Available: http://www.chineseldc.org/doc/CLDC-SPC-2004-005/intro.htm.
(2003). CASIA: CASIA northern accent speech corpus. Chinese Academy of Sciences. Available: http://www.chineseldc.org/doc/CLDC-SPC-2004-015/intro.htm.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y. M., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi SpeechRecognition Toolkit. In the Proceedings of ASRU.
Li, X., & Bilmes, J. (2006). Regularized adaptation of discriminative classifiers. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Miao, Y., Metze, F. (2015). On Speaker Adaptation of Long Short-Term Memory Recurrent Neural Networks. In the Proceedings of Interspeech.

Download references

Acknowledgements

This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No.2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61425017, No.61403386, No. 61305003), the Strategic Priority Research Program of the CAS (GrantXDB02080006) and the Major Program for the National Social Science Fund of China (13&ZD189).

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing, China
Jiangyan Yi, Zhengqi Wen, Jianhua Tao, Hao Ni & Bin Liu
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
Jiangyan Yi, Jianhua Tao & Hao Ni
CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Jianhua Tao

Authors

Jiangyan Yi
View author publications
You can also search for this author in PubMed Google Scholar
Zhengqi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Ni
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengqi Wen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yi, J., Wen, Z., Tao, J. et al. CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition. J Sign Process Syst 90, 985–997 (2018). https://doi.org/10.1007/s11265-017-1291-1

Download citation

Received: 15 January 2017
Revised: 17 August 2017
Accepted: 18 September 2017
Published: 23 September 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11265-017-1291-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

Automatic speech recognition: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

Automatic speech recognition: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation