Skip to main content

Evaluation of English Speech Recognition for Japanese Learners Using DNN-Based Acoustic Models

  • Conference paper
  • First Online:
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2018)

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 110))

Abstract

Regarding the assistance of computer-assisted language learning (CALL) systems to make foreign language learning easier, it is necessary to recognize the utterances of the learner with high accuracy. The quality of CALL systems mainly depends on the accuracy of automatic speech recognition (ASR). However, since the pronunciation of non-native speakers is greatly different from that of native speakers, existing ASR system cannot well recognize speech accurately. To solve this problem, this research projects an acoustic model based on deep neural networks (DNN), which is trained by using ERJ (English Read by Japanese) database collected from 202 Japanese learners. Compared with traditional ASR systems, this new system significantly promotes the speech recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lee, S., Noh, H., Lee, J., Lee, K., Lee, G.G.: POSTECH approaches for dialog-based English conversation tutoring. In: Proceedings APSIPA ASC, pp. 794–803 (2010)

    Google Scholar 

  2. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  3. Raux, A., Eskenazi, M.: Using task-oriented spoken dialogue systems for language learning: potential, practical applications and challenges. In: Proceedings InSTIL/ICALL Symposium, pp. 147–150 (2004)

    Google Scholar 

  4. Witt, S., Young, S.J.: Language learning based on non-native speech recognition. In: Proceedings EUROSPEECH, pp. 633–636 (1997)

    Google Scholar 

  5. Minematsu, N., Kurata, G., Hirose, K.: Integration of MLLR adaptation with pronunciation proficiency adaptation for non-native speech recognition. In: Proceedings ICSLP, pp. 529–531 (2002)

    Google Scholar 

  6. Wang, Z., Schultz, T., Waibel, A.: Comparison of acoustic model adaptation techniques on non-native speech. In: Proceedings ICASSP, pp. 540–543 (2003)

    Google Scholar 

  7. Oh, Y.R., Yoon, J.S., Kim, H.K.: Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Commun. 49(1), 59–70 (2007)

    Article  Google Scholar 

  8. Tan, T.P., Besacier, L.: Acoustic model interpolation for non-native speech recognition. In: Proceedings ICASSP, pp. 1009–1012 (2007)

    Google Scholar 

  9. Van Doremalen, J., Cucchiarini, C., Strik, H.: Optimizing automatic speech recognition for low-proficient non-native speakers. EURASIP EURASIP J. Audio, Speech, Music. Process. 2010(1), 973–954 (2010)

    Google Scholar 

  10. Wang, X., Yamamoto, S.: Second language speech recognition using multiple-pass decoding with lexicon represented by multiple reduced phoneme sets. In: Proceedings INTERSPEECH, pp. 1265–1269 (2015)

    Google Scholar 

  11. Chen, X., Cheng, J.: Deep neural network acoustic modeling for native and non-native Mandarin speech recognition. In: Proceedings ISCSLP, pp. 6–9 (2014)

    Google Scholar 

  12. Cheng, J., Chen, X., Metallinou, A.: Deep neural network acoustic models for spoken assessment applications. Speech Commun. 73, 14–27 (2015)

    Article  Google Scholar 

  13. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (2011)

    Google Scholar 

  14. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings ICASSP, pp. 215–219 (2014)

    Google Scholar 

  15. Makino, T., Aoki, R.: English read by Japanese phonetic corpus: an interim report. Res. Lang. 10(1), 79–95 (2012)

    Article  Google Scholar 

  16. Minematsu, N., Okabe, K., Ogaki, K., Hirose, K.: Measurement of objective intelligibility of Japanese accented English using ERJ (English Read by Japanese) database. In: Proceedings INTERSPEECH, pp. 1481–1484 (2011)

    Google Scholar 

  17. Luo, D., Qiao, Y., Minematsu, N., Yamauchi, Y., Hirose, K.: Regularized-MLLR speaker adaptation for computer-assisted language learning system. In: Proceedings INTERSPEECH, pp. 594–597 (2010)

    Google Scholar 

  18. Ito, A., Tsutsui, R., Makino, S., Suzuki, M.: Recognition of english utterances with grammatical and lexical mistakes for dialogue-based CALL system. In: Proceedings INTERSPEECH, pp. 2819–2822 (2008)

    Google Scholar 

  19. Wang, X., Kato, T., Yamamoto, S.: Phoneme set design based on integrated acoustic and linguistic features for second language speech recognition. IEICE Trans. Inf. Syst. 100(4), 857–864 (2017)

    Article  Google Scholar 

  20. Oshima, Y., Takamichi, S., Toda, T., Neubig, G., Sakti, S., Nakamura, S.: Non-native text-to-speech preserving speaker individuality based on partial correction of prosodic and phonetic characteristics. IEICE Trans. Inf. Syst. 99(12), 3132–3139 (2016)

    Article  Google Scholar 

  21. The CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict

  22. Yoshioka, T., Chen, X., Gales, M.J.F.: Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In: Proceedings ICASSP, pp. 5527–5531 (2014)

    Google Scholar 

  23. Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings ICASSP, pp. 4273–4276 (2012)

    Google Scholar 

  24. Pan, J., Liu, C., Wang, Z., Hu, Y., Jiang, H.: Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: why DNN surpasses GMMs in acoustic modeling. In: Proceedings ISCSLP, pp. 301–305 (2012)

    Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number JP17H00823.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akinori Ito .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fu, J., Chiba, Y., Nose, T., Ito, A. (2019). Evaluation of English Speech Recognition for Japanese Learners Using DNN-Based Acoustic Models. In: Pan, JS., Ito, A., Tsai, PW., Jain, L. (eds) Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing. IIH-MSP 2018. Smart Innovation, Systems and Technologies, vol 110. Springer, Cham. https://doi.org/10.1007/978-3-030-03748-2_11

Download citation

Publish with us

Policies and ethics