skip to main content
10.1145/3428757.3429971acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Support software for Automatic Speech Recognition systems targeted for non-native speech

Authors Info & Claims
Published:27 January 2021Publication History

ABSTRACT

Nowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation. At the same time, an extremely limited volume of labeled non-native speech datasets makes it difficult to train, from the ground up, sufficiently accurate ASR systems for non-native speakers.

In this research we address the problem and its influence on the accuracy of ASR systems, using the style transfer methodology. We designed a pipeline for modifying the speech of a non-native speaker so that it more closely resembles the native speech. This paper covers experiments for accent modification using different setups and different approaches, including neural style transfer and autoencoder. The experiments were conducted on English language pronounced by Japanese speakers (UME-ERJ dataset). The results show that there is a significant relative improvement in terms of the speech recognition accuracy. Our methodology reduces the necessity of training new algorithms for non-native speech (thus overcoming the obstacle related to the data scarcity) and can be used as a wrapper for any existing ASR system. The modification can be performed in real time, before a sample is passed into the speech recognition system itself.

References

  1. G. Hinton A. Graves, A. Mohamed. 2013. Speech recognition with deep recurrent neural networks. Proc. ICASSP. IEEE (2013).Google ScholarGoogle ScholarCross RefCross Ref
  2. Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 10 (Oct. 2014), 1533--1545. https://doi.org/10.1109/TASLP.2014.2339736 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:arXiv:1512.02595Google ScholarGoogle Scholar
  4. Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.Google ScholarGoogle Scholar
  5. D. Baby, J. F. Gemmeke, T. Virtanen, and H. Van hamme. 2015. Exemplar-based speech enhancement for deep neural network based automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4485--4489. https://doi.org/10.1109/ICASSP.2015.7178819Google ScholarGoogle ScholarCross RefCross Ref
  6. P.W.D. Charles. 2019. keras. https://github.com/charlespwd/project-title.Google ScholarGoogle Scholar
  7. N Dave. 2013. Feature extraction methods lpc, plp and mfcc in speech recognition. International Journal for Advance Research in Engineering and Technology 1 (7 2013), 1--5.Google ScholarGoogle Scholar
  8. Dehak R. Dehak N., Kenny P. J. and Ouellet P. Dumouchel P. 2011. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio, Speech and Lang. Proc. 19, 4 (May 2011), 788--798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sadaoki Furui. 2005. 50 Years of Progress in Speech and Speaker Recognition Research.Google ScholarGoogle Scholar
  10. P. Aarabi G. Shi, M. Shanechi. 2006. On the importance of phase in human speech recognition. Audio, Speech, and Language Processing, IEEE Transactions (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A Neural Algorithm of Artistic Style. arXiv:arXiv:1508.06576Google ScholarGoogle Scholar
  12. G.E.Hinton, L.Deng, D.Yu, G.E.Dahl, A.Mohamed, and N.Jaitly et al. [n.d.]. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. ([n. d.]).Google ScholarGoogle Scholar
  13. Eric Grinstein, Ngoc Duong, Alexey Ozerov, and Patrick Pérez. 2017. Audio style transfer. (2017). https://doi.org/10.1109/ICASSP.2018.8461711arXiv:arXiv:1710.11385Google ScholarGoogle Scholar
  14. Taabish Gulzar, Anand Singh, Dinesh Kumar, and Najma Farooq. 2014. A Systematic Analysis of Automatic Speech Recognition: An Overview. 4 (06 2014).Google ScholarGoogle Scholar
  15. Ben Hixon, Eric Schneider, and Susan Epstein. 2011. Phonemic Similarity Metrics to Compare Pronunciation Methods. 825--828.Google ScholarGoogle Scholar
  16. H. Zen K. Tokuda. 2015. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. Proc. ICASSP (2015).Google ScholarGoogle ScholarCross RefCross Ref
  17. K.. Lee and H.. Hon. 1988. Large-vocabulary speaker-independent continuous speech recognition using HMM. In ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing. 123--126 vol.1. https://doi.org/10.1109/ICASSP.1988.196527Google ScholarGoogle Scholar
  18. S. Lee, Y. Lee, and N. Cho. 2016. Multi-stage speech enhancement for automatic speech recognition. In 2016 IEEE International Conference on Consumer Electronics (ICCE). 383--384. https://doi.org/10.1109/ICCE.2016.7430657Google ScholarGoogle Scholar
  19. K. Livescu and J. Glass. 2000. Lexical modeling of non-native speech for automatic speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (2000).Google ScholarGoogle Scholar
  20. Angeliki Metallinou and Jian Cheng. 2014. Using deep neural networks to improve proficiency assessment for children English language learners. In Fifteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  21. T. Tien Ping. 2008. Automatic Speech Recognition for Non-Native Speakers. Ph.D. Dissertation. Université Joseph-Fourier - Grenoble.Google ScholarGoogle Scholar
  22. L. R. Rabiner and J. G. Wilpon. 1980. A simplified, robust training procedure for speaker trained, isolated word recognition systems. The Journal of the Acoustical Society of America 68, 5 (1980), 1271--1276. https://doi.org/10.1121/1.385120arXiv:https://doi.org/10.1121/1.385120Google ScholarGoogle ScholarCross RefCross Ref
  23. Kacper Radzikowski, Le Wang, Osamu Yoshie, and Robert Nowak. 2019. Dual supervised learning for non-native speech recognition. EURASIP Journal on Audio, Speech and Music Processing 2019:3 (2019), 1--10. doi:10.1186/s13636-018-0146-4, https://rdcu.be/bgUxy. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yoshie Osamu Radzikowski Kacper, Wang Le. 2017. Non-native speech recognition using characteristic speech features, with respect to nationality. Proceedings of the conference of institute of electrical engineers of japan, electronics and information systems division (2017).Google ScholarGoogle Scholar
  25. Yoshie Osamu Radzikowski Kacper, Wang Le. 2017. Non-native speech recognition using characteristic speech features, with respect to nationality. Proceedings of the conference of institute of electrical engineers of japan, electronics and information systems division (2017).Google ScholarGoogle Scholar
  26. Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English speaker's speech correction, based on domain focused document. In Proceedings of the Conference of Institute of Electrical Engineers of Japan, Electronics and Information Systems Division.Google ScholarGoogle Scholar
  27. Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English Speakers' Speech Correction, Based on Domain Focused Document. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (Singapore, Singapore) (iiWAS '16). ACM, New York, NY, USA, 276--281. https://doi.org/10.1145/3011141.3011169 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English speaker's speech correction, based on domain focused document. In Proceedings of the Conference of Institute of Electrical Engineers of Japan, Electronics and Information Systems Division. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English Speakers' Speech Correction, Based on Domain Focused Document. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (Singapore, Singapore) (iiWAS). ACM, New York, NY, USA, 276--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Suzuki, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura. 2003. Speech recognition using voice-characteristic-dependent acoustic models. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 1. I-I. https://doi.org/10.1109/ICASSP.2003.1198887Google ScholarGoogle ScholarCross RefCross Ref
  31. Geiger J. T., Zhang Z., Weninger F., Schuller B., and Rigoll G. 2014. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. Proc. Interspeech (2014).Google ScholarGoogle Scholar
  32. T. Tan and L. Besacier. 2007. Acoustic model interpolation for non-native speech recognition. Proc. ICASSP (2007).Google ScholarGoogle Scholar
  33. Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei, Peihao Wu, Wenchang Situ, Shuai Li, and Yang Zhang. 2017. Deep LSTM for Large Vocabulary Continuous Speech Recognition. In Arxiv. https://arxiv.org/abs/1703.07090Google ScholarGoogle Scholar
  34. L. M. Tomokiyo. 2001. Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in LVCSR. Ph.D. Dissertation. Carnegie Mellon University.Google ScholarGoogle Scholar
  35. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio.. In Arxiv. https://arxiv.org/abs/1609.03499Google ScholarGoogle Scholar
  36. Prateek Verma and Julius O. Smith. 2018. Neural Style Transfer for Audio Spectograms. arXiv:arXiv:1801.01589Google ScholarGoogle Scholar
  37. Wayne Xiong, Lingfeng Wu, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2018. The Microsoft 2017 Conversational Speech Recognition System. In Proc. IEEE ICASSP. 5934--5938.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Support software for Automatic Speech Recognition systems targeted for non-native speech

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services
      November 2020
      492 pages

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 January 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader