ABSTRACT
Nowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation. At the same time, an extremely limited volume of labeled non-native speech datasets makes it difficult to train, from the ground up, sufficiently accurate ASR systems for non-native speakers.
In this research we address the problem and its influence on the accuracy of ASR systems, using the style transfer methodology. We designed a pipeline for modifying the speech of a non-native speaker so that it more closely resembles the native speech. This paper covers experiments for accent modification using different setups and different approaches, including neural style transfer and autoencoder. The experiments were conducted on English language pronounced by Japanese speakers (UME-ERJ dataset). The results show that there is a significant relative improvement in terms of the speech recognition accuracy. Our methodology reduces the necessity of training new algorithms for non-native speech (thus overcoming the obstacle related to the data scarcity) and can be used as a wrapper for any existing ASR system. The modification can be performed in real time, before a sample is passed into the speech recognition system itself.
- G. Hinton A. Graves, A. Mohamed. 2013. Speech recognition with deep recurrent neural networks. Proc. ICASSP. IEEE (2013).Google ScholarCross Ref
- Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 10 (Oct. 2014), 1533--1545. https://doi.org/10.1109/TASLP.2014.2339736 Google ScholarDigital Library
- Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:arXiv:1512.02595Google Scholar
- Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.Google Scholar
- D. Baby, J. F. Gemmeke, T. Virtanen, and H. Van hamme. 2015. Exemplar-based speech enhancement for deep neural network based automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4485--4489. https://doi.org/10.1109/ICASSP.2015.7178819Google ScholarCross Ref
- P.W.D. Charles. 2019. keras. https://github.com/charlespwd/project-title.Google Scholar
- N Dave. 2013. Feature extraction methods lpc, plp and mfcc in speech recognition. International Journal for Advance Research in Engineering and Technology 1 (7 2013), 1--5.Google Scholar
- Dehak R. Dehak N., Kenny P. J. and Ouellet P. Dumouchel P. 2011. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio, Speech and Lang. Proc. 19, 4 (May 2011), 788--798. Google ScholarDigital Library
- Sadaoki Furui. 2005. 50 Years of Progress in Speech and Speaker Recognition Research.Google Scholar
- P. Aarabi G. Shi, M. Shanechi. 2006. On the importance of phase in human speech recognition. Audio, Speech, and Language Processing, IEEE Transactions (2006). Google ScholarDigital Library
- Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A Neural Algorithm of Artistic Style. arXiv:arXiv:1508.06576Google Scholar
- G.E.Hinton, L.Deng, D.Yu, G.E.Dahl, A.Mohamed, and N.Jaitly et al. [n.d.]. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. ([n. d.]).Google Scholar
- Eric Grinstein, Ngoc Duong, Alexey Ozerov, and Patrick Pérez. 2017. Audio style transfer. (2017). https://doi.org/10.1109/ICASSP.2018.8461711arXiv:arXiv:1710.11385Google Scholar
- Taabish Gulzar, Anand Singh, Dinesh Kumar, and Najma Farooq. 2014. A Systematic Analysis of Automatic Speech Recognition: An Overview. 4 (06 2014).Google Scholar
- Ben Hixon, Eric Schneider, and Susan Epstein. 2011. Phonemic Similarity Metrics to Compare Pronunciation Methods. 825--828.Google Scholar
- H. Zen K. Tokuda. 2015. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. Proc. ICASSP (2015).Google ScholarCross Ref
- K.. Lee and H.. Hon. 1988. Large-vocabulary speaker-independent continuous speech recognition using HMM. In ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing. 123--126 vol.1. https://doi.org/10.1109/ICASSP.1988.196527Google Scholar
- S. Lee, Y. Lee, and N. Cho. 2016. Multi-stage speech enhancement for automatic speech recognition. In 2016 IEEE International Conference on Consumer Electronics (ICCE). 383--384. https://doi.org/10.1109/ICCE.2016.7430657Google Scholar
- K. Livescu and J. Glass. 2000. Lexical modeling of non-native speech for automatic speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (2000).Google Scholar
- Angeliki Metallinou and Jian Cheng. 2014. Using deep neural networks to improve proficiency assessment for children English language learners. In Fifteenth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- T. Tien Ping. 2008. Automatic Speech Recognition for Non-Native Speakers. Ph.D. Dissertation. Université Joseph-Fourier - Grenoble.Google Scholar
- L. R. Rabiner and J. G. Wilpon. 1980. A simplified, robust training procedure for speaker trained, isolated word recognition systems. The Journal of the Acoustical Society of America 68, 5 (1980), 1271--1276. https://doi.org/10.1121/1.385120arXiv:https://doi.org/10.1121/1.385120Google ScholarCross Ref
- Kacper Radzikowski, Le Wang, Osamu Yoshie, and Robert Nowak. 2019. Dual supervised learning for non-native speech recognition. EURASIP Journal on Audio, Speech and Music Processing 2019:3 (2019), 1--10. doi:10.1186/s13636-018-0146-4, https://rdcu.be/bgUxy. Google ScholarDigital Library
- Yoshie Osamu Radzikowski Kacper, Wang Le. 2017. Non-native speech recognition using characteristic speech features, with respect to nationality. Proceedings of the conference of institute of electrical engineers of japan, electronics and information systems division (2017).Google Scholar
- Yoshie Osamu Radzikowski Kacper, Wang Le. 2017. Non-native speech recognition using characteristic speech features, with respect to nationality. Proceedings of the conference of institute of electrical engineers of japan, electronics and information systems division (2017).Google Scholar
- Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English speaker's speech correction, based on domain focused document. In Proceedings of the Conference of Institute of Electrical Engineers of Japan, Electronics and Information Systems Division.Google Scholar
- Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English Speakers' Speech Correction, Based on Domain Focused Document. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (Singapore, Singapore) (iiWAS '16). ACM, New York, NY, USA, 276--281. https://doi.org/10.1145/3011141.3011169 Google ScholarDigital Library
- Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English speaker's speech correction, based on domain focused document. In Proceedings of the Conference of Institute of Electrical Engineers of Japan, Electronics and Information Systems Division. Google ScholarDigital Library
- Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English Speakers' Speech Correction, Based on Domain Focused Document. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (Singapore, Singapore) (iiWAS). ACM, New York, NY, USA, 276--281. Google ScholarDigital Library
- H. Suzuki, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura. 2003. Speech recognition using voice-characteristic-dependent acoustic models. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 1. I-I. https://doi.org/10.1109/ICASSP.2003.1198887Google ScholarCross Ref
- Geiger J. T., Zhang Z., Weninger F., Schuller B., and Rigoll G. 2014. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. Proc. Interspeech (2014).Google Scholar
- T. Tan and L. Besacier. 2007. Acoustic model interpolation for non-native speech recognition. Proc. ICASSP (2007).Google Scholar
- Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei, Peihao Wu, Wenchang Situ, Shuai Li, and Yang Zhang. 2017. Deep LSTM for Large Vocabulary Continuous Speech Recognition. In Arxiv. https://arxiv.org/abs/1703.07090Google Scholar
- L. M. Tomokiyo. 2001. Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in LVCSR. Ph.D. Dissertation. Carnegie Mellon University.Google Scholar
- Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio.. In Arxiv. https://arxiv.org/abs/1609.03499Google Scholar
- Prateek Verma and Julius O. Smith. 2018. Neural Style Transfer for Audio Spectograms. arXiv:arXiv:1801.01589Google Scholar
- Wayne Xiong, Lingfeng Wu, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2018. The Microsoft 2017 Conversational Speech Recognition System. In Proc. IEEE ICASSP. 5934--5938.Google ScholarCross Ref
Index Terms
- Support software for Automatic Speech Recognition systems targeted for non-native speech
Recommendations
Accent neutralization for speech recognition of non-native speakers
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & ServicesThese days, automatic speech recognition (ASR) systems achieve higher and higher accuracy rates. The score drops significantly, in case when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason is ...
Accent modification for speech recognition of non-native speakers using neural style transfer
AbstractNowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker ...
Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Verbal and Nonverbal Features of Human-Human and Human-Machine InteractionSpeech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods ...
Comments