research-article

Support software for Automatic Speech Recognition systems targeted for non-native speech

Authors:
Kacper Radzikowski

Waseda University, Graduate School of Information, Production and Systems Kitakyushu, Japan

Waseda University, Graduate School of Information, Production and Systems Kitakyushu, Japan
View Profile

,
Osamu Yoshie

Waseda University, Graduate School of Information, Production and Systems Kitakyushu, Japan

Waseda University, Graduate School of Information, Production and Systems Kitakyushu, Japan
View Profile

,
Robert Nowak

Warsaw University of Technology Warsaw, Poland

Warsaw University of Technology Warsaw, Poland
View Profile

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & ServicesNovember 2020Pages 55–61https://doi.org/10.1145/3428757.3429971

Published:27 January 2021Publication History

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

Pages 55–61

ABSTRACT

Nowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation. At the same time, an extremely limited volume of labeled non-native speech datasets makes it difficult to train, from the ground up, sufficiently accurate ASR systems for non-native speakers.

In this research we address the problem and its influence on the accuracy of ASR systems, using the style transfer methodology. We designed a pipeline for modifying the speech of a non-native speaker so that it more closely resembles the native speech. This paper covers experiments for accent modification using different setups and different approaches, including neural style transfer and autoencoder. The experiments were conducted on English language pronounced by Japanese speakers (UME-ERJ dataset). The results show that there is a significant relative improvement in terms of the speech recognition accuracy. Our methodology reduces the necessity of training new algorithms for non-native speech (thus overcoming the obstacle related to the data scarcity) and can be used as a wrapper for any existing ASR system. The modification can be performed in real time, before a sample is passed into the speech recognition system itself.

References

G. Hinton A. Graves, A. Mohamed. 2013. Speech recognition with deep recurrent neural networks. Proc. ICASSP. IEEE (2013).Google ScholarCross Ref
Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 22, 10 (Oct. 2014), 1533--1545. https://doi.org/10.1109/TASLP.2014.2339736 Google ScholarDigital Library
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:arXiv:1512.02595Google Scholar
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.Google Scholar
D. Baby, J. F. Gemmeke, T. Virtanen, and H. Van hamme. 2015. Exemplar-based speech enhancement for deep neural network based automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4485--4489. https://doi.org/10.1109/ICASSP.2015.7178819Google ScholarCross Ref
P.W.D. Charles. 2019. keras. https://github.com/charlespwd/project-title.Google Scholar
N Dave. 2013. Feature extraction methods lpc, plp and mfcc in speech recognition. International Journal for Advance Research in Engineering and Technology 1 (7 2013), 1--5.Google Scholar
Dehak R. Dehak N., Kenny P. J. and Ouellet P. Dumouchel P. 2011. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio, Speech and Lang. Proc. 19, 4 (May 2011), 788--798. Google ScholarDigital Library
Sadaoki Furui. 2005. 50 Years of Progress in Speech and Speaker Recognition Research.Google Scholar
P. Aarabi G. Shi, M. Shanechi. 2006. On the importance of phase in human speech recognition. Audio, Speech, and Language Processing, IEEE Transactions (2006). Google ScholarDigital Library
Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A Neural Algorithm of Artistic Style. arXiv:arXiv:1508.06576Google Scholar
G.E.Hinton, L.Deng, D.Yu, G.E.Dahl, A.Mohamed, and N.Jaitly et al. [n.d.]. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. ([n. d.]).Google Scholar
Eric Grinstein, Ngoc Duong, Alexey Ozerov, and Patrick Pérez. 2017. Audio style transfer. (2017). https://doi.org/10.1109/ICASSP.2018.8461711arXiv:arXiv:1710.11385Google Scholar
Taabish Gulzar, Anand Singh, Dinesh Kumar, and Najma Farooq. 2014. A Systematic Analysis of Automatic Speech Recognition: An Overview. 4 (06 2014).Google Scholar
Ben Hixon, Eric Schneider, and Susan Epstein. 2011. Phonemic Similarity Metrics to Compare Pronunciation Methods. 825--828.Google Scholar
H. Zen K. Tokuda. 2015. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. Proc. ICASSP (2015).Google ScholarCross Ref
K.. Lee and H.. Hon. 1988. Large-vocabulary speaker-independent continuous speech recognition using HMM. In ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing. 123--126 vol.1. https://doi.org/10.1109/ICASSP.1988.196527Google Scholar
S. Lee, Y. Lee, and N. Cho. 2016. Multi-stage speech enhancement for automatic speech recognition. In 2016 IEEE International Conference on Consumer Electronics (ICCE). 383--384. https://doi.org/10.1109/ICCE.2016.7430657Google Scholar
K. Livescu and J. Glass. 2000. Lexical modeling of non-native speech for automatic speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (2000).Google Scholar
Angeliki Metallinou and Jian Cheng. 2014. Using deep neural networks to improve proficiency assessment for children English language learners. In Fifteenth Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
T. Tien Ping. 2008. Automatic Speech Recognition for Non-Native Speakers. Ph.D. Dissertation. Université Joseph-Fourier - Grenoble.Google Scholar
L. R. Rabiner and J. G. Wilpon. 1980. A simplified, robust training procedure for speaker trained, isolated word recognition systems. The Journal of the Acoustical Society of America 68, 5 (1980), 1271--1276. https://doi.org/10.1121/1.385120arXiv:https://doi.org/10.1121/1.385120Google ScholarCross Ref
Kacper Radzikowski, Le Wang, Osamu Yoshie, and Robert Nowak. 2019. Dual supervised learning for non-native speech recognition. EURASIP Journal on Audio, Speech and Music Processing 2019:3 (2019), 1--10. doi:10.1186/s13636-018-0146-4, https://rdcu.be/bgUxy. Google ScholarDigital Library
Yoshie Osamu Radzikowski Kacper, Wang Le. 2017. Non-native speech recognition using characteristic speech features, with respect to nationality. Proceedings of the conference of institute of electrical engineers of japan, electronics and information systems division (2017).Google Scholar
Yoshie Osamu Radzikowski Kacper, Wang Le. 2017. Non-native speech recognition using characteristic speech features, with respect to nationality. Proceedings of the conference of institute of electrical engineers of japan, electronics and information systems division (2017).Google Scholar
Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English speaker's speech correction, based on domain focused document. In Proceedings of the Conference of Institute of Electrical Engineers of Japan, Electronics and Information Systems Division.Google Scholar
Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English Speakers' Speech Correction, Based on Domain Focused Document. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (Singapore, Singapore) (iiWAS '16). ACM, New York, NY, USA, 276--281. https://doi.org/10.1145/3011141.3011169 Google ScholarDigital Library
Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English speaker's speech correction, based on domain focused document. In Proceedings of the Conference of Institute of Electrical Engineers of Japan, Electronics and Information Systems Division. Google ScholarDigital Library
Radzikowski Kacper, Wang Le, Yoshie Osamu. 2016. Non-native English Speakers' Speech Correction, Based on Domain Focused Document. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (Singapore, Singapore) (iiWAS). ACM, New York, NY, USA, 276--281. Google ScholarDigital Library
H. Suzuki, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura. 2003. Speech recognition using voice-characteristic-dependent acoustic models. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 1. I-I. https://doi.org/10.1109/ICASSP.2003.1198887Google ScholarCross Ref
Geiger J. T., Zhang Z., Weninger F., Schuller B., and Rigoll G. 2014. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. Proc. Interspeech (2014).Google Scholar
T. Tan and L. Besacier. 2007. Acoustic model interpolation for non-native speech recognition. Proc. ICASSP (2007).Google Scholar
Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei, Peihao Wu, Wenchang Situ, Shuai Li, and Yang Zhang. 2017. Deep LSTM for Large Vocabulary Continuous Speech Recognition. In Arxiv. https://arxiv.org/abs/1703.07090Google Scholar
L. M. Tomokiyo. 2001. Recognizing Non-native Speech: Characterizing and Adapting to Non-native Usage in LVCSR. Ph.D. Dissertation. Carnegie Mellon University.Google Scholar
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio.. In Arxiv. https://arxiv.org/abs/1609.03499Google Scholar
Prateek Verma and Julius O. Smith. 2018. Neural Style Transfer for Audio Spectograms. arXiv:arXiv:1801.01589Google Scholar
Wayne Xiong, Lingfeng Wu, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. 2018. The Microsoft 2017 Conversational Speech Recognition System. In Proc. IEEE ICASSP. 5934--5938.Google ScholarCross Ref

Index Terms

Support software for Automatic Speech Recognition systems targeted for non-native speech
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Accent neutralization for speech recognition of non-native speakers
iiWAS2019: Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services

These days, automatic speech recognition (ASR) systems achieve higher and higher accuracy rates. The score drops significantly, in case when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason is ...
Read More
Accent modification for speech recognition of non-native speakers using neural style transfer
Abstract
Nowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker ...
Read More
Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction

Speech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services
November 2020
492 pages
ISBN:9781450389228
DOI:10.1145/3428757
Editors:
Maria Indrawan-Santiago,
Eric Pardede,
Ivan Luiz Salvadori,
Matthias Steinbauer,
Ismail Khalil,
Gabriele Kotsis
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 January 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
machine learning
neural network
non-native speaker
speech recognition
style transfer
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 76
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Support software for Automatic Speech Recognition systems targeted for non-native speech

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accent neutralization for speech recognition of non-native speakers

Accent modification for speech recognition of non-native speakers using neural style transfer

Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems