Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition

Wang, Xuefei; Long, Yanhua; Xu, Dongxing

doi:10.1007/s10772-022-10010-z

Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition

Published: 03 November 2022

Volume 25, pages 987–995, (2022)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

233 Accesses
1 Citation
Explore all metrics

Abstract

Accent-variation is a challenging issue, either for traditional hybrid or current end-to-end (E2E) automatic speech recognition (ASR). Building an accent-invariant and high quality ASR system is very important for most real applications. In this study, we propose a Conformer-based architecture with accent-discriminative encoders, to leverage the accent attributes of input speech for enhancing an accent-invariant E2E ASR system. In this architecture, the encoders are composed of one universal, and two dominant accent-specific encoders. These encoders are first pre-trained and then jointly adapted with a single attention-based decoder in an end-to-end manner. Furthermore, different weighting methods and a multi-encoder-decoder architecture is also investigated and compared. Our experiments are performed on the public Common Voice with five different English-accents, results show that our proposed architecture outperforms the strong baseline in both in-domain and out-of-domain accented-ASR tasks, with a relative 2.9–3.8% word error rate reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

References

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., & Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of LREC (pp. 4218–4222).
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In Proceedings of ICASSP (pp. 4945–4949)
Chen, M., Yang, Z., Liang, J., Li, Y., & Liu, W. (2015). Improving deep neural networks based multi-accent Mandarin speech recognition using I-vectors and accent-specific top layer. In Proceedings of interspeech (pp. 3620–3624)
Chen, Y. C., Yang, Z., Yeh, C. F., Jain, M., & Seltzer, M. L. (2020). AIPNet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In Proceedings of ICASSP (pp. 6879–6983).
Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., & Jaitly, N. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of ICASSP (pp. 4774–4778)
Das, A., Kumar, K., & Wu, J. (2021). Multi-dialect speech recognition in english using attention on ensemble of experts. In Proceedings of ICASSP (pp. 6244–6248)
Gong, X., Lu, Y., Zhou, Z., & Qian, Y. (2021). Layer-wise fast adaptation for end-to-end multi-accent speech recognition. In Proceedings of interspeech (pp. 1274–1278)
Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2019). A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In International conference on learning representations.
Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of interspeech (pp. 5036–5040)
Hu, H., Yang, X., Raeesy, Z., Guo, J., Keskin, G., Arsikere, H., Rastrow, A., Stolcke, A., & Maas, R. (2021). Redat: Accent-invariant representation for end-to-end asr by domain adversarial training with relabeling. In Proceedings of ICASSP (pp. 6408–6412).
Imaizumi, R., Masumura, R., Shiota, S., & Kiya, H. (2022). End-to-end Japanese multi-dialect speech recognition and dialect identification with multi-task learning. APSIPA Transactions on Signal and Information Processing, 11(1), 4.
Article Google Scholar
Jain, A., Upreti, M. & Jyothi, P. (2018). Improved accented speech recognition using accent embeddings and multi-task learning. In Proceedings of interspeech (pp. 2454–2458)
Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N. E. Y., Yamamoto, R., Wang, X., & Watanabe, S. (2019a). A comparative study on transformer vs RNN in speech applications. In Proceedings of ASRU (pp. 449-456)
Karita, S., Enrique, N., Soplin, Y., & Watanabe, S. (2019b). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of interspeech (pp. 1408–1412).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning representations.
Li, B., Sainath, T. N., Sim, K. C., Bacchiani, M., Weinstein, E., Nguyen, P., Chen, Z., Wu, Y., & Rao, K. (2018). Multi-dialect speech recognition with a single sequence-to-sequence model. In Proceedings of ICASSP (pp. 4749–4753)
Li, J. (2020). Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1), 8.
Google Scholar
Li, S., Ouyang, B., Liao, D., Xia, S., Li, L., & Hong, Q. (2021). End-to-end multi-accent speech recognition with unsupervised accent modelling. In Proceedings of ICASSP (pp. 6418–6422).
Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. (2020). Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of ICASSP (pp. 6084–6088)
Padi, B., Mohan, A., & Ganapathy, S. (2019). Attention based hybrid I-vector BLSTM model for language recognition. In Proceedings of interspeech (pp. 1263–1267)
Rao, K. & Sak, H. (2017). Multi-accent speech recognition with hierarchical grapheme based models. In Proceedings of ICASSP (pp. 4815–4819)
Shi, X., Yu, F., Lu, Y., Liang, Y., Feng, Q., Wang, D., Qian, Y., & Xie, L. (2020). The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In Proceedings of ICASSP (pp. 6918–6922).
Shor, J., Emanuel, D., Lang, O., Tuval, O., Brenner, M., Cattiau, J., Vieira, F., McNally, M., Charbonneau, T., Nollstadt, M., & Hassidim, A. (2019). Personalizing ASR for dysarthric and accented speech with limited data. In Proceedings of interspeech (pp. 784–788)
Taku, K. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of ACL (pp. 66–75).
Tanaka, T., Masumura, R., Moriya, T., Oba, T., & Aono, Y. (2019). A joint end-to-end and DNN-HMM hybrid automatic speech recognition system with transferring sharable knowledge. In Proceedings of interspeech (pp. 2210–2214)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of advances in neural information processing systems (pp. 5998–6008) (2017).
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., & Renduchintala, A. (2018). ESPnet: End-to-end speech processing toolkit. In Proceedings of interspeech (pp. 2207–2211).
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. In IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62071302).

Author information

Authors and Affiliations

Key Innovation Group of Digital Humanities Resource and Research, and Shanghai Engineering Research Center of Intelligent Education and Bigdata, Shanghai Normal University, Shanghai, 200234, China
Xuefei Wang & Yanhua Long
Unisound AI Technology Co., Ltd., Beijing, China
Dongxing Xu

Authors

Xuefei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanhua Long
View author publications
You can also search for this author in PubMed Google Scholar
Dongxing Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanhua Long.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Long, Y. & Xu, D. Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition. Int J Speech Technol 25, 987–995 (2022). https://doi.org/10.1007/s10772-022-10010-z

Download citation

Received: 10 May 2022
Accepted: 05 October 2022
Published: 03 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10772-022-10010-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation