Skip to main content
Log in

Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Accent-variation is a challenging issue, either for traditional hybrid or current end-to-end (E2E) automatic speech recognition (ASR). Building an accent-invariant and high quality ASR system is very important for most real applications. In this study, we propose a Conformer-based architecture with accent-discriminative encoders, to leverage the accent attributes of input speech for enhancing an accent-invariant E2E ASR system. In this architecture, the encoders are composed of one universal, and two dominant accent-specific encoders. These encoders are first pre-trained and then jointly adapted with a single attention-based decoder in an end-to-end manner. Furthermore, different weighting methods and a multi-encoder-decoder architecture is also investigated and compared. Our experiments are performed on the public Common Voice with five different English-accents, results show that our proposed architecture outperforms the strong baseline in both in-domain and out-of-domain accented-ASR tasks, with a relative 2.9–3.8% word error rate reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., & Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of LREC (pp. 4218–4222).

  • Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In Proceedings of ICASSP (pp. 4945–4949)

  • Chen, M., Yang, Z., Liang, J., Li, Y., & Liu, W. (2015). Improving deep neural networks based multi-accent Mandarin speech recognition using I-vectors and accent-specific top layer. In Proceedings of interspeech (pp. 3620–3624)

  • Chen, Y. C., Yang, Z., Yeh, C. F., Jain, M., & Seltzer, M. L. (2020). AIPNet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In Proceedings of ICASSP (pp. 6879–6983).

  • Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., & Jaitly, N. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of ICASSP (pp. 4774–4778)

  • Das, A., Kumar, K., & Wu, J. (2021). Multi-dialect speech recognition in english using attention on ensemble of experts. In Proceedings of ICASSP (pp. 6244–6248)

  • Gong, X., Lu, Y., Zhou, Z., & Qian, Y. (2021). Layer-wise fast adaptation for end-to-end multi-accent speech recognition. In Proceedings of interspeech (pp. 1274–1278)

  • Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2019). A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In International conference on learning representations.

  • Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of interspeech (pp. 5036–5040)

  • Hu, H., Yang, X., Raeesy, Z., Guo, J., Keskin, G., Arsikere, H., Rastrow, A., Stolcke, A., & Maas, R. (2021). Redat: Accent-invariant representation for end-to-end asr by domain adversarial training with relabeling. In Proceedings of ICASSP (pp. 6408–6412).

  • Imaizumi, R., Masumura, R., Shiota, S., & Kiya, H. (2022). End-to-end Japanese multi-dialect speech recognition and dialect identification with multi-task learning. APSIPA Transactions on Signal and Information Processing, 11(1), 4.

    Article  Google Scholar 

  • Jain, A., Upreti, M. & Jyothi, P. (2018). Improved accented speech recognition using accent embeddings and multi-task learning. In Proceedings of interspeech (pp. 2454–2458)

  • Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N. E. Y., Yamamoto, R., Wang, X., & Watanabe, S. (2019a). A comparative study on transformer vs RNN in speech applications. In Proceedings of ASRU (pp. 449-456)

  • Karita, S., Enrique, N., Soplin, Y., & Watanabe, S. (2019b). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of interspeech (pp. 1408–1412).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning representations.

  • Li, B., Sainath, T. N., Sim, K. C., Bacchiani, M., Weinstein, E., Nguyen, P., Chen, Z., Wu, Y., & Rao, K. (2018). Multi-dialect speech recognition with a single sequence-to-sequence model. In Proceedings of ICASSP (pp. 4749–4753)

  • Li, J. (2020). Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1), 8.

    Google Scholar 

  • Li, S., Ouyang, B., Liao, D., Xia, S., Li, L., & Hong, Q. (2021). End-to-end multi-accent speech recognition with unsupervised accent modelling. In Proceedings of ICASSP (pp. 6418–6422).

  • Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. (2020). Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of ICASSP (pp. 6084–6088)

  • Padi, B., Mohan, A., & Ganapathy, S. (2019). Attention based hybrid I-vector BLSTM model for language recognition. In Proceedings of interspeech (pp. 1263–1267)

  • Rao, K. & Sak, H. (2017). Multi-accent speech recognition with hierarchical grapheme based models. In Proceedings of ICASSP (pp. 4815–4819)

  • Shi, X., Yu, F., Lu, Y., Liang, Y., Feng, Q., Wang, D., Qian, Y., & Xie, L. (2020). The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In Proceedings of ICASSP (pp. 6918–6922).

  • Shor, J., Emanuel, D., Lang, O., Tuval, O., Brenner, M., Cattiau, J., Vieira, F., McNally, M., Charbonneau, T., Nollstadt, M., & Hassidim, A. (2019). Personalizing ASR for dysarthric and accented speech with limited data. In Proceedings of interspeech (pp. 784–788)

  • Taku, K. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of ACL (pp. 66–75).

  • Tanaka, T., Masumura, R., Moriya, T., Oba, T., & Aono, Y. (2019). A joint end-to-end and DNN-HMM hybrid automatic speech recognition system with transferring sharable knowledge. In Proceedings of interspeech (pp. 2210–2214)

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of advances in neural information processing systems (pp. 5998–6008) (2017).

  • Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., & Renduchintala, A. (2018). ESPnet: End-to-end speech processing toolkit. In Proceedings of interspeech (pp. 2207–2211).

  • Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. In IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62071302).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhua Long.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Long, Y. & Xu, D. Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition. Int J Speech Technol 25, 987–995 (2022). https://doi.org/10.1007/s10772-022-10010-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-022-10010-z

Keywords

Navigation