Abstract
Accent-variation is a challenging issue, either for traditional hybrid or current end-to-end (E2E) automatic speech recognition (ASR). Building an accent-invariant and high quality ASR system is very important for most real applications. In this study, we propose a Conformer-based architecture with accent-discriminative encoders, to leverage the accent attributes of input speech for enhancing an accent-invariant E2E ASR system. In this architecture, the encoders are composed of one universal, and two dominant accent-specific encoders. These encoders are first pre-trained and then jointly adapted with a single attention-based decoder in an end-to-end manner. Furthermore, different weighting methods and a multi-encoder-decoder architecture is also investigated and compared. Our experiments are performed on the public Common Voice with five different English-accents, results show that our proposed architecture outperforms the strong baseline in both in-domain and out-of-domain accented-ASR tasks, with a relative 2.9–3.8% word error rate reduction.
Similar content being viewed by others
References
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F.M., & Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of LREC (pp. 4218–4222).
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In Proceedings of ICASSP (pp. 4945–4949)
Chen, M., Yang, Z., Liang, J., Li, Y., & Liu, W. (2015). Improving deep neural networks based multi-accent Mandarin speech recognition using I-vectors and accent-specific top layer. In Proceedings of interspeech (pp. 3620–3624)
Chen, Y. C., Yang, Z., Yeh, C. F., Jain, M., & Seltzer, M. L. (2020). AIPNet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In Proceedings of ICASSP (pp. 6879–6983).
Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., & Jaitly, N. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of ICASSP (pp. 4774–4778)
Das, A., Kumar, K., & Wu, J. (2021). Multi-dialect speech recognition in english using attention on ensemble of experts. In Proceedings of ICASSP (pp. 6244–6248)
Gong, X., Lu, Y., Zhou, Z., & Qian, Y. (2021). Layer-wise fast adaptation for end-to-end multi-accent speech recognition. In Proceedings of interspeech (pp. 1274–1278)
Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2019). A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In International conference on learning representations.
Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of interspeech (pp. 5036–5040)
Hu, H., Yang, X., Raeesy, Z., Guo, J., Keskin, G., Arsikere, H., Rastrow, A., Stolcke, A., & Maas, R. (2021). Redat: Accent-invariant representation for end-to-end asr by domain adversarial training with relabeling. In Proceedings of ICASSP (pp. 6408–6412).
Imaizumi, R., Masumura, R., Shiota, S., & Kiya, H. (2022). End-to-end Japanese multi-dialect speech recognition and dialect identification with multi-task learning. APSIPA Transactions on Signal and Information Processing, 11(1), 4.
Jain, A., Upreti, M. & Jyothi, P. (2018). Improved accented speech recognition using accent embeddings and multi-task learning. In Proceedings of interspeech (pp. 2454–2458)
Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N. E. Y., Yamamoto, R., Wang, X., & Watanabe, S. (2019a). A comparative study on transformer vs RNN in speech applications. In Proceedings of ASRU (pp. 449-456)
Karita, S., Enrique, N., Soplin, Y., & Watanabe, S. (2019b). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of interspeech (pp. 1408–1412).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning representations.
Li, B., Sainath, T. N., Sim, K. C., Bacchiani, M., Weinstein, E., Nguyen, P., Chen, Z., Wu, Y., & Rao, K. (2018). Multi-dialect speech recognition with a single sequence-to-sequence model. In Proceedings of ICASSP (pp. 4749–4753)
Li, J. (2020). Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1), 8.
Li, S., Ouyang, B., Liao, D., Xia, S., Li, L., & Hong, Q. (2021). End-to-end multi-accent speech recognition with unsupervised accent modelling. In Proceedings of ICASSP (pp. 6418–6422).
Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. (2020). Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of ICASSP (pp. 6084–6088)
Padi, B., Mohan, A., & Ganapathy, S. (2019). Attention based hybrid I-vector BLSTM model for language recognition. In Proceedings of interspeech (pp. 1263–1267)
Rao, K. & Sak, H. (2017). Multi-accent speech recognition with hierarchical grapheme based models. In Proceedings of ICASSP (pp. 4815–4819)
Shi, X., Yu, F., Lu, Y., Liang, Y., Feng, Q., Wang, D., Qian, Y., & Xie, L. (2020). The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In Proceedings of ICASSP (pp. 6918–6922).
Shor, J., Emanuel, D., Lang, O., Tuval, O., Brenner, M., Cattiau, J., Vieira, F., McNally, M., Charbonneau, T., Nollstadt, M., & Hassidim, A. (2019). Personalizing ASR for dysarthric and accented speech with limited data. In Proceedings of interspeech (pp. 784–788)
Taku, K. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of ACL (pp. 66–75).
Tanaka, T., Masumura, R., Moriya, T., Oba, T., & Aono, Y. (2019). A joint end-to-end and DNN-HMM hybrid automatic speech recognition system with transferring sharable knowledge. In Proceedings of interspeech (pp. 2210–2214)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of advances in neural information processing systems (pp. 5998–6008) (2017).
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., & Renduchintala, A. (2018). ESPnet: End-to-end speech processing toolkit. In Proceedings of interspeech (pp. 2207–2211).
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. In IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62071302).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, X., Long, Y. & Xu, D. Universal and accent-discriminative encoders for conformer-based accent-invariant speech recognition. Int J Speech Technol 25, 987–995 (2022). https://doi.org/10.1007/s10772-022-10010-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-10010-z