Skip to main content

Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

  • 1016 Accesses

Abstract

Taiwan-accented speech bears similarities to the Mandarin Min dialect, but with substantial differences in vocabulary, which significantly impacts spoken language recognition outcomes. This paper concentrates on integrating pre-trained language models (PLMs) with state-of-the-art self-supervised learning (SSL)-based speech recognition systems for Taiwan-accented speech recognition tasks. We propose a progressive error correction process in tandem with recognition to fully exploit the autoregressive nature of PLM models. Experimental results demonstrate that our method effectively addresses recognition errors stemming from misspelled vocabulary in accented speech. Our proposed progressive approach achieves roughly a 0.5% improvement compared to the conventional method. Furthermore, we demonstrate that fine-tuning PLMs solely with the text from the accented dataset can enhance recognition performance, despite the limitations of accented speech resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://old-site.clsp.jhu.edu/ws04/groups/ws04casr.

  2. 2.

    https://commonvoice.mozilla.org.

  3. 3.

    https://huggingface.co/facebook/wav2vec2-large-xlsr-53.

  4. 4.

    https://commonvoice.mozilla.org/zh-TW.

  5. 5.

    https://github.com/ckiplab/ckip-transformers.

References

  1. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: Proceedings of The 33rd International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 48, pp. 173–182 (2016). https://proceedings.mlr.press/v48/amodei16.html

  2. Baevski, A., Hsu, W.N., Conneau, A., Auli, M.: Unsupervised speech recognition. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 27826–27839 (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/ea159dc9788ffac311592613b7f71fbb-Paper.pdf

  3. Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7694–7698 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054224

  4. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 12449–12460 (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf

  5. Bai, Y., Yi, J., Tao, J., Tian, Z., Wen, Z., Zhang, S.: Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1897–1911 (2021). https://doi.org/10.1109/TASLP.2021.3082299

    Article  Google Scholar 

  6. Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE Press (2018). https://doi.org/10.1109/ICASSP.2018.8462105

  7. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602 (2014)

  8. Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577–585 (2015)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  10. Dong, L., Xu, B.: CIF: continuous integrate-and-fire for end-to-end speech recognition. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054250

  11. Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of BERT for sequence-to-sequence ASR. In: Proceedings of Interspeech 2020, pp. 3635–3639 (2020). https://doi.org/10.21437/Interspeech.2020-1179

  12. Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of BERT for sequence-to-sequence ASR. CoRR abs/2008.03822 (2020). https://arxiv.org/abs/2008.03822

  13. Graves, A., Fernandez, S., Gomez, F., Shmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML (2006)

    Google Scholar 

  14. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of ICML, pp. 1764–1772 (2014)

    Google Scholar 

  15. Heafield, K., Pouzyrevsky, I., Clark, J., Koehn, P.: Scalable modified kneser-ney language model estimation. In: Proceedings of ACL (2013)

    Google Scholar 

  16. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of ICML, pp. 2790–2799 (2019)

    Google Scholar 

  17. Hsu, W.N., Tsai, Y.H.H., Bolte, B., Salakhutdinov, R., Mohamed, A.: Hubert: how much can a bad teacher benefit ASR pre-training? In: Proceedings of IEEE-ICASSP, pp. 6533–6537 (2021)

    Google Scholar 

  18. Li, J., Wang, X., Li, Y., et al.: The speechtransformer for large-scale mandarin Chinese speech recognition. In: Proceedings of IEEE-ICASSP, pp. 7095–7099 (2019)

    Google Scholar 

  19. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  20. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  21. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)

  22. Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., Waibel, A.: An emprical exploration of CTC acoustic models. In: Proceedings of IEEE-ICASSP (2016)

    Google Scholar 

  23. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of Interspeech, vol. 2, pp. 1045–1048 (2010)

    Google Scholar 

  24. Ogawa, A., Delcroix, M., Karita, S., Nakatani, T.: Rescoring n-best speech recognition list based on one-on-one hypothesis comparison using encoder-classifier model. In: Proceedings of IEEE-ICASSP (2018)

    Google Scholar 

  25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  26. Salazar, J., et al.: Masked language model scoring. In: Proceedings of ACL (2020)

    Google Scholar 

  27. Salazar, J., Liang, D., Nguyen, T.Q., Kirchhoff, K.: Masked language model scoring. arXiv preprint arXiv:1910.14659 (2019)

  28. Shin, J., Lee, Y., Jung, K.: Effective sentence scoring method using BERT for speech recognition. In: Proceedings of ACML, pp. 1081–1093 (2019)

    Google Scholar 

  29. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16

    Chapter  Google Scholar 

  30. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurlPS, vol. 30 (2017)

    Google Scholar 

  31. Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 (2020)

  32. Xu, L., et al.: Rescorebert: discriminative speech recognition rescoring with BERT. In: Proceedings of IEEE-ICASSP (2022)

    Google Scholar 

  33. Yu, F.H., Chen, K.Y., Lu, K.H.: Non-autoregressive ASR modeling using pre-trained language models for Chinese speech recognition. IEEE/ACM Trans. ASLP 30, 1474–1482 (2022)

    Google Scholar 

  34. Zhang, S., Huang, H., Liu, J., Li, H.: Spelling error correction with soft-masked BERT. arXiv preprint arXiv:2005.07421 (2020)

  35. Zhang, S., Lei, M., Yan, Z.: Investigation of transformer based spelling correction model for CTC-based end-to-end mandarin speech recognition. In: Proceedings of Interspeech, pp. 2180–2184 (2019)

    Google Scholar 

  36. Zhao, Y., et al.: Bart based semantic correction for mandarin automatic speech recognition system. In: Proceedings of Interspeech (2021)

    Google Scholar 

  37. Zheng, G., et al.: Wav-BERT: cooperative acoustic and linguistic representation learning for low-resource speech recognition. In: Proceedings of EMNLP findings (2021)

    Google Scholar 

  38. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of AAAI, vol. 34, pp. 13041–13049 (2020)

    Google Scholar 

  39. Zhou, S., Xu, S., Xu, B.: Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv preprint arXiv:1806.05059 (2018)

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Number 23K11227 and 23H03402, and NICT tenure-track startup funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiyi Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, S., Li, J. (2023). Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44195-0_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44194-3

  • Online ISBN: 978-3-031-44195-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics