Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition

Li, Sheng; Li, Jiyi

doi:10.1007/978-3-031-44195-0_32

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

International Conference on Artificial Neural Networks

1016 Accesses

Abstract

Taiwan-accented speech bears similarities to the Mandarin Min dialect, but with substantial differences in vocabulary, which significantly impacts spoken language recognition outcomes. This paper concentrates on integrating pre-trained language models (PLMs) with state-of-the-art self-supervised learning (SSL)-based speech recognition systems for Taiwan-accented speech recognition tasks. We propose a progressive error correction process in tandem with recognition to fully exploit the autoregressive nature of PLM models. Experimental results demonstrate that our method effectively addresses recognition errors stemming from misspelled vocabulary in accented speech. Our proposed progressive approach achieves roughly a 0.5% improvement compared to the conventional method. Furthermore, we demonstrate that fine-tuning PLMs solely with the text from the accented dataset can enhance recognition performance, despite the limitations of accented speech resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Automatic Speech Recognition for Non-native English with Transfer Learning and Language Model Decoding

AutoSSR: an efficient approach for automatic spontaneous speech recognition model for the Punjabi Language

Article 10 August 2020

The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System

Notes

References

Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: Proceedings of The 33rd International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 48, pp. 173–182 (2016). https://proceedings.mlr.press/v48/amodei16.html
Baevski, A., Hsu, W.N., Conneau, A., Auli, M.: Unsupervised speech recognition. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 27826–27839 (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/ea159dc9788ffac311592613b7f71fbb-Paper.pdf
Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7694–7698 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054224
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 12449–12460 (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
Bai, Y., Yi, J., Tao, J., Tian, Z., Wen, Z., Zhang, S.: Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1897–1911 (2021). https://doi.org/10.1109/TASLP.2021.3082299
Article Google Scholar
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE Press (2018). https://doi.org/10.1109/ICASSP.2018.8462105
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602 (2014)
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577–585 (2015)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dong, L., Xu, B.: CIF: continuous integrate-and-fire for end-to-end speech recognition. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054250
Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of BERT for sequence-to-sequence ASR. In: Proceedings of Interspeech 2020, pp. 3635–3639 (2020). https://doi.org/10.21437/Interspeech.2020-1179
Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of BERT for sequence-to-sequence ASR. CoRR abs/2008.03822 (2020). https://arxiv.org/abs/2008.03822
Graves, A., Fernandez, S., Gomez, F., Shmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML (2006)
Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of ICML, pp. 1764–1772 (2014)
Google Scholar
Heafield, K., Pouzyrevsky, I., Clark, J., Koehn, P.: Scalable modified kneser-ney language model estimation. In: Proceedings of ACL (2013)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of ICML, pp. 2790–2799 (2019)
Google Scholar
Hsu, W.N., Tsai, Y.H.H., Bolte, B., Salakhutdinov, R., Mohamed, A.: Hubert: how much can a bad teacher benefit ASR pre-training? In: Proceedings of IEEE-ICASSP, pp. 6533–6537 (2021)
Google Scholar
Li, J., Wang, X., Li, Y., et al.: The speechtransformer for large-scale mandarin Chinese speech recognition. In: Proceedings of IEEE-ICASSP, pp. 7095–7099 (2019)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)
Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., Waibel, A.: An emprical exploration of CTC acoustic models. In: Proceedings of IEEE-ICASSP (2016)
Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of Interspeech, vol. 2, pp. 1045–1048 (2010)
Google Scholar
Ogawa, A., Delcroix, M., Karita, S., Nakatani, T.: Rescoring n-best speech recognition list based on one-on-one hypothesis comparison using encoder-classifier model. In: Proceedings of IEEE-ICASSP (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Salazar, J., et al.: Masked language model scoring. In: Proceedings of ACL (2020)
Google Scholar
Salazar, J., Liang, D., Nguyen, T.Q., Kirchhoff, K.: Masked language model scoring. arXiv preprint arXiv:1910.14659 (2019)
Shin, J., Lee, Y., Jung, K.: Effective sentence scoring method using BERT for speech recognition. In: Proceedings of ACML, pp. 1081–1093 (2019)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurlPS, vol. 30 (2017)
Google Scholar
Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 (2020)
Xu, L., et al.: Rescorebert: discriminative speech recognition rescoring with BERT. In: Proceedings of IEEE-ICASSP (2022)
Google Scholar
Yu, F.H., Chen, K.Y., Lu, K.H.: Non-autoregressive ASR modeling using pre-trained language models for Chinese speech recognition. IEEE/ACM Trans. ASLP 30, 1474–1482 (2022)
Google Scholar
Zhang, S., Huang, H., Liu, J., Li, H.: Spelling error correction with soft-masked BERT. arXiv preprint arXiv:2005.07421 (2020)
Zhang, S., Lei, M., Yan, Z.: Investigation of transformer based spelling correction model for CTC-based end-to-end mandarin speech recognition. In: Proceedings of Interspeech, pp. 2180–2184 (2019)
Google Scholar
Zhao, Y., et al.: Bart based semantic correction for mandarin automatic speech recognition system. In: Proceedings of Interspeech (2021)
Google Scholar
Zheng, G., et al.: Wav-BERT: cooperative acoustic and linguistic representation learning for low-resource speech recognition. In: Proceedings of EMNLP findings (2021)
Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of AAAI, vol. 34, pp. 13041–13049 (2020)
Google Scholar
Zhou, S., Xu, S., Xu, B.: Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv preprint arXiv:1806.05059 (2018)

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Number 23K11227 and 23H03402, and NICT tenure-track startup funding.

Author information

Authors and Affiliations

National Institute of Information and Communications Technology, Kyoto, Japan
Sheng Li
University of Yamanashi, Kofu, Japan
Jiyi Li

Authors

Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiyi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiyi Li .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S., Li, J. (2023). Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-44195-0_32
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition

Abstract

Access this chapter

Similar content being viewed by others

Improving Automatic Speech Recognition for Non-native English with Transfer Learning and Language Model Decoding

AutoSSR: an efficient approach for automatic spontaneous speech recognition model for the Punjabi Language

The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition

Abstract

Access this chapter

Similar content being viewed by others

Improving Automatic Speech Recognition for Non-native English with Transfer Learning and Language Model Decoding

AutoSSR: an efficient approach for automatic spontaneous speech recognition model for the Punjabi Language

The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation