Abstract
Taiwan-accented speech bears similarities to the Mandarin Min dialect, but with substantial differences in vocabulary, which significantly impacts spoken language recognition outcomes. This paper concentrates on integrating pre-trained language models (PLMs) with state-of-the-art self-supervised learning (SSL)-based speech recognition systems for Taiwan-accented speech recognition tasks. We propose a progressive error correction process in tandem with recognition to fully exploit the autoregressive nature of PLM models. Experimental results demonstrate that our method effectively addresses recognition errors stemming from misspelled vocabulary in accented speech. Our proposed progressive approach achieves roughly a 0.5% improvement compared to the conventional method. Furthermore, we demonstrate that fine-tuning PLMs solely with the text from the accented dataset can enhance recognition performance, despite the limitations of accented speech resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: Proceedings of The 33rd International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 48, pp. 173–182 (2016). https://proceedings.mlr.press/v48/amodei16.html
Baevski, A., Hsu, W.N., Conneau, A., Auli, M.: Unsupervised speech recognition. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 27826–27839 (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/ea159dc9788ffac311592613b7f71fbb-Paper.pdf
Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7694–7698 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054224
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 12449–12460 (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
Bai, Y., Yi, J., Tao, J., Tian, Z., Wen, Z., Zhang, S.: Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1897–1911 (2021). https://doi.org/10.1109/TASLP.2021.3082299
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE Press (2018). https://doi.org/10.1109/ICASSP.2018.8462105
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602 (2014)
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), vol. 1, pp. 577–585 (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dong, L., Xu, B.: CIF: continuous integrate-and-fire for end-to-end speech recognition. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054250
Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of BERT for sequence-to-sequence ASR. In: Proceedings of Interspeech 2020, pp. 3635–3639 (2020). https://doi.org/10.21437/Interspeech.2020-1179
Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of BERT for sequence-to-sequence ASR. CoRR abs/2008.03822 (2020). https://arxiv.org/abs/2008.03822
Graves, A., Fernandez, S., Gomez, F., Shmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML (2006)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of ICML, pp. 1764–1772 (2014)
Heafield, K., Pouzyrevsky, I., Clark, J., Koehn, P.: Scalable modified kneser-ney language model estimation. In: Proceedings of ACL (2013)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of ICML, pp. 2790–2799 (2019)
Hsu, W.N., Tsai, Y.H.H., Bolte, B., Salakhutdinov, R., Mohamed, A.: Hubert: how much can a bad teacher benefit ASR pre-training? In: Proceedings of IEEE-ICASSP, pp. 6533–6537 (2021)
Li, J., Wang, X., Li, Y., et al.: The speechtransformer for large-scale mandarin Chinese speech recognition. In: Proceedings of IEEE-ICASSP, pp. 7095–7099 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)
Miao, Y., Gowayyed, M., Na, X., Ko, T., Metze, F., Waibel, A.: An emprical exploration of CTC acoustic models. In: Proceedings of IEEE-ICASSP (2016)
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of Interspeech, vol. 2, pp. 1045–1048 (2010)
Ogawa, A., Delcroix, M., Karita, S., Nakatani, T.: Rescoring n-best speech recognition list based on one-on-one hypothesis comparison using encoder-classifier model. In: Proceedings of IEEE-ICASSP (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Salazar, J., et al.: Masked language model scoring. In: Proceedings of ACL (2020)
Salazar, J., Liang, D., Nguyen, T.Q., Kirchhoff, K.: Masked language model scoring. arXiv preprint arXiv:1910.14659 (2019)
Shin, J., Lee, Y., Jung, K.: Effective sentence scoring method using BERT for speech recognition. In: Proceedings of ACML, pp. 1081–1093 (2019)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurlPS, vol. 30 (2017)
Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 (2020)
Xu, L., et al.: Rescorebert: discriminative speech recognition rescoring with BERT. In: Proceedings of IEEE-ICASSP (2022)
Yu, F.H., Chen, K.Y., Lu, K.H.: Non-autoregressive ASR modeling using pre-trained language models for Chinese speech recognition. IEEE/ACM Trans. ASLP 30, 1474–1482 (2022)
Zhang, S., Huang, H., Liu, J., Li, H.: Spelling error correction with soft-masked BERT. arXiv preprint arXiv:2005.07421 (2020)
Zhang, S., Lei, M., Yan, Z.: Investigation of transformer based spelling correction model for CTC-based end-to-end mandarin speech recognition. In: Proceedings of Interspeech, pp. 2180–2184 (2019)
Zhao, Y., et al.: Bart based semantic correction for mandarin automatic speech recognition system. In: Proceedings of Interspeech (2021)
Zheng, G., et al.: Wav-BERT: cooperative acoustic and linguistic representation learning for low-resource speech recognition. In: Proceedings of EMNLP findings (2021)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of AAAI, vol. 34, pp. 13041–13049 (2020)
Zhou, S., Xu, S., Xu, B.: Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv preprint arXiv:1806.05059 (2018)
Acknowledgements
This work was partially supported by JSPS KAKENHI Grant Number 23K11227 and 23H03402, and NICT tenure-track startup funding.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, S., Li, J. (2023). Correction while Recognition: Combining Pretrained Language Model for Taiwan-Accented Speech Recognition. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-44195-0_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)