Abstract
Various applications of lip-reading technology are considered, and it is an important issue. In recent years, lip-reading research on sentences has attracted attention. However, most of the published datasets are English-talking scenes, and there are few datasets other than English. Therefore, in this research, we are researching Japanese sentence-level lip-reading. In this paper, we construct Japanese sentence utterance scene datasets ITA and ROHAN4600 and propose the Conformer-based lip-reading method. Recognition experiments were conducted using the Transformer model as a conventional method. As a result, it was confirmed that the Conformer model obtained high recognition accuracy both at the phoneme and the mora levels.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Afouras, T., Chung, J.S., Zisserman, A.: Deep lip reading: a comparison of models and an online application. In: Interspeech (2018)
Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv:1809.00496 (2018). https://doi.org/10.48550/arXiv.1809.00496
Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown, G.J.: A corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523–EL529 (2018)
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 (2016). https://doi.org/10.48550/arXiv.1611.01599
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6447–6456 (2017). https://doi.org/10.1109/CVPR.2017.367
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (ACCV) (2016)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Interspeech (2020)
Kodama, M., Saitoh, T.: Replacing speaker-independent recognition task with speaker-dependent task for lip-reading using first order motion model paper. In: 13th International Conference on Graphics and Image Processing (ICGIP) (2021). https://doi.org/10.1117/12.2623640
Nakamura, Y., Saitoh, T., Itoh, K.: 3DCNN-based mouth shape recognition for patient with intractable neurological diseases. In: 13th International Conference on Graphics and Image Processing (ICGIP) (2021). https://doi.org/10.1117/12.2623642
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: INTERSPEECH, pp. 1149–1153 (2014)
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP J. Adv. Signal Process. 2002(11), 1–13 (2002). https://doi.org/10.1155/S1110865702206101
Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: 24th International Conference on Pattern Recognition (ICPR), pp. 3228–3232 (2018). https://doi.org/10.1109/ICPR.2018.8545664
Shirakata, T., Saitoh, T.: Japanese sentence dataset for lip-reading. In: IAPR Conference on Machine Vision Applications (MVA) (2021). https://doi.org/10.23919/MVA51890.2021.9511353
Tamura, S., et al.: CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition. In: International Conference on Auditory-Visual Speech Processing (AVSP) (2010)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017). https://doi.org/10.48550/arXiv.1706.03762
Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: International Conference on Computer Vision (ICCV), pp. 713–722 (2019). https://doi.org/10.1109/ICCV.2019.00080
Acknowledgments
This work was supported by JSPS KAKENHI Grants Number 19KT0029.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Arakane, T., Saitoh, T., Chiba, R., Morise, M., Oda, Y. (2023). Conformer-Based Lip-Reading for Japanese Sentence. In: Yan, W.Q., Nguyen, M., Stommel, M. (eds) Image and Vision Computing. IVCNZ 2022. Lecture Notes in Computer Science, vol 13836. Springer, Cham. https://doi.org/10.1007/978-3-031-25825-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-25825-1_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25824-4
Online ISBN: 978-3-031-25825-1
eBook Packages: Computer ScienceComputer Science (R0)