Conformer-Based Lip-Reading for Japanese Sentence

Arakane, Taiki; Saitoh, Takeshi; Chiba, Ryuuichi; Morise, Masanori; Oda, Yasuo

doi:10.1007/978-3-031-25825-1_34

Conformer-Based Lip-Reading for Japanese Sentence

Taiki Arakane¹⁰,
Takeshi Saitoh ORCID: orcid.org/0000-0001-8844-9707¹⁰,
Ryuuichi Chiba¹¹,
Masanori Morise¹¹ &
…
Yasuo Oda¹²

Conference paper
First Online: 04 February 2023

752 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13836))

Abstract

Various applications of lip-reading technology are considered, and it is an important issue. In recent years, lip-reading research on sentences has attracted attention. However, most of the published datasets are English-talking scenes, and there are few datasets other than English. Therefore, in this research, we are researching Japanese sentence-level lip-reading. In this paper, we construct Japanese sentence utterance scene datasets ITA and ROHAN4600 and propose the Conformer-based lip-reading method. Recognition experiments were conducted using the Transformer model as a conventional method. As a result, it was confirmed that the Conformer model obtained high recognition accuracy both at the phoneme and the mora levels.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Afouras, T., Chung, J.S., Zisserman, A.: Deep lip reading: a comparison of models and an online application. In: Interspeech (2018)
Google Scholar
Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv:1809.00496 (2018). https://doi.org/10.48550/arXiv.1809.00496
Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown, G.J.: A corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523–EL529 (2018)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 (2016). https://doi.org/10.48550/arXiv.1611.01599
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6447–6456 (2017). https://doi.org/10.1109/CVPR.2017.367
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (ACCV) (2016)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005
Article Google Scholar
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Interspeech (2020)
Google Scholar
Kodama, M., Saitoh, T.: Replacing speaker-independent recognition task with speaker-dependent task for lip-reading using first order motion model paper. In: 13th International Conference on Graphics and Image Processing (ICGIP) (2021). https://doi.org/10.1117/12.2623640
Nakamura, Y., Saitoh, T., Itoh, K.: 3DCNN-based mouth shape recognition for patient with intractable neurological diseases. In: 13th International Conference on Graphics and Image Processing (ICGIP) (2021). https://doi.org/10.1117/12.2623642
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: INTERSPEECH, pp. 1149–1153 (2014)
Google Scholar
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP J. Adv. Signal Process. 2002(11), 1–13 (2002). https://doi.org/10.1155/S1110865702206101
Article Google Scholar
Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: 24th International Conference on Pattern Recognition (ICPR), pp. 3228–3232 (2018). https://doi.org/10.1109/ICPR.2018.8545664
Shirakata, T., Saitoh, T.: Japanese sentence dataset for lip-reading. In: IAPR Conference on Machine Vision Applications (MVA) (2021). https://doi.org/10.23919/MVA51890.2021.9511353
Tamura, S., et al.: CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition. In: International Conference on Auditory-Visual Speech Processing (AVSP) (2010)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017). https://doi.org/10.48550/arXiv.1706.03762
Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: International Conference on Computer Vision (ICCV), pp. 713–722 (2019). https://doi.org/10.1109/ICCV.2019.00080

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grants Number 19KT0029.

Author information

Authors and Affiliations

Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, Japan
Taiki Arakane & Takeshi Saitoh
Meiji University, 4-21-1 Nakano, Nakano-ku, Tokyo, Japan
Ryuuichi Chiba & Masanori Morise
SSS LLC, 2-1-2 Sanno, Ota Ward, Tokyo, Japan
Yasuo Oda

Authors

Taiki Arakane
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Saitoh
View author publications
You can also search for this author in PubMed Google Scholar
Ryuuichi Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Masanori Morise
View author publications
You can also search for this author in PubMed Google Scholar
Yasuo Oda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takeshi Saitoh .

Editor information

Editors and Affiliations

Auckland University of Technology, Auckland, New Zealand
Wei Qi Yan
Auckland University of Technology, Auckland, New Zealand
Minh Nguyen
Auckland University of Technology, Auckland, New Zealand
Martin Stommel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arakane, T., Saitoh, T., Chiba, R., Morise, M., Oda, Y. (2023). Conformer-Based Lip-Reading for Japanese Sentence. In: Yan, W.Q., Nguyen, M., Stommel, M. (eds) Image and Vision Computing. IVCNZ 2022. Lecture Notes in Computer Science, vol 13836. Springer, Cham. https://doi.org/10.1007/978-3-031-25825-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-25825-1_34
Published: 04 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25824-4
Online ISBN: 978-3-031-25825-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics