Skip to main content

Conformer-Based Lip-Reading for Japanese Sentence

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13836))

Abstract

Various applications of lip-reading technology are considered, and it is an important issue. In recent years, lip-reading research on sentences has attracted attention. However, most of the published datasets are English-talking scenes, and there are few datasets other than English. Therefore, in this research, we are researching Japanese sentence-level lip-reading. In this paper, we construct Japanese sentence utterance scene datasets ITA and ROHAN4600 and propose the Conformer-based lip-reading method. Recognition experiments were conducted using the Transformer model as a conventional method. As a result, it was confirmed that the Conformer model obtained high recognition accuracy both at the phoneme and the mora levels.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://spandh.dcs.shef.ac.uk/gridcorpus/.

  2. 2.

    https://spandh.dcs.shef.ac.uk//avlombard/.

  3. 3.

    https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html.

  4. 4.

    https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html.

  5. 5.

    https://github.com/mmorise/rohan4600.

  6. 6.

    https://www.saitoh-lab.com/SSSD/index_ja.html.

  7. 7.

    https://zunko.jp/multimodal_dev/login.php.

References

  1. Afouras, T., Chung, J.S., Zisserman, A.: Deep lip reading: a comparison of models and an online application. In: Interspeech (2018)

    Google Scholar 

  2. Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv:1809.00496 (2018). https://doi.org/10.48550/arXiv.1809.00496

  3. Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown, G.J.: A corpus of audio-visual lombard speech with frontal and profile views. J. Acoust. Soc. Am. 143(6), EL523–EL529 (2018)

    Google Scholar 

  4. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv:1611.01599 (2016). https://doi.org/10.48550/arXiv.1611.01599

  5. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6447–6456 (2017). https://doi.org/10.1109/CVPR.2017.367

  6. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (ACCV) (2016)

    Google Scholar 

  7. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006). https://doi.org/10.1121/1.2229005

    Article  Google Scholar 

  8. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Interspeech (2020)

    Google Scholar 

  9. Kodama, M., Saitoh, T.: Replacing speaker-independent recognition task with speaker-dependent task for lip-reading using first order motion model paper. In: 13th International Conference on Graphics and Image Processing (ICGIP) (2021). https://doi.org/10.1117/12.2623640

  10. Nakamura, Y., Saitoh, T., Itoh, K.: 3DCNN-based mouth shape recognition for patient with intractable neurological diseases. In: 13th International Conference on Graphics and Image Processing (ICGIP) (2021). https://doi.org/10.1117/12.2623642

  11. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: INTERSPEECH, pp. 1149–1153 (2014)

    Google Scholar 

  12. Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP J. Adv. Signal Process. 2002(11), 1–13 (2002). https://doi.org/10.1155/S1110865702206101

    Article  Google Scholar 

  13. Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: 24th International Conference on Pattern Recognition (ICPR), pp. 3228–3232 (2018). https://doi.org/10.1109/ICPR.2018.8545664

  14. Shirakata, T., Saitoh, T.: Japanese sentence dataset for lip-reading. In: IAPR Conference on Machine Vision Applications (MVA) (2021). https://doi.org/10.23919/MVA51890.2021.9511353

  15. Tamura, S., et al.: CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition. In: International Conference on Auditory-Visual Speech Processing (AVSP) (2010)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017). https://doi.org/10.48550/arXiv.1706.03762

  17. Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: International Conference on Computer Vision (ICCV), pp. 713–722 (2019). https://doi.org/10.1109/ICCV.2019.00080

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grants Number 19KT0029.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takeshi Saitoh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arakane, T., Saitoh, T., Chiba, R., Morise, M., Oda, Y. (2023). Conformer-Based Lip-Reading for Japanese Sentence. In: Yan, W.Q., Nguyen, M., Stommel, M. (eds) Image and Vision Computing. IVCNZ 2022. Lecture Notes in Computer Science, vol 13836. Springer, Cham. https://doi.org/10.1007/978-3-031-25825-1_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25825-1_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25824-4

  • Online ISBN: 978-3-031-25825-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics