Skip to main content

Lipreading with LipsID

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2020)

Abstract

This paper presents an approach for adaptation of the current visual speech recognition systems. The adaptation technique is based on LipsID features. These features represent a processed area of lips ROI. The features are extracted in a classification task by neural network pre-trained on the dataset-specific to the lip-reading system used for visual speech recognition. The training procedure for LipsID implements ArcFace loss to separate different speakers in the dataset and to provide distinctive features for every one of them. The network uses convolutional layers to extract features from input sequences of speaker images and is designed to take the same input as the lipreading system. Parallel processing of input sequence by LipsID network and lipreading network is followed by a combination of both feature sets and final recognition by Connectionist Temporal Classification (CTC) mechanism. This paper presents results from experiments with the LipNet network by re-implementing the system and comparing it with and without LipsID features. The results show a promising path for future experiments and other systems. The training and testing process of neural networks used in this work utilizes Tensorflow/Keras implementations [4].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition (2018). arXiv preprint arXiv:1809.00496

  2. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)

    Google Scholar 

  3. Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: Lipet: End-to-end sentence-level lipreading (2016). arXiv preprint arXiv:1611.01599

  4. Chollet, F.: Keras: GitHub repository (2015)

    Google Scholar 

  5. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)

    Google Scholar 

  6. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  7. Cootes, T.F., Cristopher J.T.: Statistical models of appearance for computer vision (2004)

    Google Scholar 

  8. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

    Google Scholar 

  9. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine learning, pp. 369–376 (2006)

    Google Scholar 

  10. Hlaváč, M., Gruber, I., Železný, M., Karpov, A.: LipsID using 3D convolutional neural networks. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 209–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_22

    Chapter  Google Scholar 

  11. Karafiát, M., Burget, L., Matějka, P., Glembek, O., Černocký, J.: iVector-based discriminative adaptation for automatic speech recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, pp. 152–157 (2011). https://doi.org/10.1109/ASRU.2011.6163922

  12. Sterpu G., Saam C., Harte N.: How to teach DNNs to pay attention to the visual modality in speech recognition. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020). https://doi.org/10.1109/TASLP.2020.2980436

  13. Hlaváč, M.: Automated Lipreading with LipsID Features. PhD Thesis, University of West Bohemia, (2019)

    Google Scholar 

  14. Sterpu, G., Saam, C., Harte, N.: Attention-based audio-visual fusion for robust automatic speech recognition. In: 2018 International Conference on Multimodal Interaction (ICMI 2018) (2018). https://doi.org/10.1145/3242969.3243014

  15. Harte, N., Gillen, E.: TCD-TIMIT: an Audio-Visual Corpus of Continuous Speech. In: IEEE Transactions on Multimedia, pp. 603–615 (2015). https://doi.org/10.1109/TMM.2015.2407694

Download references

Acknowledgments.

This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017 and the Ministry of Science and Higher Education of the Russian Federation, agreement No. 14.616.21.0095 (reference RFMEFI61618X0095). Moreover, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miroslav Hlaváč .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hlaváč, M., Gruber, I., Železný, M., Karpov, A. (2020). Lipreading with LipsID. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60276-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60275-8

  • Online ISBN: 978-3-030-60276-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics