Lipreading with LipsID

Hlaváč, Miroslav; Gruber, Ivan; Železný, Miloš; Karpov, Alexey

doi:10.1007/978-3-030-60276-5_18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

International Conference on Speech and Computer

1622 Accesses
1 Citations

Abstract

This paper presents an approach for adaptation of the current visual speech recognition systems. The adaptation technique is based on LipsID features. These features represent a processed area of lips ROI. The features are extracted in a classification task by neural network pre-trained on the dataset-specific to the lip-reading system used for visual speech recognition. The training procedure for LipsID implements ArcFace loss to separate different speakers in the dataset and to provide distinctive features for every one of them. The network uses convolutional layers to extract features from input sequences of speaker images and is designed to take the same input as the lipreading system. Parallel processing of input sequence by LipsID network and lipreading network is followed by a combination of both feature sets and final recognition by Connectionist Temporal Classification (CTC) mechanism. This paper presents results from experiments with the LipNet network by re-implementing the system and comparing it with and without LipsID features. The results show a promising path for future experiments and other systems. The training and testing process of neural networks used in this work utilizes Tensorflow/Keras implementations [4].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition (2018). arXiv preprint arXiv:1809.00496
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: Lipet: End-to-end sentence-level lipreading (2016). arXiv preprint arXiv:1611.01599
Chollet, F.: Keras: GitHub repository (2015)
Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Cootes, T.F., Cristopher J.T.: Statistical models of appearance for computer vision (2004)
Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine learning, pp. 369–376 (2006)
Google Scholar
Hlaváč, M., Gruber, I., Železný, M., Karpov, A.: LipsID using 3D convolutional neural networks. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 209–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_22
Chapter Google Scholar
Karafiát, M., Burget, L., Matějka, P., Glembek, O., Černocký, J.: iVector-based discriminative adaptation for automatic speech recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, pp. 152–157 (2011). https://doi.org/10.1109/ASRU.2011.6163922
Sterpu G., Saam C., Harte N.: How to teach DNNs to pay attention to the visual modality in speech recognition. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020). https://doi.org/10.1109/TASLP.2020.2980436
Hlaváč, M.: Automated Lipreading with LipsID Features. PhD Thesis, University of West Bohemia, (2019)
Google Scholar
Sterpu, G., Saam, C., Harte, N.: Attention-based audio-visual fusion for robust automatic speech recognition. In: 2018 International Conference on Multimodal Interaction (ICMI 2018) (2018). https://doi.org/10.1145/3242969.3243014
Harte, N., Gillen, E.: TCD-TIMIT: an Audio-Visual Corpus of Continuous Speech. In: IEEE Transactions on Multimedia, pp. 603–615 (2015). https://doi.org/10.1109/TMM.2015.2407694

Download references

Acknowledgments.

This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017 and the Ministry of Science and Higher Education of the Russian Federation, agreement No. 14.616.21.0095 (reference RFMEFI61618X0095). Moreover, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Miroslav Hlaváč, Ivan Gruber & Miloš Železný
NTIS, Pilsen, Czech Republic
Miroslav Hlaváč, Ivan Gruber & Miloš Železný
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences SPIIRAS, St. Petersburg, Russia
Alexey Karpov

Authors

Miroslav Hlaváč
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Gruber
View author publications
You can also search for this author in PubMed Google Scholar
Miloš Železný
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miroslav Hlaváč .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hlaváč, M., Gruber, I., Železný, M., Karpov, A. (2020). Lipreading with LipsID. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_18
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics