LipsID Using 3D Convolutional Neural Networks

Hlaváč, Miroslav; Gruber, Ivan; Železný, Miloš; Karpov, Alexey

doi:10.1007/978-3-319-99579-3_22

LipsID Using 3D Convolutional Neural Networks

Miroslav Hlaváč^16,17,18,
Ivan Gruber^16,17,18,
Miloš Železný¹⁶ &
…
Alexey Karpov^18,19

Conference paper
First Online: 25 August 2018

1424 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Abstract

This paper presents a proposition for a method inspired by iVectors for improvement of visual speech recognition in the similar way iVectors are used to improve the recognition rate of audio speech recognition. A neural network for feature extraction is presented with training parameters and evaluation. The network is trained as a classifier for a closed set of 64 speakers from the UWB-HSCAVC dataset and then the last softmax fully connected layer is removed to gain a feature vector of size 256. The network is provided with sequences of 15 frames and outputs the softmax classification to 64 classes. The training data consists of approximately 20000 sequences of grayscale images from the first 50 sentences that are common to every speaker. The network is then evaluated on the 60000 sequences created from 150 sentences from each speaker. The testing sentences are different for each speaker.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/. software available from tensorflow.org
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859–1866 (2014)
Google Scholar
Chollet, F., et al.: Keras: Deep learning library for theano and tensorflow, vol. 7, p. 8 (2015). https://keras.io/k
Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. CoRR abs/1611.05358 (2016). http://arxiv.org/abs/1611.05358
Chung, J., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (2016)
Google Scholar
Císař, P., Železnỳ, M., Krňoul, Z., Kanis, J., Zelinka, J., Müller, L.: Design and recording of czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the Auditory-Visual Speech Processing International Conference 2005 (2005)
Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

Download references

Acknowledgments

This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017. The work has been also supported by the grant of the University of West Bohemia, project No. SGS-2016-039. This work was supported by the Government of the Russian Federation (grant No. 08-08) and the Russian Foundation for Basic Research (project No. 18-07-01407) too. Moreover, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Faculty of Applied Sciences, Department of Cybernetics, UWB, Pilsen, Czech Republic
Miroslav Hlaváč, Ivan Gruber & Miloš Železný
Faculty of Applied Sciences, NTIS, UWB, Pilsen, Czech Republic
Miroslav Hlaváč & Ivan Gruber
ITMO University, St. Petersburg, Russia
Miroslav Hlaváč, Ivan Gruber & Alexey Karpov
SPIIRAS, St. Petersburg, Russia
Alexey Karpov

Authors

Miroslav Hlaváč
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Gruber
View author publications
You can also search for this author in PubMed Google Scholar
Miloš Železný
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miroslav Hlaváč .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hlaváč, M., Gruber, I., Železný, M., Karpov, A. (2018). LipsID Using 3D Convolutional Neural Networks. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_22
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics