FATALRead - Fooling visual speech recognition models

Gupta, Anup Kumar; Gupta, Puneet; Rahtu, Esa

doi:10.1007/s10489-021-02846-w

FATALRead - Fooling visual speech recognition models

Put words on Lips

Published: 12 November 2021

Volume 52, pages 9001–9016, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

1468 Accesses
13 Citations
Explore all metrics

Abstract

Visual speech recognition is essential in understanding speech in several real-world applications such as surveillance systems and aiding differently-abled. It proliferates the research in the realm of visual speech recognition, also known as Automatic Lip Reading (ALR). In recent years, Deep Learning (DL) methods are being utilised for developing ALR systems. DL models tend to be vulnerable to adversarial attacks. Studying these attacks creates new research directions in designing robust DL systems. Existing attacks on images and videos classification models are not directly applicable to ALR systems. Since the ALR systems encompass temporal information, attacking these systems is comparatively more challenging and strenuous than attacking image classification models. Similarly, compared to other video classification tasks, the region-of-interest is smaller in the case of ALR systems. Despite these factors, our proposed method, Fooling AuTomAtic Lip Reading (FATALRead), can successfully perform adversarial attacks on state-of-the-art ALR systems. To the best of our knowledge, we are the first to successfully fool ALR systems for the word recognition task. We further demonstrate that the success of the attack is increased by incorporating logits instead of probabilities in the loss function. Our extensive experiments on a publicly available dataset, show that our attack successfully circumvents the well-known transformation based defences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

When deep learning deciphers silent video: a survey on automatic deep lip reading

Article 22 March 2025

Deep hybrid architectures and DenseNet35 in speaker-dependent visual speech recognition

Article 02 May 2024

Continuous lipreading based on acoustic temporal alignments

Article Open access 06 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Dataset Link: www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html (For non-commercial individual research and private study use only. BBC content included courtesy of the BBC.)
Available at: https://github.com/mpc001/end-to-end-lipreading
Available at: https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks

References

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038. https://www.sciencedirect.com/science/article/pii/S0925231216315533
Article Google Scholar
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and Harnessing Adversarial Examples. In: International Conference on Learning Representations, (ICLR). https://research.google/pubs/pub43405/
Gupta P, Rahtu E (2019) MLAttack: Fooling Semantic Segmentation Networks by Multi-layer Attacks. In: German Conference on Pattern Recognition (GCPR). https://doi.org/10.1007/978-3-030-33676-9_28. Springer, pp 401–413
Modas A, Sanchez-Matilla R, Frossard P, Cavallaro A (2020) Toward robust sensing for autonomous vehicles: An adversarial perspective. IEEE Signal Process Mag 37(4):14–23. https://doi.org/10.1109/MSP.2020.2985363
Article Google Scholar
Goswami G, Agarwal A, Ratha N, Singh R, Vatsa M (2019) Detecting and mitigating adversarial perturbations for robust face recognition. Int J Comput Vis 127(6):719–742. https://doi.org/10.1007/s11263-019-01160-w
Article Google Scholar
García J, Majadas R, Fernández F (2020) Learning adversarial attack policies through multi-objective reinforcement learning. Eng Appl Artif Intell 96:104021. https://doi.org/10.1016/j.engappai.2020.104021. https://www.sciencedirect.com/science/article/pii/S0952197620303043
Article Google Scholar
Sun X, Sun S (2021) Adversarial robustness and attacks for multi-view deep models. Eng Appl Artif Intell 97:104085. https://doi.org/10.1016/j.engappai.2020.104085. https://www.sciencedirect.com/science/article/pii/S0952197620303419
Article Google Scholar
Xu J, Du Q (2020) TextTricker: Loss-based and gradient-based adversarial attacks on text classification models. Eng Appl Artif Intell 92:103641. https://doi.org/10.1016/j.engappai.2020.103641. https://www.sciencedirect.com/science/article/pii/S0952197620300956
Article Google Scholar
Marino DL, Wickramasinghe CS, Manic M (2018) An adversarial approach for explainable AI in intrusion detection systems. In: (IECON) Annual Conference of the IEEE Industrial Electronics Society. https://doi.org/10.1109/IECON.2018.8591457. IEEE, pp 3237–3243
Yuan X, He P, Zhu Q, Li X (2019) Adversarial examples: Attacks and defenses for deep learning. IEEE Trans Neural Netw Learn Syst 30(9):2805–2824. https://doi.org/10.1109/TNNLS.2018.2886017
Article MathSciNet Google Scholar
Ephrat A, Halperin T, Peleg Shmuel (2017) Improved speech reconstruction from silent video. In: International Conference on Computer Vision Workshops (ICCV-W). https://doi.org/10.1109/ICCVW.2017.61. IEEE, pp 455–462
Fernandez-Lopez A, Sukno FM (2018) Survey on automatic lip-reading in the era of deep learning. Image Vis Comput 78:53–72. https://doi.org/10.1016/j.imavis.2018.07.002
Article Google Scholar
Ezz M, Mostafa AM, Nasr AA (2020) A silent password recognition framework based on lip analysis. IEEE Access 8:55354–55371. https://doi.org/10.1109/ACCESS.2020.2982359
Article Google Scholar
Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild. In: Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2017.367. IEEE, pp 3444–3453
Adeel A, Gogate M, Hussain A, Whitmer WM (2019) Lip-reading driven deep learning approach for speech enhancement. IEEE Trans Emerg Top Comput Intell:1–10. https://doi.org/10.1109/TETCI.2019.2917039
Ephrat A, Mosseri I, Lang O, Dekel T, Wilson K, Hassidim A, Freeman WT, Rubinstein M (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans Graph 37(4):112:1–112:11. https://doi.org/10.1145/3197517.3201357
Article Google Scholar
Rothkrantz L (2017) Lip-reading by surveillance cameras. In: Smart City Symposium Prague (SCSP). IEEE, pp 1–6
Xu W, Evans D, Qi Y (2018) Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In: Network and Distributed Systems Security Symposium (NDSS). https://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018_03A-4_Xu_paper.pdf
Dziugaite GK, Ghahramani Z, Roy DM (2016) A study of the effect of JPG compression on adversarial images. arXiv:https://arxiv.org/abs/1608.00853
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2014) Intriguing properties of neural networks. In: Bengio Y, LeCun Yx (eds) International conference on learning representations, ICLR. https://research.google/pubs/pub42503.pdf
Kurakin A, Goodfellow I, Bengio S (2017) Adversarial examples in the physical world. In: International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=HJGU3Rodl
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: International conference on learning representations, ICLR. https://openreview.net/forum?id=rJzIBfZAb
Moosavi-Dezfooli S, Fawzi A, Frossard P (2016) Deepfool: A simple and accurate method to fool deep neural networks. In: IEEE conference on computer vision and pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2016.282. IEEE Computer Society, pp 2574–2582
Moosavi-Dezfooli S, Fawzi A, Fawzi O, Frossard Pa (2017) Universal adversarial perturbations. In: IEEE conference on computer vision and pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2017.17. IEEE Computer Society, pp 86–94
Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks. In: IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.2017.49. IEEE, pp 39–57
Papernot N, McDaniel P, Wu X, Jha S, Swami A (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.2016.41. IEEE, pp 582–597
Wei X, Zhu J, Yuan S, Su H (2019) Sparse adversarial perturbations for videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.33018973, vol 33. AAAI Press, pp 8973–8980
Inkawhich N, Inkawhich M, Chen Y, Li H (2018) Adversarial attacks for optical flow-based action recognition classifiers. arXiv:https://arxiv.org/abs/1811.11875
Chen Z, Xie L, Pang S, He Y, Tian Q (2021) Appending adversarial frames for universal video attack
Zajac M, Zołna K, Rostamzadeh N, Pinheiro PO (2019) Adversarial framing for image and video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.330110077, vol 33. AAAI Press, pp 10077–10078
Pony R, Naeh I, Mannor S (2020) Over-the-air adversarial flickering attacks against video recognition networks. arXiv:https://arxiv.org/abs/2002.05123
Hao M, Mamut M, Yadikar N, Aysa A, Ubul K (2020) A survey of research on lipreading technology. IEEE Access 8:204518–204544. https://doi.org/10.1109/ACCESS.2020.3036865
Article Google Scholar
Vakhshiteh F, Almasganj F, Nickabadi A (2018) Lip-reading via deep neural networks using hybrid visual features. Image Anal Stereol 37(2):159–171. https://doi.org/10.5566/ias.1859. https://www.ias-iss.org/ojs/IAS/article/view/1859
Article Google Scholar
Petridis S, Pantic M (2016) Deep complementary bottleneck features for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472088, pp 2304–2308
Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338. https://doi.org/10.1016/j.neucom.2019.01.078
Article Google Scholar
Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for lipreading. In: International Speech Communication Association (INTERSPEECH). https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html, pp 3652–3656
Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M (2018) End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2018.8461326. IEEE, pp 6548–6552
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (NIPS). https://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:https://arxiv.org/abs/1807.00458
Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP40776.2020.9053841. IEEE, pp 6319–6323
Assael YM, Shillingford B, Whiteson S, de Freitas N (2016) Lipnet: sentence-level lipreading. arXiv:https://arxiv.org/abs/1611.01599
Bradski G (2000) The OpenCV Library. Dr. Dobb’s J Softw Tools 25:120–125
Google Scholar
Riba E, Fathollahi M, Chaney W, Rublee E, Bradski G (2018) Torchgeometry: when PyTorch meets geometry. https://drive.google.com/file/d/1xiao1Xj9WzjJ08YY_nYwsthE-wxfyfhG/view?usp=sharing
Riba E, Mishkin D, Ponsa D, Rublee E, Bradski G (2020) Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. In: IEEE Winter Conference on Applications of Computer Vision (WACV). https://doi.org/10.1109/WACV45572.2020.9093363, pp 3674–3683
Chung JS, Zisserman A (2016) Lip reading in the wild. In: Asian Conference on Computer Vision (ACCV). https://doi.org/10.1007/978-3-319-54184-6_6. Springer, pp 87–103
Graese A, Rozsa A, Boult TE (2016) Assessing threat of adversarial examples on deep neural networks. In: IEEE International Conference on Machine Learning and Applications (ICMLA). https://doi.org/10.1109/ICMLA.2016.0020. IEEE, pp 69–74
Guo C, Rana M, Cisse M, van der Maaten L (2018) Countering adversarial images using input transformations. In: International Conference on Learning Representations (ICLR)
Gupta P, Rahtu E (2019) CIIDefence: Defeating adversarial attacks by fusing class-specific image inpainting and image denoising. In: IEEE International Conference on Computer Vision (ICCV). https://openreview.net/forum?id=SyJ7ClWCb, pp 6708–6717
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5-6):602–610. https://www.sciencedirect.com/science/article/pii/S0893608005001206
Article Google Scholar
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. In: International Conference on Artificial Neural Networks (ICANN). https://doi.org/10.1007/11550907_126. Springer, pp 799–804
Hayes J, Danezis G (2018) Learning universal adversarial perturbations with generative models. In: IEEE security and privacy workshops, SP workshops. https://doi.org/10.1109/SPW.2018.00015. IEEE Computer Society, pp 43–49

Download references

Acknowledgements

We would like to thank the respective authors for providing code and pretrained models. We would also like to thank BBC for providing the Lip Reading Words in the Wild (LRW) dataset. We are also thankful to the anonymous reviewers for their valuable suggestions to improve the quality of the paper. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT Indore, Indore, India
Anup Kumar Gupta & Puneet Gupta
Computer Vision Group, Tampere University, Tampere, Finland
Esa Rahtu

Authors

Anup Kumar Gupta
View author publications
You can also search for this author inPubMed Google Scholar
Puneet Gupta
View author publications
You can also search for this author inPubMed Google Scholar
Esa Rahtu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Anup Kumar Gupta.

Ethics declarations

Conflicts of interest/Competing interests

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material (data transparency)

Dataset is publicly available at: www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html

Code availability (software application or custom code)

Code is publicly available at: https://github.com/AnupKumarGupta/FATALRead-Fooling-Visual-Speech-Recogntion

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(MP4 37.4 MB)

(PDF 120 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, A.K., Gupta, P. & Rahtu, E. FATALRead - Fooling visual speech recognition models. Appl Intell 52, 9001–9016 (2022). https://doi.org/10.1007/s10489-021-02846-w

Download citation

Accepted: 14 September 2021
Published: 12 November 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10489-021-02846-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FATALRead - Fooling visual speech recognition models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

When deep learning deciphers silent video: a survey on automatic deep lip reading

Deep hybrid architectures and DenseNet35 in speaker-dependent visual speech recognition

Continuous lipreading based on acoustic temporal alignments

Explore related subjects

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/Competing interests

Additional information

Availability of data and material (data transparency)

Code availability (software application or custom code)

Publisher’s note

Electronic supplementary material

(PDF 120 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now