Skip to main content
Log in

FATALRead - Fooling visual speech recognition models

Put words on Lips

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Visual speech recognition is essential in understanding speech in several real-world applications such as surveillance systems and aiding differently-abled. It proliferates the research in the realm of visual speech recognition, also known as Automatic Lip Reading (ALR). In recent years, Deep Learning (DL) methods are being utilised for developing ALR systems. DL models tend to be vulnerable to adversarial attacks. Studying these attacks creates new research directions in designing robust DL systems. Existing attacks on images and videos classification models are not directly applicable to ALR systems. Since the ALR systems encompass temporal information, attacking these systems is comparatively more challenging and strenuous than attacking image classification models. Similarly, compared to other video classification tasks, the region-of-interest is smaller in the case of ALR systems. Despite these factors, our proposed method, Fooling AuTomAtic Lip Reading (FATALRead), can successfully perform adversarial attacks on state-of-the-art ALR systems. To the best of our knowledge, we are the first to successfully fool ALR systems for the word recognition task. We further demonstrate that the success of the attack is increased by incorporating logits instead of probabilities in the loss function. Our extensive experiments on a publicly available dataset, show that our attack successfully circumvents the well-known transformation based defences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Dataset Link: www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html (For non-commercial individual research and private study use only. BBC content included courtesy of the BBC.)

  2. Available at: https://github.com/mpc001/end-to-end-lipreading

  3. Available at: https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks

References

  1. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038. https://www.sciencedirect.com/science/article/pii/S0925231216315533

    Article  Google Scholar 

  2. Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and Harnessing Adversarial Examples. In: International Conference on Learning Representations, (ICLR). https://research.google/pubs/pub43405/

  3. Gupta P, Rahtu E (2019) MLAttack: Fooling Semantic Segmentation Networks by Multi-layer Attacks. In: German Conference on Pattern Recognition (GCPR). https://doi.org/10.1007/978-3-030-33676-9_28. Springer, pp 401–413

  4. Modas A, Sanchez-Matilla R, Frossard P, Cavallaro A (2020) Toward robust sensing for autonomous vehicles: An adversarial perspective. IEEE Signal Process Mag 37(4):14–23. https://doi.org/10.1109/MSP.2020.2985363

    Article  Google Scholar 

  5. Goswami G, Agarwal A, Ratha N, Singh R, Vatsa M (2019) Detecting and mitigating adversarial perturbations for robust face recognition. Int J Comput Vis 127(6):719–742. https://doi.org/10.1007/s11263-019-01160-w

    Article  Google Scholar 

  6. García J, Majadas R, Fernández F (2020) Learning adversarial attack policies through multi-objective reinforcement learning. Eng Appl Artif Intell 96:104021. https://doi.org/10.1016/j.engappai.2020.104021. https://www.sciencedirect.com/science/article/pii/S0952197620303043

    Article  Google Scholar 

  7. Sun X, Sun S (2021) Adversarial robustness and attacks for multi-view deep models. Eng Appl Artif Intell 97:104085. https://doi.org/10.1016/j.engappai.2020.104085. https://www.sciencedirect.com/science/article/pii/S0952197620303419

    Article  Google Scholar 

  8. Xu J, Du Q (2020) TextTricker: Loss-based and gradient-based adversarial attacks on text classification models. Eng Appl Artif Intell 92:103641. https://doi.org/10.1016/j.engappai.2020.103641. https://www.sciencedirect.com/science/article/pii/S0952197620300956

    Article  Google Scholar 

  9. Marino DL, Wickramasinghe CS, Manic M (2018) An adversarial approach for explainable AI in intrusion detection systems. In: (IECON) Annual Conference of the IEEE Industrial Electronics Society. https://doi.org/10.1109/IECON.2018.8591457. IEEE, pp 3237–3243

  10. Yuan X, He P, Zhu Q, Li X (2019) Adversarial examples: Attacks and defenses for deep learning. IEEE Trans Neural Netw Learn Syst 30(9):2805–2824. https://doi.org/10.1109/TNNLS.2018.2886017

    Article  MathSciNet  Google Scholar 

  11. Ephrat A, Halperin T, Peleg Shmuel (2017) Improved speech reconstruction from silent video. In: International Conference on Computer Vision Workshops (ICCV-W). https://doi.org/10.1109/ICCVW.2017.61. IEEE, pp 455–462

  12. Fernandez-Lopez A, Sukno FM (2018) Survey on automatic lip-reading in the era of deep learning. Image Vis Comput 78:53–72. https://doi.org/10.1016/j.imavis.2018.07.002

    Article  Google Scholar 

  13. Ezz M, Mostafa AM, Nasr AA (2020) A silent password recognition framework based on lip analysis. IEEE Access 8:55354–55371. https://doi.org/10.1109/ACCESS.2020.2982359

    Article  Google Scholar 

  14. Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild. In: Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2017.367. IEEE, pp 3444–3453

  15. Adeel A, Gogate M, Hussain A, Whitmer WM (2019) Lip-reading driven deep learning approach for speech enhancement. IEEE Trans Emerg Top Comput Intell:1–10. https://doi.org/10.1109/TETCI.2019.2917039

  16. Ephrat A, Mosseri I, Lang O, Dekel T, Wilson K, Hassidim A, Freeman WT, Rubinstein M (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans Graph 37(4):112:1–112:11. https://doi.org/10.1145/3197517.3201357

    Article  Google Scholar 

  17. Rothkrantz L (2017) Lip-reading by surveillance cameras. In: Smart City Symposium Prague (SCSP). IEEE, pp 1–6

  18. Xu W, Evans D, Qi Y (2018) Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In: Network and Distributed Systems Security Symposium (NDSS). https://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018_03A-4_Xu_paper.pdf

  19. Dziugaite GK, Ghahramani Z, Roy DM (2016) A study of the effect of JPG compression on adversarial images. arXiv:https://arxiv.org/abs/1608.00853

  20. Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2014) Intriguing properties of neural networks. In: Bengio Y, LeCun Yx (eds) International conference on learning representations, ICLR. https://research.google/pubs/pub42503.pdf

  21. Kurakin A, Goodfellow I, Bengio S (2017) Adversarial examples in the physical world. In: International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=HJGU3Rodl

  22. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: International conference on learning representations, ICLR. https://openreview.net/forum?id=rJzIBfZAb

  23. Moosavi-Dezfooli S, Fawzi A, Frossard P (2016) Deepfool: A simple and accurate method to fool deep neural networks. In: IEEE conference on computer vision and pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2016.282. IEEE Computer Society, pp 2574–2582

  24. Moosavi-Dezfooli S, Fawzi A, Fawzi O, Frossard Pa (2017) Universal adversarial perturbations. In: IEEE conference on computer vision and pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2017.17. IEEE Computer Society, pp 86–94

  25. Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks. In: IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.2017.49. IEEE, pp 39–57

  26. Papernot N, McDaniel P, Wu X, Jha S, Swami A (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.2016.41. IEEE, pp 582–597

  27. Wei X, Zhu J, Yuan S, Su H (2019) Sparse adversarial perturbations for videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.33018973, vol 33. AAAI Press, pp 8973–8980

  28. Inkawhich N, Inkawhich M, Chen Y, Li H (2018) Adversarial attacks for optical flow-based action recognition classifiers. arXiv:https://arxiv.org/abs/1811.11875

  29. Chen Z, Xie L, Pang S, He Y, Tian Q (2021) Appending adversarial frames for universal video attack

  30. Zajac M, Zołna K, Rostamzadeh N, Pinheiro PO (2019) Adversarial framing for image and video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.330110077, vol 33. AAAI Press, pp 10077–10078

  31. Pony R, Naeh I, Mannor S (2020) Over-the-air adversarial flickering attacks against video recognition networks. arXiv:https://arxiv.org/abs/2002.05123

  32. Hao M, Mamut M, Yadikar N, Aysa A, Ubul K (2020) A survey of research on lipreading technology. IEEE Access 8:204518–204544. https://doi.org/10.1109/ACCESS.2020.3036865

    Article  Google Scholar 

  33. Vakhshiteh F, Almasganj F, Nickabadi A (2018) Lip-reading via deep neural networks using hybrid visual features. Image Anal Stereol 37(2):159–171. https://doi.org/10.5566/ias.1859. https://www.ias-iss.org/ojs/IAS/article/view/1859

    Article  Google Scholar 

  34. Petridis S, Pantic M (2016) Deep complementary bottleneck features for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472088, pp 2304–2308

  35. Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338. https://doi.org/10.1016/j.neucom.2019.01.078

    Article  Google Scholar 

  36. Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for lipreading. In: International Speech Communication Association (INTERSPEECH). https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html, pp 3652–3656

  37. Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M (2018) End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2018.8461326. IEEE, pp 6548–6552

  38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (NIPS). https://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008

  39. Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:https://arxiv.org/abs/1807.00458

  40. Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP40776.2020.9053841. IEEE, pp 6319–6323

  41. Assael YM, Shillingford B, Whiteson S, de Freitas N (2016) Lipnet: sentence-level lipreading. arXiv:https://arxiv.org/abs/1611.01599

  42. Bradski G (2000) The OpenCV Library. Dr. Dobb’s J Softw Tools 25:120–125

    Google Scholar 

  43. Riba E, Fathollahi M, Chaney W, Rublee E, Bradski G (2018) Torchgeometry: when PyTorch meets geometry. https://drive.google.com/file/d/1xiao1Xj9WzjJ08YY_nYwsthE-wxfyfhG/view?usp=sharing

  44. Riba E, Mishkin D, Ponsa D, Rublee E, Bradski G (2020) Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. In: IEEE Winter Conference on Applications of Computer Vision (WACV). https://doi.org/10.1109/WACV45572.2020.9093363, pp 3674–3683

  45. Chung JS, Zisserman A (2016) Lip reading in the wild. In: Asian Conference on Computer Vision (ACCV). https://doi.org/10.1007/978-3-319-54184-6_6. Springer, pp 87–103

  46. Graese A, Rozsa A, Boult TE (2016) Assessing threat of adversarial examples on deep neural networks. In: IEEE International Conference on Machine Learning and Applications (ICMLA). https://doi.org/10.1109/ICMLA.2016.0020. IEEE, pp 69–74

  47. Guo C, Rana M, Cisse M, van der Maaten L (2018) Countering adversarial images using input transformations. In: International Conference on Learning Representations (ICLR)

  48. Gupta P, Rahtu E (2019) CIIDefence: Defeating adversarial attacks by fusing class-specific image inpainting and image denoising. In: IEEE International Conference on Computer Vision (ICCV). https://openreview.net/forum?id=SyJ7ClWCb, pp 6708–6717

  49. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5-6):602–610. https://www.sciencedirect.com/science/article/pii/S0893608005001206

    Article  Google Scholar 

  50. Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. In: International Conference on Artificial Neural Networks (ICANN). https://doi.org/10.1007/11550907_126. Springer, pp 799–804

  51. Hayes J, Danezis G (2018) Learning universal adversarial perturbations with generative models. In: IEEE security and privacy workshops, SP workshops. https://doi.org/10.1109/SPW.2018.00015. IEEE Computer Society, pp 43–49

Download references

Acknowledgements

We would like to thank the respective authors for providing code and pretrained models. We would also like to thank BBC for providing the Lip Reading Words in the Wild (LRW) dataset. We are also thankful to the anonymous reviewers for their valuable suggestions to improve the quality of the paper. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anup Kumar Gupta.

Ethics declarations

Conflicts of interest/Competing interests

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material (data transparency)

Dataset is publicly available at: www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html

Code availability (software application or custom code)

Code is publicly available at: https://github.com/AnupKumarGupta/FATALRead-Fooling-Visual-Speech-Recogntion

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(MP4 37.4 MB)

(PDF 120 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, A.K., Gupta, P. & Rahtu, E. FATALRead - Fooling visual speech recognition models. Appl Intell 52, 9001–9016 (2022). https://doi.org/10.1007/s10489-021-02846-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02846-w

Keywords

Navigation