Abstract
Visual speech recognition is essential in understanding speech in several real-world applications such as surveillance systems and aiding differently-abled. It proliferates the research in the realm of visual speech recognition, also known as Automatic Lip Reading (ALR). In recent years, Deep Learning (DL) methods are being utilised for developing ALR systems. DL models tend to be vulnerable to adversarial attacks. Studying these attacks creates new research directions in designing robust DL systems. Existing attacks on images and videos classification models are not directly applicable to ALR systems. Since the ALR systems encompass temporal information, attacking these systems is comparatively more challenging and strenuous than attacking image classification models. Similarly, compared to other video classification tasks, the region-of-interest is smaller in the case of ALR systems. Despite these factors, our proposed method, Fooling AuTomAtic Lip Reading (FATALRead), can successfully perform adversarial attacks on state-of-the-art ALR systems. To the best of our knowledge, we are the first to successfully fool ALR systems for the word recognition task. We further demonstrate that the success of the attack is increased by incorporating logits instead of probabilities in the loss function. Our extensive experiments on a publicly available dataset, show that our attack successfully circumvents the well-known transformation based defences.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Dataset Link: www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html (For non-commercial individual research and private study use only. BBC content included courtesy of the BBC.)
Available at: https://github.com/mpc001/end-to-end-lipreading
References
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038. https://www.sciencedirect.com/science/article/pii/S0925231216315533
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and Harnessing Adversarial Examples. In: International Conference on Learning Representations, (ICLR). https://research.google/pubs/pub43405/
Gupta P, Rahtu E (2019) MLAttack: Fooling Semantic Segmentation Networks by Multi-layer Attacks. In: German Conference on Pattern Recognition (GCPR). https://doi.org/10.1007/978-3-030-33676-9_28. Springer, pp 401–413
Modas A, Sanchez-Matilla R, Frossard P, Cavallaro A (2020) Toward robust sensing for autonomous vehicles: An adversarial perspective. IEEE Signal Process Mag 37(4):14–23. https://doi.org/10.1109/MSP.2020.2985363
Goswami G, Agarwal A, Ratha N, Singh R, Vatsa M (2019) Detecting and mitigating adversarial perturbations for robust face recognition. Int J Comput Vis 127(6):719–742. https://doi.org/10.1007/s11263-019-01160-w
García J, Majadas R, Fernández F (2020) Learning adversarial attack policies through multi-objective reinforcement learning. Eng Appl Artif Intell 96:104021. https://doi.org/10.1016/j.engappai.2020.104021. https://www.sciencedirect.com/science/article/pii/S0952197620303043
Sun X, Sun S (2021) Adversarial robustness and attacks for multi-view deep models. Eng Appl Artif Intell 97:104085. https://doi.org/10.1016/j.engappai.2020.104085. https://www.sciencedirect.com/science/article/pii/S0952197620303419
Xu J, Du Q (2020) TextTricker: Loss-based and gradient-based adversarial attacks on text classification models. Eng Appl Artif Intell 92:103641. https://doi.org/10.1016/j.engappai.2020.103641. https://www.sciencedirect.com/science/article/pii/S0952197620300956
Marino DL, Wickramasinghe CS, Manic M (2018) An adversarial approach for explainable AI in intrusion detection systems. In: (IECON) Annual Conference of the IEEE Industrial Electronics Society. https://doi.org/10.1109/IECON.2018.8591457. IEEE, pp 3237–3243
Yuan X, He P, Zhu Q, Li X (2019) Adversarial examples: Attacks and defenses for deep learning. IEEE Trans Neural Netw Learn Syst 30(9):2805–2824. https://doi.org/10.1109/TNNLS.2018.2886017
Ephrat A, Halperin T, Peleg Shmuel (2017) Improved speech reconstruction from silent video. In: International Conference on Computer Vision Workshops (ICCV-W). https://doi.org/10.1109/ICCVW.2017.61. IEEE, pp 455–462
Fernandez-Lopez A, Sukno FM (2018) Survey on automatic lip-reading in the era of deep learning. Image Vis Comput 78:53–72. https://doi.org/10.1016/j.imavis.2018.07.002
Ezz M, Mostafa AM, Nasr AA (2020) A silent password recognition framework based on lip analysis. IEEE Access 8:55354–55371. https://doi.org/10.1109/ACCESS.2020.2982359
Chung JS, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild. In: Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2017.367. IEEE, pp 3444–3453
Adeel A, Gogate M, Hussain A, Whitmer WM (2019) Lip-reading driven deep learning approach for speech enhancement. IEEE Trans Emerg Top Comput Intell:1–10. https://doi.org/10.1109/TETCI.2019.2917039
Ephrat A, Mosseri I, Lang O, Dekel T, Wilson K, Hassidim A, Freeman WT, Rubinstein M (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans Graph 37(4):112:1–112:11. https://doi.org/10.1145/3197517.3201357
Rothkrantz L (2017) Lip-reading by surveillance cameras. In: Smart City Symposium Prague (SCSP). IEEE, pp 1–6
Xu W, Evans D, Qi Y (2018) Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In: Network and Distributed Systems Security Symposium (NDSS). https://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018_03A-4_Xu_paper.pdf
Dziugaite GK, Ghahramani Z, Roy DM (2016) A study of the effect of JPG compression on adversarial images. arXiv:https://arxiv.org/abs/1608.00853
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2014) Intriguing properties of neural networks. In: Bengio Y, LeCun Yx (eds) International conference on learning representations, ICLR. https://research.google/pubs/pub42503.pdf
Kurakin A, Goodfellow I, Bengio S (2017) Adversarial examples in the physical world. In: International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=HJGU3Rodl
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: International conference on learning representations, ICLR. https://openreview.net/forum?id=rJzIBfZAb
Moosavi-Dezfooli S, Fawzi A, Frossard P (2016) Deepfool: A simple and accurate method to fool deep neural networks. In: IEEE conference on computer vision and pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2016.282. IEEE Computer Society, pp 2574–2582
Moosavi-Dezfooli S, Fawzi A, Fawzi O, Frossard Pa (2017) Universal adversarial perturbations. In: IEEE conference on computer vision and pattern recognition, CVPR. https://doi.org/10.1109/CVPR.2017.17. IEEE Computer Society, pp 86–94
Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks. In: IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.2017.49. IEEE, pp 39–57
Papernot N, McDaniel P, Wu X, Jha S, Swami A (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.2016.41. IEEE, pp 582–597
Wei X, Zhu J, Yuan S, Su H (2019) Sparse adversarial perturbations for videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.33018973, vol 33. AAAI Press, pp 8973–8980
Inkawhich N, Inkawhich M, Chen Y, Li H (2018) Adversarial attacks for optical flow-based action recognition classifiers. arXiv:https://arxiv.org/abs/1811.11875
Chen Z, Xie L, Pang S, He Y, Tian Q (2021) Appending adversarial frames for universal video attack
Zajac M, Zołna K, Rostamzadeh N, Pinheiro PO (2019) Adversarial framing for image and video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v33i01.330110077, vol 33. AAAI Press, pp 10077–10078
Pony R, Naeh I, Mannor S (2020) Over-the-air adversarial flickering attacks against video recognition networks. arXiv:https://arxiv.org/abs/2002.05123
Hao M, Mamut M, Yadikar N, Aysa A, Ubul K (2020) A survey of research on lipreading technology. IEEE Access 8:204518–204544. https://doi.org/10.1109/ACCESS.2020.3036865
Vakhshiteh F, Almasganj F, Nickabadi A (2018) Lip-reading via deep neural networks using hybrid visual features. Image Anal Stereol 37(2):159–171. https://doi.org/10.5566/ias.1859. https://www.ias-iss.org/ojs/IAS/article/view/1859
Petridis S, Pantic M (2016) Deep complementary bottleneck features for visual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2016.7472088, pp 2304–2308
Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338. https://doi.org/10.1016/j.neucom.2019.01.078
Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for lipreading. In: International Speech Communication Association (INTERSPEECH). https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html, pp 3652–3656
Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic M (2018) End-to-end audiovisual speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP.2018.8461326. IEEE, pp 6548–6552
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (NIPS). https://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:https://arxiv.org/abs/1807.00458
Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP40776.2020.9053841. IEEE, pp 6319–6323
Assael YM, Shillingford B, Whiteson S, de Freitas N (2016) Lipnet: sentence-level lipreading. arXiv:https://arxiv.org/abs/1611.01599
Bradski G (2000) The OpenCV Library. Dr. Dobb’s J Softw Tools 25:120–125
Riba E, Fathollahi M, Chaney W, Rublee E, Bradski G (2018) Torchgeometry: when PyTorch meets geometry. https://drive.google.com/file/d/1xiao1Xj9WzjJ08YY_nYwsthE-wxfyfhG/view?usp=sharing
Riba E, Mishkin D, Ponsa D, Rublee E, Bradski G (2020) Kornia: an Open Source Differentiable Computer Vision Library for PyTorch. In: IEEE Winter Conference on Applications of Computer Vision (WACV). https://doi.org/10.1109/WACV45572.2020.9093363, pp 3674–3683
Chung JS, Zisserman A (2016) Lip reading in the wild. In: Asian Conference on Computer Vision (ACCV). https://doi.org/10.1007/978-3-319-54184-6_6. Springer, pp 87–103
Graese A, Rozsa A, Boult TE (2016) Assessing threat of adversarial examples on deep neural networks. In: IEEE International Conference on Machine Learning and Applications (ICMLA). https://doi.org/10.1109/ICMLA.2016.0020. IEEE, pp 69–74
Guo C, Rana M, Cisse M, van der Maaten L (2018) Countering adversarial images using input transformations. In: International Conference on Learning Representations (ICLR)
Gupta P, Rahtu E (2019) CIIDefence: Defeating adversarial attacks by fusing class-specific image inpainting and image denoising. In: IEEE International Conference on Computer Vision (ICCV). https://openreview.net/forum?id=SyJ7ClWCb, pp 6708–6717
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5-6):602–610. https://www.sciencedirect.com/science/article/pii/S0893608005001206
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. In: International Conference on Artificial Neural Networks (ICANN). https://doi.org/10.1007/11550907_126. Springer, pp 799–804
Hayes J, Danezis G (2018) Learning universal adversarial perturbations with generative models. In: IEEE security and privacy workshops, SP workshops. https://doi.org/10.1109/SPW.2018.00015. IEEE Computer Society, pp 43–49
Acknowledgements
We would like to thank the respective authors for providing code and pretrained models. We would also like to thank BBC for providing the Lip Reading Words in the Wild (LRW) dataset. We are also thankful to the anonymous reviewers for their valuable suggestions to improve the quality of the paper. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest/Competing interests
The authors declare that they have no conflict of interest.
Additional information
Availability of data and material (data transparency)
Dataset is publicly available at: www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html
Code availability (software application or custom code)
Code is publicly available at: https://github.com/AnupKumarGupta/FATALRead-Fooling-Visual-Speech-Recogntion
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Gupta, A.K., Gupta, P. & Rahtu, E. FATALRead - Fooling visual speech recognition models. Appl Intell 52, 9001–9016 (2022). https://doi.org/10.1007/s10489-021-02846-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02846-w