skip to main content
10.1145/3394171.3414009acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal

Published: 12 October 2020 Publication History

Abstract

In this paper, we investigate the fragility of deep image captioning models against adversarial attacks. Different from existing works that generate common words and concepts, we focus on the adversarial attacks towards controllable image captioning, i.e., removing target words from captions by imposing adversarial noises to images while maintaining the captioning accuracy for the remaining visual content. We name this new task as Masked Image Captioning (MIC), which is expected to be training and labeling free for end-to-end captioning models. Meanwhile, we propose a novel adversarial learning approach for this new task, termed Show, Mask, and Tell (SMT), which crafts adversarial examples to mask the target concepts via minimizing an objective loss while training the noise generator. Concretely, three novel designs are introduced in this loss, i.e., word removal regularization, captioning accuracy regularization, and noise filtering regularization. For quantitative validation, we propose a benchmark dataset for MIC based on the MS COCO dataset, together with a new evaluation metric called Attack Quality. Experimental results show that the proposed approach achieves successful attacks by removing 93.8% and 91.9% target words while maintaining 97.3% and 97.4% accuracies on two cutting-edge captioning models, respectively.

Supplementary Material

MP4 File (3394171.3414009.mp4)
The video for "Attacking Image Captioning Towards Accuracy-Preserving TargetWords Removal"

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[3]
Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. In ICCV.
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL (Workshops).
[6]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NeurIPS.
[7]
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE SSP.
[8]
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. 2018. Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. In ACL.
[9]
Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs. arXiv preprint (2020).
[10]
Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2019. Show, control and tell: a framework for generating controllable and grounded captions. In CVPR.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
[12]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In CVPR.
[13]
Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, and Vladimir Vapnik. 1997. Support vector regression machines. In NeurIPS.
[14]
Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE TPAMI (2019).
[15]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
[16]
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint (2014).
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[18]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019 a. Attention on attention for image captioning. In CVPR.
[19]
Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019 b. Adaptively Aligned Image Captioning via Adaptive Attention Time. In NeurIPS.
[20]
Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim Papadopoulos, and Vittorio Ferrari. 2016. How hard can it be? Estimating the difficulty of visual search in an image. In CVPR.
[21]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.
[22]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint (2014).
[23]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NeurIPS.
[24]
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint (2016).
[25]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL (Workshops).
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV.
[27]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR.
[28]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR.
[29]
Alexander Mathews, Lexing Xie, and Xuming He. 2018. Semstyle: Learning to generate stylised image captions using unaligned text. In CVPR.
[30]
Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI.
[31]
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR.
[32]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.
[33]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.
[34]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR.
[35]
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In CVPR.
[36]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint (2013).
[37]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR.
[38]
Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In AAAI.
[39]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
[40]
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In ICCV.
[41]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
[42]
Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, and Wei Liu. 2019. Exact Adversarial Attack to Image Captioning via Structured Output Learning with Latent Variables. In CVPR.

Cited By

View all
  • (2024)Multi-branch Collaborative Learning Network for 3D Visual GroundingComputer Vision – ECCV 202410.1007/978-3-031-72952-2_22(381-398)Online publication date: 29-Sep-2024
  • (2023)CAPatchProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620276(679-696)Online publication date: 9-Aug-2023
  • (2023)Black-Box Attacks on Image Activity Prediction and its Natural Language Explanations2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00396(3688-3697)Online publication date: 2-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial attack
  2. dataset
  3. image captioning
  4. target words removal

Qualifiers

  • Research-article

Funding Sources

  • Natural Science Foundation of Guangdong Province in China
  • National Natural Science Foundation of China
  • National Key R&D Program

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-branch Collaborative Learning Network for 3D Visual GroundingComputer Vision – ECCV 202410.1007/978-3-031-72952-2_22(381-398)Online publication date: 29-Sep-2024
  • (2023)CAPatchProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620276(679-696)Online publication date: 9-Aug-2023
  • (2023)Black-Box Attacks on Image Activity Prediction and its Natural Language Explanations2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00396(3688-3697)Online publication date: 2-Oct-2023
  • (2023)Deep image captioning: A review of methods, trends and future challengesNeurocomputing10.1016/j.neucom.2023.126287546(126287)Online publication date: Aug-2023
  • (2022)Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2022.318343431(4321-4335)Online publication date: 2022
  • (2022)Visuals to Text: A Comprehensive Review on Automatic Image CaptioningIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2022.1057349:8(1339-1365)Online publication date: Aug-2022
  • (2022)Most and Least Retrievable Images in Visual-Language Query SystemsComputer Vision – ECCV 202210.1007/978-3-031-19836-6_1(1-18)Online publication date: 22-Oct-2022
  • (2021)Latent Memory-augmented Graph Transformer for Visual StorytellingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475236(4892-4901)Online publication date: 17-Oct-2021
  • (2021)Pick-Object-Attack: Type-specific adversarial attack for object detectionComputer Vision and Image Understanding10.1016/j.cviu.2021.103257(103257)Online publication date: Aug-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media