research-article

Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal

Authors:

Jianzhuang Liu,

Qi TianAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4226 - 4234

https://doi.org/10.1145/3394171.3414009

Published: 12 October 2020 Publication History

Abstract

In this paper, we investigate the fragility of deep image captioning models against adversarial attacks. Different from existing works that generate common words and concepts, we focus on the adversarial attacks towards controllable image captioning, i.e., removing target words from captions by imposing adversarial noises to images while maintaining the captioning accuracy for the remaining visual content. We name this new task as Masked Image Captioning (MIC), which is expected to be training and labeling free for end-to-end captioning models. Meanwhile, we propose a novel adversarial learning approach for this new task, termed Show, Mask, and Tell (SMT), which crafts adversarial examples to mask the target concepts via minimizing an objective loss while training the noise generator. Concretely, three novel designs are introduced in this loss, i.e., word removal regularization, captioning accuracy regularization, and noise filtering regularization. For quantitative validation, we propose a benchmark dataset for MIC based on the MS COCO dataset, together with a new evaluation metric called Attack Quality. Experimental results show that the proposed approach achieves successful attacks by removing 93.8% and 91.9% target words while maintaining 97.3% and 97.4% accuracies on two cutting-edge captioning models, respectively.

Supplementary Material

MP4 File (3394171.3414009.mp4)

The video for "Attacking Image Captioning Towards Accuracy-Preserving TargetWords Removal"

Download
22.84 MB

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[3]

Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning. In ICCV.

[4]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.

Digital Library

[5]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL (Workshops).

[6]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NeurIPS.

[7]

Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE SSP.

[8]

Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. 2018. Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. In ACL.

[9]

Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs. arXiv preprint (2020).

[10]

Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2019. Show, control and tell: a framework for generating controllable and grounded captions. In CVPR.

[11]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.

[12]

Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In CVPR.

[13]

Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, and Vladimir Vapnik. 1997. Support vector regression machines. In NeurIPS.

[14]

Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE TPAMI (2019).

[15]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

[16]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint (2014).

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[18]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019 a. Attention on attention for image captioning. In CVPR.

[19]

Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019 b. Adaptively Aligned Image Captioning via Adaptive Attention Time. In NeurIPS.

[20]

Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim Papadopoulos, and Vittorio Ferrari. 2016. How hard can it be? Estimating the difficulty of visual search in an image. In CVPR.

[21]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.

[22]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint (2014).

[23]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NeurIPS.

[24]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint (2016).

[25]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL (Workshops).

[26]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV.

[27]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR.

[28]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR.

[29]

Alexander Mathews, Lexing Xie, and Xuming He. 2018. Semstyle: Learning to generate stylised image captions using unaligned text. In CVPR.

[30]

Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI.

[31]

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR.

[32]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.

[33]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.

[34]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR.

[35]

Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In CVPR.

[36]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint (2013).

[37]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR.

[38]

Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In AAAI.

[39]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.

[40]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In ICCV.

[41]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.

[42]

Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, and Wei Liu. 2019. Exact Adversarial Attack to Image Captioning via Structured Output Learning with Latent Variables. In CVPR.

Cited By

Qian ZMa YLin ZJi JZheng XSun XJi R(2024)Multi-branch Collaborative Learning Network for 3D Visual GroundingComputer Vision – ECCV 202410.1007/978-3-031-72952-2_22(381-398)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72952-2_22
Zhang SCheng YZhu WJi XXu WCalandrino JTroncoso C(2023)CAPatchProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620276(679-696)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620276
Baia APoggioni VCavallaro A(2023)Black-Box Attacks on Image Activity Prediction and its Natural Language Explanations2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00396(3688-3697)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00396
Show More Cited By

Index Terms

Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Natural language generation

Recommendations

Protecting by attacking: A personal information protecting method with cross-modal adversarial examples
Highlights
- We first investigate user-oriented attacks against image captioning models to protect personal information. The attributes of user-oriented attacks are ...
Abstract
Recent years’ development of AI technology brings more convenience to our life while at the same time increasing the risk of personal information leakage. In this work, we try to protect personal information contained in the images by ...
Towards local visual modeling for image captioning
Highlights
- Local visual modeling with grid features for image captioning.
- Locality-...
Abstract
In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (...
Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of Guangdong Province in China
National Natural Science Foundation of China
National Key R&D Program

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
330
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qian ZMa YLin ZJi JZheng XSun XJi R(2024)Multi-branch Collaborative Learning Network for 3D Visual GroundingComputer Vision – ECCV 202410.1007/978-3-031-72952-2_22(381-398)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72952-2_22
Zhang SCheng YZhu WJi XXu WCalandrino JTroncoso C(2023)CAPatchProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620276(679-696)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620276
Baia APoggioni VCavallaro A(2023)Black-Box Attacks on Image Activity Prediction and its Natural Language Explanations2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00396(3688-3697)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00396
Xu LTang QLv JZheng BZeng XLi W(2023)Deep image captioning: A review of methods, trends and future challengesNeurocomputing10.1016/j.neucom.2023.126287546(126287)Online publication date: Aug-2023
https://doi.org/10.1016/j.neucom.2023.126287
Ji JMa YSun XZhou YWu YJi R(2022)Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2022.318343431(4321-4335)Online publication date: 2022
https://doi.org/10.1109/TIP.2022.3183434
Ming YHu NFan CFeng FZhou JYu H(2022)Visuals to Text: A Comprehensive Review on Automatic Image CaptioningIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2022.1057349:8(1339-1365)Online publication date: Aug-2022
https://doi.org/10.1109/JAS.2022.105734
Zhu LNing RLi JXin CWu H(2022)Most and Least Retrievable Images in Visual-Language Query SystemsComputer Vision – ECCV 202210.1007/978-3-031-19836-6_1(1-18)Online publication date: 22-Oct-2022
https://doi.org/10.1007/978-3-031-19836-6_1
Qi MQin JHuang DShen ZYang YLuo JShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Latent Memory-augmented Graph Transformer for Visual StorytellingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475236(4892-4901)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475236
Mohamad Nezami OChaturvedi ADras MGarain U(2021)Pick-Object-Attack: Type-specific adversarial attack for object detectionComputer Vision and Image Understanding10.1016/j.cviu.2021.103257(103257)Online publication date: Aug-2021
https://doi.org/10.1016/j.cviu.2021.103257

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten