EnsembleFool: A method to generate adversarial examples based on model fusion strategy
Introduction
Deep neural networks enable intelligent systems to exhibit high fidelity on interested tasks such as computer vision, natural language processing, speech recognition, and recommendation, considerably lowering the human labours in real applications (Kurakin et al., 2017). However, the intention to defraud these models makes the systems to be possibly unsafe. Especially, the adversarial examples (Akhtar, Mian, 2018, Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, 2014), generated by adding undetectable perturbations into the original input, have the ability of misleading the deep models and resulting in incorrect predictions(Déniz, Vállez, García, 2019, Pedraza, Déniz, Bueno, 2020, Zhao, Liu, Larson, 2020). As Fig. 1 illustrates, a well-trained model predicts the unperturbed input as the true label with 95.62% confidence, whereas giving a wrong label with 87.56% confidence when the image is corrupted by well-designed noise. The deep models’ vulnerability remains a critical safety issue and it is necessary to evaluate the model robustness with respect to adversarial attacks. Hence, advanced attack techniques are expected to cooperate with the adversarial defense community to improve the safety of intelligent systems.
The existing adversarial attack methods can be roughly divided into two categories: (1) the white-box attack (Carlini, Wagner, 2017, Chen, Zheng, Chen, Xiong, 2020, Goodfellow, Shlens, Szegedy, 2015, Kurakin, Goodfellow, Bengio, 2017) and (2) the black-box attack (Brendel, Rauber, Bethge, 2018, Chen, Su, Shen, Xiong, Zheng, 2019, Chen, Zhang, Sharma, Yi, Hsieh, 2017, Ren, Zhou, Wang, Wu, Wu, Choo, 2020, Zhao, Dua, Singh, 2018). The white-box attack depicts a problem that the attacker has the knowledge of the structure and parameters of the to-be-attacked model, which could generate extremely deceptive adversarial examples during attacking. For instance, Goodfellow et al. (2015) proposed the fast gradient sign method (FGSM) that can effectively calculate the perturbation by adjusting the model’s gradients. Unlike the white-box attack, the black-box attack only allows the model outputs in terms of the adversarial inputs to be accessible, while the model details are absent. This poses a more difficult problem than the white-box attack because of the model’s limited on-hand information. To handle this problem, recent researches indicate that adversarial examples have transferability (Liu et al., 2017), which is saying that the examples crafted for a given model can fool other unknown models. This property inspires several state-of-the-art black-box attacks (Brendel, Rauber, Bethge, 2018, Chen, Zhang, Sharma, Yi, Hsieh, 2017, Zhao, Dua, Singh, 2018). For example, MI-FGSM (Dong et al., 2018) applies an iterative and momentum technique into FGSM to generate improved adversarial examples, achieving enhanced attack ability in both white-box and black-box cases.
The transferability of adversarial examples endows the methods mentioned above with the ability to attack pure deep models. Unfortunately, when the models are equipped with certain defense mechanisms, the above methods exhibit low efficacy of fooling the black-box models. For example, adversarial training (Song, He, Lin, Wang, Hopcroft, 2020, Tramèr, Kurakin, Papernot, Goodfellow, Boneh, McDaniel, 2018) and input modification (Cohen, Rosenfeld, Kolter, 2019, Liu, Liu, Liu, Xu, Lin, Wang, Wen, 2019) are two typical ways to enhance the robustness of deep models in the black-box case. This implies an imparity issue between different models, suggesting that it is not implementable to create a universal attack with a single model. Instead, the characteristics of multiple models could be simultaneously considered when synthesizing the input perturbation such that the generated example could be a threat to all the models.
To further investigate the model diversity, we visualize the attention maps of different deep models with varying architectures through CAM (Zhou et al., 2016). Fig. 2 shows that the main focuses of different models during prediction are located in different spatial regions, where we illustrate using a heatmap mask and the focus difference are marked using red boxes. This verifies that different models would have resistance to attacks in different regions of the input image and more importantly, the attack result of one model could be beneficial to attack another model. Hence, the input perturbation could be improved by involving the expressed robustness of multiple models.
Inspired by the above analyses, in this paper, we propose a novel attack method, named as EnsembleFool. Specifically, considering the attack map of a single model is limited and different models possess different maps, we follow the MI-FGSM framework, which integrates multiple models to enlarge the attack map such that the attacking capability is enhanced. To implement a flexible integration process, we develop a series of fusion strategies based on the attacking results in the previous iteration. This is motivated by the attack effect of an adversarial example that can be clearly expressed by the output of the model, and hence, the output could guide the attacking in the next iteration. In such a way, the adversarial examples crafted by the ensemble of multiple models exhibit enhanced attacking ability in both white-box and black-box cases. The main contributions of this paper are summarized as follows:
- •
We investigate the transferability of the adversarial examples crafted by a single model, showing that the primary focuses of different models are located in different receptive fields and the attack map could not be shared between different models. This validates the necessity of integrating multiple models in a flexible way.
- •
Based on MI-FGSM, we are well motivated to develop a novel ensemble attack method called EnsembleFool, which produces the adversarial examples by integrating the attack results of multiple models in an adaptive way such that the most informative attack could be dominant in each attack iteration.
- •
Extensive experiments validate the state-of-the-art performance of the proposed method even in the cases of attack models with defense mechanisms.
The remainder of this paper is organized as follows. The related works and the preliminary are briefly reviewed in Section 2 and Section 3, respectively. The proposed method is introduced in Section 4.1. The experiments are presented in Section 5, with the conclusion drawn in Section 6.
Section snippets
Related work
In this section, we briefly review the relevant methods to the current work, including adversarial attacking, attacking using ensemble of models, and adversarial defence. Adversarial attacking. Deep neural networks have been shown vulnerable to adversarial examples (Szegedy et al., 2014), posing the requirement of developing advanced attack methods to evaluate the robustness of the models. For example, Goodfellow et al. (2015) argued that the primary reason of the vulnerability was the linear
Preliminary
Before presenting the proposed method, we first introduce the notations and the preliminary knowledge about the current work. Let denote an image, denote the corresponding ground-truth label, denote the network parameters, and denote the loss function. To generate the adversarial example, our goal is to maximize the loss function given under the constraint that the generated example should look visually similar to the original image and the
Methods
In this section, we introduce the proposed adaptive ensemble model in detail, with the variations of manual and dynamical adaptivity.
Experiments
In this section, we conduct a series of experiments on public datasets to validate the effectiveness of the proposed method by comparing it with the state-of-the-arts.
Conclusions
Different deep models exhibit different characteristics of vulnerability when being attacked. Exploring this property would help to generate more powerful adversarial examples. Starting from this, in this paper, we propose a novel adaptive ensemble-based adversarial attack method that employs an adaptive strategy to fuse the information of multiple models. Specifically, the model prediction reveals whether correct or confident the model is with respect to the input. This fact informs us to use
Declaration of Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Wenyu Peng: Conceptualization, Methodology. Renyang Liu: Writing - original draft. Ruxin Wang: Writing - review & editing. Taining Cheng: Data curation. Zifeng Wu: Validation. Li Cai: Visualization, Investigation. Wei Zhou: Supervision.
Declaration of Competing Interest
No.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61762089, Grant 61663047, Grant 61863036, Grant 61762092, Grant 71972165, and Grant 61763048, the Yunnan Applied Basic Research Projects under Grant No. 202001BB050034, and Yunnan Province Science Foundation for Youths under Grant No.202005AC160007.
Wenyu Peng received the BE degree in network engineering from Yunnan University in 2019. She is a master degree candidate at the School of Software, Yunnan University. Her research interests include deep learning, adversarial attack and bio-informatics.
References (34)
- et al.
POBA-GA: perturbation optimized black-box adversarial attacks via genetic algorithm
Comput. Secur.
(2019) - et al.
RCA-SOC: a novel adversarial defense by refocusing on critical areas and strengthening object contours
Comput. Secur.
(2020) - et al.
Query-efficient label-only attacks against black-box machine learning models
Comput. Secur.
(2020) - et al.
Threat of adversarial attacks on deep learning in computer vision: a survey
IEEE Access
(2018) - et al.
Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples
ICML
(2018) - et al.
Decision-based adversarial attacks: Reliable attacks against black-box machine learning models
ICLR
(2018) - et al.
Towards evaluating the robustness of neural networks
SP
(2017) - et al.
ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
(2017) - et al.
Certified adversarial robustness via randomized smoothing
ICML
(2019) - et al.
Adversarial examples are a manifestation of the fitting-generalization trade-off
IWANN
(2019)
Robustness to adversarial examples can be improved with overfitting
Int. J. Mach. Learn. Cybern.
Boosting adversarial attacks with momentum
CVPR
Evading defenses to transferable adversarial examples by translation-invariant attacks
CVPR
Explaining and harnessing adversarial examples
ICRL
Countering adversarial images using input transformations
ICLR
Adversarial examples in the physical world
ICLR
Defense against adversarial attacks using high-level representation guided denoiser
CVPR
Cited by (0)
Wenyu Peng received the BE degree in network engineering from Yunnan University in 2019. She is a master degree candidate at the School of Software, Yunnan University. Her research interests include deep learning, adversarial attack and bio-informatics.
Renyang Liu received the BE degree in Computer Science from the Northwest normal University in 2017, He is a current Ph.D. candidate in School of Information Science and Engineering at Yunnan University, Kunming, China. His current research interest includes deep learning, adversarial attack and graph learning.
Ruxin Wang is currently an associate professor with the National Pilot School of Software, Yunnan University, Kunming, China. He received his BEng from Xidian University, his MSc from Huazhong University of Science and Technology, and his PhD degree from the University of Technology Sydney. His research interests include image restoration, deep learning, and computer vision. He focuses on the topic of image synthesis by using both discriminative models and generative models. He has authored and co-authored 20+ research papers including IEEE T-NNLS, T-IP, T-Cyb, ICCV, and AAAI. He has received ”the 1000 Talents Plan for Young Talents of Yunnan Province” award.
Taining Cheng received the bachelor’s degree from Yunnan University, Yunnan, China, in 2019. He is currently working toward the Masters degree in School of Software at Yunnan University. His current research interest includes learned index and distributed computing
Zifeng Wu received the BE degree in network engineering from Yunnan University in 2019. He is a master degree candidate at the School of Software, Yunnan University. His current research interests include machine learning, model compression and acceleration.
Li Cai was born in Kunming, China, in 1975. She received the M.S. degree in computer application from Yunnan University, China, in 2007. Then, she received the Ph.D. degree with the School of Computer Science, Fudan University, China, in 2020. From 1997 to 2002, she was a Research Assistant with Network Center. Since 2010, she has been an Associate Professor with the School of Software, Yunnan University. Her research interests include intelligent transportation, machine learning, visualization, and data quality.
Wei Zhou received the Ph.D. degree from the University of Chinese Academy of Sciences. He is currently a Full Professor with the Software School, Yunnan University. His current research interests include the distributed data intensive computing and bio-informatics. He is currently a Fellow of the China Communications Society, a member of the Yunnan Communications Institute, and a member of the Bioinformatics Group of the Chinese Computer Society. He won the Wu Daguan Outstanding Teacher Award of Yunnan University in 2016, and was selected into the Youth Talent Program of Yunnan University in 2017. Hosted a number of National Natural Science Foundation projects.