Abstract
Machine learning has been tremendously successful in various fields, rang-ing from image classification to natural language processing. Despite it has been gained ubiquitous, its application in high-risk domains has been hindered by the opacity of its decision-making, i.e., users do not understand the reason for the given prediction result. To circumvent this limitation, explainable artificial intelligence (XAI) is being developed from multiple perspectives and at multiple levels. However, the auxiliary information provided by XAI helps to build a trust bridge between users and models, while inevitably increasing the risk of the model being attacked. In this paper, we prove that explanation information has a certain risk of attack on the model, and to explore how the adversary can use explanation information to reduce the attack dimension. Our proposed attack method can reduce the perturbation range to a certain extent, i.e., the adversary can add perturbation in a very small range. It can ensure the distortion and success rate at the same time, reduce the perturbation amplitude, and obtain the adversary samples that can not be discernible by human eyes. Extensive evaluations results show that the explanation information provided by XAI provides a set of sensitive features for the adversary. On the CIFAR-10 dataset, the scope of our attack is 90% smaller than the C &W attack, while maintaining a similar success rate and distortion. At the same time, we verify that our method can still achieve good attack effect even in black box.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Molnar, C.: Interpretable Machine Learning (2020). https://www.lulu.com/
Tu, C.C., Ting, P., Chen, P.Y., et al.: Autozoom: autoencoder-based zeroth order optimization method for attacking black-box neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33(01), pp. 742–749 (2019)
Aïvodji, U., Bolot, A., Gambs, S.: Model extraction from counterfactual explanations. arXiv preprint arXiv:2009.01884 (2020)
Amich, A., Eshete, B.: EG-Booster: explanation-guided booster of ML evasion attacks. arXiv preprint arXiv:2108.13930 (2021)
Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE, pp. 39–57 (2017)
Elshawi, R., Al-Mallah, M.H., Sakr, S.: On the interpretability of machine learning-based model for predicting hypertension. BMC Med. Inform. Decis. Making 19(1), 1–32 (2019)
Shokri, R., Strobel, M., Zick, Y.: On the privacy risks of model explanations. In: AIES 2021: AAAI/ACM Conference on AI, Ethics, and Society. ACM (2021)
Garcia, W., Choi, J.I., Adari, S.K., et al.: Explainable black-box attacks against model-based authentication. arXiv preprint arXiv:1810.00024 (2018)
Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)
Milli, S., Schmidt, L., Dragan, A.D., et al.: Model reconstruction from model explanations. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 1–9 (2019)
Ovadia, Y., Fertig, E., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift (2019)
Papernot, N., McDaniel, P., Jha, S., et al.: The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS &P), pp. 372–387 IEEE (2016)
Ribeiro, MT., Singh, S., Guestrin, C.: Why should I trust you?: explaining the predictions of any classifier. In: The 22nd ACM SIGKDD International Conference. ACM (2016)
Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural net-works. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019)
Severi, G., Meyer, J., Coull, S., et al.: Explanation-guided backdoor poisoning attacks against malware classifiers. In: 30th USENIX Security Symposium (USENIX Security 21), pp. 1487–1504 (2021)
Zhao, X., Zhang, W., Xiao, X., et al.: Exploiting explanations for model inversion attacks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 682–692 (2021)
Chen, P.Y., et al.: Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26 (2017)
Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 484–501. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_29
Du, Z., Liu, F., Yan, X.: Minimum adversarial examples. Entropy 24(3), 396 (2022)
Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128(2), 336–359 (2020)
Wang, H., Wang, Z., Du, M., et al.: Score-CAM: score-weighted visual explanations for convolutional neural networks (2019)
Mothilal, R.K., Sharma, A., Tan, C.: Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 607–617 (2020)
Ilyas, A., Engstrom, L., Athalye, A., et al.: Query-efficient black-box adversarial examples (superceded). arXiv preprint arXiv:1712.07113 (2017)
Lee, H., Kim, S.T., Ro, Y.M.: Generation of multimodal justification using visual word constraint model for explainable computer-aided diagnosis. In: Suzuki, K., et al. (eds.) ML-CDS/IMIMIC -2019. LNCS, vol. 11797, pp. 21–29. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33850-3_3
Meyes, R., de Puiseau, C.W., Posada-Moreno, A., Meisen, T.: Under the hood of neural networks: characterizing learned representations by functional neuron populations and network ablations. arXiv preprint arXiv:2004.01254 (2020)
Van Molle, P., De Strooper, M., Verbelen, T., Vankeirsbilck, B., Simoens, P., Dhoedt, B.: Visualizing convolutional neural networks to improve decision support for skin lesion classification. In: Stoyanov, D., et al. (eds.) MLCN/DLF/IMIMIC -2018. LNCS, vol. 11038, pp. 115–123. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02628-8_13
Acknowledgements
This work is supported by This work is supported by the National Natural Science Foundation of China (Grant No. 61966011), Hainan University Education and Teaching Reform Research Project (Grant No. HDJWJG01).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, M., Liu, X., Yan, A., Qi, Y., Li, W. (2023). Explanation-Guided Minimum Adversarial Attack. In: Xu, Y., Yan, H., Teng, H., Cai, J., Li, J. (eds) Machine Learning for Cyber Security. ML4CS 2022. Lecture Notes in Computer Science, vol 13655. Springer, Cham. https://doi.org/10.1007/978-3-031-20096-0_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-20096-0_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20095-3
Online ISBN: 978-3-031-20096-0
eBook Packages: Computer ScienceComputer Science (R0)