ABSTRACT
Adversarial images are specifically designed to fool neural networks into making a wrong decision about what they are looking at, which severely degrade neural network accuracy. Recently, empirical and theoretical evidence suggests that robust neural network models tend to have better interpretable gradients. Therefore, we speculate that improving the interpretability of the gradients of the neural network models may also help to improve the robustness of the models. Two methods are used to add gradient-dependent constraint terms to the loss function of neural network models and both improve the robustness of the models. The first method adds the fussed lasso penalty term of the saliency maps to the loss function of the neural network models, which makes the saliency maps arrange in a natural way to improve the interpretability of the saliency maps, and uses the gradient enhancement for relu instead of relu to strengthen the constraint of regularization term on saliency maps. In the second method, the cosine similarity penalty term between the input gradients and the image contour is added to the loss function of the model to constrain the approximation between the input gradients and the image contour. This method has a certain biological significance, because the contour information of the image is used in the human visual system to recognize the image. Both methods improve the interpretability of model‘s gradients and the first method exceeds most regularization methods except adversarial training on MNIST and the second method even exceeds the adversarial training under white-box attacks on CIFAR-10 and CIFAR-100.
- Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning. PMLR, 274–283.Google Scholar
- Prasad Chalasani, Somesh Jha, Aravind Sadagopan, and Xi Wu. 2018. Adversarial learning and explainability in structured datasets. arXiv preprint arXiv:1810.06583(2018).Google Scholar
- Alvin Chan, Yi Tay, and Yew-Soon Ong. 2020. What it thinks is important is important: Robustness transfers through input gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 332–341.Google ScholarCross Ref
- Alvin Chan, Yi Tay, Yew Soon Ong, and Jie Fu. 2019. Jacobian adversarially regularized networks for robustness. arXiv preprint arXiv:1912.10185(2019).Google Scholar
- Christian Etmann, Sebastian Lunz, Peter Maass, and Carola-Bibiane Schönlieb. 2019. On the Connection Between Adversarial Robustness and Saliency Map Interpretability. arXiv (2019).Google Scholar
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and Harnessing Adversarial Examples. Computer Science (2014).Google Scholar
- Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. 2018. Lemna: Explaining deep learning based security applications. In proceedings of the 2018 ACM SIGSAC conference on computer and communications security. 364–379.Google ScholarDigital Library
- Yiwen Guo, Long Chen, Yurong Chen, and Changshui Zhang. 2020. On connections between regularizations for improving dnn robustness. IEEE transactions on pattern analysis and machine intelligence (2020).Google ScholarDigital Library
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083(2017).Google Scholar
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. 2019. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9078–9086.Google ScholarCross Ref
- Adam Noack, Isaac Ahern, Dejing Dou, and Boyang Li. 2021. An Empirical Study on the Relation Between Network Interpretability and Adversarial Robustness. SN Computer Science 2, 1 (2021), 1–13.Google ScholarDigital Library
- Andrew Slavin Ross and Finale Doshi-Velez. 2017. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients. (2017).Google Scholar
- Kevin Roth, Yannic Kilcher, and Thomas Hofmann. 2019. The odds are odd: A statistical test for detecting adversarial examples. In International Conference on Machine Learning. PMLR, 5498–5507.Google Scholar
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Computer ence (2013).Google Scholar
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. Computer Science (2013).Google Scholar
- Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2010. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society 67, 1 (2010), 91–108.Google ScholarCross Ref
- Jonathan Uesato, Brendan O’donoghue, Pushmeet Kohli, and Aaron Oord. 2018. Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning. PMLR, 5025–5034.Google Scholar
- Xingxing Wei, Siyuan Liang, Ning Chen, and Xiaochun Cao. 2018. Transferable Adversarial Attacks for Image and Video Object Detection. (2018).Google Scholar
- Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L Yuille, and Kaiming He. 2019. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 501–509.Google ScholarCross Ref
- Xinyang Zhang, Ningfei Wang, Hua Shen, Shouling Ji, Xiapu Luo, and Ting Wang. 2020. Interpretable deep learning under fire. In 29th {USENIX} Security Symposium ({USENIX} Security 20).Google Scholar
Index Terms
- Fighting Adversarial Images With Interpretable Gradients
Recommendations
Wavelet regularization benefits adversarial training
AbstractAdversarial training methods are frequently-used empirical defense methods against adversarial examples. While many regularization techniques demonstrate effectiveness when combined with adversarial training, these methods typically ...
LW-Net: an interpretable network with smart lifting wavelet kernel for mechanical feature extraction and fault diagnosis
AbstractDeep learning has been applied in mechanical fault diagnosis. Hereinto, the convolutional neural network (CNN) has the shallow convolution operation, supporting the function of feature learning. However, the interpretability of CNN has always been ...
An extensive evaluation of deep featuresof convolutional neural networks for saliency prediction of human visual attention
AbstractBased on transfer learning, feature maps of deep convolutional neural networks (DCNNs) have been used to predict human visual attention. In this paper, we conduct extensive comparisons to investigate effects of feature maps on the ...
Comments