Can We Really Trust Explanations? Evaluating the Stability of Feature Attribution Explanation Methods via Adversarial Attack

Yang, Zhao; Zhang, Yuanzhe; Jiang, Zhongtao; Ju, Yiming; Zhao, Jun; Liu, Kang

doi:10.1007/978-3-031-18315-7_18

Zhao Yang^14,15,
Yuanzhe Zhang^14,15,
Zhongtao Jiang^14,15,
Yiming Ju^14,15,
Jun Zhao^14,15 &
…
Kang Liu^14,15,16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13603))

Included in the following conference series:

China National Conference on Chinese Computational Linguistics

504 Accesses

Abstract

Explanations can increase the transparency of neural networks and make them more trustworthy. However, can we really trust explanations generated by the existing explanation methods? If the explanation methods are not stable enough, the credibility of the explanation will be greatly reduced. Previous studies seldom considered such an important issue. To this end, this paper proposes a new evaluation frame to evaluate the stability of current typical feature attribution explanation methods via textual adversarial attack. Our frame could generate adversarial examples with similar textual semantics. Such adversarial examples will make the original models have the same outputs, but make most current explanation methods deduce completely different explanations. Under this frame, we test five classical explanation methods and show their performance on several stability-related metrics. Experimental results show our evaluation is effective and could reveal the stability performance of existing explanation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Feature attribution based explanation methods show the importance of each token to the prediction. Therefore, paraphrase-based attack methods do not fit because they would modify too many parts of inputs at once.
2.
Black-box refers to we can only utilize the outputs of the model during the attack. However, some explanation methods are not black-box such as gradient-based methods. Whether the explanation method is black-box has nothing to do with our black box attack method.

References

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim, B.: Sanity checks for saliency maps. arXiv preprint arXiv:1810.03292 (2018)
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W.: Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998 (2018)
Atanasova, P., Simonsen, J.G., Lioma, C., Augenstein, I.: A diagnostic study of explainability techniques for text classification. arXiv preprint arXiv:2009.13295 (2020)
Bastings, J., Aziz, W., Titov, I.: Interpretable neural predictions with differentiable binary variables. arXiv preprint arXiv:1905.08160 (2019)
Bloomfield, L.: A set of postulates for the science of language. Language 2(3), 153–164 (1926)
Article Google Scholar
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://doi.org/10.18653/v1/D17-1070. https://aclanthology.org/D17-1070
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
DeYoung, J., et al.: ERASER: a benchmark to evaluate rationalized NLP models. arXiv preprint arXiv:1911.03429 (2019)
Ding, S., Koehn, P.: Evaluating saliency methods for neural language models. arXiv preprint arXiv:2104.05824 (2021)
Ghorbani, A., Abid, A., Zou, J.: Interpretation of neural networks is fragile. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3681–3688 (2019)
Google Scholar
Heo, J., Joo, S., Moon, T.: Fooling neural network interpretations via adversarial model manipulation. In: Advances in Neural Information Processing Systems, pp. 2925–2936 (2019)
Google Scholar
Herman, B.: The promise and peril of human evaluation for model interpretability. arXiv preprint arXiv:1711.07414 (2017)
Jacovi, A., Goldberg, Y.: Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685 (2020)
Jain, S., Wallace, B.C.: Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019)
Jiang, Z., Zhang, Y., Yang, Z., Zhao, J., Liu, K.: Alignment rationale for natural language inference. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5372–5387. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.417. https://aclanthology.org/2021.acl-long.417
Li, J., Monroe, W., Jurafsky, D.: Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016)
Lipton, Z.C.: The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31–57 (2018)
Article Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150. Association for Computational Linguistics, Portland, Oregon, USA (2011). https://aclanthology.org/P11-1015
Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Miller, T.: Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38 (2019)
Article MathSciNet Google Scholar
Molnar, C.: Interpretable Machine Learning. Lulu. com (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Qi, F., Yang, C., Liu, Z., Dong, Q., Sun, M., Dong, Z.: OpenHowNet: an open sememe-based lexical knowledge base. arXiv preprint arXiv:1901.09957 (2019)
Ren, S., Deng, Y., He, K., Che, W.: Generating natural language adversarial examples through probability weighted word saliency. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1085–1097 (2019)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you? explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Robnik-Šikonja, M., Bohanec, M.: Perturbation-based explanations of prediction models. In: Zhou, J., Chen, F. (eds.) Human and Machine Learning. HIS, pp. 159–175. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-90403-0_9
Chapter Google Scholar
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019)
Article Google Scholar
Samanta, S., Mehta, S.: Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812 (2017)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling lime and SHAP: adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180–186 (2020)
Google Scholar
Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: SmoothGrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Google Scholar
Spearman, C.: The proof and measurement of association between two things. (1961)
Google Scholar
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365 (2017)
Wang, J., Tuyls, J., Wallace, E., Singh, S.: Gradient-based analysis of NLP models is manipulable. arXiv preprint arXiv:2010.05419 (2020)
Wiegreffe, S., Pinter, Y.: Attention is not not explanation. arXiv preprint arXiv:1908.04626 (2019)
Zang, Y., et al.: Word-level textual adversarial attacking as combinatorial optimization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6066–6080 (2020)
Google Scholar
Zhang, W.E., Sheng, Q.Z., Alhazmi, A., Li, C.: Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Trans. Intell. Syst. Technol. (TIST) 11(3), 1–41 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61922085, 61831022, 61906196), the Key Research Program of the Chinese Academy of Sciences (Grant NO. ZDBS-SSW-JSC006) and the Youth Innovation Promotion Association CAS. This work was also supported by Yunnan provincial major science and technology special plan projects, under Grant:202103AA080015.

Author information

Authors and Affiliations

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
Zhao Yang, Yuanzhe Zhang, Zhongtao Jiang, Yiming Ju, Jun Zhao & Kang Liu
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Zhao Yang, Yuanzhe Zhang, Zhongtao Jiang, Yiming Ju, Jun Zhao & Kang Liu
Beijing Academy of Artificial Intelligence, Beijing, 100084, China
Kang Liu

Authors

Zhao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanzhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongtao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Ju
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Kang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kang Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Tsinghua University, Beijing, China
Yang Liu
Harbin Institute of Technology, Harbin, China
Wanxiang Che
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Yang Feng
Fudan University, Shanghai, China
Xipeng Qiu
Beijing Language and Culture University, Beijing, China
Gaoqi Rao
Chinese Academy of Sciences, Institute of Automation, Beijing, China
Yubo Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z., Zhang, Y., Jiang, Z., Ju, Y., Zhao, J., Liu, K. (2022). Can We Really Trust Explanations? Evaluating the Stability of Feature Attribution Explanation Methods via Adversarial Attack. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2022. Lecture Notes in Computer Science(), vol 13603. Springer, Cham. https://doi.org/10.1007/978-3-031-18315-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-18315-7_18
Published: 06 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18314-0
Online ISBN: 978-3-031-18315-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics