Abstract
Deep Neural Network (DNN) models are widely used in many cutting-edge domains, such as medical diagnostics and autonomous driving. However, an urgent need to test DNN models thoroughly has increasingly risen. Recent research proposes various structural and non-structural coverage criteria to measure test adequacy. Structural coverage criteria quantify the degree to which the internal elements of DNN models are covered by a test suite. However, they convey little information about individual inputs and exhibit limited correlation with defect detection. Additionally, existing non-structural coverage criteria are unaware of neurons’ importance to decision-making. This paper addresses these limitations by proposing novel non-structural coverage criteria. By tracing neurons’ cumulative contribution to the final decision on the training set, this paper identifies important neurons of DNN models. A novel metric is proposed to quantify the difference in important neuron behavior between a test input and the training set, which provides a measured way at individual test input granularity. Additionally, two non-structural coverage criteria are introduced that allow for the quantification of test adequacy by examining differences in important neuron behavior between the testing and the training set. The empirical evaluation of image datasets demonstrates that the proposed metric outperforms the existing non-structural adequacy metrics by up to 14.7% accuracy improvement in capturing error-revealing test inputs. Compared with state-of-the-art coverage criteria, the proposed coverage criteria are more sensitive to errors, including natural errors and adversarial examples.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availibility Statement
The datasets generated and analyzed during the current study are available in the following Github repository, https://github.com/TestingAIGroup/DeepLID.
Notes
After fatal uber crash, a self-driving start-up moves forward. https://www.nytimes.com/2018/05/07/technology/uber-crash-autonomous-drivea
CIFAR10 Model in Keras. https://keras.io/examples/cifar10_cnn
Nvidia-Autopilot-Keras. https://github.com/0bserver07/Nvidia-AutopilotKeras
Behavioral cloning: end-to-end learning for self-driving cars. https://github.com/navoshta/behavioral-cloning
References
Uda (2016) The udacity open source self-driving car project. https://github.com/ udacity/self-driving-car
Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10(7):1–46. https://doi.org/10.1371/journal.pone.0130140
Carlini N, Wagner DA (2017) Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on security and privacy, SP 2017, San Jose, CA, USA, May 22–26, 2017, IEEE Computer Society, pp 39–57. https://doi.org/10.1109/SP.2017.49
Chen J, Wu Z, Wang Z, You H, Zhang L, Yan M (2020a) Practical accuracy estimation for efficient deep neural network testing. ACM Trans Softw Eng Methodol 29(4):30:1–30:35. https://doi.org/10.1145/3394112
Chen J, Yan M, Wang Z, Kang Y, Wu Z (2020b) Deep neural network test coverage: How far are we? CoRR arXiv:2010.04946
Chen Z, Huang X (2017) End-to-end learning for lane keeping of self-driving cars. In: IEEE Intelligent vehicles symposium, IV 2017, Los Angeles, CA, USA, June 11-14, 2017, IEEE, pp 1856–1860. https://doi.org/10.1109/IVS.2017.7995975
Du X, Xie X, Li Y, Ma L, Liu Y, Zhao J (2019) Deepstellar: model-based quantitative analysis of stateful deep learning systems. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26–30, 2019, ACM, pp 477–487. https://doi.org/10.1145/3338906.3338954
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nat 542(7639):115–118. https://doi.org/10.1038/nature21056
Feng Y, Shi Q, Gao X, Wan J, Fang C, Chen Z (2020) Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. In: Khurshid S, Pasareanu CS (eds) ISSTA ’20: 29th ACM SIGSOFT International symposium on software testing and analysis, virtual event, USA, July 18–22, 2020, ACM, pp 177–188. https://doi.org/10.1145/3395363.3397357
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: Balcan M, Weinberger KQ (eds) Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR.org, JMLR Workshop and Conference Proceedings, vol 48, pp 1050–1059. http://proceedings.mlr.press/v48/gal16.html
Gerasimou S, Eniser HF, Sen A, Cakan A (2020) Importance-driven deep learning system testing. In: Rothermel G, Bae D (eds) ICSE ’20: 42nd International conference on software engineering, Seoul, South Korea, 27 June - 19 July, 2020, ACM, pp 702–713. https://doi.org/10.1145/3377811.3380391
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6572
Goodfellow IJ, Papernot N, McDaniel PD (2016) cleverhans v0.1: an adversarial machine learning library. CoRR arXiv:1610.00768
Guo J, Jiang Y, Zhao Y, Chen Q, Sun J (2018) Dlfuzz: differential fuzzing testing of deep learning systems. In: Leavens GT, Garcia A, Pasareanu CS (eds) Proceedings of the 2018 ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, ACM, pp 739–743. https://doi.org/10.1145/3236024.3264835
Haq FU, Shin D, Nejati S, Briand LC (2022) Correction to: Can offline testing of deep neural networks replace their online testing? Empir Softw Eng 27(6):141. https://doi.org/10.1007/s10664-022-10172-z
Harel-Canada F, Wang L, Gulzar MA, Gu Q, Kim M (2020) Is neuron coverage a meaningful measure for testing deep neural networks? In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8–13, 2020, ACM, pp 851–862. https://doi.org/10.1145/3368089.3409754
Houle ME (2017) Local intrinsic dimensionality II: multivariate analysis and distributional support. In: Beecks C, Borutta F, Kröger P, Seidl T (eds) Similarity Search and Applications - 10th International Conference, SISAP 2017, Munich, Germany, October 4-6, 2017, Proceedings, Springer, Lecture Notes in Computer Science, vol 10609, pp 80–95. https://doi.org/10.1007/978-3-319-68474-1_6
Hu Q, Ma L, Xie X, Yu B, Liu Y, Zhao J (2019) Deepmutation++: A mutation testing framework for deep learning systems. In: 34th IEEE/ACM International conference on automated software engineering, ASE 2019, San Diego, CA, USA, November 11–15, 2019, IEEE, pp 1158–1161. https://doi.org/10.1109/ASE.2019.00126
Huang W, Sun Y, Zhao XE, Sharp J, Ruan W, Meng J, Huang X (2021) Coverage-guided testing for recurrent neural networks. IEEE Transactions on Reliability
Huang X, Kroening D, Ruan W, Sharp J, Sun Y, Thamo E, Wu M, Yi X (2020) A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Comput Sci Rev 37:100270. https://doi.org/10.1016/j.cosrev.2020.100270
Humbatova N, Jahangirova G, Tonella P (2021) Deepcrime: mutation testing of deep learning systems based on real faults. In: Cadar C, Zhang X (eds) ISSTA ’21: 30th ACM SIGSOFT International symposium on software testing and analysis, virtual event, Denmark, July 11–17, 2021, ACM, pp 67–78. https://doi.org/10.1145/3460319.3464825
Julian KD, Kochenderfer MJ, Owen MP (2019) Deep neural network compression for aircraft collision avoidance systems. J Guid Control Dyn 42(3):598–608. https://doi.org/10.2514/1.G003724
Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: Reif JH (ed) Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19–21, 2002, Montréal, Québec, Canada, ACM, pp 741–750. https://doi.org/10.1145/509907.510013
Kim J, Feldt R, Yoo S (2019) Guiding deep learning system testing using surprise adequacy. In: Atlee JM, Bultan T, Whittle J (eds) Proceedings of the 41st international conference on software engineering, ICSE 2019, Montreal, QC, Canada, May 25–31, 2019, IEEE / ACM, pp 1039–1049. https://doi.org/10.1109/ICSE.2019.00108
Kim J, Ju J, Feldt R, Yoo S (2020) Reducing DNN labelling cost using surprise adequacy: an industrial case study for autonomous driving. In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM Joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8–13, 2020, ACM, pp 1466–1476. https://doi.org/10.1145/3368089.3417065
Kim J, Feldt R, Yoo S (2023) Evaluating surprise adequacy for deep learning system testing. ACM Trans Softw Eng Methodol 32(2):42:1–42:29. https://doi.org/10.1145/3546947
Kim S, Yoo S (2020) Evaluating surprise adequacy for question answering. In: ICSE ’20: 42nd International conference on software engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020, ACM, pp 197–202. https://doi.org/10.1145/3387940.3391465
Kim S, Yoo S (2021) Multimodal surprise adequacy analysis of inputs for natural language processing DNN models. In: 2nd IEEE/ACM International conference on automation of software test, AST@ICSE 2021, Madrid, Spain, May 20-21, 2021, IEEE, pp 80–89. https://doi.org/10.1109/AST52587.2021.00017
Krizhevsky N, Vinod H, Geoffrey C, Papadakis M, Ventresque A (2014a) The cifar-10 dataset. http://www.cs.toronto.edu/~kriz/cifar.html
Krizhevsky N, Vinod H, Geoffrey C, Papadakis M, Ventresque A (2014b) The cifar-100 dataset. http://www.cs.toronto.edu/~kriz/cifar.html
Kurakin A, Goodfellow IJ, Bengio S (2017) Adversarial examples in the physical world. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, OpenReview.net. https://openreview.net/forum?id=HJGU3Rodl
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 6402–6413. https://proceedings.neurips.cc/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html
LeCun Y, Cortes C (1998) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Lee S, Cha S, Lee D, Oh H (2020) Effective white-box testing of deep neural networks with adaptive neuron-selection strategy. In: Khurshid S, Pasareanu CS (eds) ISSTA ’20: 29th ACM SIGSOFT International symposium on software testing and analysis, virtual event, USA, July 18-22, 2020, ACM, pp 165–176. https://doi.org/10.1145/3395363.3397346
Li Z, Ma X, Xu C, Cao C (2019a) Structural coverage criteria for neural networks could be misleading. In: Sarma A, Murta L (eds) Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019, IEEE / ACM, pp 89–92. https://doi.org/10.1109/ICSE-NIER.2019.00031
Li Z, Ma X, Xu C, Cao C, Xu J, Lü J (2019b) Boosting operational DNN testing efficiency through conditioning. In: Dumas M, Pfahl D, Apel S, Russo A (eds) Proceedings of the ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26–30, 2019, ACM, pp 499–509. https://doi.org/10.1145/3338906.3338930
Liu T (2011) Learning to Rank for Information Retrieval. Springer. https://doi.org/10.1007/978-3-642-14267-3
Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y, Zhao J, Wang Y (2018a) Deepgauge: multi-granularity testing criteria for deep learning systems. In: Huchard M, Kästner C, Fraser G (eds) Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, ACM, pp 120–131. https://doi.org/10.1145/3238147.3238202
Ma L, Zhang F, Sun J, Xue M, Li B, Juefei-Xu F, Xie C, Li L, Liu Y, Zhao J, Wang Y (2018b) Deepmutation: Mutation testing of deep learning systems. In: Ghosh S, Natella R, Cukic B, Poston RS, Laranjeiro N (eds) 29th IEEE International symposium on software reliability engineering, ISSRE 2018, Memphis, TN, USA, October 15-18, 2018, IEEE Computer Society, pp 100–111. https://doi.org/10.1109/ISSRE.2018.00021
Ma L, Juefei-Xu F, Xue M, Li B, Li L, Liu Y, Zhao J (2019) Deepct: Tomographic combinatorial testing for deep learning systems. In: Wang X, Lo D, Shihab E (eds) 26th IEEE International conference on software analysis, evolution and reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019, IEEE, pp 614–618. https://doi.org/10.1109/SANER.2019.8668044
Ma W, Papadakis M, Tsakmalis A, Cordy M, Traon YL (2021) Test selection for deep learning systems. ACM Trans Softw Eng Methodol 30(2):13:1–13:22. https://doi.org/10.1145/3417330
Ma X, Li B, Wang Y, Erfani SM, Wijewickrema SNR, Schoenebeck G, Song D, Houle ME, Bailey J (2018c) Characterizing adversarial subspaces using local intrinsic dimensionality. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net. https://openreview.net/forum?id=B1gJ1L2aW
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A (2011) Reading digits in natural images with unsupervised feature learning
Odena A, Olsson C, Andersen DG, Goodfellow IJ (2019) Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol 97, pp 4901–4911. http://proceedings.mlr.press/v97/odena19a.html
Ouyang T, Isobe Y, Marco VS, Ogata J, Seo Y, Oiwa Y (2021) AI robustness analysis with consideration of corner cases. In: 2021 IEEE International Conference on Artificial Intelligence Testing, AITest 2021, Oxford, United Kingdom, August 23-26, 2021, IEEE, pp 29–36. https://doi.org/10.1109/AITEST52744.2021.00016
Papernot N, McDaniel PD, Jha S, Fredrikson M, Celik ZB, Swami A (2016) The limitations of deep learning in adversarial settings. In: IEEE European symposium on security and privacy, EuroS &P 2016, Saarbrücken, Germany, March 21-24, 2016, IEEE, pp 372–387. https://doi.org/10.1109/EuroSP.2016.36
Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: Automated whitebox testing of deep learning systems. In: Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017, ACM, pp 1–18. https://doi.org/10.1145/3132747.3132785
Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, Tonella P (2020) Testing machine learning based systems: a systematic mapping. Empir Softw Eng 25(6):5193–5254. https://doi.org/10.1007/s10664-020-09881-0
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices. In: Annual meeting of the southern association for institutional research, Citeseer, pp 1–51
Sekhon J, Fleming C (2019) Towards improved testing for deep learning. In: Sarma A, Murta L (eds) Proceedings of the 41st international conference on software engineering: new ideas and emerging results, ICSE (NIER) 2019, Montreal, QC, Canada, May 29-31, 2019, IEEE / ACM, pp 85–88. https://doi.org/10.1109/ICSE-NIER.2019.00030
Shen W, Li Y, Chen L, Han Y, Zhou Y, Xu B (2020) Multiple-boundary clustering and prioritization to promote neural network retraining. In: 35th IEEE/ACM International conference on automated software engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020, IEEE, pp 410–422. https://doi.org/10.1145/3324884.3416621
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sun Y, Wu M, Ruan W, Huang X, Kwiatkowska M, Kroening D (2018) Concolic testing for deep neural networks. In: Huchard M, Kästner C, Fraser G (eds) Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, ACM, pp 109–119. https://doi.org/10.1145/3238147.3238172
Sun Y, Huang X, Kroening D, Sharp J, Hill M, Ashmore R (2019) Structural test coverage criteria for deep neural networks. In: Atlee JM, Bultan T, Whittle J (eds) Proceedings of the 41st international conference on software engineering: companion proceedings, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, IEEE / ACM, pp 320–321. https://doi.org/10.1109/ICSE-Companion.2019.00134
Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Chaudron M, Crnkovic I, Chechik M, Harman M (eds) Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, ACM, pp 303–314. https://doi.org/10.1145/3180155.3180220
Wang Z, You H, Chen J, Zhang Y, Dong X, Zhang W (2021) Prioritizing test inputs for deep neural networks via mutation analysis. In: 43rd IEEE/ACM International conference on software engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021, IEEE, pp 397–409. https://doi.org/10.1109/ICSE43902.2021.00046
Weiss M, Tonella P (2021) Fail-safe execution of deep learning based systems through uncertainty monitoring. In: 14th IEEE Conference on software testing, verification and validation, ICST 2021, Porto de Galinhas, Brazil, April 12-16, 2021, IEEE, pp 24–35. https://doi.org/10.1109/ICST49551.2021.00015
Weiss M, Chakraborty R, Tonella P (2021) A review and refinement of surprise adequacy. In: 3rd IEEE/ACM International workshop on deep learning for testing and testing for deep learning, DeepTest@ICSE 2021, Madrid, Spain, June 1, 2021, IEEE, pp 17–24. https://doi.org/10.1109/DeepTest52559.2021.00009
Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics, Springer, pp 196–202
Xiao H, Rasul K, Vollgraf R (2019) Fashion-mnist is a dataset of zalando’s article images. https://github.com/zalandoresearch/fashion-mnist
Xie X, Ma L, Juefei-Xu F, Xue M, Chen H, Liu Y, Zhao J, Li B, Yin J, See S (2019) Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In: Zhang D, Møller A (eds) Proceedings of the 28th ACM SIGSOFT International symposium on software testing and analysis, ISSTA 2019, Beijing, China, July 15-19, 2019, ACM, pp 146–157. https://doi.org/10.1145/3293882.3330579
Xie X, Li T, Wang J, Ma L, Guo Q, Juefei-Xu F, Liu Y (2022) NPC: neuron path coverage via characterizing decision logic of deep neural networks. ACM Trans Softw Eng Methodol 31(3):47:1–47:27. https://doi.org/10.1145/3490489
Yan S, Tao G, Liu X, Zhai J, Ma S, Xu L, Zhang X (2020) Correlations between deep neural network model coverage criteria and model quality. In: Devanbu P, Cohen MB, Zimmermann T (eds) ESEC/FSE ’20: 28th ACM Joint european software engineering conference and symposium on the foundations of software engineering, virtual event, USA, November 8-13, 2020, ACM, pp 775–787. https://doi.org/10.1145/3368089.3409671
Yuan Y, Pang Q, Wang S (2023) Revisiting neuron coverage for DNN testing: A layer-wise and distribution-aware criterion. In: 45th IEEE/ACM International conference on software engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, IEEE, pp 1200–1212. https://doi.org/10.1109/ICSE48619.2023.00107
Zhang JM, Harman M, Ma L, Liu Y (2022) Machine learning testing: Survey, landscapes and horizons. IEEE Trans Software Eng 48(2):1–36. https://doi.org/10.1109/TSE.2019.2962027
Acknowledgements
I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and is not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed. This work is supported by the Key Program of the National Natural Science Foundation of China (No. U224120044), the National Natural Science Foundation of China (No. 62202223), the Natural Science Foundation of Jiangsu Province (No. BK20220881), the Open Fund of the State Key Laboratory for Novel Software Technology (No. KFKT2024B27), and the Fundamental Research Funds for the Central Universities (No. NT2024020).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
No conflict of interest exists in the submission of this manuscript, and this manuscript is approved by all authors for publication.
Additional information
Communicated by: Shin Yoo
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, H., Tao, C. & Huang, Z. Neuron importance-aware coverage analysis for deep neural network testing. Empir Software Eng 29, 118 (2024). https://doi.org/10.1007/s10664-024-10524-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-024-10524-x