skip to main content
research-article

Evaluating Surprise Adequacy for Deep Learning System Testing

Published: 29 March 2023 Publication History

Abstract

The rapid adoption of Deep Learning (DL) systems in safety critical domains such as medical imaging and autonomous driving urgently calls for ways to test their correctness and robustness. Borrowing from the concept of test adequacy in traditional software testing, existing work on testing of DL systems initially investigated DL systems from structural point of view, leading to a number of coverage metrics. Our lack of understanding of the internal mechanism of Deep Neural Networks (DNNs), however, means that coverage metrics defined on the Boolean dichotomy of coverage are hard to intuitively interpret and understand. We propose the degree of out-of-distribution-ness of a given input as its adequacy for testing: the more surprising a given input is to the DNN under test, the more likely the system will show unexpected behavior for the input. We develop the concept of surprise into a test adequacy criterion, called Surprise Adequacy (SA). Intuitively, SA measures the difference in the behavior of the DNN for the given input and its behavior for the training data. We posit that a good test input should be sufficiently, but not overtly, surprising compared to the training dataset. This article evaluates SA using a range of DL systems from simple image classifiers to autonomous driving car platforms, as well as both small and large data benchmarks ranging from MNIST to ImageNet. The results show that the SA value of an input can be a reliable predictor of the correctness of the mode behavior. We also show that SA can be used to detect adversarial examples, and also be efficiently computed against large training dataset such as ImageNet using sampling.

References

[2]
Udacity. 2016. The Udacity open source self-driving car project. Retrieved from https://github.com/udacity/self-driving-car.
[3]
Paul Ammann and Jeff Offutt. 2016. Introduction to Software Testing. Cambridge University Press.
[4]
Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. 2013. Better mixing via deep representations. Proceedings of the 30th International Conference on Machine Learning 28, 1 (2013), 552–560.
[5]
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, and Jake Zhao. 2016. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.
[6]
Nicholas Carlini and David Wagner. 2017. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 3–14.
[7]
Nicholas Carlini and David A. Wagner. 2016. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (S&P). 39–57.
[8]
Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. 2015. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision. 2722–2730.
[9]
Junjie Chen, Ming Yan, Zan Wang, Yuning Kang, and Zhuo Wu. 2020. Deep neural network test coverage: How far are we? arXiv:2010.04946. Retrieved from https://arxiv.org/abs/2010.04946.
[10]
T. Y. Chen, F.-C. Kuo, T. H. Tse, and Zhi Quan Zhou. 2004. Metamorphic testing and beyond. In Proceedings of the International Workshop on Software Technology and Engineering Practice (STEP’03). 94–100.
[11]
Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. 2019. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations (ICLR).
[12]
Zhihua Cui, Fei Xue, Xingjuan Cai, Yang Cao, Gai-ge Wang, and Jinjun Chen. 2018. Detection of malicious code variants based on deep learning. IEEE Transactions on Industrial Informatics 14, 7 (2018), 3187–3196.
[13]
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[14]
Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1915–1929.
[15]
Reuben Feinman, Ryan R. Curtin, Saurabh Shintre, and Andrew B. Gardner. 2017. Detecting adversarial samples from artifacts. arXiv:1703.00410. Retrieved from https://arxiv.org/abs/1703.00410.
[16]
Robert Feldt, Simon Poulding, David Clark, and Shin Yoo. 2016. Test set diameter: Quantifying the diversity of sets of test cases. In Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation (ICST’16). 223–233.
[17]
Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning. 1050–1059.
[18]
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep bayesian active learning with image data. In Proceedings of the International Conference on Machine Learning. PMLR, 1183–1192.
[19]
Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. 2021. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342.
[20]
Simos Gerasimou, Hasan Ferit Eniser, Alper Sen, and Alper Cakan. 2020. Importance-driven deep learning system testing. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 702–713.
[21]
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations.
[22]
Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2021. Operation is the hardest teacher: Estimating DNN accuracy looking for mispredictions. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 348–358.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[24]
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82–97.
[25]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[26]
Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. 2017. Safety verification of deep neural networks. In Computer Aided Verification, Rupak Majumdar and Viktor Kunčak (Eds.), Springer International Publishing, 3–29.
[27]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1. 1–10.
[28]
Sungmin Kang, Robert Feldt, and Shin Yoo. 2020. SINVAD: Search-based image space navigation for DNN image classifier test input generation. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops (SBST’20). 521–528.
[29]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering (ICSE’19). IEEE Press, 1039–1049.
[30]
Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. 2020. Reducing DNN labelling cost using surprise adequacy: An industrial case study for autonomous driving. In Proceedings of ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE Industry Track) (ESEC/FSE’20). 1466–1476.
[31]
Seah Kim and Shin Yoo. 2021. Multimodal surprise adequacy analysis of inputs for natural language processing DNN models. In Proceedings of the 2021 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 80–89.
[32]
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
[33]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2014. The CIFAR-10 dataset. Retrieved from http://www.cs.toronto.edu/kriz/cifar.html. (2014).
[34]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105.
[35]
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. Artificial Intelligence Safety and Security. Chapman and Hall/CRC, 99–112.
[36]
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2016. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems 30.
[37]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
[38]
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. AT & T Labs [Online]. Retrieved from http://yann.lecun.com/exdb/mnist. 2 (2010).
[39]
Yu Li, Min Li, Qiuxia Lai, Yannan Liu, and Qiang Xu. 2021. TestRank: Bringing order into unlabeled test instances for deep learning tasks. Advances in Neural Information Processing Systems 34 (2021), 20874–20886.
[40]
Stijn Luca, Peter Karsmakers, Kris Cuppens, Tom Croonenborghs, Anouk Van de Vel, Berten Ceulemans, Lieven Lagae, Sabine Van Huffel, and Bart Vanrumste. 2014. Detecting rare events using extreme value statistics applied to epileptic convulsions in children. Artificial Intelligence in Medicine 60, 2 (2014), 89–96.
[41]
Lei Ma, Felix Juefei-Xu, Jiyuan Sun, Chunyang Chen, Ting Su, Fuyuan Zhang, Minhui Xue, Bo Li, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Comprehensive and multi-granularity testing criteria for gauging the robustness of deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 120–131.
[42]
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. 2018. DeepMutation: Mutation testing of deep learning systems. In IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). 100–111.
[43]
Lei Ma, Fuyuan Zhang, Minhui Xue, Bo Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. Combinatorial testing for deep learning systems. arXiv preprint arXiv:1806.07723.
[44]
Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. 2018. Characterizing adversarial subspaces using local intrinsic dimensionality. In Proceedings of the 6th International Conference on Learning Representations. ICLR, 1–15.
[45]
Prasanta Chandra Mahalanobis. 2018. Reprint of: Mahalanobis, P.C. (1936) “On the generalised distance in statistics”. Sankhya A 80, 1 (2018), 1–7.
[46]
Christian Murphy, Kuang Shen, and Gail Kaiser. 2009. Automatic system testing of programs without test oracles. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA’09). ACM Press, 189–200.
[47]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
[48]
Tinghui Ouyang, Yoshinao Isobe, Vicent Sanz Marco, Jun Ogata, Yoshiki Seo, and Yutaka Oiwa. 2021. AI robustness analysis with consideration of corner cases. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE, 29–36.
[49]
Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, and Rujun Long. 2018. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768.
[50]
Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2015. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy (EuroS&P’16). IEEE, 372–387.
[51]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 1–18. DOI:
[52]
Simon Poulding and Robert Feldt. 2017. Generating controllably invalid and atypical inputs for robustness testing. In Proceedings of the Software Testing, Verification and Validation Workshops (ICSTW’17), IEEE International Conference on. IEEE, 81–84.
[53]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Computing Surveys (CSUR) 54, 9 (2021), 1–40.
[54]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[55]
Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. 2021. Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE 109, 3 (2021), 247–278. DOI:
[56]
David W. Scott. 2015. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons.
[57]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 3104–3112.
[58]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[59]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
[60]
L. Tarassenko, A. Hann, A. Patterson, E. Braithwaite, K. Davidson, V. Barber, and D. Young. 2005. Biosign™: Multi-parameter monitoring for early warning of patient deterioration. In The 3rd IEE International Seminar on Medical Applications of Signal Processing 2005 (Ref. No. 2005-1119). 71–76.
[61]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. ACM, 303–314.
[62]
Matt P. Wand and M. Chris Jones. 1994. Kernel Smoothing. Chapman and Hall/CRC.
[63]
Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 397–409.
[64]
M. Weiss, R. Chakraborty, and P. Tonella. 2021. A review and refinement of surprise adequacy. In Proceedings of the 2021 IEEE/ACM Third International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest). IEEE Computer Society, 17–24. DOI:
[65]
Wikipedia. 2021. Computational complexity of mathematical operations. Retrieved from https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations. Online; accessed 22-October-2021.
[66]
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. Deephunter: A coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 146–157.
[67]
Shenao Yan, Guanhong Tao, Xuwei Liu, Juan Zhai, Shiqing Ma, Lei Xu, and Xiangyu Zhang. 2020. Correlations between deep neural network model coverage criteria and model quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 775–787.
[68]
Shin Yoo. 2010. Metamorphic testing of stochastic optimisation. In Proceedings of the 3rd International Workshop on Search-Based Software Testing (SBST’10). 192–201.
[69]
Shin Yoo. 2019. SBST in the age of machine learning systems: Challenges ahead. In Proceedings of the 12th International Workshop on Search-Based Software Testing (SBST’19). IEEE Press, 2–2. DOI:
[70]
Shin Yoo and Mark Harman. 2012. Regression testing minimisation, selection and prioritisation: A survey. Software Testing, Verification, and Reliability 22, 2 (March2012), 67–120.
[71]
Hong Zhu, Patrick A. V. Hall, and John H. R. May. 1997. Software unit test coverage and adequacy. ACM Computing Surverys 29, 4 (Dec.1997), 366–427. DOI:

Cited By

View all
  • (2025)Markov model based coverage testing of deep learning software systemsInformation and Software Technology10.1016/j.infsof.2024.107628179(107628)Online publication date: Mar-2025
  • (2024)Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695067(732-744)Online publication date: 27-Oct-2024
  • (2024)Neuron Semantic-Guided Test Generation for Deep Neural Networks FuzzingACM Transactions on Software Engineering and Methodology10.1145/368883534:1(1-38)Online publication date: 14-Aug-2024
  • Show More Cited By

Index Terms

  1. Evaluating Surprise Adequacy for Deep Learning System Testing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 2
      March 2023
      946 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/3586025
      • Editor:
      • Mauro Pezzè
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 March 2023
      Online AM: 06 July 2022
      Accepted: 11 June 2022
      Revised: 17 March 2022
      Received: 22 November 2021
      Published in TOSEM Volume 32, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Test adequacy
      2. deep learning systems

      Qualifiers

      • Research-article

      Funding Sources

      • National Research Foundation of Korea (NRF)
      • Institute of Information & communications Technology Planning & Evaluation
      • Swedish Scientific Council

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)333
      • Downloads (Last 6 weeks)19
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Markov model based coverage testing of deep learning software systemsInformation and Software Technology10.1016/j.infsof.2024.107628179(107628)Online publication date: Mar-2025
      • (2024)Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695067(732-744)Online publication date: 27-Oct-2024
      • (2024)Neuron Semantic-Guided Test Generation for Deep Neural Networks FuzzingACM Transactions on Software Engineering and Methodology10.1145/368883534:1(1-38)Online publication date: 14-Aug-2024
      • (2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
      • (2024)Test Selection for Deep Neural Networks using Meta-Models with Uncertainty MetricsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680312(678-690)Online publication date: 11-Sep-2024
      • (2024)Test Optimization in DNN Testing: A SurveyACM Transactions on Software Engineering and Methodology10.1145/364367833:4(1-42)Online publication date: 27-Jan-2024
      • (2024)TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural NetworksIEEE Transactions on Software Engineering10.1109/TSE.2024.3482984(1-23)Online publication date: 2024
      • (2024)Defect-based Testing for Safety-critical ML Components2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00088(255-262)Online publication date: 28-Oct-2024
      • (2024)DeepFeature: Guiding adversarial testing for deep neural network systems using robust featuresJournal of Systems and Software10.1016/j.jss.2024.112201(112201)Online publication date: Aug-2024
      • (2024)Neuron importance-aware coverage analysis for deep neural network testingEmpirical Software Engineering10.1007/s10664-024-10524-x29:5Online publication date: 25-Jul-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media