skip to main content
10.1145/3609437.3609457acmotherconferencesArticle/Chapter ViewAbstractPublication PagesinternetwareConference Proceedingsconference-collections
research-article

Practical Accuracy Evaluation for Deep Learning Systems via Latent Representation Discrepancy

Published: 05 October 2023 Publication History

Abstract

As deep learning systems have been widely deployed in many safety-critical scenarios, their quality and reliability have raised growing concerns. Assuring the quality and evaluating the accuracy of deep learning models could be challenging because, unlike traditional software, DL systems rely on large amounts of labeled data for training and evaluation. The DL models have variability in their behavioral features on datasets with different distributions. In practical application, the potential distribution shift between training and usage scenarios may have an impact on the performance of the model and bring extra vulnerability to DL systems. Although some neuron coverage testing criteria have been proposed to assist in testing the DL systems, they are still limited by the amount of labeled data. Meanwhile, manual labeling test data collected from real-world application scenarios is very time-consuming and costly.
In this paper, we propose a novel testing metric, namely LRD, to evaluate the practical accuracy of deep learning systems without requiring the ground truth of test data. The metric uses optimal transport theory to compare model behavior on real-world test data to that on training and out-of-distribution (OOD) sets, by extracting latent representations from the model during input data processing and constructing representation patterns based on the training dataset. The paper further introduces two algorithms powered by the latent representation for out-of-distribution (OOD) data detection and LRD-guided test selection for model retraining. The experimental results show that the evaluation results of LRD have a significant positive correlation with the actual accuracy of the model, and the proposed algorithms are more effective than related OOD detection and test prioritization techniques.

References

[1]
[n. d.]. scipy.stats.wasserstein_distance — SciPy v1.11.1 Manual. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html. (Accessed on 07/16/2023).
[2]
Yoshua Bengio, Gregoire Mesnil, Yann Dauphin, and Salah Rifai. 2013. Better Mixing via Deep Representations. In Proceedings of the 30th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 552–560. https://proceedings.mlr.press/v28/bengio13.html
[3]
David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, and Jianjun Zhao. 2020. Cats are not fish: Deep learning testing calls for out-of-distribution awareness. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 1041–1052.
[4]
Christopher M Bishop. 2006. Pattern recognition. Machine learning 128, 9 (2006).
[5]
Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical Accuracy Estimation for Efficient Deep Neural Network Testing. ACM Trans. Softw. Eng. Methodol. 29, 4, Article 30 (oct 2020), 35 pages. https://doi.org/10.1145/3394112
[6]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12, ARTICLE (2011), 2493–2537.
[7]
Wayne W Daniel 1990. Applied nonparametric statistics. Vol. 2. PWS-Kent Pub. Boston.
[8]
Alex Davies. 2019. Tesla’s Latest Autopilot Death Looks Just Like a Prior Crash. https://www.wired.com/story/teslas-latest-autopilot-death-looks-like-prior-crash/
[9]
Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. 2000. The mahalanobis distance. Chemometrics and intelligent laboratory systems 50, 1 (2000), 1–18.
[10]
Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. 2017. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410 (2017).
[11]
Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 177–188.
[12]
Xiang Gao, Ripon K Saha, Mukul R Prasad, and Abhik Roychoudhury. 2020. Fuzz testing based data augmentation to improve robustness of deep neural networks. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE, 1147–1158.
[13]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
[14]
John Clifford Gower. 1985. Properties of Euclidean and non-Euclidean distance matrices. Linear algebra and its applications 67 (1985), 81–97.
[15]
Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. Dlfuzz: Differential fuzzing testing of deep learning systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 739–743.
[16]
Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016).
[17]
Dan Hendrycks, Mantas Mazeika, and Thomas G. Dietterich. 2019. Deep Anomaly Detection with Outlier Exposure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=HyxCxhRcY7
[18]
Qiang Hu, Lei Ma, Xiaofei Xie, Bing Yu, Yang Liu, and Jianjun Zhao. 2019. Deepmutation++: A mutation testing framework for deep learning systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1158–1161.
[19]
Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. 2009. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2372–2379.
[20]
Leonid V Kantorovich. 2006. On the translocation of masses. Journal of mathematical sciences 133, 4 (2006), 1381–1382.
[21]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1039–1049.
[22]
John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
[23]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444.
[24]
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31 (2018).
[25]
Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, 2011. Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE intelligent vehicles symposium (IV). IEEE, 163–168.
[26]
Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting Operational DNN Testing Efficiency through Conditioning. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 499–509. https://doi.org/10.1145/3338906.3338930
[27]
Shiyu Liang, Yixuan Li, and R Srikant. 2018. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. In International Conference on Learning Representations.
[28]
Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. 2020. Energy-based Out-of-distribution Detection. Advances in Neural Information Processing Systems 33 (2020).
[29]
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, 2018. Deepgauge: Multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 120–131.
[30]
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, 2018. Deepmutation: Mutation testing of deep learning systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 100–111.
[31]
Matt McFarland. 2016. Tesla’s autopilot probed by government after crash kills driver. https://money.cnn.com/2016/06/30/technology/tesla-autopilot-death/index.html.
[32]
Agnieszka Mikołajczyk and Michał Grochowski. 2018. Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW). IEEE, 117–122.
[33]
Gaspard Monge. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris (1781).
[34]
Ziad Obermeyer and Ezekiel J Emanuel. 2016. Predicting the future—big data, machine learning, and clinical medicine. The New England journal of medicine 375, 13 (2016), 1216.
[35]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles. 1–18.
[36]
Douglas A Reynolds. 2009. Gaussian mixture models.Encyclopedia of biometrics 741, 659-663 (2009).
[37]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65.
[38]
Burr Settles. 2009. Active learning literature survey. (2009).
[39]
Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 410–422.
[40]
Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of Big Data 6, 1 (2019), 1–48.
[41]
Youcheng Sun, Xiaowei Huang, Daniel Kroening, James Sharp, Matthew Hill, and Rob Ashmore. 2018. Testing deep neural networks. arXiv preprint arXiv:1803.04792 (2018).
[42]
Yuchi Tian, Ziyuan Zhong, Vicente Ordonez, Gail Kaiser, and Baishakhi Ray. 2020. Testing DNN image classifiers for confusion & bias errors. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1122–1134.
[43]
Luis Caicedo Torres, Luiz Manella Pereira, and M Hadi Amini. 2021. A Survey on Optimal Transport for Machine Learning: Theory and Applications. arXiv preprint arXiv:2106.01963 (2021).
[44]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).
[45]
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 146–157.

Index Terms

  1. Practical Accuracy Evaluation for Deep Learning Systems via Latent Representation Discrepancy

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    Internetware '23: Proceedings of the 14th Asia-Pacific Symposium on Internetware
    August 2023
    332 pages
    ISBN:9798400708947
    DOI:10.1145/3609437
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep learning systems
    2. deep neural network testing
    3. quality assurance
    4. test optimization

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    Internetware 2023

    Acceptance Rates

    Overall Acceptance Rate 55 of 111 submissions, 50%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 66
      Total Downloads
    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media