A study of learning likely data structure properties using machine learning models

Usman, Muhammad; Wang, Wenxi; Wang, Kaiyuan; Yelen, Cagdas; Dini, Nima; Khurshid, Sarfraz

doi:10.1007/s10009-020-00577-w

A study of learning likely data structure properties using machine learning models

STTT
Special Issue: SPIN 2019
Published: 07 June 2020

Volume 22, pages 601–615, (2020)
Cite this article

International Journal on Software Tools for Technology Transfer Aims and scope Submit manuscript

Muhammad Usman¹,
Wenxi Wang¹,
Kaiyuan Wang¹,
Cagdas Yelen¹,
Nima Dini¹ &
…
Sarfraz Khurshid¹

224 Accesses
1 Citation
Explore all metrics

Abstract

Data structure properties are important for many testing and analysis tasks. For example, model checkers use these properties to find program faults. These properties are often written manually which can be error prone and lead to false alarms. This paper presents the results of controlled experiments performed using existing machine learning (ML) models on various data structures. These data structures are dynamic and reside on the program heap. We use ten data structure subjects and ten ML models to evaluate the learnability of data structure properties. The study reveals five key findings. One, most of the ML models perform well in learning data structure properties, but some of the ML models such as quadratic discriminant analysis and Gaussian naive Bayes are not suitable for learning data structure properties. Two, most of the ML models have high performance even when trained on just 1% of data samples. Three, certain data structure properties such as binary heap and red black tree are more learnable than others. Four, there are no significant differences between the learnability of varied-size (i.e., up to a certain size) and fixed-size data structures. Five, there can be significant differences in performance based on the encoding used. These findings show that using machine learning models to learn data structure properties is very promising. We believe that these properties, once learned, can be used to provide a run-time check to see whether a program state at a particular point satisfies the learned property. Learned properties can also be employed in the future to automate static and dynamic analysis, which would enhance software testing and verification techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 4

A survey on semi-supervised learning

Article Open access 15 November 2019

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Article Open access 19 April 2023

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

References

Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
MathSciNet Google Scholar
Bacaër, N.: Verhulst and the logistic equation 01, 1838 (2011)
Bodik, R.: Program synthesis: opportunities for the next decade. In: International Conference on Functional Programming, pp. 1–1 (2015)
Boyapati, C., Khurshid, S., Marinov, D.: Korat: automated testing based on Java predicates. In: International Symposium on Software Testing and Analysis, pp. 123–133 (2002)
Briand, L.C., Labiche, Y., Liu, X.: Using machine learning to support debugging with tarantula. In: International Symposium on Software Reliability, pp. 137–146 (2007)
Brouwer, A.E., Haemers, W.H.: Spectra of Graphs. Springer, New York (2012)
Book MATH Google Scholar
Çelik, A., Pai, S., Khurshid, S., Gligoric, M.: Bounded exhaustive test-input generation on GPUs. PACMPL 1(OOPSLA), 94:1–94:25 (2017)
Google Scholar
Chen, Y.-F., Hong, C.-D., Lin, A.W., Rümmer, P.: Learning to prove safety over parameterised concurrent systems. In: Formal Methods in Computer Aided Design, pp. 76–83 (2017)
Clarke, E.M., Kroening, D., Yorav, K.: Behavioral consistency of C and verilog programs using bounded model checking. In: Design Automation Conference, pp. 368–371 (2003)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Csallner, C., Tillmann, N., Smaragdakis, Y.: DySy: Dynamic symbolic execution for invariant inference. In: International Conference on Software Engineering, pp. 281–290 (2008)
de Moura, L.M., Kong, S., Avigad, J., van Doorn, F., von Raumer, J.: The lean theorem prover (system description). In: International Conference on Automated Deduction, pp. 378–388 (2015)
Demsky, B., Rinard, M.C.: Automatic detection and repair of errors in data structures. In: Conference on Object-Oriented Programming Systems, Languages and Applications, pp. 78–95 (2003)
Dillig, I., Dillig, T., Li, B., McMillan, K.: Inductive invariant generation via abductive inference. In: International Conference on Object Oriented Programming Systems Languages and Applications, pp. 443–456 (2013)
Dini, N., Yelen, C., Alrmaih, Z., Kulkarni, A., Khurshid, S.: Korat-API: a framework to enhance korat to better support testing and reliability techniques. In: International Symposium on Applied Computing, pp. 1934–1943 (2018)
Dini, N., Yelen, C., Gligoric, M., Khurshid, S.: Extension-aware automated testing based on imperative predicates. In: Conference on Software Testing, Validation and Verification, pp. 25–36 (2019)
Dini, N., Yelen, C., Khurshid, S.: Optimizing parallel Korat using invalid ranges. In: International Symposium on Model Checking of Software, pp. 182–191 (2017)
Elkarablieh, B., Garcia, I., Suen, Y.L., Sarfraz, K.: Assertion-based repair of complex data structures. In: International Conference on Automated Software Engineering, pp. 64–73 (2007)
Ernst, M.D., Czeisler, A., Griswold, W.G., Notkin, D.: Quickly detecting relevant program invariants. In: International Conference on Software Engineering, pp. 449–458 (2000)
Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S., Pacheco, C., Tschantz, M.S., Xiao, C.: The Daikon system for dynamic detection of likely invariants. Sci. Comput. Program. 69(1–3), 35–45 (2007)
Article MathSciNet MATH Google Scholar
Facundo, M., Degiovanni, R., Ponzio, P., Regis, G., Aguirre, N., Frias, M.F.: Training binary classifiers as data structure invariants. In: International Conference on Software Engineering, pp. 759–770 (2019)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet MATH Google Scholar
Friedman, J.H.: Greedy function approximation: A gradient boosting machine. Ann. Statist. 29(5), 1189–1232 (2001)
Article MathSciNet MATH Google Scholar
Garg, P., Neider, D., Madhusudan, P., Roth, D.: Learning invariants using decision trees and implication counterexamples. In: Symposium on Principles of Programming Languages, pp. 499–512 (2016)
Godefroid, P.: Model checking for programming languages using verisoft. In: Symposium on Principles of Programming Languages, pp. 174–186 (1997)
Gomes, C.P., Sabharwal, A., Selman, B.: Model counting (2008)
Gulwani, S.: Dimensions in program synthesis. In: International Symposium on Principles and Practice of Declarative Programming, pp. 13–24 (2010)
Guo, C., Berkhahn, F.: Entity embeddings of categorical variables. CoRR (2016). arXiv:1604.06737
Hernandez, J., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 262–269. Springer (2013)
Ho, T.K.: Random decision forests. In: International Conference on Document Analysis and Recognition (1995)
Hoder, K., Kovács, L., Voronkov, A.: Invariant generation in vampire. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 60–64. Springer (2011)
Jackson, D., Vaziri, M.: Finding bugs with a constraint solver. In: International Symposium on Software Testing and Analysis, pp. 14–25 (2000)
Jha, S., Gulwani, S., Seshia, S.A., Tiwari, A.: Oracle-guided component-based program synthesis. In: International Conference on Software Engineering, pp. 215–224 (2010)
Jump, M., McKinley, K.S.: Dynamic shape analysis via degree metrics. In: International Symposium on Memory Management, pp. 119–128 (2009)
Kazemi, S.M., Poole, D.: Relnn: A deep neural model for relational learning (2017)
Ke, Y., Stolee, K.T, Goues, C.L., Brun, Y.: Repairing programs with semantic code search (T). In: International Conference on Automated Software Engineering, pp. 295–306 (2015)
Korat GitHub repository. https://github.com/korattest/korat
Korel, B.: Automated software test data generation. Trans. Softw. Eng. 16(8), 870–879 (1990)
Article Google Scholar
Liskov, B., Guttag, J.V.: Program Development in Java-Abstraction, Specification, and Object-Oriented Design. Addison-Wesley, Boston (2001)
MATH Google Scholar
Malik, M., Pervaiz, A., Uzuncaova, E., Khurshid, S.: Deryaft: A tool for generating representation invariants of structurally complex data. In: International Conference on Software Engineering, pp. 859–862 (2008)
Malik, M.Z.: Dynamic shape analysis of program heap using graph spectra: NIER track. In: International Conference on Software Engineering, pp. 952–955 (2011)
Manna, Z., Waldinger, R.: A deductive approach to program synthesis. ACM Trans. Program. Lang. Syst. 2(1), 90–121 (1980)
Article MATH Google Scholar
McMillan, K.L.: Quantified invariant generation using an interpolating saturation prover. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 413–427 (2008)
Mera, E., Lopez-García, P., Hermenegildo, M.: Integrating software testing and run-time checking in an assertion verification framework. In: Logic Programming, pp. 281–295. Springer (2009)
Meyer, B.: Class invariants: concepts, problems, solutions. CoRR (2016). arXiv:1608.07637
Misailovic, S., Milicevic, A., Petrovic, N., Khurshid, S., Marinov, D.: Parallel test generation and execution with Korat. In: Symposium on the Foundations of Software Engineering, pp. 135–144 (2007)
Møller, A., Schwartzbach, M.I.: The pointer assertion logic engine. In: Conference on Programming Language Design and Implementation, pp. 221–231 (2001)
Murtagh, F.: Multilayer perceptrons for classification and regression. Neurocomputing 2(5), 183–197 (1991)
Article MathSciNet Google Scholar
Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T.: Feedback-directed random test generation. In: International Conference on Software Engineering, pp. 75–84 (2007)
Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI Workshop on Imbalanced Data Sets, pp. 1–3 (2000)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Reynolds, J.C.: Separation logic: a logic for shared mutable data structures. In: Symposium on Logic in Computer Science, pp. 55–74 (2002)
Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI, pp. 3 (2001)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Sagiv, S., Reps, T.W., Wilhelm, R.: Parametric shape analysis via 3-valued logic. In: Symposium on Principles of Programming Languages, pp. 105–118 (1999)
Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Non-linear loop invariant generation using gröbner bases. In: Symposium on Principles of Programming Languages, pp. 318–329 (2004)
Scikit-Learn Library. https://scikit-learn.org/stable/. Accessed 18 Apr 2019
Si, X., Dai, H., Raghothaman, M., Naik, M., Le, S.: Learning loop invariants for program verification. In: Conference on Neural Information Processing Systems, pp. 7762–7773 (2018)
Si, X., Dai, H., Raghothaman, M., Naik, M., Le, S.: Learning loop invariants for program verification. In: Advances in Neural Information Processing Systems, pp. 7751–7762 (2018)
Siddiqui, J.H., Khurshid, S.: PKorat: Parallel generation of structurally complex test inputs. In: International Conference on Software Testing Verification and Validation, pp. 250–259 (2009)
Singh, S., Zhang, M., Khurshid, S.: Learning guided enumerative synthesis for superoptimization. In: International Symposium on Model Checking of Software, p. 172–192 (2019)
Solar-Lezama, A.: Program Synthesis by Sketching. PhD thesis (2008)
Usman, M., Wang, W., Vasic, M., Wang, K., Vikalo, H., Khurshid, S.: A study of the learnability of relational properties. In: 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). To appear(2020)
Usman, M., Wang, W., Wang, K., Yelen, C., Dini, N., Khurshid, S.: A study of learning data structure invariants using off-the-shelf tools. In: International Symposium on Model Checking of Software, pp. 226–243 (2019)
Valiant, L.G.: A theory of the learnable. CACM 27(11) (1984)
Vapnik, V.N., Chervonenkis, A.Ya.: On the uniform convergence of relative frequencies of events to their probabilities. In: Measures of Complexity: Festschrift for Alexey Chervonenkis. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-21852-6_3
Visser, W., Havelund, K., Brat, G.P., Park, S.: Model checking programs. In: International Conference on Automated Software Engineering, pp. 3–12 (2000)
Wu, W., Mallet, Y., Walczak, B., Penninckx, W., Massart, D.L., Heuerding, S., Erni, F.: Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to nir data. Anal. Chim. Acta 329(3), 257–265 (1996)
Article Google Scholar
Zee, K., Kuncak, V., Rinard, M.C.: Full functional verification of linked data structures. In: Conference on Programming Language Design and Implementation, pp. 349–361 (2008)

Download references

Acknowledgements

We thank Rohan Garg, Emily Ginsburg, Michael Herrington, Tara Kuruvilla, Raghav Prakash and the anonymous reviewers for helpful feedback and comments. This research was partially supported by the US National Science Foundation under Grant Nos. CCF-1704790 and CCF-1718903.

Author information

Authors and Affiliations

University of Texas at Austin, Austin, TX, 78712, USA
Muhammad Usman, Wenxi Wang, Kaiyuan Wang, Cagdas Yelen, Nima Dini & Sarfraz Khurshid

Authors

Muhammad Usman
View author publications
You can also search for this author in PubMed Google Scholar
Wenxi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kaiyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Cagdas Yelen
View author publications
You can also search for this author in PubMed Google Scholar
Nima Dini
View author publications
You can also search for this author in PubMed Google Scholar
Sarfraz Khurshid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Usman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Usman, M., Wang, W., Wang, K. et al. A study of learning likely data structure properties using machine learning models. Int J Softw Tools Technol Transfer 22, 601–615 (2020). https://doi.org/10.1007/s10009-020-00577-w

Download citation

Published: 07 June 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10009-020-00577-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study of learning likely data structure properties using machine learning models

Abstract

Access this article

Similar content being viewed by others

A survey on semi-supervised learning

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Data collection and quality challenges in deep learning: a data-centric AI perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A study of learning likely data structure properties using machine learning models

Abstract

Access this article

Similar content being viewed by others

A survey on semi-supervised learning

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Data collection and quality challenges in deep learning: a data-centric AI perspective

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation