Abstract
Data structure properties are important for many testing and analysis tasks. For example, model checkers use these properties to find program faults. These properties are often written manually which can be error prone and lead to false alarms. This paper presents the results of controlled experiments performed using existing machine learning (ML) models on various data structures. These data structures are dynamic and reside on the program heap. We use ten data structure subjects and ten ML models to evaluate the learnability of data structure properties. The study reveals five key findings. One, most of the ML models perform well in learning data structure properties, but some of the ML models such as quadratic discriminant analysis and Gaussian naive Bayes are not suitable for learning data structure properties. Two, most of the ML models have high performance even when trained on just 1% of data samples. Three, certain data structure properties such as binary heap and red black tree are more learnable than others. Four, there are no significant differences between the learnability of varied-size (i.e., up to a certain size) and fixed-size data structures. Five, there can be significant differences in performance based on the encoding used. These findings show that using machine learning models to learn data structure properties is very promising. We believe that these properties, once learned, can be used to provide a run-time check to see whether a program state at a particular point satisfies the learned property. Learned properties can also be employed in the future to automate static and dynamic analysis, which would enhance software testing and verification techniques.
Similar content being viewed by others
References
Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
Bacaër, N.: Verhulst and the logistic equation 01, 1838 (2011)
Bodik, R.: Program synthesis: opportunities for the next decade. In: International Conference on Functional Programming, pp. 1–1 (2015)
Boyapati, C., Khurshid, S., Marinov, D.: Korat: automated testing based on Java predicates. In: International Symposium on Software Testing and Analysis, pp. 123–133 (2002)
Briand, L.C., Labiche, Y., Liu, X.: Using machine learning to support debugging with tarantula. In: International Symposium on Software Reliability, pp. 137–146 (2007)
Brouwer, A.E., Haemers, W.H.: Spectra of Graphs. Springer, New York (2012)
Çelik, A., Pai, S., Khurshid, S., Gligoric, M.: Bounded exhaustive test-input generation on GPUs. PACMPL 1(OOPSLA), 94:1–94:25 (2017)
Chen, Y.-F., Hong, C.-D., Lin, A.W., Rümmer, P.: Learning to prove safety over parameterised concurrent systems. In: Formal Methods in Computer Aided Design, pp. 76–83 (2017)
Clarke, E.M., Kroening, D., Yorav, K.: Behavioral consistency of C and verilog programs using bounded model checking. In: Design Automation Conference, pp. 368–371 (2003)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Csallner, C., Tillmann, N., Smaragdakis, Y.: DySy: Dynamic symbolic execution for invariant inference. In: International Conference on Software Engineering, pp. 281–290 (2008)
de Moura, L.M., Kong, S., Avigad, J., van Doorn, F., von Raumer, J.: The lean theorem prover (system description). In: International Conference on Automated Deduction, pp. 378–388 (2015)
Demsky, B., Rinard, M.C.: Automatic detection and repair of errors in data structures. In: Conference on Object-Oriented Programming Systems, Languages and Applications, pp. 78–95 (2003)
Dillig, I., Dillig, T., Li, B., McMillan, K.: Inductive invariant generation via abductive inference. In: International Conference on Object Oriented Programming Systems Languages and Applications, pp. 443–456 (2013)
Dini, N., Yelen, C., Alrmaih, Z., Kulkarni, A., Khurshid, S.: Korat-API: a framework to enhance korat to better support testing and reliability techniques. In: International Symposium on Applied Computing, pp. 1934–1943 (2018)
Dini, N., Yelen, C., Gligoric, M., Khurshid, S.: Extension-aware automated testing based on imperative predicates. In: Conference on Software Testing, Validation and Verification, pp. 25–36 (2019)
Dini, N., Yelen, C., Khurshid, S.: Optimizing parallel Korat using invalid ranges. In: International Symposium on Model Checking of Software, pp. 182–191 (2017)
Elkarablieh, B., Garcia, I., Suen, Y.L., Sarfraz, K.: Assertion-based repair of complex data structures. In: International Conference on Automated Software Engineering, pp. 64–73 (2007)
Ernst, M.D., Czeisler, A., Griswold, W.G., Notkin, D.: Quickly detecting relevant program invariants. In: International Conference on Software Engineering, pp. 449–458 (2000)
Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S., Pacheco, C., Tschantz, M.S., Xiao, C.: The Daikon system for dynamic detection of likely invariants. Sci. Comput. Program. 69(1–3), 35–45 (2007)
Facundo, M., Degiovanni, R., Ponzio, P., Regis, G., Aguirre, N., Frias, M.F.: Training binary classifiers as data structure invariants. In: International Conference on Software Engineering, pp. 759–770 (2019)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Friedman, J.H.: Greedy function approximation: A gradient boosting machine. Ann. Statist. 29(5), 1189–1232 (2001)
Garg, P., Neider, D., Madhusudan, P., Roth, D.: Learning invariants using decision trees and implication counterexamples. In: Symposium on Principles of Programming Languages, pp. 499–512 (2016)
Godefroid, P.: Model checking for programming languages using verisoft. In: Symposium on Principles of Programming Languages, pp. 174–186 (1997)
Gomes, C.P., Sabharwal, A., Selman, B.: Model counting (2008)
Gulwani, S.: Dimensions in program synthesis. In: International Symposium on Principles and Practice of Declarative Programming, pp. 13–24 (2010)
Guo, C., Berkhahn, F.: Entity embeddings of categorical variables. CoRR (2016). arXiv:1604.06737
Hernandez, J., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 262–269. Springer (2013)
Ho, T.K.: Random decision forests. In: International Conference on Document Analysis and Recognition (1995)
Hoder, K., Kovács, L., Voronkov, A.: Invariant generation in vampire. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 60–64. Springer (2011)
Jackson, D., Vaziri, M.: Finding bugs with a constraint solver. In: International Symposium on Software Testing and Analysis, pp. 14–25 (2000)
Jha, S., Gulwani, S., Seshia, S.A., Tiwari, A.: Oracle-guided component-based program synthesis. In: International Conference on Software Engineering, pp. 215–224 (2010)
Jump, M., McKinley, K.S.: Dynamic shape analysis via degree metrics. In: International Symposium on Memory Management, pp. 119–128 (2009)
Kazemi, S.M., Poole, D.: Relnn: A deep neural model for relational learning (2017)
Ke, Y., Stolee, K.T, Goues, C.L., Brun, Y.: Repairing programs with semantic code search (T). In: International Conference on Automated Software Engineering, pp. 295–306 (2015)
Korat GitHub repository. https://github.com/korattest/korat
Korel, B.: Automated software test data generation. Trans. Softw. Eng. 16(8), 870–879 (1990)
Liskov, B., Guttag, J.V.: Program Development in Java-Abstraction, Specification, and Object-Oriented Design. Addison-Wesley, Boston (2001)
Malik, M., Pervaiz, A., Uzuncaova, E., Khurshid, S.: Deryaft: A tool for generating representation invariants of structurally complex data. In: International Conference on Software Engineering, pp. 859–862 (2008)
Malik, M.Z.: Dynamic shape analysis of program heap using graph spectra: NIER track. In: International Conference on Software Engineering, pp. 952–955 (2011)
Manna, Z., Waldinger, R.: A deductive approach to program synthesis. ACM Trans. Program. Lang. Syst. 2(1), 90–121 (1980)
McMillan, K.L.: Quantified invariant generation using an interpolating saturation prover. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 413–427 (2008)
Mera, E., Lopez-García, P., Hermenegildo, M.: Integrating software testing and run-time checking in an assertion verification framework. In: Logic Programming, pp. 281–295. Springer (2009)
Meyer, B.: Class invariants: concepts, problems, solutions. CoRR (2016). arXiv:1608.07637
Misailovic, S., Milicevic, A., Petrovic, N., Khurshid, S., Marinov, D.: Parallel test generation and execution with Korat. In: Symposium on the Foundations of Software Engineering, pp. 135–144 (2007)
Møller, A., Schwartzbach, M.I.: The pointer assertion logic engine. In: Conference on Programming Language Design and Implementation, pp. 221–231 (2001)
Murtagh, F.: Multilayer perceptrons for classification and regression. Neurocomputing 2(5), 183–197 (1991)
Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T.: Feedback-directed random test generation. In: International Conference on Software Engineering, pp. 75–84 (2007)
Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI Workshop on Imbalanced Data Sets, pp. 1–3 (2000)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Reynolds, J.C.: Separation logic: a logic for shared mutable data structures. In: Symposium on Logic in Computer Science, pp. 55–74 (2002)
Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI, pp. 3 (2001)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Sagiv, S., Reps, T.W., Wilhelm, R.: Parametric shape analysis via 3-valued logic. In: Symposium on Principles of Programming Languages, pp. 105–118 (1999)
Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Non-linear loop invariant generation using gröbner bases. In: Symposium on Principles of Programming Languages, pp. 318–329 (2004)
Scikit-Learn Library. https://scikit-learn.org/stable/. Accessed 18 Apr 2019
Si, X., Dai, H., Raghothaman, M., Naik, M., Le, S.: Learning loop invariants for program verification. In: Conference on Neural Information Processing Systems, pp. 7762–7773 (2018)
Si, X., Dai, H., Raghothaman, M., Naik, M., Le, S.: Learning loop invariants for program verification. In: Advances in Neural Information Processing Systems, pp. 7751–7762 (2018)
Siddiqui, J.H., Khurshid, S.: PKorat: Parallel generation of structurally complex test inputs. In: International Conference on Software Testing Verification and Validation, pp. 250–259 (2009)
Singh, S., Zhang, M., Khurshid, S.: Learning guided enumerative synthesis for superoptimization. In: International Symposium on Model Checking of Software, p. 172–192 (2019)
Solar-Lezama, A.: Program Synthesis by Sketching. PhD thesis (2008)
Usman, M., Wang, W., Vasic, M., Wang, K., Vikalo, H., Khurshid, S.: A study of the learnability of relational properties. In: 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). To appear(2020)
Usman, M., Wang, W., Wang, K., Yelen, C., Dini, N., Khurshid, S.: A study of learning data structure invariants using off-the-shelf tools. In: International Symposium on Model Checking of Software, pp. 226–243 (2019)
Valiant, L.G.: A theory of the learnable. CACM 27(11) (1984)
Vapnik, V.N., Chervonenkis, A.Ya.: On the uniform convergence of relative frequencies of events to their probabilities. In: Measures of Complexity: Festschrift for Alexey Chervonenkis. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-21852-6_3
Visser, W., Havelund, K., Brat, G.P., Park, S.: Model checking programs. In: International Conference on Automated Software Engineering, pp. 3–12 (2000)
Wu, W., Mallet, Y., Walczak, B., Penninckx, W., Massart, D.L., Heuerding, S., Erni, F.: Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to nir data. Anal. Chim. Acta 329(3), 257–265 (1996)
Zee, K., Kuncak, V., Rinard, M.C.: Full functional verification of linked data structures. In: Conference on Programming Language Design and Implementation, pp. 349–361 (2008)
Acknowledgements
We thank Rohan Garg, Emily Ginsburg, Michael Herrington, Tara Kuruvilla, Raghav Prakash and the anonymous reviewers for helpful feedback and comments. This research was partially supported by the US National Science Foundation under Grant Nos. CCF-1704790 and CCF-1718903.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Usman, M., Wang, W., Wang, K. et al. A study of learning likely data structure properties using machine learning models. Int J Softw Tools Technol Transfer 22, 601–615 (2020). https://doi.org/10.1007/s10009-020-00577-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10009-020-00577-w