Skip to main content
Log in

A study of learning likely data structure properties using machine learning models

  • STTT
  • Special Issue: SPIN 2019
  • Published:
International Journal on Software Tools for Technology Transfer Aims and scope Submit manuscript

Abstract

Data structure properties are important for many testing and analysis tasks. For example, model checkers use these properties to find program faults. These properties are often written manually which can be error prone and lead to false alarms. This paper presents the results of controlled experiments performed using existing machine learning (ML) models on various data structures. These data structures are dynamic and reside on the program heap. We use ten data structure subjects and ten ML models to evaluate the learnability of data structure properties. The study reveals five key findings. One, most of the ML models perform well in learning data structure properties, but some of the ML models such as quadratic discriminant analysis and Gaussian naive Bayes are not suitable for learning data structure properties. Two, most of the ML models have high performance even when trained on just 1% of data samples. Three, certain data structure properties such as binary heap and red black tree are more learnable than others. Four, there are no significant differences between the learnability of varied-size (i.e., up to a certain size) and fixed-size data structures. Five, there can be significant differences in performance based on the encoding used. These findings show that using machine learning models to learn data structure properties is very promising. We believe that these properties, once learned, can be used to provide a run-time check to see whether a program state at a particular point satisfies the learned property. Learned properties can also be employed in the future to automate static and dynamic analysis, which would enhance software testing and verification techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)

    MathSciNet  Google Scholar 

  2. Bacaër, N.: Verhulst and the logistic equation 01, 1838 (2011)

  3. Bodik, R.: Program synthesis: opportunities for the next decade. In: International Conference on Functional Programming, pp. 1–1 (2015)

  4. Boyapati, C., Khurshid, S., Marinov, D.: Korat: automated testing based on Java predicates. In: International Symposium on Software Testing and Analysis, pp. 123–133 (2002)

  5. Briand, L.C., Labiche, Y., Liu, X.: Using machine learning to support debugging with tarantula. In: International Symposium on Software Reliability, pp. 137–146 (2007)

  6. Brouwer, A.E., Haemers, W.H.: Spectra of Graphs. Springer, New York (2012)

    Book  MATH  Google Scholar 

  7. Çelik, A., Pai, S., Khurshid, S., Gligoric, M.: Bounded exhaustive test-input generation on GPUs. PACMPL 1(OOPSLA), 94:1–94:25 (2017)

    Google Scholar 

  8. Chen, Y.-F., Hong, C.-D., Lin, A.W., Rümmer, P.: Learning to prove safety over parameterised concurrent systems. In: Formal Methods in Computer Aided Design, pp. 76–83 (2017)

  9. Clarke, E.M., Kroening, D., Yorav, K.: Behavioral consistency of C and verilog programs using bounded model checking. In: Design Automation Conference, pp. 368–371 (2003)

  10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  11. Csallner, C., Tillmann, N., Smaragdakis, Y.: DySy: Dynamic symbolic execution for invariant inference. In: International Conference on Software Engineering, pp. 281–290 (2008)

  12. de Moura, L.M., Kong, S., Avigad, J., van Doorn, F., von Raumer, J.: The lean theorem prover (system description). In: International Conference on Automated Deduction, pp. 378–388 (2015)

  13. Demsky, B., Rinard, M.C.: Automatic detection and repair of errors in data structures. In: Conference on Object-Oriented Programming Systems, Languages and Applications, pp. 78–95 (2003)

  14. Dillig, I., Dillig, T., Li, B., McMillan, K.: Inductive invariant generation via abductive inference. In: International Conference on Object Oriented Programming Systems Languages and Applications, pp. 443–456 (2013)

  15. Dini, N., Yelen, C., Alrmaih, Z., Kulkarni, A., Khurshid, S.: Korat-API: a framework to enhance korat to better support testing and reliability techniques. In: International Symposium on Applied Computing, pp. 1934–1943 (2018)

  16. Dini, N., Yelen, C., Gligoric, M., Khurshid, S.: Extension-aware automated testing based on imperative predicates. In: Conference on Software Testing, Validation and Verification, pp. 25–36 (2019)

  17. Dini, N., Yelen, C., Khurshid, S.: Optimizing parallel Korat using invalid ranges. In: International Symposium on Model Checking of Software, pp. 182–191 (2017)

  18. Elkarablieh, B., Garcia, I., Suen, Y.L., Sarfraz, K.: Assertion-based repair of complex data structures. In: International Conference on Automated Software Engineering, pp. 64–73 (2007)

  19. Ernst, M.D., Czeisler, A., Griswold, W.G., Notkin, D.: Quickly detecting relevant program invariants. In: International Conference on Software Engineering, pp. 449–458 (2000)

  20. Ernst, M.D., Perkins, J.H., Guo, P.J., McCamant, S., Pacheco, C., Tschantz, M.S., Xiao, C.: The Daikon system for dynamic detection of likely invariants. Sci. Comput. Program. 69(1–3), 35–45 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  21. Facundo, M., Degiovanni, R., Ponzio, P., Regis, G., Aguirre, N., Frias, M.F.: Training binary classifiers as data structure invariants. In: International Conference on Software Engineering, pp. 759–770 (2019)

  22. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  23. Friedman, J.H.: Greedy function approximation: A gradient boosting machine. Ann. Statist. 29(5), 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  24. Garg, P., Neider, D., Madhusudan, P., Roth, D.: Learning invariants using decision trees and implication counterexamples. In: Symposium on Principles of Programming Languages, pp. 499–512 (2016)

  25. Godefroid, P.: Model checking for programming languages using verisoft. In: Symposium on Principles of Programming Languages, pp. 174–186 (1997)

  26. Gomes, C.P., Sabharwal, A., Selman, B.: Model counting (2008)

  27. Gulwani, S.: Dimensions in program synthesis. In: International Symposium on Principles and Practice of Declarative Programming, pp. 13–24 (2010)

  28. Guo, C., Berkhahn, F.: Entity embeddings of categorical variables. CoRR (2016). arXiv:1604.06737

  29. Hernandez, J., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 262–269. Springer (2013)

  30. Ho, T.K.: Random decision forests. In: International Conference on Document Analysis and Recognition (1995)

  31. Hoder, K., Kovács, L., Voronkov, A.: Invariant generation in vampire. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 60–64. Springer (2011)

  32. Jackson, D., Vaziri, M.: Finding bugs with a constraint solver. In: International Symposium on Software Testing and Analysis, pp. 14–25 (2000)

  33. Jha, S., Gulwani, S., Seshia, S.A., Tiwari, A.: Oracle-guided component-based program synthesis. In: International Conference on Software Engineering, pp. 215–224 (2010)

  34. Jump, M., McKinley, K.S.: Dynamic shape analysis via degree metrics. In: International Symposium on Memory Management, pp. 119–128 (2009)

  35. Kazemi, S.M., Poole, D.: Relnn: A deep neural model for relational learning (2017)

  36. Ke, Y., Stolee, K.T, Goues, C.L., Brun, Y.: Repairing programs with semantic code search (T). In: International Conference on Automated Software Engineering, pp. 295–306 (2015)

  37. Korat GitHub repository. https://github.com/korattest/korat

  38. Korel, B.: Automated software test data generation. Trans. Softw. Eng. 16(8), 870–879 (1990)

    Article  Google Scholar 

  39. Liskov, B., Guttag, J.V.: Program Development in Java-Abstraction, Specification, and Object-Oriented Design. Addison-Wesley, Boston (2001)

    MATH  Google Scholar 

  40. Malik, M., Pervaiz, A., Uzuncaova, E., Khurshid, S.: Deryaft: A tool for generating representation invariants of structurally complex data. In: International Conference on Software Engineering, pp. 859–862 (2008)

  41. Malik, M.Z.: Dynamic shape analysis of program heap using graph spectra: NIER track. In: International Conference on Software Engineering, pp. 952–955 (2011)

  42. Manna, Z., Waldinger, R.: A deductive approach to program synthesis. ACM Trans. Program. Lang. Syst. 2(1), 90–121 (1980)

    Article  MATH  Google Scholar 

  43. McMillan, K.L.: Quantified invariant generation using an interpolating saturation prover. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 413–427 (2008)

  44. Mera, E., Lopez-García, P., Hermenegildo, M.: Integrating software testing and run-time checking in an assertion verification framework. In: Logic Programming, pp. 281–295. Springer (2009)

  45. Meyer, B.: Class invariants: concepts, problems, solutions. CoRR (2016). arXiv:1608.07637

  46. Misailovic, S., Milicevic, A., Petrovic, N., Khurshid, S., Marinov, D.: Parallel test generation and execution with Korat. In: Symposium on the Foundations of Software Engineering, pp. 135–144 (2007)

  47. Møller, A., Schwartzbach, M.I.: The pointer assertion logic engine. In: Conference on Programming Language Design and Implementation, pp. 221–231 (2001)

  48. Murtagh, F.: Multilayer perceptrons for classification and regression. Neurocomputing 2(5), 183–197 (1991)

    Article  MathSciNet  Google Scholar 

  49. Pacheco, C., Lahiri, S.K., Ernst, M.D., Ball, T.: Feedback-directed random test generation. In: International Conference on Software Engineering, pp. 75–84 (2007)

  50. Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI Workshop on Imbalanced Data Sets, pp. 1–3 (2000)

  51. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  52. Reynolds, J.C.: Separation logic: a logic for shared mutable data structures. In: Symposium on Logic in Computer Science, pp. 55–74 (2002)

  53. Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI, pp. 3 (2001)

  54. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  55. Sagiv, S., Reps, T.W., Wilhelm, R.: Parametric shape analysis via 3-valued logic. In: Symposium on Principles of Programming Languages, pp. 105–118 (1999)

  56. Sankaranarayanan, S., Sipma, H.B., Manna, Z.: Non-linear loop invariant generation using gröbner bases. In: Symposium on Principles of Programming Languages, pp. 318–329 (2004)

  57. Scikit-Learn Library. https://scikit-learn.org/stable/. Accessed 18 Apr 2019

  58. Si, X., Dai, H., Raghothaman, M., Naik, M., Le, S.: Learning loop invariants for program verification. In: Conference on Neural Information Processing Systems, pp. 7762–7773 (2018)

  59. Si, X., Dai, H., Raghothaman, M., Naik, M., Le, S.: Learning loop invariants for program verification. In: Advances in Neural Information Processing Systems, pp. 7751–7762 (2018)

  60. Siddiqui, J.H., Khurshid, S.: PKorat: Parallel generation of structurally complex test inputs. In: International Conference on Software Testing Verification and Validation, pp. 250–259 (2009)

  61. Singh, S., Zhang, M., Khurshid, S.: Learning guided enumerative synthesis for superoptimization. In: International Symposium on Model Checking of Software, p. 172–192 (2019)

  62. Solar-Lezama, A.: Program Synthesis by Sketching. PhD thesis (2008)

  63. Usman, M., Wang, W., Vasic, M., Wang, K., Vikalo, H., Khurshid, S.: A study of the learnability of relational properties. In: 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). To appear(2020)

  64. Usman, M., Wang, W., Wang, K., Yelen, C., Dini, N., Khurshid, S.: A study of learning data structure invariants using off-the-shelf tools. In: International Symposium on Model Checking of Software, pp. 226–243 (2019)

  65. Valiant, L.G.: A theory of the learnable. CACM 27(11) (1984)

  66. Vapnik, V.N., Chervonenkis, A.Ya.: On the uniform convergence of relative frequencies of events to their probabilities. In: Measures of Complexity: Festschrift for Alexey Chervonenkis. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-21852-6_3

  67. Visser, W., Havelund, K., Brat, G.P., Park, S.: Model checking programs. In: International Conference on Automated Software Engineering, pp. 3–12 (2000)

  68. Wu, W., Mallet, Y., Walczak, B., Penninckx, W., Massart, D.L., Heuerding, S., Erni, F.: Comparison of regularized discriminant analysis linear discriminant analysis and quadratic discriminant analysis applied to nir data. Anal. Chim. Acta 329(3), 257–265 (1996)

    Article  Google Scholar 

  69. Zee, K., Kuncak, V., Rinard, M.C.: Full functional verification of linked data structures. In: Conference on Programming Language Design and Implementation, pp. 349–361 (2008)

Download references

Acknowledgements

We thank Rohan Garg, Emily Ginsburg, Michael Herrington, Tara Kuruvilla, Raghav Prakash and the anonymous reviewers for helpful feedback and comments. This research was partially supported by the US National Science Foundation under Grant Nos. CCF-1704790 and CCF-1718903.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Usman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Usman, M., Wang, W., Wang, K. et al. A study of learning likely data structure properties using machine learning models. Int J Softw Tools Technol Transfer 22, 601–615 (2020). https://doi.org/10.1007/s10009-020-00577-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10009-020-00577-w

Keywords

Navigation