Skip to main content

Programming with “Big Code”

  • Conference paper
  • First Online:
Programming Languages and Systems (APLAS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9458))

Included in the following conference series:

Abstract

The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this talk, I will cover recent research trends on leveraging such “big code” for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [11, 15], tracelets [3], numerical abstractions [13, 14], and textual descriptions [1, 22], as well as different notions of code similarity based on these representations.

To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [19, 20], variable order Markov models [2], and other distance-based and model-based sequence classification techniques.

Finally, I will show applications of these techniques including semantic code search in both source code and stripped binaries, code completion and reverse engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. http://like2drops.com

  2. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. Res. 22, 385–421 (2004)

    MathSciNet  MATH  Google Scholar 

  3. David, Y., Yahav, E.: Tracelet-based code search in executables. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pp. 349–360 (2014)

    Google Scholar 

  4. Faktor, A., Irani, M.: Clustering by composition: unsupervised discovery of image categories. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1092–1106 (2014)

    Article  Google Scholar 

  5. Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)

    Article  Google Scholar 

  6. Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: ACM SIGGRAPH 2007 Papers, SIGGRAPH ’07, New York, NY, USA (2007)

    Google Scholar 

  7. Horwitz, S.: Identifying the semantic and textual differences between two versions of a program, vol. 25. ACM (1990)

    Google Scholar 

  8. Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)

    Article  Google Scholar 

  9. Kang, H., Hebert, M., Efros, A.A., Kanade, T.: Data-driven objectness. IEEE Trans. Pattern Anal. Mach. Intell. 37(1), 189–195 (2015)

    Article  Google Scholar 

  10. Katz, O.: Type prediction using variable order Markov models. Master’s thesis, Technion (2015)

    Google Scholar 

  11. Mishne, A., Shoham, S., Yahav, E.: Typestate-based semantic code search over partial programs. In: OOPSLA ’12 (2012)

    Google Scholar 

  12. Necula, G.C.: Translation validation for an optimizing compiler. In: Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pp. 83–94, New York, NY, USA (2000)

    Google Scholar 

  13. Partush, N., Yahav, E.: Abstract semantic differencing for numerical programs. In: Logozzo, F., Fähndrich, M. (eds.) Static Analysis. LNCS, vol. 7935, pp. 238–258. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Partush, N., Yahav, E.: Abstract semantic differencing via speculative correlation. In: Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA’14 (2014)

    Google Scholar 

  15. Peleg, H., Shoham, S., Yahav, E., Yang, H.: Symbolic automata for represnting big code. In: International journal on Software Tools for Technology Transfer, STTT’15 (2015)

    Google Scholar 

  16. Pnueli, A., Siegel, M.D., Singerman, E.: Translation validation. In: Steffen, B. (ed.) TACAS 1998. LNCS, vol. 1384, pp. 151–166. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  17. Ramos, D.A., Engler, D.R.: Practical, low-effort equivalence verification of real code. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 669–685. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  18. Raychev, V., Vechev, M., Krause, A.: Predicting program properties from “big code”. In: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pp. 111–124 (2015)

    Google Scholar 

  19. Raychev, V., Vechev, M., Yahav, E.: Code completion with statistical language models. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’14, p. 44 (2014)

    Google Scholar 

  20. Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proc. IEEE 88, 1270–1278 (2000)

    Article  Google Scholar 

  21. Sabanal, P.V., Yason, M.V.: Reversing C++. https://www.blackhat.com/presentations/bh-dc-07/Sabanal_Yason/Paper/bh-dc-07-Sabanal_Yason-WP.pdf

  22. Sinai, M.B., Yahav, E.: Code similarity via natural language descriptions. In: POPL Off the Beaten Track, OBT’15 (2014)

    Google Scholar 

Download references

Acknowledgement

The research leading to these results has received funding from the European Union’s - Seventh Framework Programme (FP7) under grant agreement no. 615688 ERC- COG-PRIME.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eran Yahav .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Yahav, E. (2015). Programming with “Big Code”. In: Feng, X., Park, S. (eds) Programming Languages and Systems. APLAS 2015. Lecture Notes in Computer Science(), vol 9458. Springer, Cham. https://doi.org/10.1007/978-3-319-26529-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26529-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26528-5

  • Online ISBN: 978-3-319-26529-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics