Abstract
The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this talk, I will cover recent research trends on leveraging such “big code” for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [11, 15], tracelets [3], numerical abstractions [13, 14], and textual descriptions [1, 22], as well as different notions of code similarity based on these representations.
To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [19, 20], variable order Markov models [2], and other distance-based and model-based sequence classification techniques.
Finally, I will show applications of these techniques including semantic code search in both source code and stripped binaries, code completion and reverse engineering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. Res. 22, 385–421 (2004)
David, Y., Yahav, E.: Tracelet-based code search in executables. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pp. 349–360 (2014)
Faktor, A., Irani, M.: Clustering by composition: unsupervised discovery of image categories. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1092–1106 (2014)
Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)
Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: ACM SIGGRAPH 2007 Papers, SIGGRAPH ’07, New York, NY, USA (2007)
Horwitz, S.: Identifying the semantic and textual differences between two versions of a program, vol. 25. ACM (1990)
Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)
Kang, H., Hebert, M., Efros, A.A., Kanade, T.: Data-driven objectness. IEEE Trans. Pattern Anal. Mach. Intell. 37(1), 189–195 (2015)
Katz, O.: Type prediction using variable order Markov models. Master’s thesis, Technion (2015)
Mishne, A., Shoham, S., Yahav, E.: Typestate-based semantic code search over partial programs. In: OOPSLA ’12 (2012)
Necula, G.C.: Translation validation for an optimizing compiler. In: Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pp. 83–94, New York, NY, USA (2000)
Partush, N., Yahav, E.: Abstract semantic differencing for numerical programs. In: Logozzo, F., Fähndrich, M. (eds.) Static Analysis. LNCS, vol. 7935, pp. 238–258. Springer, Heidelberg (2013)
Partush, N., Yahav, E.: Abstract semantic differencing via speculative correlation. In: Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA’14 (2014)
Peleg, H., Shoham, S., Yahav, E., Yang, H.: Symbolic automata for represnting big code. In: International journal on Software Tools for Technology Transfer, STTT’15 (2015)
Pnueli, A., Siegel, M.D., Singerman, E.: Translation validation. In: Steffen, B. (ed.) TACAS 1998. LNCS, vol. 1384, pp. 151–166. Springer, Heidelberg (1998)
Ramos, D.A., Engler, D.R.: Practical, low-effort equivalence verification of real code. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 669–685. Springer, Heidelberg (2011)
Raychev, V., Vechev, M., Krause, A.: Predicting program properties from “big code”. In: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pp. 111–124 (2015)
Raychev, V., Vechev, M., Yahav, E.: Code completion with statistical language models. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’14, p. 44 (2014)
Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proc. IEEE 88, 1270–1278 (2000)
Sabanal, P.V., Yason, M.V.: Reversing C++. https://www.blackhat.com/presentations/bh-dc-07/Sabanal_Yason/Paper/bh-dc-07-Sabanal_Yason-WP.pdf
Sinai, M.B., Yahav, E.: Code similarity via natural language descriptions. In: POPL Off the Beaten Track, OBT’15 (2014)
Acknowledgement
The research leading to these results has received funding from the European Union’s - Seventh Framework Programme (FP7) under grant agreement no. 615688 ERC- COG-PRIME.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Yahav, E. (2015). Programming with “Big Code”. In: Feng, X., Park, S. (eds) Programming Languages and Systems. APLAS 2015. Lecture Notes in Computer Science(), vol 9458. Springer, Cham. https://doi.org/10.1007/978-3-319-26529-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-26529-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26528-5
Online ISBN: 978-3-319-26529-2
eBook Packages: Computer ScienceComputer Science (R0)