Programming with “Big Code”

Yahav, Eran

doi:10.1007/978-3-319-26529-2_1

Eran Yahav¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9458))

Included in the following conference series:

Asian Symposium on Programming Languages and Systems

880 Accesses
4 Citations

Abstract

The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this talk, I will cover recent research trends on leveraging such “big code” for program analysis, program synthesis and reverse engineering. We will consider a range of semantic representations based on symbolic automata [11, 15], tracelets [3], numerical abstractions [13, 14], and textual descriptions [1, 22], as well as different notions of code similarity based on these representations.

To leverage these semantic representations, we will consider a number of prediction techniques, including statistical language models [19, 20], variable order Markov models [2], and other distance-based and model-based sequence classification techniques.

Finally, I will show applications of these techniques including semantic code search in both source code and stripped binaries, code completion and reverse engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Symbolic automata for representing big code

Article 10 May 2015

Complexity-Based Code Embeddings

JEMMA: An extensible Java dataset for ML4Code applications

Article Open access 10 March 2023

References

http://like2drops.com
Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. Res. 22, 385–421 (2004)
MathSciNet MATH Google Scholar
David, Y., Yahav, E.: Tracelet-based code search in executables. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pp. 349–360 (2014)
Google Scholar
Faktor, A., Irani, M.: Clustering by composition: unsupervised discovery of image categories. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1092–1106 (2014)
Article Google Scholar
Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)
Article Google Scholar
Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: ACM SIGGRAPH 2007 Papers, SIGGRAPH ’07, New York, NY, USA (2007)
Google Scholar
Horwitz, S.: Identifying the semantic and textual differences between two versions of a program, vol. 25. ACM (1990)
Google Scholar
Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)
Article Google Scholar
Kang, H., Hebert, M., Efros, A.A., Kanade, T.: Data-driven objectness. IEEE Trans. Pattern Anal. Mach. Intell. 37(1), 189–195 (2015)
Article Google Scholar
Katz, O.: Type prediction using variable order Markov models. Master’s thesis, Technion (2015)
Google Scholar
Mishne, A., Shoham, S., Yahav, E.: Typestate-based semantic code search over partial programs. In: OOPSLA ’12 (2012)
Google Scholar
Necula, G.C.: Translation validation for an optimizing compiler. In: Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pp. 83–94, New York, NY, USA (2000)
Google Scholar
Partush, N., Yahav, E.: Abstract semantic differencing for numerical programs. In: Logozzo, F., Fähndrich, M. (eds.) Static Analysis. LNCS, vol. 7935, pp. 238–258. Springer, Heidelberg (2013)
Chapter Google Scholar
Partush, N., Yahav, E.: Abstract semantic differencing via speculative correlation. In: Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA’14 (2014)
Google Scholar
Peleg, H., Shoham, S., Yahav, E., Yang, H.: Symbolic automata for represnting big code. In: International journal on Software Tools for Technology Transfer, STTT’15 (2015)
Google Scholar
Pnueli, A., Siegel, M.D., Singerman, E.: Translation validation. In: Steffen, B. (ed.) TACAS 1998. LNCS, vol. 1384, pp. 151–166. Springer, Heidelberg (1998)
Chapter Google Scholar
Ramos, D.A., Engler, D.R.: Practical, low-effort equivalence verification of real code. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 669–685. Springer, Heidelberg (2011)
Chapter Google Scholar
Raychev, V., Vechev, M., Krause, A.: Predicting program properties from “big code”. In: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pp. 111–124 (2015)
Google Scholar
Raychev, V., Vechev, M., Yahav, E.: Code completion with statistical language models. In: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’14, p. 44 (2014)
Google Scholar
Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proc. IEEE 88, 1270–1278 (2000)
Article Google Scholar
Sabanal, P.V., Yason, M.V.: Reversing C++. https://www.blackhat.com/presentations/bh-dc-07/Sabanal_Yason/Paper/bh-dc-07-Sabanal_Yason-WP.pdf
Sinai, M.B., Yahav, E.: Code similarity via natural language descriptions. In: POPL Off the Beaten Track, OBT’15 (2014)
Google Scholar

Download references

Acknowledgement

The research leading to these results has received funding from the European Union’s - Seventh Framework Programme (FP7) under grant agreement no. 615688 ERC- COG-PRIME.

Author information

Authors and Affiliations

Technion, Haifa, Israel
Eran Yahav

Authors

Eran Yahav
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eran Yahav .

Editor information

Editors and Affiliations

Univ. of Science and Technology of China, Hefei, Anhui, China
Xinyu Feng
Pohang Univ. of Science and Technology, Nam-Gu, Pohang, Korea (Republic of)
Sungwoo Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yahav, E. (2015). Programming with “Big Code”. In: Feng, X., Park, S. (eds) Programming Languages and Systems. APLAS 2015. Lecture Notes in Computer Science(), vol 9458. Springer, Cham. https://doi.org/10.1007/978-3-319-26529-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-26529-2_1
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26528-5
Online ISBN: 978-3-319-26529-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics