Complexity-Based Code Embeddings

Folea, Rares; Iacob, Radu; Slusanschi, Emil; Rebedea, Traian

doi:10.1007/978-3-031-41456-5_20

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14162))

Included in the following conference series:

International Conference on Computational Collective Intelligence

548 Accesses

Abstract

This paper presents a generic method for transforming the source code of various algorithms to numerical embeddings, by dynamically analysing the behaviour of computer programs against different inputs and by tailoring multiple generic complexity functions for the analysed metrics. The used algorithms embeddings are based on r-Complexity [7]. Using the proposed code embeddings, we present an implementation of the XGBoost algorithm that achieves an average \(90\%\) F1-score on a multi-label dataset with 11 classes, built using real-world code snippets submitted for programming competitions on the Codeforces platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We define dynamic analysis as the process of developing a computer software evaluation based on data acquired from experiments conducted out on a real computing system by executing programs against a range of different inputs.
2.
The model is generic and other values can be used as well, yet these are the most relevant values that we have used in our research.
3.
In our research, we have searched only a small discrete set of values for n, described earlier in this section.
4.
Frontend refers here to the part of the hardware responsible for fetching and decoding instructions.
5.
TheInputsCodeforces is a public dataset:
https://github.com/raresraf/TheInputsCodeforces.
6.
https://www.github.com/raresraf/AlgoRAF.
7.
https://www.github.com/raresraf/AlgoRAF/tree/master/viz.

References

Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. Proc. ACM Program. Lang. 3(POPL), 1–29 (2019)
Google Scholar
Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp. 3589–3601. NIPS2 018, Curran Associates Inc., Red Hook, NY, USA (2018)
Google Scholar
Buratti, L., et al.: Exploring software naturalness through neural language models. CoRR abs/2006.12641 (2020). https://arxiv.org/abs/2006.12641
Calotoiu, A.: Automatic empirical performance modeling of parallel programs (2018)
Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Google Scholar
Chistyakov, A., Lobacheva, E., Kuznetsov, A., Romanenko, A.: Semantic embeddings for program behavior patterns. CoRR abs/1804.03635 (2018). http://arxiv.org/abs/1804.03635
Folea, R., Slusanschi, E.I.: A new metric for evaluating the performance and complexity of computer programs: a new approach to the traditional ways of measuring the complexity of algorithms and estimating running times. In: 2021 23rd International Conference on Control Systems and Computer Science (CSCS), pp. 157–164. IEEE (2021)
Google Scholar
Iacob, R.C.A., Monea, V.C., Rădulescu, D., Ceapă, A.F., Rebedea, T., Trăusan-Matu, S.: Algolabel: a large dataset for multi-label classification of algorithmic challenges. Mathematics 8(11), 1995 (2020)
Article Google Scholar
Koc, U., Saadatpanah, P., Foster, J.S., Porter, A.A.: Learning a classifier for false positive error reports emitted by static code analysis tools. In: Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 35–42 (2017)
Google Scholar
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018). https://aclanthology.org/L18-1008
Redmond, K., Luo, L., Zeng, Q.: A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. CoRR abs/1812.09652 (2018). http://arxiv.org/abs/1812.09652
Svyatkovskiy, A., Lee, S., Hadjitofi, A., Riechert, M., Franco, J.V., Allamanis, M.: Fast and memory-efficient neural code completion. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 329–340. IEEE (2021)
Google Scholar
Wang, K.: Learning scalable and precise representation of program semantics. CoRR abs/1905.05251 (2019). http://arxiv.org/abs/1905.05251
Wang, K., Singh, R., Su, Z.: Dynamic neural program embeddings for program repair. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=BJuWrGW0Z
Wang, K., Su, Z.: Blended, precise semantic program embeddings. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 121–134. PLDI 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3385412.3385999
Yousefi-Azar, M., Hamey, L., Varadharajan, V., Chen, S.: Learning latent byte-level feature representation for malware detection. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11304, pp. 568–578. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04212-7_50
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University Politehnica of Bucharest, Bucharest, Romania
Rares Folea, Radu Iacob, Emil Slusanschi & Traian Rebedea

Authors

Rares Folea
View author publications
You can also search for this author in PubMed Google Scholar
Radu Iacob
View author publications
You can also search for this author in PubMed Google Scholar
Emil Slusanschi
View author publications
You can also search for this author in PubMed Google Scholar
Traian Rebedea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rares Folea .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Eötvös Loránd University, Budapest, Hungary
János Botzheim
Eötvös Loránd University, Budapest, Hungary
László Gulyás
Universidad Complutense de Madrid, Madrid, Spain
Manuel Núñez
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Jan Treur
Universität Münster, Münster, Germany
Gottfried Vossen
Wrocław University of Science and Technology, Wrocław, Poland
Adrianna Kozierkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Folea, R., Iacob, R., Slusanschi, E., Rebedea, T. (2023). Complexity-Based Code Embeddings. In: Nguyen, N.T., et al. Computational Collective Intelligence. ICCCI 2023. Lecture Notes in Computer Science(), vol 14162. Springer, Cham. https://doi.org/10.1007/978-3-031-41456-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-41456-5_20
Published: 13 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41455-8
Online ISBN: 978-3-031-41456-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics