Abstract
This paper presents a generic method for transforming the source code of various algorithms to numerical embeddings, by dynamically analysing the behaviour of computer programs against different inputs and by tailoring multiple generic complexity functions for the analysed metrics. The used algorithms embeddings are based on r-Complexity [7]. Using the proposed code embeddings, we present an implementation of the XGBoost algorithm that achieves an average \(90\%\) F1-score on a multi-label dataset with 11 classes, built using real-world code snippets submitted for programming competitions on the Codeforces platform.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We define dynamic analysis as the process of developing a computer software evaluation based on data acquired from experiments conducted out on a real computing system by executing programs against a range of different inputs.
- 2.
The model is generic and other values can be used as well, yet these are the most relevant values that we have used in our research.
- 3.
In our research, we have searched only a small discrete set of values for n, described earlier in this section.
- 4.
Frontend refers here to the part of the hardware responsible for fetching and decoding instructions.
- 5.
TheInputsCodeforces is a public dataset:
- 6.
- 7.
References
Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. Proc. ACM Program. Lang. 3(POPL), 1–29 (2019)
Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp. 3589–3601. NIPS2 018, Curran Associates Inc., Red Hook, NY, USA (2018)
Buratti, L., et al.: Exploring software naturalness through neural language models. CoRR abs/2006.12641 (2020). https://arxiv.org/abs/2006.12641
Calotoiu, A.: Automatic empirical performance modeling of parallel programs (2018)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Chistyakov, A., Lobacheva, E., Kuznetsov, A., Romanenko, A.: Semantic embeddings for program behavior patterns. CoRR abs/1804.03635 (2018). http://arxiv.org/abs/1804.03635
Folea, R., Slusanschi, E.I.: A new metric for evaluating the performance and complexity of computer programs: a new approach to the traditional ways of measuring the complexity of algorithms and estimating running times. In: 2021 23rd International Conference on Control Systems and Computer Science (CSCS), pp. 157–164. IEEE (2021)
Iacob, R.C.A., Monea, V.C., Rădulescu, D., Ceapă, A.F., Rebedea, T., Trăusan-Matu, S.: Algolabel: a large dataset for multi-label classification of algorithmic challenges. Mathematics 8(11), 1995 (2020)
Koc, U., Saadatpanah, P., Foster, J.S., Porter, A.A.: Learning a classifier for false positive error reports emitted by static code analysis tools. In: Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 35–42 (2017)
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018). https://aclanthology.org/L18-1008
Redmond, K., Luo, L., Zeng, Q.: A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. CoRR abs/1812.09652 (2018). http://arxiv.org/abs/1812.09652
Svyatkovskiy, A., Lee, S., Hadjitofi, A., Riechert, M., Franco, J.V., Allamanis, M.: Fast and memory-efficient neural code completion. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 329–340. IEEE (2021)
Wang, K.: Learning scalable and precise representation of program semantics. CoRR abs/1905.05251 (2019). http://arxiv.org/abs/1905.05251
Wang, K., Singh, R., Su, Z.: Dynamic neural program embeddings for program repair. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=BJuWrGW0Z
Wang, K., Su, Z.: Blended, precise semantic program embeddings. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 121–134. PLDI 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3385412.3385999
Yousefi-Azar, M., Hamey, L., Varadharajan, V., Chen, S.: Learning latent byte-level feature representation for malware detection. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11304, pp. 568–578. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04212-7_50
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Folea, R., Iacob, R., Slusanschi, E., Rebedea, T. (2023). Complexity-Based Code Embeddings. In: Nguyen, N.T., et al. Computational Collective Intelligence. ICCCI 2023. Lecture Notes in Computer Science(), vol 14162. Springer, Cham. https://doi.org/10.1007/978-3-031-41456-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-41456-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41455-8
Online ISBN: 978-3-031-41456-5
eBook Packages: Computer ScienceComputer Science (R0)