Skip to main content

Complexity-Based Code Embeddings

  • Conference paper
  • First Online:
Computational Collective Intelligence (ICCCI 2023)

Abstract

This paper presents a generic method for transforming the source code of various algorithms to numerical embeddings, by dynamically analysing the behaviour of computer programs against different inputs and by tailoring multiple generic complexity functions for the analysed metrics. The used algorithms embeddings are based on r-Complexity [7]. Using the proposed code embeddings, we present an implementation of the XGBoost algorithm that achieves an average \(90\%\) F1-score on a multi-label dataset with 11 classes, built using real-world code snippets submitted for programming competitions on the Codeforces platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We define dynamic analysis as the process of developing a computer software evaluation based on data acquired from experiments conducted out on a real computing system by executing programs against a range of different inputs.

  2. 2.

    The model is generic and other values can be used as well, yet these are the most relevant values that we have used in our research.

  3. 3.

    In our research, we have searched only a small discrete set of values for n, described earlier in this section.

  4. 4.

    Frontend refers here to the part of the hardware responsible for fetching and decoding instructions.

  5. 5.

    TheInputsCodeforces is a public dataset:

    https://github.com/raresraf/TheInputsCodeforces.

  6. 6.

    https://www.github.com/raresraf/AlgoRAF.

  7. 7.

    https://www.github.com/raresraf/AlgoRAF/tree/master/viz.

References

  1. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. Proc. ACM Program. Lang. 3(POPL), 1–29 (2019)

    Google Scholar 

  2. Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp. 3589–3601. NIPS2 018, Curran Associates Inc., Red Hook, NY, USA (2018)

    Google Scholar 

  3. Buratti, L., et al.: Exploring software naturalness through neural language models. CoRR abs/2006.12641 (2020). https://arxiv.org/abs/2006.12641

  4. Calotoiu, A.: Automatic empirical performance modeling of parallel programs (2018)

    Google Scholar 

  5. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

    Google Scholar 

  6. Chistyakov, A., Lobacheva, E., Kuznetsov, A., Romanenko, A.: Semantic embeddings for program behavior patterns. CoRR abs/1804.03635 (2018). http://arxiv.org/abs/1804.03635

  7. Folea, R., Slusanschi, E.I.: A new metric for evaluating the performance and complexity of computer programs: a new approach to the traditional ways of measuring the complexity of algorithms and estimating running times. In: 2021 23rd International Conference on Control Systems and Computer Science (CSCS), pp. 157–164. IEEE (2021)

    Google Scholar 

  8. Iacob, R.C.A., Monea, V.C., Rădulescu, D., Ceapă, A.F., Rebedea, T., Trăusan-Matu, S.: Algolabel: a large dataset for multi-label classification of algorithmic challenges. Mathematics 8(11), 1995 (2020)

    Article  Google Scholar 

  9. Koc, U., Saadatpanah, P., Foster, J.S., Porter, A.A.: Learning a classifier for false positive error reports emitted by static code analysis tools. In: Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 35–42 (2017)

    Google Scholar 

  10. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018). https://aclanthology.org/L18-1008

  11. Redmond, K., Luo, L., Zeng, Q.: A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. CoRR abs/1812.09652 (2018). http://arxiv.org/abs/1812.09652

  12. Svyatkovskiy, A., Lee, S., Hadjitofi, A., Riechert, M., Franco, J.V., Allamanis, M.: Fast and memory-efficient neural code completion. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 329–340. IEEE (2021)

    Google Scholar 

  13. Wang, K.: Learning scalable and precise representation of program semantics. CoRR abs/1905.05251 (2019). http://arxiv.org/abs/1905.05251

  14. Wang, K., Singh, R., Su, Z.: Dynamic neural program embeddings for program repair. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=BJuWrGW0Z

  15. Wang, K., Su, Z.: Blended, precise semantic program embeddings. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 121–134. PLDI 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3385412.3385999

  16. Yousefi-Azar, M., Hamey, L., Varadharajan, V., Chen, S.: Learning latent byte-level feature representation for malware detection. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11304, pp. 568–578. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04212-7_50

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rares Folea .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Folea, R., Iacob, R., Slusanschi, E., Rebedea, T. (2023). Complexity-Based Code Embeddings. In: Nguyen, N.T., et al. Computational Collective Intelligence. ICCCI 2023. Lecture Notes in Computer Science(), vol 14162. Springer, Cham. https://doi.org/10.1007/978-3-031-41456-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41456-5_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41455-8

  • Online ISBN: 978-3-031-41456-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics