skip to main content
10.1145/3446132.3446420acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacaiConference Proceedingsconference-collections
research-article

A Machine Learning Based Plagiarism Detection in Source Code

Published: 09 March 2021 Publication History

Abstract

Converting source codes to feature vectors can be useful in programming-related tasks, such as plagiarism detection on ACM contests. We present a brand-new method for feature extraction from C++ files, which includes both features describing syntactic and lexical properties of an AST tree and features characterizing disassembly of source code. We propose a method for solving the plagiarism detection task as a classification problem. We prove the effectiveness of our feature set by testing on a dataset that contains 50 ACM problems and ∼90k solutions for them. Trained xgboost model gets a relative binary f1-score=0.745 on the test set.

References

[1]
U. Alon, O. Levy, and E. Yahav, 2018, code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400.
[2]
J. Li, Y. Wang, M.R. Lyu, and I. King, 2018, Code completion with neuralattention and pointer networks. arXiv preprint arXiv:1711.09573.
[3]
P. Bielik, V. Raychev and M. Vechev, 2017, Learning a static analyzer from data. In International Conference on Computer Aided Verification, 233–253.
[4]
B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. Vadhan, and K. Yang, 2012, On the (im)possibility of obfuscating programs. Journal of the ACM (JACM), 59(2), 1–48.
[5]
J. Jones. 2003, Abstract syntax tree implementation idioms. In Proceedings of the 10th conference on pattern languages of programs (plop2003), 25-26.
[6]
E. Söderberg, T. Ekman, G. Hedin and E. Magnusson, 2013, Extensible intraprocedural flow analysis at the abstract syntax tree level. Science of Computer Programming, 78(10), 1809–1827.
[7]
V. Kalgutkar, R. Kaur, H. Gonzalez, N. Stakhanova and A. Matyukhina, 2019, Code authorship attribution: Methods and challenges. ACM Computing Surveys (CSUR), 52(1), 1–36.
[8]
A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi and R. Greenstadt, 2015, De-anonymizing programmers via code stylometry. In 24 th {USENIX} Security Symposium ({USENIX}Security15), 255–270.
[9]
E. Bogomolov, V. Kovalenko, A. Bacchelli and T. Bryksin, 2020, Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. arXiv preprint arXiv:2001.11593.
[10]
P.W. Oman and C.R. Cook, 1989, Programming style authorship analysis. In Proceedings of the 17th conference on ACM Annual Computer Science Conference, 320–326.
[11]
G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, 2006, Source code author identification based on n-gram author profiles. In IFIP International Conference on Artificial Intelligence Applications and Innovations, 508–515.
[12]
I. Krsul and E. H. Spafford, 1997, Authorship analysis: Identifying the author of a program. Computers & Security, 16(3), 233–257.
[13]
B.S. Elenbogen and N. Seliya, 2008, Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges, 23(3), 50–57.
[14]
G. Frantzeskou, E. Stamatatos, S. Gritzalis, C.E. Chaski and B.S. Howald, 2007, Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence, 6(1):1–18.
[15]
S. Burrows, A.L. Uitdenbogerd and A. Turpin, 2014, Comparing techniques for authorship attribution of source code. Software: Practice and Experience, 44(1), 1–32.
[16]
B. Alsulami, E. Dauber, R. Harang, S. Mancoridis and R. Greenstadt. 2017, Source code authorship attribution using long short-term memory based networks. In European Symposium on Research in Computer Security, 65–82.
[17]
N. Rosenblum, B.P Miller, and X. Zhu, 2011, Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposiumon Software Testing and Analysis, 100–110.
[18]
L. Simko, L. Zettlemoyer and T. Kohno, 2018, Recognizing and imitating programmer style: Adversaries in program authorship attribution. In Proceedings on Privacy Enhancing Technologies, 127–144.
[19]
C. Zhang, S. Wang, J. Wu and Z. Niu, 2017, Authorship identification of source codes. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, 282–296.
[20]
J. Kothari, M. Shevertalov, E. Stehle and S. Mancoridis, 2007, A probabilistic approach to source code authorship identification. In Fourth International Conference on Information Technology (ITNG’07), 243–248.
[21]
X. Yang, G. Xu, Q. Li, Y. Guo and M. Zhang, 2017, Authorship attribution of source code by using back propagation neural network based on particle swarm optimization. PloS one, 12(11):e0187204
[22]
M.A. Ellis and B. Stroustrup, 1990. The annotated C++ reference manual. Addison-Wesley.
[23]
B. Schwarz, S. Debray, and G. Andrews, 2002, Disassembly of executable code revisited. In Ninth Working Conference on Reverse Engineering, 45–54.
[24]
A. Caliskan, F. Yamaguchi, E. Dauber, R. Harang, K. Rieck, R. Greenstadt and A. Narayanan, 2015, When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprintarXiv:1512.08546.

Cited By

View all
  • (2024)Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problemsProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644926(394-406)Online publication date: 15-Apr-2024
  • (2024)Evaluating uniXcoder Embeddings for Automated Grading: A Study Across Varied Code Perspectives2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10725702(1-5)Online publication date: 24-Jun-2024
  • (2022)AP-Coach: formative feedback generation for learning introductory programming concepts2022 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE)10.1109/TALE54877.2022.00060(323-330)Online publication date: Dec-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence
December 2020
576 pages
ISBN:9781450388115
DOI:10.1145/3446132
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. feature selection
  2. plagiarism detection in source code
  3. random forest

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACAI 2020

Acceptance Rates

Overall Acceptance Rate 173 of 395 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problemsProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644926(394-406)Online publication date: 15-Apr-2024
  • (2024)Evaluating uniXcoder Embeddings for Automated Grading: A Study Across Varied Code Perspectives2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10725702(1-5)Online publication date: 24-Jun-2024
  • (2022)AP-Coach: formative feedback generation for learning introductory programming concepts2022 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE)10.1109/TALE54877.2022.00060(323-330)Online publication date: Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media