research-article

A Machine Learning Based Plagiarism Detection in Source Code

Authors:

Nickolay Viuginov,

Andrey FilchenkovAuthors Info & Claims

ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

Article No.: 93, Pages 1 - 6

https://doi.org/10.1145/3446132.3446420

Published: 09 March 2021 Publication History

Abstract

Converting source codes to feature vectors can be useful in programming-related tasks, such as plagiarism detection on ACM contests. We present a brand-new method for feature extraction from C++ files, which includes both features describing syntactic and lexical properties of an AST tree and features characterizing disassembly of source code. We propose a method for solving the plagiarism detection task as a classification problem. We prove the effectiveness of our feature set by testing on a dataset that contains 50 ACM problems and ∼90k solutions for them. Trained xgboost model gets a relative binary f1-score=0.745 on the test set.

References

[1]

U. Alon, O. Levy, and E. Yahav, 2018, code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400.

[2]

J. Li, Y. Wang, M.R. Lyu, and I. King, 2018, Code completion with neuralattention and pointer networks. arXiv preprint arXiv:1711.09573.

[3]

P. Bielik, V. Raychev and M. Vechev, 2017, Learning a static analyzer from data. In International Conference on Computer Aided Verification, 233–253.

[4]

B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. Vadhan, and K. Yang, 2012, On the (im)possibility of obfuscating programs. Journal of the ACM (JACM), 59(2), 1–48.

Digital Library

[5]

J. Jones. 2003, Abstract syntax tree implementation idioms. In Proceedings of the 10th conference on pattern languages of programs (plop2003), 25-26.

[6]

E. Söderberg, T. Ekman, G. Hedin and E. Magnusson, 2013, Extensible intraprocedural flow analysis at the abstract syntax tree level. Science of Computer Programming, 78(10), 1809–1827.

Digital Library

[7]

V. Kalgutkar, R. Kaur, H. Gonzalez, N. Stakhanova and A. Matyukhina, 2019, Code authorship attribution: Methods and challenges. ACM Computing Surveys (CSUR), 52(1), 1–36.

Digital Library

[8]

A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi and R. Greenstadt, 2015, De-anonymizing programmers via code stylometry. In 24 th {USENIX} Security Symposium ({USENIX}Security15), 255–270.

[9]

E. Bogomolov, V. Kovalenko, A. Bacchelli and T. Bryksin, 2020, Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. arXiv preprint arXiv:2001.11593.

[10]

P.W. Oman and C.R. Cook, 1989, Programming style authorship analysis. In Proceedings of the 17th conference on ACM Annual Computer Science Conference, 320–326.

[11]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, 2006, Source code author identification based on n-gram author profiles. In IFIP International Conference on Artificial Intelligence Applications and Innovations, 508–515.

[12]

I. Krsul and E. H. Spafford, 1997, Authorship analysis: Identifying the author of a program. Computers & Security, 16(3), 233–257.

Digital Library

[13]

B.S. Elenbogen and N. Seliya, 2008, Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges, 23(3), 50–57.

Digital Library

[14]

G. Frantzeskou, E. Stamatatos, S. Gritzalis, C.E. Chaski and B.S. Howald, 2007, Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence, 6(1):1–18.

[15]

S. Burrows, A.L. Uitdenbogerd and A. Turpin, 2014, Comparing techniques for authorship attribution of source code. Software: Practice and Experience, 44(1), 1–32.

[16]

B. Alsulami, E. Dauber, R. Harang, S. Mancoridis and R. Greenstadt. 2017, Source code authorship attribution using long short-term memory based networks. In European Symposium on Research in Computer Security, 65–82.

[17]

N. Rosenblum, B.P Miller, and X. Zhu, 2011, Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposiumon Software Testing and Analysis, 100–110.

[18]

L. Simko, L. Zettlemoyer and T. Kohno, 2018, Recognizing and imitating programmer style: Adversaries in program authorship attribution. In Proceedings on Privacy Enhancing Technologies, 127–144.

[19]

C. Zhang, S. Wang, J. Wu and Z. Niu, 2017, Authorship identification of source codes. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, 282–296.

[20]

J. Kothari, M. Shevertalov, E. Stehle and S. Mancoridis, 2007, A probabilistic approach to source code authorship identification. In Fourth International Conference on Information Technology (ITNG’07), 243–248.

[21]

X. Yang, G. Xu, Q. Li, Y. Guo and M. Zhang, 2017, Authorship attribution of source code by using back propagation neural network based on particle swarm optimization. PloS one, 12(11):e0187204

[22]

M.A. Ellis and B. Stroustrup, 1990. The annotated C++ reference manual. Addison-Wesley.

[23]

B. Schwarz, S. Debray, and G. Andrews, 2002, Disassembly of executable code revisited. In Ninth Working Conference on Reverse Engineering, 45–54.

[24]

A. Caliskan, F. Yamaguchi, E. Dauber, R. Harang, K. Rieck, R. Greenstadt and A. Narayanan, 2015, When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprintarXiv:1512.08546.

Cited By

Idialu OMathews NMaipradit RAtlee JNagappan MSpinellis DConstantinou EBacchelli A(2024)Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problemsProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644926(394-406)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643991.3644926
Narmada NPati P(2024)Evaluating uniXcoder Embeddings for Automated Grading: A Study Across Varied Code Perspectives2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10725702(1-5)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICCCNT61001.2024.10725702
Duong TShar LShankararaman V(2022)AP-Coach: formative feedback generation for learning introductory programming concepts2022 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE)10.1109/TALE54877.2022.00060(323-330)Online publication date: Dec-2022
https://doi.org/10.1109/TALE54877.2022.00060

Recommendations

Severity Classification of Code Smells Using Machine-Learning Methods
Abstract
Code smell detection can be very useful for minimizing maintenance costs and improving software quality. Code smells help developers/programmers, researchers to subjectively interpret design defects in different ways. Code smells instances can ...
Code smell detection based on supervised learning models: A survey
Abstract
Supervised learning-based code smell detection has become one of the dominant approaches to identify code smell. Existing works optimize the process of code smell detection from multiple aspects, such as high-quality dataset, feature selection, ...
Efficient clustering-based source code plagiarism detection using PIY

Vast amounts of information available online make plagiarism increasingly easy to commit, and this is particularly true of source code. The traditional approach of detecting copied work in a course setting is manual inspection. This is not only tedious ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

December 2020

576 pages

ISBN:9781450388115

DOI:10.1145/3446132

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ACAI 2020

ACAI 2020: 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

December 24 - 26, 2020

Sanya, China

Acceptance Rates

Overall Acceptance Rate 173 of 395 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
164
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Idialu OMathews NMaipradit RAtlee JNagappan MSpinellis DConstantinou EBacchelli A(2024)Whodunit: Classifying Code as Human Authored or GPT-4 Generated - A case study on CodeChef problemsProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644926(394-406)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643991.3644926
Narmada NPati P(2024)Evaluating uniXcoder Embeddings for Automated Grading: A Study Across Varied Code Perspectives2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10725702(1-5)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICCCNT61001.2024.10725702
Duong TShar LShankararaman V(2022)AP-Coach: formative feedback generation for learning introductory programming concepts2022 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE)10.1109/TALE54877.2022.00060(323-330)Online publication date: Dec-2022
https://doi.org/10.1109/TALE54877.2022.00060

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents