research-article

Syntax Trees and Information Retrieval to Improve Code Similarity Detection

Authors:
Oscar Karnalim

University of Newcastle, New South Wales, Australia and Maranatha Christian University

University of Newcastle, New South Wales, Australia and Maranatha Christian University
View Profile

,
Simon

University of Newcastle, New South Wales, Australia

University of Newcastle, New South Wales, Australia
View Profile

ACE'20: Proceedings of the Twenty-Second Australasian Computing Education ConferenceFebruary 2020Pages 48–55https://doi.org/10.1145/3373165.3373171

Published:03 February 2020Publication History

ACE'20: Proceedings of the Twenty-Second Australasian Computing Education Conference

Pages 48–55

ABSTRACT

In dealing with source code plagiarism and collusion, automated code similarity detection can be used to filter student submissions and draw attention to pairs of programs that appear unduly similar. The effectiveness of the detection process can be improved by considering more structural information about each program, but the ensuing computation can increase the processing time. This paper proposes a similarity detection technique that uses richer structural information than normal while maintaining a reasonable execution time. The technique generates the syntax trees of program code files, extracts directly connected n-gram structure tokens from them, and performs the subsequent comparisons using an algorithm from information retrieval, cosine correlation in the vector space model. Evaluation of the approach shows that consideration of the program structure (i.e., syntax tree) increases the recall and f-score (measures of effectiveness) at the expense of execution time (a measure of efficiency). However, the use of an information retrieval comparison process goes some way to offsetting this loss of efficiency.

References

Christian Arwin and S. M. M. Tahaghoghi. 2006. Plagiarism detection across programming languages. In 29th Australasian Computer Science Conference - Volume 48. Australian Computer Society, Hobart, 277--286.Google Scholar
Andrés M. Bejarano, Lucy E. García, and Eduardo E. Zurek. 2015. Detection of source code similitude in academic environments. Computer Applications in Engineering Education 23, 1 (Jan 2015), 13--22. https://doi.org/10.1002/cae.21571Google ScholarDigital Library
Steven Burrows, S. M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37, 2 (Feb 2007), 151--175. https://doi.org/10.1002/spe.750Google ScholarCross Ref
Keith D. Cooper and Linda Torczon. 2012. Engineering a Compiler (Second Edition). Morgan Kaufmann.Google Scholar
Georgina Cosma and Mike Joy. 2008. Towards a definition of source-code plagiarism. IEEE Transactions on Education 51, 2 (May 2008), 195--200. https://doi.org/10.1109/TE.2007.906776Google ScholarDigital Library
Georgina Cosma and Mike Joy. 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput. 61, 3 (Mar 2012), 379--394. https://doi.org/10.1109/TC.2011.223Google ScholarDigital Library
W. Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley. 520 pages.Google ScholarDigital Library
Zoran &Dstrok;urić and Dragan Gašević. 2013. A source code similarity system for plagiarism detection. Computer Journal 56, 1 (Jan 2013), 70--86. https://doi.org/10.1093/comjnl/bxs018Google Scholar
Enrique Flores, Alberto Barrón-Cedeño, Lidia Moreno, and Paolo Rosso. 2015. Cross-language source code reuse detection using latent semantic analysis. Journal of Universal Computer Science 21, 13 (2015), 1708--1725.Google Scholar
Enrique Flores, Alberto Barrón-Cedeño, Lidia Moreno, and Paolo Rosso. 2015. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education 23, 3 (May 2015), 383--390. https://doi.org/10.1002/cae.21608Google ScholarDigital Library
Robert Fraser. 2014. Collaboration, collusion and plagiarism in computer science coursework. Informatics in Education 13, 2 (Sep 2014), 179--195. https://doi.org/10.15388/infedu.2014.01Google ScholarCross Ref
Deqiang Fu, Yanyan Xu, Haoran Yu, and Boyang Yang. 2017. WASTK: a weighted abstract syntax tree kernel method for source code plagiarism detection. Scientific Programming 2017 (Feb 2017), 1--8. https://doi.org/10.1155/2017/7809047Google Scholar
J. Paul Gibson. 2009. Software reuse and plagiarism: a code of practice. In 14th Annual ACM SIGCSE conference on Innovation and Technology in Computer Science Education. ACM Press, New York, New York, USA, 55--59. https://doi.org/10.1145/1562877.1562900Google ScholarDigital Library
Basel Halak and Mohammed El-Hajjar. 2016. Plagiarism detection and prevention techniques in engineering education. In 11th European Workshop on Microelectronics Education. IEEE, Southampton, 1--3. https://doi.org/10.1109/EWME.2016.7496465Google ScholarCross Ref
James K Harris. 1994. Plagiarism in computer science courses. In Conference on Ethics in the Computer Age. 133--135.Google ScholarDigital Library
Oscar Karnalim. 2016. Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In 10th International Conference on Information & Communication Technology and Systems. IEEE, Surabaya, 63--68. https://doi.org/10.1109/ICTS.2016.7910274Google ScholarCross Ref
Oscar Karnalim. 2019. Source code plagiarism detection with low-level structural representation and information retrieval. International Journal of Computers and Applications (Mar 2019). https://doi.org/10.1080/1206212X.2019.1589944Google Scholar
Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education 18, 2 (Nov 2019), 321--344. https://doi.org/10.15388/infedu.2019.15Google ScholarCross Ref
Hiroshi Kikuchi, Takaaki Goto, Mitsuo Wakatsuki, and Tetsuro Nishino. 2014. A source code plagiarism detecting method using alignment with abstract syntax tree elements. In 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, Las Vegas, 1--6. https://doi.org/10.1109/SNPD.2014.6888733Google ScholarCross Ref
Maxim Mozgovoy, Sergey Karakovskiy, and Vitaly Klyuev. 2007. Fast and reliable plagiarism detection system. In 37th Annual Frontiers in Education Conference. IEEE, 11--14. https://doi.org/10.1109/FIE.2007.4417860Google ScholarCross Ref
Karl J. Ottenstein. 1976. An algorithmic approach to the detection and prevention of plagiarism. ACM SIGCSE Bulletin 8, 4 (Dec 1976), 30--41. https://doi.org/10.1145/382222.382462Google ScholarDigital Library
Terence Parr. 2013. The definitive ANTLR 4 reference. Pragmatic Bookshelf.Google Scholar
Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 11 (2002), 1016--1038.Google Scholar
Faqih Salban Rabbani and Oscar Karnalim. 2017. Detecting source code plagiarism on .NET programming languages using low-level representation and adaptive local alignment. Journal of Information and Organizational Sciences 41, 1 (Jun 2017), 105--123. https://doi.org/10.31341/jios.41.1.7Google ScholarCross Ref
Judy Sheard, Simon, Matthew Butler, Katrina Falkner, Michael Morgan, and Amali Weerasinghe. 2017. Strategies for maintaining academic integrity in first-year computing courses. In 2017 ACM Conference on Innovation and Technology in Computer Science Education. ACM Press, Bologna, 244--249. https://doi.org/10.1145/3059009.3059064Google ScholarDigital Library
Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research.ACM Press, Koli, 23--32. https://doi.org/10.1145/2526968.2526971Google Scholar
Simon, Trina Myers, Dianna Hardy, and Raina Mason. 2019. Variations on a theme: academic integrity and program code. In 21st Australasian Computing Education Conference. ACM Press, Sydney, 56--63. https://doi.org/10.1145/3286960.3286967Google Scholar
Simon, Judy Sheard, Michael Morgan, Andrew Petersen, Amber Settle, and Jane Sinclair. 2018. Informing students about academic integrity in programming. In 20th Australasian Computing Education Conference. ACM Press, New York, New York, USA, 113--122. https://doi.org/10.1145/3160489.3160502Google Scholar
Hyun-Je Song, Seong-Bae Park, and Se Young Park. 2015. Computation of program source code similarity by composition of parse tree and call graph. Mathematical Problems in Engineering 2015 (Apr 2015), 1--12. https://doi.org/10.1155/2015/429807Google Scholar
Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education 27, 1 (2019), 166--182. https://doi.org/10.1002/cae.22066Google ScholarCross Ref
Farhan Ullah, Junfeng Wang, Muhammad Farhan, Sohail Jabbar, Zhiming Wu, and Shehzad Khalid. 2018. Plagiarism detection in students' programming assignments based on semantics: multimedia e-learning based smart assessment methodology. Multimedia Tools and Applications (Mar 2018). https://doi.org/10.1007/s11042-018-5827-6Google Scholar
Lisheng Wang, Lingchao Jiang, and Guofeng Qin. 2018. A search of Verilog code plagiarism detection method. In 13th International Conference on Computer Science & Education. IEEE, Colombo, 1--5. https://doi.org/10.1109/ICCSE.2018.8468817Google ScholarCross Ref
Michael J. Wise. 1992. Detection of similarities in student programs: YAP'ing may be preferable to Plague'ing. In 23rd SIGCSE Technical Symposium on Computer Science Education, Vol. 24. ACM Press, Kansas City, 268--271. https://doi.org/10.1145/134510.134564Google ScholarDigital Library
Michael J. Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education, Vol. 28. ACM Press, Philadelphia, 130--134. https://doi.org/10.1145/236452.236525Google ScholarDigital Library

Recommendations

Preprocessing for Source Code Similarity Detection in Introductory Programming
Koli Calling '20: Proceedings of the 20th Koli Calling International Conference on Computing Education Research

It is well documented that some students either work together on programming assessments when required to work individually (collusion) or make unauthorised use of existing code from external sources (plagiarism). One approach used in the detection of ...
Read More
Fuzzy Logic Based Similarity Measure for Information Retrieval System Performance Improvement
ICDCIT 2014: Proceedings of the 10th International Conference on Distributed Computing and Internet Technology - Volume 8337

The documents of any information retrieval system are ranked on the basis of similarity measure. Some similarity measures e.g. Cosine, Euclidean and Okapi etc. have been extensively used for retrieving relevant documents against the query. In present ...
Read More
Improve Neural Machine Translation by Syntax Tree
ISCSIC '18: Proceedings of the 2nd International Symposium on Computer Science and Intelligent Control

Most of the neural machine translation methods are devoted to using syntactic information at one end of the Encoder-Decoder framework. They didn't use syntactic information at both ends, so that the syntactic information cannot be fully utilized to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACE'20: Proceedings of the Twenty-Second Australasian Computing Education Conference
February 2020
221 pages
ISBN:9781450376860
DOI:10.1145/3373165
Conference Chairs:
Andrew Luxton-Reilly
University of Auckland
,
Claudia Szabo
The University of Adelaide
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
computing education
information retrieval
plagiarism and collusion in programming
source code similarity detection
syntax tree
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ACE'20 Paper Acceptance Rate23of51submissions,45%Overall Acceptance Rate161of359submissions,45%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 288
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Syntax Trees and Information Retrieval to Improve Code Similarity Detection

ACE'20: Proceedings of the Twenty-Second Australasian Computing Education Conference

ABSTRACT

References

Cited By

Recommendations

Preprocessing for Source Code Similarity Detection in Introductory Programming

Fuzzy Logic Based Similarity Measure for Information Retrieval System Performance Improvement

Improve Neural Machine Translation by Syntax Tree