skip to main content
10.1145/3373165.3373171acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesaus-ceConference Proceedingsconference-collections
research-article

Syntax Trees and Information Retrieval to Improve Code Similarity Detection

Authors Info & Claims
Published:03 February 2020Publication History

ABSTRACT

In dealing with source code plagiarism and collusion, automated code similarity detection can be used to filter student submissions and draw attention to pairs of programs that appear unduly similar. The effectiveness of the detection process can be improved by considering more structural information about each program, but the ensuing computation can increase the processing time. This paper proposes a similarity detection technique that uses richer structural information than normal while maintaining a reasonable execution time. The technique generates the syntax trees of program code files, extracts directly connected n-gram structure tokens from them, and performs the subsequent comparisons using an algorithm from information retrieval, cosine correlation in the vector space model. Evaluation of the approach shows that consideration of the program structure (i.e., syntax tree) increases the recall and f-score (measures of effectiveness) at the expense of execution time (a measure of efficiency). However, the use of an information retrieval comparison process goes some way to offsetting this loss of efficiency.

References

  1. Christian Arwin and S. M. M. Tahaghoghi. 2006. Plagiarism detection across programming languages. In 29th Australasian Computer Science Conference - Volume 48. Australian Computer Society, Hobart, 277--286.Google ScholarGoogle Scholar
  2. Andrés M. Bejarano, Lucy E. García, and Eduardo E. Zurek. 2015. Detection of source code similitude in academic environments. Computer Applications in Engineering Education 23, 1 (Jan 2015), 13--22. https://doi.org/10.1002/cae.21571Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Steven Burrows, S. M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37, 2 (Feb 2007), 151--175. https://doi.org/10.1002/spe.750Google ScholarGoogle ScholarCross RefCross Ref
  4. Keith D. Cooper and Linda Torczon. 2012. Engineering a Compiler (Second Edition). Morgan Kaufmann.Google ScholarGoogle Scholar
  5. Georgina Cosma and Mike Joy. 2008. Towards a definition of source-code plagiarism. IEEE Transactions on Education 51, 2 (May 2008), 195--200. https://doi.org/10.1109/TE.2007.906776Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Georgina Cosma and Mike Joy. 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput. 61, 3 (Mar 2012), 379--394. https://doi.org/10.1109/TC.2011.223Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley. 520 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Zoran Đurić and Dragan Gašević. 2013. A source code similarity system for plagiarism detection. Computer Journal 56, 1 (Jan 2013), 70--86. https://doi.org/10.1093/comjnl/bxs018Google ScholarGoogle Scholar
  9. Enrique Flores, Alberto Barrón-Cedeño, Lidia Moreno, and Paolo Rosso. 2015. Cross-language source code reuse detection using latent semantic analysis. Journal of Universal Computer Science 21, 13 (2015), 1708--1725.Google ScholarGoogle Scholar
  10. Enrique Flores, Alberto Barrón-Cedeño, Lidia Moreno, and Paolo Rosso. 2015. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education 23, 3 (May 2015), 383--390. https://doi.org/10.1002/cae.21608Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Robert Fraser. 2014. Collaboration, collusion and plagiarism in computer science coursework. Informatics in Education 13, 2 (Sep 2014), 179--195. https://doi.org/10.15388/infedu.2014.01Google ScholarGoogle ScholarCross RefCross Ref
  12. Deqiang Fu, Yanyan Xu, Haoran Yu, and Boyang Yang. 2017. WASTK: a weighted abstract syntax tree kernel method for source code plagiarism detection. Scientific Programming 2017 (Feb 2017), 1--8. https://doi.org/10.1155/2017/7809047Google ScholarGoogle Scholar
  13. J. Paul Gibson. 2009. Software reuse and plagiarism: a code of practice. In 14th Annual ACM SIGCSE conference on Innovation and Technology in Computer Science Education. ACM Press, New York, New York, USA, 55--59. https://doi.org/10.1145/1562877.1562900Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Basel Halak and Mohammed El-Hajjar. 2016. Plagiarism detection and prevention techniques in engineering education. In 11th European Workshop on Microelectronics Education. IEEE, Southampton, 1--3. https://doi.org/10.1109/EWME.2016.7496465Google ScholarGoogle ScholarCross RefCross Ref
  15. James K Harris. 1994. Plagiarism in computer science courses. In Conference on Ethics in the Computer Age. 133--135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Oscar Karnalim. 2016. Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In 10th International Conference on Information & Communication Technology and Systems. IEEE, Surabaya, 63--68. https://doi.org/10.1109/ICTS.2016.7910274Google ScholarGoogle ScholarCross RefCross Ref
  17. Oscar Karnalim. 2019. Source code plagiarism detection with low-level structural representation and information retrieval. International Journal of Computers and Applications (Mar 2019). https://doi.org/10.1080/1206212X.2019.1589944Google ScholarGoogle Scholar
  18. Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education 18, 2 (Nov 2019), 321--344. https://doi.org/10.15388/infedu.2019.15Google ScholarGoogle ScholarCross RefCross Ref
  19. Hiroshi Kikuchi, Takaaki Goto, Mitsuo Wakatsuki, and Tetsuro Nishino. 2014. A source code plagiarism detecting method using alignment with abstract syntax tree elements. In 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, Las Vegas, 1--6. https://doi.org/10.1109/SNPD.2014.6888733Google ScholarGoogle ScholarCross RefCross Ref
  20. Maxim Mozgovoy, Sergey Karakovskiy, and Vitaly Klyuev. 2007. Fast and reliable plagiarism detection system. In 37th Annual Frontiers in Education Conference. IEEE, 11--14. https://doi.org/10.1109/FIE.2007.4417860Google ScholarGoogle ScholarCross RefCross Ref
  21. Karl J. Ottenstein. 1976. An algorithmic approach to the detection and prevention of plagiarism. ACM SIGCSE Bulletin 8, 4 (Dec 1976), 30--41. https://doi.org/10.1145/382222.382462Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Terence Parr. 2013. The definitive ANTLR 4 reference. Pragmatic Bookshelf.Google ScholarGoogle Scholar
  23. Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 11 (2002), 1016--1038.Google ScholarGoogle Scholar
  24. Faqih Salban Rabbani and Oscar Karnalim. 2017. Detecting source code plagiarism on .NET programming languages using low-level representation and adaptive local alignment. Journal of Information and Organizational Sciences 41, 1 (Jun 2017), 105--123. https://doi.org/10.31341/jios.41.1.7Google ScholarGoogle ScholarCross RefCross Ref
  25. Judy Sheard, Simon, Matthew Butler, Katrina Falkner, Michael Morgan, and Amali Weerasinghe. 2017. Strategies for maintaining academic integrity in first-year computing courses. In 2017 ACM Conference on Innovation and Technology in Computer Science Education. ACM Press, Bologna, 244--249. https://doi.org/10.1145/3059009.3059064Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research.ACM Press, Koli, 23--32. https://doi.org/10.1145/2526968.2526971Google ScholarGoogle Scholar
  27. Simon, Trina Myers, Dianna Hardy, and Raina Mason. 2019. Variations on a theme: academic integrity and program code. In 21st Australasian Computing Education Conference. ACM Press, Sydney, 56--63. https://doi.org/10.1145/3286960.3286967Google ScholarGoogle Scholar
  28. Simon, Judy Sheard, Michael Morgan, Andrew Petersen, Amber Settle, and Jane Sinclair. 2018. Informing students about academic integrity in programming. In 20th Australasian Computing Education Conference. ACM Press, New York, New York, USA, 113--122. https://doi.org/10.1145/3160489.3160502Google ScholarGoogle Scholar
  29. Hyun-Je Song, Seong-Bae Park, and Se Young Park. 2015. Computation of program source code similarity by composition of parse tree and call graph. Mathematical Problems in Engineering 2015 (Apr 2015), 1--12. https://doi.org/10.1155/2015/429807Google ScholarGoogle Scholar
  30. Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education 27, 1 (2019), 166--182. https://doi.org/10.1002/cae.22066Google ScholarGoogle ScholarCross RefCross Ref
  31. Farhan Ullah, Junfeng Wang, Muhammad Farhan, Sohail Jabbar, Zhiming Wu, and Shehzad Khalid. 2018. Plagiarism detection in students' programming assignments based on semantics: multimedia e-learning based smart assessment methodology. Multimedia Tools and Applications (Mar 2018). https://doi.org/10.1007/s11042-018-5827-6Google ScholarGoogle Scholar
  32. Lisheng Wang, Lingchao Jiang, and Guofeng Qin. 2018. A search of Verilog code plagiarism detection method. In 13th International Conference on Computer Science & Education. IEEE, Colombo, 1--5. https://doi.org/10.1109/ICCSE.2018.8468817Google ScholarGoogle ScholarCross RefCross Ref
  33. Michael J. Wise. 1992. Detection of similarities in student programs: YAP'ing may be preferable to Plague'ing. In 23rd SIGCSE Technical Symposium on Computer Science Education, Vol. 24. ACM Press, Kansas City, 268--271. https://doi.org/10.1145/134510.134564Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michael J. Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education, Vol. 28. ACM Press, Philadelphia, 130--134. https://doi.org/10.1145/236452.236525Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ACE'20: Proceedings of the Twenty-Second Australasian Computing Education Conference
    February 2020
    221 pages
    ISBN:9781450376860
    DOI:10.1145/3373165

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 3 February 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    ACE'20 Paper Acceptance Rate23of51submissions,45%Overall Acceptance Rate161of359submissions,45%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader