skip to main content
10.1145/3408877.3432436acmconferencesArticle/Chapter ViewAbstractPublication PagessigcseConference Proceedingsconference-collections
research-article

Common Code Segment Selection: Semi-Automated Approach and Evaluation

Authors Info & Claims
Published:05 March 2021Publication History

ABSTRACT

When comparing student programs to check for evidence of plagiarism or collusion, the goal is to identify code segments that are common to two or more programs. Yet some code segments are common for reasons other than plagiarism or collusion, and so should not be considered. A few code similarity detection tools automatically remove very common segment, but they are prone to false results as no human validation is involved. This paper proposes a semi-automated approach for excluding common segments, where human validation is introduced before excluding the segments. As existing selection techniques are not detachable from their similarity detection tools, we propose a new tool to independently select the segments (C2S2), along with several adjustable selection constraints to keep the number of suggested segments reasonable for manual observation. In order to independently evaluate automated selection techniques, we propose and apply three metrics. The evaluation shows our selection technique to be more effective and efficient than the basis underlying existing selection techniques, and establishes the benefit of each of its selection features.

References

  1. Aleksi Ahtiainen, Sami Surakka, and Mikko Rahikainen. 2006. Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises. In Sixth Baltic Sea Conference on Computing Education Research (Koli Calling 2006). 141--142. https://doi.org/10.1145/1315803.1315831Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Christian Arwin and SMM Tahaghoghi. 2006. Plagiarism detection across programming languages. In 29th Australasian Computer Science Conference (ACSC 2006). 277--286.Google ScholarGoogle Scholar
  3. Georgina Cosma and Mike Joy. 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput., Vol. 61, 3 (2012), 379--394. https://doi.org/10.1109/TC.2011.223Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zoran DJuri? and Dragan Gav sevi?. 2013. A source code similarity system for plagiarism detection. Computer Journal, Vol. 56, 1 (2013), 70--86. https://doi.org/10.1093/comjnl/bxs018Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christian Domin, Henning Pohl, and Markus Krause. 2016. Improving plagiarism detection in coding assignments by dynamic removal of common ground. In 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 1173--1179. https://doi.org/10.1145/2851581.2892512Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. JAW Faidhi and SK Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education, Vol. 11, 1 (1987), 11--19. https://doi.org/10.1016/0360--1315(87)90042-XGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  8. Enrique Flores, Alberto Barró n-Cede n o, Lidia Moreno, and Paolo Rosso. 2015. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, Vol. 21, 13 (2015), 1708--1725.Google ScholarGoogle Scholar
  9. Robert Fraser. 2014. Collaboration, collusion and plagiarism in computer science coursework. Informatics in Education, Vol. 13, 2 (2014), 179--195. https://doi.org/10.15388/infedu.2014.01Google ScholarGoogle ScholarCross RefCross Ref
  10. Ushio Inoue and Shuhei Wada. 2012. Detecting plagiarisms in elementary programming courses. In Ninth International Conference on Fuzzy Systems and Knowledge Discovery. 2308--2312. https://doi.org/10.1109/FSKD.2012.6234186Google ScholarGoogle ScholarCross RefCross Ref
  11. Oscar Karnalim. 2020. TF-IDF inspired detection for cross-language source code plagiarism and collusion. Computer Science, Vol. 21, 1 (2020), 97--121. https://doi.org/10.7494/csci.2020.21.1.3389Google ScholarGoogle ScholarCross RefCross Ref
  12. Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019 a. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education, Vol. 18, 2 (2019), 321--344. https://doi.org/10.15388/infedu.2019.15Google ScholarGoogle ScholarCross RefCross Ref
  13. Oscar Karnalim, Simon, and William Chivers. 2019 b. Similarity detection techniques for academic source code plagiarism and collusion: a review. In International Conference on Engineering, Technology and Education. https://doi.org/10.1109/TALE48000.2019.9225953Google ScholarGoogle ScholarCross RefCross Ref
  14. Dragutin Kermek and Matija Novak. 2016. Process model improvement for source code plagiarism detection in student programming assignments. Informatics in Education, Vol. 15, 1 (2016), 103--126. https://doi.org/10.15388/infedu.2016.06Google ScholarGoogle ScholarCross RefCross Ref
  15. Cynthia Kustanto and Inggriani Liem. 2009. Automatic source code plagiarism detection. In 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing. 481--486. https://doi.org/10.1109/SNPD.2009.62Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Samuel Mann and Zelda Frew. 2006. Similarity and originality in code: plagiarism and normal variation in student assignments. In Eighth Australasian Conference on Computing Education (ACE 2006). 143--150.Google ScholarGoogle Scholar
  17. Tony Ohmann and Imad Rahal. 2015. Efficient clustering-based source code plagiarism detection using PIY. Knowledge and Information Systems, Vol. 43, 2 (2015), 445--472. https://doi.org/10.1007/s10115-014-0742--2Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Terence Parr. 2013. The Definitive ANTLR 4 Reference. Pragmatic Bookshelf.Google ScholarGoogle Scholar
  19. Jonathan YH Poon, Kazunari Sugiyama, Yee Fan Tan, and Min-Yen Kan. 2012. Instructor-centric source code plagiarism detection and plagiarism corpus. In 17th Conference on Innovation and Technology in Computer Science Education (ITiCSE 2012). 122. https://doi.org/10.1145/2325296.2325328Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, Vol. 8, 11 (2002), 1016--1038.Google ScholarGoogle Scholar
  21. Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. 2003. Winnowing: local algorithms for document fingerprinting. In International Conference on Management of Data. 76--85. https://doi.org/10.1145/872757.872770Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G Sidorov, M Ibarra Romero, I Markov, R Guzman-Cabrera, L Chanona-Herná ndez, and F Velá squez. 2017. Measuring similarity between Karel programs using character and word n-grams. Programming and Computer Software, Vol. 43, 1 (2017), 47--50. https://doi.org/10.1134/S0361768817010066Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research. 23--32. https://doi.org/10.1145/2526968.2526971Google ScholarGoogle Scholar
  24. Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education, Vol. 27, 1 (2019), 166--182. https://doi.org/10.1002/cae.22066Google ScholarGoogle ScholarCross RefCross Ref
  25. Michael J Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education (SIGCSE 1996). 130--134. https://doi.org/10.1145/236452.236525Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Feng-Pu Yang, Hewijin Christine Jiau, and Kuo-Feng Ssu. 2014. Beyond plagiarism: an active learning method to analyze causes behind code-similarity. Computers & Education, Vol. 70 (2014), 161--172. https://doi.org/10.1016/J.COMPEDU.2013.08.005Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Common Code Segment Selection: Semi-Automated Approach and Evaluation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education
          March 2021
          1454 pages
          ISBN:9781450380621
          DOI:10.1145/3408877

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 March 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,595of4,542submissions,35%

          Upcoming Conference

          SIGCSE Virtual 2024

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader