ABSTRACT
When comparing student programs to check for evidence of plagiarism or collusion, the goal is to identify code segments that are common to two or more programs. Yet some code segments are common for reasons other than plagiarism or collusion, and so should not be considered. A few code similarity detection tools automatically remove very common segment, but they are prone to false results as no human validation is involved. This paper proposes a semi-automated approach for excluding common segments, where human validation is introduced before excluding the segments. As existing selection techniques are not detachable from their similarity detection tools, we propose a new tool to independently select the segments (C2S2), along with several adjustable selection constraints to keep the number of suggested segments reasonable for manual observation. In order to independently evaluate automated selection techniques, we propose and apply three metrics. The evaluation shows our selection technique to be more effective and efficient than the basis underlying existing selection techniques, and establishes the benefit of each of its selection features.
- Aleksi Ahtiainen, Sami Surakka, and Mikko Rahikainen. 2006. Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises. In Sixth Baltic Sea Conference on Computing Education Research (Koli Calling 2006). 141--142. https://doi.org/10.1145/1315803.1315831Google ScholarDigital Library
- Christian Arwin and SMM Tahaghoghi. 2006. Plagiarism detection across programming languages. In 29th Australasian Computer Science Conference (ACSC 2006). 277--286.Google Scholar
- Georgina Cosma and Mike Joy. 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput., Vol. 61, 3 (2012), 379--394. https://doi.org/10.1109/TC.2011.223Google ScholarDigital Library
- W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley.Google ScholarDigital Library
- Zoran DJuri? and Dragan Gav sevi?. 2013. A source code similarity system for plagiarism detection. Computer Journal, Vol. 56, 1 (2013), 70--86. https://doi.org/10.1093/comjnl/bxs018Google ScholarDigital Library
- Christian Domin, Henning Pohl, and Markus Krause. 2016. Improving plagiarism detection in coding assignments by dynamic removal of common ground. In 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 1173--1179. https://doi.org/10.1145/2851581.2892512Google ScholarDigital Library
- JAW Faidhi and SK Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education, Vol. 11, 1 (1987), 11--19. https://doi.org/10.1016/0360--1315(87)90042-XGoogle ScholarDigital Library
- Enrique Flores, Alberto Barró n-Cede n o, Lidia Moreno, and Paolo Rosso. 2015. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, Vol. 21, 13 (2015), 1708--1725.Google Scholar
- Robert Fraser. 2014. Collaboration, collusion and plagiarism in computer science coursework. Informatics in Education, Vol. 13, 2 (2014), 179--195. https://doi.org/10.15388/infedu.2014.01Google ScholarCross Ref
- Ushio Inoue and Shuhei Wada. 2012. Detecting plagiarisms in elementary programming courses. In Ninth International Conference on Fuzzy Systems and Knowledge Discovery. 2308--2312. https://doi.org/10.1109/FSKD.2012.6234186Google ScholarCross Ref
- Oscar Karnalim. 2020. TF-IDF inspired detection for cross-language source code plagiarism and collusion. Computer Science, Vol. 21, 1 (2020), 97--121. https://doi.org/10.7494/csci.2020.21.1.3389Google ScholarCross Ref
- Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019 a. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education, Vol. 18, 2 (2019), 321--344. https://doi.org/10.15388/infedu.2019.15Google ScholarCross Ref
- Oscar Karnalim, Simon, and William Chivers. 2019 b. Similarity detection techniques for academic source code plagiarism and collusion: a review. In International Conference on Engineering, Technology and Education. https://doi.org/10.1109/TALE48000.2019.9225953Google ScholarCross Ref
- Dragutin Kermek and Matija Novak. 2016. Process model improvement for source code plagiarism detection in student programming assignments. Informatics in Education, Vol. 15, 1 (2016), 103--126. https://doi.org/10.15388/infedu.2016.06Google ScholarCross Ref
- Cynthia Kustanto and Inggriani Liem. 2009. Automatic source code plagiarism detection. In 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing. 481--486. https://doi.org/10.1109/SNPD.2009.62Google ScholarDigital Library
- Samuel Mann and Zelda Frew. 2006. Similarity and originality in code: plagiarism and normal variation in student assignments. In Eighth Australasian Conference on Computing Education (ACE 2006). 143--150.Google Scholar
- Tony Ohmann and Imad Rahal. 2015. Efficient clustering-based source code plagiarism detection using PIY. Knowledge and Information Systems, Vol. 43, 2 (2015), 445--472. https://doi.org/10.1007/s10115-014-0742--2Google ScholarDigital Library
- Terence Parr. 2013. The Definitive ANTLR 4 Reference. Pragmatic Bookshelf.Google Scholar
- Jonathan YH Poon, Kazunari Sugiyama, Yee Fan Tan, and Min-Yen Kan. 2012. Instructor-centric source code plagiarism detection and plagiarism corpus. In 17th Conference on Innovation and Technology in Computer Science Education (ITiCSE 2012). 122. https://doi.org/10.1145/2325296.2325328Google ScholarDigital Library
- Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, Vol. 8, 11 (2002), 1016--1038.Google Scholar
- Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. 2003. Winnowing: local algorithms for document fingerprinting. In International Conference on Management of Data. 76--85. https://doi.org/10.1145/872757.872770Google ScholarDigital Library
- G Sidorov, M Ibarra Romero, I Markov, R Guzman-Cabrera, L Chanona-Herná ndez, and F Velá squez. 2017. Measuring similarity between Karel programs using character and word n-grams. Programming and Computer Software, Vol. 43, 1 (2017), 47--50. https://doi.org/10.1134/S0361768817010066Google ScholarDigital Library
- Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research. 23--32. https://doi.org/10.1145/2526968.2526971Google Scholar
- Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education, Vol. 27, 1 (2019), 166--182. https://doi.org/10.1002/cae.22066Google ScholarCross Ref
- Michael J Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education (SIGCSE 1996). 130--134. https://doi.org/10.1145/236452.236525Google ScholarDigital Library
- Feng-Pu Yang, Hewijin Christine Jiau, and Kuo-Feng Ssu. 2014. Beyond plagiarism: an active learning method to analyze causes behind code-similarity. Computers & Education, Vol. 70 (2014), 161--172. https://doi.org/10.1016/J.COMPEDU.2013.08.005Google ScholarDigital Library
Index Terms
- Common Code Segment Selection: Semi-Automated Approach and Evaluation
Recommendations
Choosing Code Segments to Exclude from Code Similarity Detection
ITiCSE-WGR '20: Proceedings of the Working Group Reports on Innovation and Technology in Computer Science EducationWhen student programs are compared for similarity as a step in the detection of academic misconduct, certain segments of code are always sure to be similar but are no cause for suspicion. Some of these segments are boilerplate code (e.g. public static ...
Disguising Code to Help Students Understand Code Similarity
Koli Calling '20: Proceedings of the 20th Koli Calling International Conference on Computing Education ResearchTo act with academic integrity in programming, students need to understand the concept of code similarity and the aspects that contribute to it, because undue similarity is often used as a first step in detecting plagiarism or collusion. However, if ...
Selection of Code Segments for Exclusion from Code Similarity Detection
ITiCSE '20: Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science EducationWhen student programs are compared for similarity, certain segments of code are always sure to be similar. Some of these segments are boilerplate code -- public static void main String [] args and the like -- and some will be code that was provided to ...
Comments