research-article

Common Code Segment Selection: Semi-Automated Approach and Evaluation

Authors:
Oscar Karnalim

University of Newcastle & Maranatha Christian University, Ourimbah, Australia

University of Newcastle & Maranatha Christian University, Ourimbah, Australia
View Profile

,
Simon

University of Newcastle, Ourimbah, Australia

University of Newcastle, Ourimbah, Australia
View Profile

SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science EducationMarch 2021Pages 335–341https://doi.org/10.1145/3408877.3432436

Published:05 March 2021Publication History

SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education

Pages 335–341

ABSTRACT

When comparing student programs to check for evidence of plagiarism or collusion, the goal is to identify code segments that are common to two or more programs. Yet some code segments are common for reasons other than plagiarism or collusion, and so should not be considered. A few code similarity detection tools automatically remove very common segment, but they are prone to false results as no human validation is involved. This paper proposes a semi-automated approach for excluding common segments, where human validation is introduced before excluding the segments. As existing selection techniques are not detachable from their similarity detection tools, we propose a new tool to independently select the segments (C2S2), along with several adjustable selection constraints to keep the number of suggested segments reasonable for manual observation. In order to independently evaluate automated selection techniques, we propose and apply three metrics. The evaluation shows our selection technique to be more effective and efficient than the basis underlying existing selection techniques, and establishes the benefit of each of its selection features.

References

Aleksi Ahtiainen, Sami Surakka, and Mikko Rahikainen. 2006. Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises. In Sixth Baltic Sea Conference on Computing Education Research (Koli Calling 2006). 141--142. https://doi.org/10.1145/1315803.1315831Google ScholarDigital Library
Christian Arwin and SMM Tahaghoghi. 2006. Plagiarism detection across programming languages. In 29th Australasian Computer Science Conference (ACSC 2006). 277--286.Google Scholar
Georgina Cosma and Mike Joy. 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput., Vol. 61, 3 (2012), 379--394. https://doi.org/10.1109/TC.2011.223Google ScholarDigital Library
W Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley.Google ScholarDigital Library
Zoran DJuri? and Dragan Gav sevi?. 2013. A source code similarity system for plagiarism detection. Computer Journal, Vol. 56, 1 (2013), 70--86. https://doi.org/10.1093/comjnl/bxs018Google ScholarDigital Library
Christian Domin, Henning Pohl, and Markus Krause. 2016. Improving plagiarism detection in coding assignments by dynamic removal of common ground. In 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 1173--1179. https://doi.org/10.1145/2851581.2892512Google ScholarDigital Library
JAW Faidhi and SK Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers & Education, Vol. 11, 1 (1987), 11--19. https://doi.org/10.1016/0360--1315(87)90042-XGoogle ScholarDigital Library
Enrique Flores, Alberto Barró n-Cede n o, Lidia Moreno, and Paolo Rosso. 2015. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, Vol. 21, 13 (2015), 1708--1725.Google Scholar
Robert Fraser. 2014. Collaboration, collusion and plagiarism in computer science coursework. Informatics in Education, Vol. 13, 2 (2014), 179--195. https://doi.org/10.15388/infedu.2014.01Google ScholarCross Ref
Ushio Inoue and Shuhei Wada. 2012. Detecting plagiarisms in elementary programming courses. In Ninth International Conference on Fuzzy Systems and Knowledge Discovery. 2308--2312. https://doi.org/10.1109/FSKD.2012.6234186Google ScholarCross Ref
Oscar Karnalim. 2020. TF-IDF inspired detection for cross-language source code plagiarism and collusion. Computer Science, Vol. 21, 1 (2020), 97--121. https://doi.org/10.7494/csci.2020.21.1.3389Google ScholarCross Ref
Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019 a. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education, Vol. 18, 2 (2019), 321--344. https://doi.org/10.15388/infedu.2019.15Google ScholarCross Ref
Oscar Karnalim, Simon, and William Chivers. 2019 b. Similarity detection techniques for academic source code plagiarism and collusion: a review. In International Conference on Engineering, Technology and Education. https://doi.org/10.1109/TALE48000.2019.9225953Google ScholarCross Ref
Dragutin Kermek and Matija Novak. 2016. Process model improvement for source code plagiarism detection in student programming assignments. Informatics in Education, Vol. 15, 1 (2016), 103--126. https://doi.org/10.15388/infedu.2016.06Google ScholarCross Ref
Cynthia Kustanto and Inggriani Liem. 2009. Automatic source code plagiarism detection. In 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing. 481--486. https://doi.org/10.1109/SNPD.2009.62Google ScholarDigital Library
Samuel Mann and Zelda Frew. 2006. Similarity and originality in code: plagiarism and normal variation in student assignments. In Eighth Australasian Conference on Computing Education (ACE 2006). 143--150.Google Scholar
Tony Ohmann and Imad Rahal. 2015. Efficient clustering-based source code plagiarism detection using PIY. Knowledge and Information Systems, Vol. 43, 2 (2015), 445--472. https://doi.org/10.1007/s10115-014-0742--2Google ScholarDigital Library
Terence Parr. 2013. The Definitive ANTLR 4 Reference. Pragmatic Bookshelf.Google Scholar
Jonathan YH Poon, Kazunari Sugiyama, Yee Fan Tan, and Min-Yen Kan. 2012. Instructor-centric source code plagiarism detection and plagiarism corpus. In 17th Conference on Innovation and Technology in Computer Science Education (ITiCSE 2012). 122. https://doi.org/10.1145/2325296.2325328Google ScholarDigital Library
Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, Vol. 8, 11 (2002), 1016--1038.Google Scholar
Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. 2003. Winnowing: local algorithms for document fingerprinting. In International Conference on Management of Data. 76--85. https://doi.org/10.1145/872757.872770Google ScholarDigital Library
G Sidorov, M Ibarra Romero, I Markov, R Guzman-Cabrera, L Chanona-Herná ndez, and F Velá squez. 2017. Measuring similarity between Karel programs using character and word n-grams. Programming and Computer Software, Vol. 43, 1 (2017), 47--50. https://doi.org/10.1134/S0361768817010066Google ScholarDigital Library
Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research. 23--32. https://doi.org/10.1145/2526968.2526971Google Scholar
Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education, Vol. 27, 1 (2019), 166--182. https://doi.org/10.1002/cae.22066Google ScholarCross Ref
Michael J Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education (SIGCSE 1996). 130--134. https://doi.org/10.1145/236452.236525Google ScholarDigital Library
Feng-Pu Yang, Hewijin Christine Jiau, and Kuo-Feng Ssu. 2014. Beyond plagiarism: an active learning method to analyze causes behind code-similarity. Computers & Education, Vol. 70 (2014), 161--172. https://doi.org/10.1016/J.COMPEDU.2013.08.005Google ScholarDigital Library

Index Terms

Common Code Segment Selection: Semi-Automated Approach and Evaluation

Recommendations

Choosing Code Segments to Exclude from Code Similarity Detection
ITiCSE-WGR '20: Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education

When student programs are compared for similarity as a step in the detection of academic misconduct, certain segments of code are always sure to be similar but are no cause for suspicion. Some of these segments are boilerplate code (e.g. public static ...
Read More
Disguising Code to Help Students Understand Code Similarity
Koli Calling '20: Proceedings of the 20th Koli Calling International Conference on Computing Education Research

To act with academic integrity in programming, students need to understand the concept of code similarity and the aspects that contribute to it, because undue similarity is often used as a first step in detecting plagiarism or collusion. However, if ...
Read More
Selection of Code Segments for Exclusion from Code Similarity Detection
ITiCSE '20: Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education

When student programs are compared for similarity, certain segments of code are always sure to be similar. Some of these segments are boilerplate code -- public static void main String [] args and the like -- and some will be code that was provided to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education
March 2021
1454 pages
ISBN:9781450380621
DOI:10.1145/3408877
General Chairs:
Mark Sherriff
University of Virginia, USA
,
Laurence D. Merkle
Air Force Institute of Technology, USA
,
Program Chairs:
Pamela Cutter
Kalamazoo College, USA
,
Alvaro Monge
California State University, Long Beach, USA
,
Judithe Sheard
Monash University, Australia
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 March 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code similarity
collusion
common code segment
n-gram
plagiarism
semi-automated approach
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,595of4,542submissions,35%
Upcoming Conference
SIGCSE Virtual 2024

Sponsor:

sigcse

SIGCSE Virtual 2024: ACM Virtual Global Computing Education Conference

December 5 - 7, 2024

Virtual Event , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 117
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Common Code Segment Selection: Semi-Automated Approach and Evaluation

SIGCSE '21: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education

ABSTRACT

References

Cited By

Index Terms

Recommendations

Choosing Code Segments to Exclude from Code Similarity Detection

Disguising Code to Help Students Understand Code Similarity

Selection of Code Segments for Exclusion from Code Similarity Detection