ABSTRACT
In dealing with source code plagiarism and collusion, automated code similarity detection can be used to filter student submissions and draw attention to pairs of programs that appear unduly similar. The effectiveness of the detection process can be improved by considering more structural information about each program, but the ensuing computation can increase the processing time. This paper proposes a similarity detection technique that uses richer structural information than normal while maintaining a reasonable execution time. The technique generates the syntax trees of program code files, extracts directly connected n-gram structure tokens from them, and performs the subsequent comparisons using an algorithm from information retrieval, cosine correlation in the vector space model. Evaluation of the approach shows that consideration of the program structure (i.e., syntax tree) increases the recall and f-score (measures of effectiveness) at the expense of execution time (a measure of efficiency). However, the use of an information retrieval comparison process goes some way to offsetting this loss of efficiency.
- Christian Arwin and S. M. M. Tahaghoghi. 2006. Plagiarism detection across programming languages. In 29th Australasian Computer Science Conference - Volume 48. Australian Computer Society, Hobart, 277--286.Google Scholar
- Andrés M. Bejarano, Lucy E. García, and Eduardo E. Zurek. 2015. Detection of source code similitude in academic environments. Computer Applications in Engineering Education 23, 1 (Jan 2015), 13--22. https://doi.org/10.1002/cae.21571Google ScholarDigital Library
- Steven Burrows, S. M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37, 2 (Feb 2007), 151--175. https://doi.org/10.1002/spe.750Google ScholarCross Ref
- Keith D. Cooper and Linda Torczon. 2012. Engineering a Compiler (Second Edition). Morgan Kaufmann.Google Scholar
- Georgina Cosma and Mike Joy. 2008. Towards a definition of source-code plagiarism. IEEE Transactions on Education 51, 2 (May 2008), 195--200. https://doi.org/10.1109/TE.2007.906776Google ScholarDigital Library
- Georgina Cosma and Mike Joy. 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput. 61, 3 (Mar 2012), 379--394. https://doi.org/10.1109/TC.2011.223Google ScholarDigital Library
- W. Bruce Croft, Donald Metzler, and Trevor Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison-Wesley. 520 pages.Google ScholarDigital Library
- Zoran Đurić and Dragan Gašević. 2013. A source code similarity system for plagiarism detection. Computer Journal 56, 1 (Jan 2013), 70--86. https://doi.org/10.1093/comjnl/bxs018Google Scholar
- Enrique Flores, Alberto Barrón-Cedeño, Lidia Moreno, and Paolo Rosso. 2015. Cross-language source code reuse detection using latent semantic analysis. Journal of Universal Computer Science 21, 13 (2015), 1708--1725.Google Scholar
- Enrique Flores, Alberto Barrón-Cedeño, Lidia Moreno, and Paolo Rosso. 2015. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education 23, 3 (May 2015), 383--390. https://doi.org/10.1002/cae.21608Google ScholarDigital Library
- Robert Fraser. 2014. Collaboration, collusion and plagiarism in computer science coursework. Informatics in Education 13, 2 (Sep 2014), 179--195. https://doi.org/10.15388/infedu.2014.01Google ScholarCross Ref
- Deqiang Fu, Yanyan Xu, Haoran Yu, and Boyang Yang. 2017. WASTK: a weighted abstract syntax tree kernel method for source code plagiarism detection. Scientific Programming 2017 (Feb 2017), 1--8. https://doi.org/10.1155/2017/7809047Google Scholar
- J. Paul Gibson. 2009. Software reuse and plagiarism: a code of practice. In 14th Annual ACM SIGCSE conference on Innovation and Technology in Computer Science Education. ACM Press, New York, New York, USA, 55--59. https://doi.org/10.1145/1562877.1562900Google ScholarDigital Library
- Basel Halak and Mohammed El-Hajjar. 2016. Plagiarism detection and prevention techniques in engineering education. In 11th European Workshop on Microelectronics Education. IEEE, Southampton, 1--3. https://doi.org/10.1109/EWME.2016.7496465Google ScholarCross Ref
- James K Harris. 1994. Plagiarism in computer science courses. In Conference on Ethics in the Computer Age. 133--135.Google ScholarDigital Library
- Oscar Karnalim. 2016. Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In 10th International Conference on Information & Communication Technology and Systems. IEEE, Surabaya, 63--68. https://doi.org/10.1109/ICTS.2016.7910274Google ScholarCross Ref
- Oscar Karnalim. 2019. Source code plagiarism detection with low-level structural representation and information retrieval. International Journal of Computers and Applications (Mar 2019). https://doi.org/10.1080/1206212X.2019.1589944Google Scholar
- Oscar Karnalim, Setia Budi, Hapnes Toba, and Mike Joy. 2019. Source code plagiarism detection in academia with information retrieval: dataset and the observation. Informatics in Education 18, 2 (Nov 2019), 321--344. https://doi.org/10.15388/infedu.2019.15Google ScholarCross Ref
- Hiroshi Kikuchi, Takaaki Goto, Mitsuo Wakatsuki, and Tetsuro Nishino. 2014. A source code plagiarism detecting method using alignment with abstract syntax tree elements. In 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, Las Vegas, 1--6. https://doi.org/10.1109/SNPD.2014.6888733Google ScholarCross Ref
- Maxim Mozgovoy, Sergey Karakovskiy, and Vitaly Klyuev. 2007. Fast and reliable plagiarism detection system. In 37th Annual Frontiers in Education Conference. IEEE, 11--14. https://doi.org/10.1109/FIE.2007.4417860Google ScholarCross Ref
- Karl J. Ottenstein. 1976. An algorithmic approach to the detection and prevention of plagiarism. ACM SIGCSE Bulletin 8, 4 (Dec 1976), 30--41. https://doi.org/10.1145/382222.382462Google ScholarDigital Library
- Terence Parr. 2013. The definitive ANTLR 4 reference. Pragmatic Bookshelf.Google Scholar
- Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 11 (2002), 1016--1038.Google Scholar
- Faqih Salban Rabbani and Oscar Karnalim. 2017. Detecting source code plagiarism on .NET programming languages using low-level representation and adaptive local alignment. Journal of Information and Organizational Sciences 41, 1 (Jun 2017), 105--123. https://doi.org/10.31341/jios.41.1.7Google ScholarCross Ref
- Judy Sheard, Simon, Matthew Butler, Katrina Falkner, Michael Morgan, and Amali Weerasinghe. 2017. Strategies for maintaining academic integrity in first-year computing courses. In 2017 ACM Conference on Innovation and Technology in Computer Science Education. ACM Press, Bologna, 244--249. https://doi.org/10.1145/3059009.3059064Google ScholarDigital Library
- Simon, Beth Cook, Judy Sheard, Angela Carbone, and Chris Johnson. 2013. Academic integrity: differences between computing assessments and essays. In 13th Koli Calling International Conference on Computing Education Research.ACM Press, Koli, 23--32. https://doi.org/10.1145/2526968.2526971Google Scholar
- Simon, Trina Myers, Dianna Hardy, and Raina Mason. 2019. Variations on a theme: academic integrity and program code. In 21st Australasian Computing Education Conference. ACM Press, Sydney, 56--63. https://doi.org/10.1145/3286960.3286967Google Scholar
- Simon, Judy Sheard, Michael Morgan, Andrew Petersen, Amber Settle, and Jane Sinclair. 2018. Informing students about academic integrity in programming. In 20th Australasian Computing Education Conference. ACM Press, New York, New York, USA, 113--122. https://doi.org/10.1145/3160489.3160502Google Scholar
- Hyun-Je Song, Seong-Bae Park, and Se Young Park. 2015. Computation of program source code similarity by composition of parse tree and call graph. Mathematical Problems in Engineering 2015 (Apr 2015), 1--12. https://doi.org/10.1155/2015/429807Google Scholar
- Lisan Sulistiani and Oscar Karnalim. 2019. ES-Plag: efficient and sensitive source code plagiarism detection tool for academic environment. Computer Applications in Engineering Education 27, 1 (2019), 166--182. https://doi.org/10.1002/cae.22066Google ScholarCross Ref
- Farhan Ullah, Junfeng Wang, Muhammad Farhan, Sohail Jabbar, Zhiming Wu, and Shehzad Khalid. 2018. Plagiarism detection in students' programming assignments based on semantics: multimedia e-learning based smart assessment methodology. Multimedia Tools and Applications (Mar 2018). https://doi.org/10.1007/s11042-018-5827-6Google Scholar
- Lisheng Wang, Lingchao Jiang, and Guofeng Qin. 2018. A search of Verilog code plagiarism detection method. In 13th International Conference on Computer Science & Education. IEEE, Colombo, 1--5. https://doi.org/10.1109/ICCSE.2018.8468817Google ScholarCross Ref
- Michael J. Wise. 1992. Detection of similarities in student programs: YAP'ing may be preferable to Plague'ing. In 23rd SIGCSE Technical Symposium on Computer Science Education, Vol. 24. ACM Press, Kansas City, 268--271. https://doi.org/10.1145/134510.134564Google ScholarDigital Library
- Michael J. Wise. 1996. YAP3: improved detection of similarities in computer program and other texts. In 27th SIGCSE Technical Symposium on Computer Science Education, Vol. 28. ACM Press, Philadelphia, 130--134. https://doi.org/10.1145/236452.236525Google ScholarDigital Library
Recommendations
Preprocessing for Source Code Similarity Detection in Introductory Programming
Koli Calling '20: Proceedings of the 20th Koli Calling International Conference on Computing Education ResearchIt is well documented that some students either work together on programming assessments when required to work individually (collusion) or make unauthorised use of existing code from external sources (plagiarism). One approach used in the detection of ...
Fuzzy Logic Based Similarity Measure for Information Retrieval System Performance Improvement
ICDCIT 2014: Proceedings of the 10th International Conference on Distributed Computing and Internet Technology - Volume 8337The documents of any information retrieval system are ranked on the basis of similarity measure. Some similarity measures e.g. Cosine, Euclidean and Okapi etc. have been extensively used for retrieving relevant documents against the query. In present ...
Improve Neural Machine Translation by Syntax Tree
ISCSIC '18: Proceedings of the 2nd International Symposium on Computer Science and Intelligent ControlMost of the neural machine translation methods are devoted to using syntactic information at one end of the Encoder-Decoder framework. They didn't use syntactic information at both ends, so that the syntactic information cannot be fully utilized to ...
Comments