An application for plagiarized source code detection based on a parse tree kernel

https://doi.org/10.1016/j.engappai.2013.06.007Get rights and content

Highlights

  • Program plagiarism detection method that relies on parse tree similarities.

  • Parse trees are compared in a kernel space.

  • A new source code parse tree kernel is proposed for detection performance.

  • Evaluation with real-world data showed 0.93 F-1 score at max.

Abstract

Program plagiarism detection is a task of detecting plagiarized code pairs among a set of source codes. In this paper, we propose a code plagiarism detection system that uses a parse tree kernel. Our parse tree kernel calculates a similarity value between two source codes in terms of their parse tree similarity. Since parse trees contain the essential syntactic structure of source codes, the system effectively handles structural information. The contributions of this paper are two-fold. First, we propose a parse tree kernel that is optimized for program source code. The evaluation shows that our system based on this kernel outperforms well-known baseline systems. Second, we collected a large number of real-world Java source codes from a university programming class. This test set was manually analyzed and tagged by two independent human annotators to mark plagiarized codes. It can be used to evaluate the performance of various detection systems in real-world environments. The experiments with the test set show that the performance of our plagiarism detection system reaches to 93% level of human annotators.

Introduction

Plagiarism, defined as “using someone else's work as their own without reference to original sources” (Maurer et al., 2006), has received much attention in diverse fields. The attention to plagiarism is constantly increasing due to the growth of information technology. Internet, digital documents, and file sharing systems have made it easy to access more and more information including program source codes. Plagiarism is considered as one of the most severe problems in education, since students can submit their course assignments without any understanding of the subject by plagiarizing someone else's work. According to Evans (2006), over 30% of students have experience of copying some text or even an entire paper without reference. Therefore, there is much demand for methods to deal with students' plagiarism.

This paper focuses on detecting structured text plagiarism, especially in program source code. Parker and Hamblen (1989) defined program source code plagiarism as a program that has been produced from another program with a small number of routine changes. One key part of plagiarism detection systems is the similarity measure. According to White and Joy (2004), plagiarism detection tools can be regarded as programs that compare documents in order to identify similarities and discover submissions that might be plagiarized.

The similarity measure for plagiarism detection systems should reflect the characteristics of the program source code. Compared with normal text, source code has some unique characteristics:

  • Source code is composed of a large set of rarely occurring user-defined words and a small set of frequently occurring reserved words like for, while, and if.

  • Even though a pair of source code can be greatly different in terms of string similarity, they can achieve the same functionality.

  • Source code has a structure that is determined by the reserved words.

The first and second characteristics imply that character-level comparison that is often adopted in plagiarism detection for normal text is not suitable for program plagiarism detection. A clue for a way to achieve good program plagiarism detection can be found in the third characteristic. When plagiarizing a source code, it is much harder to change the structure of the source code. Converting user-defined vocabulary is easy, and can be done without a proper understanding of the source code. However, redefining the structure of a program is very tricky and often as hard as writing the module from the scratch. This means that the program structure is an important feature for plagiarism detection. Some previous program plagiarism detection systems attempted to design similarity measures that reflected structural information to some extent. However, since most of them defined their structural features on the lexical level, their ability to compare entire structures is fairly limited. Structural information of a source code can be presented by the parse tree of the source code. Thus, a software plagiarism detection system should be able to use parse trees to incorporate structural information.

Defining a metric for parse trees is not a trivial task, since it is generally harder to define a metric for structured data. One of the prominent methods for comparing structured data is the kernel functions. Haussler (1999) first defined a mathematically profound way to define a kernel function for structured data, the so-called R-convolution kernel. An R-convolution kernel defines a kernel space with infinitely high dimensions where each dimension corresponds to each possible substructure. By comparing two given structures in the kernel space (without explicitly generating infinite dimensions – the kernel trick), a convolution kernel can make structural comparison without manually selected structural features.

In this paper, an effective plagiarism detection system for the Java language is proposed. The proposed system adopts a parse tree kernel (Collins and Duffy, 2001), which is an R-convolution kernel for tree structures. The parse tree kernel computes the similarity value between a pair of parse trees. Thus, the structural information of the source code is fully reflected in the proposed system. Parse tree kernels have been successfully used in natural language processing (NLP). However, the parse tree kernel used in NLP does not perform well for program source code due to two issues. The first issue is the asymmetric influence of node changes. In previous tree kernels, changes near a root node have larger influence than changes near leaf nodes. When the tree depth is small, this effect is not serious. However, the parse trees of program sources are much larger than that of natural language sentences, and this unwanted influence greatly affects tree comparison. The second issue is the sequence of subtrees. Previous tree kernels count sequence of subtrees. Unlike natural language sentences, the sequence of two substructures (like the order of two methods in a Java class) has little information in program source codes. We identified these two issues, and propose a new parse tree kernel for program source code.

The proposed plagiarism detection is performed in three steps: First, parse trees are generated from the source codes. Second, all the parse trees are compared via the proposed parse tree kernel, and similarity values between all pairs are obtained. Finally, the groups of source codes that are most likely to be plagiarized are selected according to the similarity values.

The performance of the proposed system is evaluated with two evaluation sets. In the first experiment, our system is evaluated by using a synthesized data set. In this data set, several well-known plagiarism methods are simulated (like replacing variable names, and inserting redundant code) to generate different types of plagiarized source codes. In the second experiment, a real-world Java source code collection from a college programming class is used as the data set. The experimental results show that the proposed system can successfully detect real-world program plagiarism on 93% level of human annotators.

This paper is organized as follows. Section 2 presents related work and previous plagiarism detection systems. Section 3 describes the proposed system. Section 4 details the two experiments and the paper is concluded in Section 5.

Section snippets

Related work

The characteristics of intellectual properties are important in detecting partial copying of the properties. For written texts, various anti-plagiarism systems have been introduced including COPS (Brin et al., 1995), SCAM (Shivakumar and García-Molina, 1995), and CHECK (Si et al., 1997). Oberreuter and Velásquez (2013) recently proposed a text mining technique to detect plagiarized documents. In their work, a document is divided into segments by using a sliding window of a certain length over

Source code plagiarism detection

The proposed system detects plagiarism by using the parse trees of a piece of source code. Fig. 1 illustrates the proposed system that operates in three steps. In the first step, the system extracts parse trees from the source codes with a syntactic analyzer. Then, the pair-wise similarities between all pairs of the source code are calculated using the parse tree kernel. This results in an all-pair similarity matrix. Finally, in the third step, plagiarized pairs are detected by selecting pairs

Experiments

The proposed system is evaluated using two data sets: a synthesized data set and a real-world data set. The goal of the first experiment is to evaluate the effectiveness of the proposed system against specific “plagiarism attacks”. The second experiment aims to evaluate the proposed system in a real-world environment. In all experiment, the threshold for subtree depth Δ is set as 3 and the decay factor λ is 0.3 heuristically.

Conclusions

In this paper, we proposed an automatic program plagiarism detection system. The proposed system compares source codes with a specialized tree kernel for parse trees of source codes. Parse tree kernels used in the NLP domain are not appropriate to program source codes due to their characteristics. Compared to the parse trees of natural language sentences, the parse trees of program source codes are much larger and deeper, and the order of subtrees is not informative. We proposed a specialized

Role of funding source

This work was supported in part by the Industrial Strategic Technology Development Program (10035348, Development of a Cognitive Planning and Learning Model for Mobile Platforms) funded by the Ministry of Knowledge Economy (MKE, Korea). The funding sources did not influence the research direction and submission decision.

References (24)

  • H. Berghel et al.

    Measurements of program similarity in identical task environments

    SIGPLAN Not.

    (1984)
  • Bravo-Marquez, F., L'Huillier, G., Rios, S., Velásquez, J., 2011. A text similarity meta-search engine based on...
  • Brin, S., Davis, J., García-Molina, H., 1995. Copy detection mechanisms for digital documents. In: Proceedings of the...
  • X. Chen et al.

    Shared information and program plagiarism detection

    IEEE Trans. Inf. Theory

    (2004)
  • Collins, M., Duffy, N., 2001. Convolution kernels for natural language. In: Proceedings of the NIPS 2001, pp....
  • Z. Durić et al.

    A source code similarity system for plagiarism detection

    Comput. J.

    (2013)
  • R. Evans

    Evaluating an electronic plagiarism detection service

    Active Learn. Higher Educ.

    (2006)
  • Gitchell, D., Tran, N., 1998. A utility for detecting similarity in computer programs. In: Proceedings of the 30th ACM...
  • Grier, S., 1981. A tool that detects plagiarism in pascal programs. In: Twelfth SIGCSE Technical Symposium, vol. 13,...
  • M. Halstead

    Elements of Software Science

    (1977)
  • Haussler, D., 1999. Convolution kernels on discrete structures. Technical Report UCS-CRL-99-10 University of California...
  • T. Kamiya et al.

    CCFinder: a multilinguistic token-based code clone detection system for large scale source code

    IEEE Trans. Software Eng.

    (2002)
  • Cited by (0)

    View full text