Elsevier

Knowledge-Based Systems

Volume 135, 1 November 2017, Pages 135-146
Knowledge-Based Systems

A linguistic treatment for automatic external plagiarism detection

https://doi.org/10.1016/j.knosys.2017.08.008Get rights and content

Highlights

  • We proposed a specialized method for external plagiarism detection.

  • It integrates semantic and syntactic information to capture the meaning of passages.

  • It used SRL method to detect passive and active sentences.

  • It detects copied text, paraphrasing, transformation of sentences and changing of word.

  • Results displayed that it is to be preferred over PAN-11systems and other methods.

Abstract

Plagiarism is the unauthorized use of the ideas, presentation of someone else's words or work as your own. This paper presents an External Plagiarism Detection System (EPDS), which employs a combination of the Semantic Role Labeling (SRL) technique, the semantic and syntactic information. Most of the available methods fail to capture the meaning in the comparison between a source document sentence and a suspicious document sentence when two sentences have same surface text. Therefore, it leads to incorrect or even unnecessary matching results. However, the proposed method is able to avoid selecting the source text sentence whose similarity with suspicious text sentence is high but its meaning is different. On the other hand, an author may change the sentence from: active to passive and vice versa; hence, the method also employed the SRL technique to tackle the aforementioned challenge. Furthermore, the method used the content word expansion approach to bridge the lexical gaps and identify the similar ideas that are expressed using different wording. The proposed method is able to detect different types of plagiarism such as the exact verbatim copying, paraphrasing, transformation of sentences, changing of word structure. As a result, the experimental results have displayed that the proposed method is able to improve the performance compared with the participating systems in PAN-PC-11 and other existing techniques.

Introduction

With the explosive growth of the internet, the massive amounts of information make it easy to take someone else's work or idea and represent it as one's own. Due to the huge amount of information and the easy accessibility to information, the act of plagiarism is rapidly increasing. Plagiarism is explained in the thesaurus as the representation someone else's words without confirmation [49]. Plagiarism happens when a new text purposely uses the existing resource or materials [3], [11], [48]. This might include changing the thought, idea, and opinion. The text plagiarism can be ‘verbatim copying’, ‘cutting sentences’, ‘combining sentences’, ‘paraphrasing’ [31]. Nowadays, in various areas such as publishing, journalism, patent verification, academics, plagiarism detection is essential to ensure originality of text, materials, and resource [35]. Plagiarism detection can be performed manually (by a human) or automatically (by the computer system).

Automatic plagiarism detection methods can be performed in two different ways. One of them is based on detecting the similarity of a suspect document and original documents that exist within the reference text dataset. This approach of detecting the existing plagiarism is called external plagiarism detection [50]. The second method is based on detecting the plagiarism that exists in a suspicious document without having need of a reference text dataset. Since this kind of method does not need a reference text dataset, this method of detecting the existing plagiarism is called intrinsic plagiarism detection [26].

Today, plagiarism is a serious crime. In order to defend the rights of the original owner of the works, ways of detecting plagiarism are being investigated as a research field in computer science. Several approaches have been proposed to detect plagiarized materials. Generally, two documents are compared based on the word or sentence level using those approaches [42].

In this paper our objective is 2-fold. First, we propose a plagiarism detection method that combines several linguistic features such as the syntactic information (word-order), semantic information (SRL) and content word expansion. Although some of previous systems used n-gram to consider syntactic information, or some of them used the SRL technique to consider semantic information, to the best of our knowledge, they did not combine the syntactic information (word-order), semantic information (SRL) and content word expansion. We incorporate these features in order to (a) capture the meaning of the sentences; (b) detect passive and active sentences; (c) bridge the lexical gaps. Second, we aim to compare the performance of the proposed method with the participating systems in PAN-PC-11 and other existing methods.

In text relevance context, the semantic relations between words and their syntactic structure have a key role in sentence comprehension. The syntactic information, like word-order, can prepare important information to distinguish the meaning of two sentences when two sentences share the similar bag-of-words. For instance, “S1: Teacher helps student” and S2: Student helps teacher” will be judged as similar sentences because they have the same surface text. However, their meaning is different. Therefore, to compare two documents, the source document, and a suspicious document, the proposed method should contribute syntactic information to determine the suspicious similarity between two documents; otherwise, it fails to capture the meaning in comparison and often there is a conflict to identify the suspicious similarity between documents. However, it leads to incorrect or even unnecessary matching results.

On the other hand, another situation for the plagiarism can be displayed through the following example. Given two sentences (i.e., source sentence: “Father likes his child”; suspected sentence: “Child is liked by his father”), an author may change the structure of a source sentence. As shown in the example, the structure of two sentences may differ if the active versus passive voice is used. The proposed method also used the Semantic Role Labeling (SRL) approach to identify this type of plagiarism. It was noted that the SRL approach is able to capture the arguments (subject, object, verb and indirect object) for a sentence despite changing the places for the labels inside the sentences. The main task of the SRL approach is the detection of the semantic arguments associated with the verb of a sentence and their classification into their specific roles. The explanation of SRL approach is described in detail in the Section 3.3.6.

Furthermore, in the comparison between two sentences, two sentences are considered to be similar if most of the words are the same or if they are a paraphrase of each other. However, given two sentences S3 and S4 (i.e., S3: “Teacher helps the student”; S4: “Instructor aids student”), it is not always the case that sentences with similar meaning necessarily share many similar words. Hence, semantic information such as the semantic similarity between words and synonymous words can provide useful information when two sentences have similar meaning, but they used different words in the sentences. This is because people can express the same meaning using various sentences in terms of word content. However, the more similar sentence may be represented with similar words, rather than the original words expressed in the source document sentences; hence the semantic information will help to identify the similar ideas, when an author presents someone else's idea as his or her own words by text manipulation approach, paraphrase or synonym words.

In summary, we make the following contributions in this research: (1) we present a robust external plagiarism detection method based on a combined set of linguistic features such as the semantic information, syntactic information and the SRL technique; (2) our proposed method is a comprehensive plagiarism detection technique, which focuses on many types of detection such as detecting ‘verbatim copying’, ‘paraphrasing’, ‘changing of word structure’ and ‘changing sentence from: active to passive and vice versa’; (3) the method combines the semantic and syntactic information to distinguish the meaning of two sentences, when two sentences share the similar bag-of-words; (4) the method employed a semantic approach for plagiarism detection based on Semantic Role Labeling. The method does not analyze the content of a text document as text syntax only, but also captures the underlying semantic meaning in terms of the relationships among its words; (5) the method also employed a word semantic similarity measuring method to overcome vocabulary mismatch problem in sentence comparison; (6) we show that our method, which is tested on the PAN-PC-11 dataset, provides competitive results with other methods.

The system is able to work on any language, if the lexical database (WordNet), the stop-word list, shallow parsing and the SRL approach can be replaced with the lexical database, stop-word list, parser and SRL technique of the current language, respectively.

The rest of this paper is structured as follows. In Section 2 of this paper, we consider related work on plagiarism detection. In Section 3, we explain our proposed system. We then summarize the experimental results in Section 4. Finally, we conclude this paper in Section 5.

Section snippets

Brief review of literature

Plagiarism is defined as “the re-use of someone else's ideas, processes, results, or words without explicitly acknowledging the original author and source” [8]. Given a formal definition of a plagiarism Plag case= {Ssus, dsus, Ssrc, dsrc} as a 4-tuple which includes a passage Ssus in a document dsus that is the plagiarized version of some source passage Ssrc in dsrc. The main task of the plagiarism detector is to find Plag case. There are three main stages in plagiarism detection process [7,20

Proposed method

The overall system structure of IEPDM for plagiarism detection is shown in Fig. 1. Let Docsus be a suspicious document, Corsrc a large corpus of documents and Docsrc be a set of candidate or relevant document. The IEPDM performs the following main steps:

  • 1.

    Pre-processing step—in this step different NLP approaches are applied to the source and suspicious documents.

  • 2.

    Candidate retrieval step—this step retrieves the Docsrc that is most similar to the Docsus.

  • 3.

    Detailed comparison step—in the current

Experiments

We conducted our experiment on the PAN-PC-11 dataset to evaluate our proposed method.

Conclusion and future work

Automatic plagiarism detection systems aim to provide experts with evidence for taking decisions about potential cases of unauthorized text re-use. In this paper, we introduced a system to detect different types of plagiarism. A significant aspect of the proposed method is that the method is able to catch the meaning in the comparison between two passages when two passages have same surface text or various word/synonym has been employed in the passages.

On the other hand, the method is able to

Acknowledgements

This work is supported by The Ministry of Higher Education (MOHE) under Q.J130000.21A2.03E53 - STATISTICAL MACHINE LEARNING METHODS TO TEXT SUMMARIZATIONS. The authors would like to thank Research Management Centre (RMC), Universiti Teknologi Malaysia (UTM) for the support in R & D, UTM Big Data Centre (BDC) for the inspiration in making this study a success. The authors would also like to thank the anonymous reviewers who have contributed enormously to this work.

References (50)

  • C.F. Baker et al.

    The berkeley framenet project

  • A. Barrón-Cedeño et al.

    Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection

    Comput. Linguist.

    (2013)
  • S. Bird

    NLTK: the natural language toolkit

  • D. Bollegala, Y. Matsuo, M. Ishizuka, Measuring semantic similarity between words using web search engines. WWW 2007 /...
  • P. Clough et al.

    Developing a corpus of plagiarised short answers

    Lang. Resour. Eval.

    (2011)
  • N. Cooke et al.

    A high-performance plagiarism detection system-notebook for PAN at CLEF

  • N. Ehsan et al.

    A pairwise document analysis approach for monolingual plagiarism detection

  • A. Ekbal et al.

    Plagiarism detection in text using vector space model

  • D. Gildea et al.

    Automatic labeling of semantic roles

    Comput. Linguist.

    (2002)
  • J. Grman et al.

    Improved implementation for finding text similarities in large collections of data: notebook for PAN at CLEF

  • C. Grozea et al.

    ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection

  • S. Hiremath et al.

    Plagiarism detection-different methods and their analysis: review

    Int. J. Innov. Res. Adv. Eng.

    (2014)
  • P. Jaccard

    The distribution of the flora in the alpine zone

    New Phytolog.

    (1912)
  • J. Kasprzak et al.

    Finding plagiarism by evaluating document similarities

  • K. Kipper et al.

    Class-based construction of a verb lexicon

  • Cited by (16)

    • An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding

      2022, Expert Systems with Applications
      Citation Excerpt :

      Detailed comparison is performed using distinct words from the pair of sentences. Later, Abdi, Shamsuddin, Idris, Alguliyev, and Aliguliyev (2017) also proposed the usage of semantic role labeling in the measurement of similarities. Kanjirangat and Gupta (2017) proposed a document-level plagiarism detection system based on Vector Space Model (VSM).

    • Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

      2020, Expert Systems with Applications
      Citation Excerpt :

      The goal is to identify the parts of the text that are inconsistent with other sections in the same document (Oberreuter & VeláSquez, 2013). The other class of systems (extrinsic) compares the suspicious document with a collection of source documents to find the plagiarized cases and their original sources (Abdi et al., 2017; Potthast et al., 2009). While the intrinsic systems look for dissimilarity within a document, the extrinsic approaches search to find similarity across documents (Stein et al., 2011).

    • A question answering system in hadith using linguistic knowledge

      2020, Computer Speech and Language
      Citation Excerpt :

      The following tasks are performed to measure the word-order similarity between two sentences. For more details, please refer to Abdi et al. (2015c) and Abdi et al. (2017). To create the syntactic-vector.

    • An improved extrinsic monolingual plagiarism detection approach of the Bengali text

      2023, International Journal of Electrical and Computer Engineering
    • Design and Evaluation of a Similarity Checking Tool for Indonesian Documents

      2023, 2023 8th International Conference on Informatics and Computing, ICIC 2023
    View all citing articles on Scopus
    View full text