A linguistic treatment for automatic external plagiarism detection
Introduction
With the explosive growth of the internet, the massive amounts of information make it easy to take someone else's work or idea and represent it as one's own. Due to the huge amount of information and the easy accessibility to information, the act of plagiarism is rapidly increasing. Plagiarism is explained in the thesaurus as the representation someone else's words without confirmation [49]. Plagiarism happens when a new text purposely uses the existing resource or materials [3], [11], [48]. This might include changing the thought, idea, and opinion. The text plagiarism can be ‘verbatim copying’, ‘cutting sentences’, ‘combining sentences’, ‘paraphrasing’ [31]. Nowadays, in various areas such as publishing, journalism, patent verification, academics, plagiarism detection is essential to ensure originality of text, materials, and resource [35]. Plagiarism detection can be performed manually (by a human) or automatically (by the computer system).
Automatic plagiarism detection methods can be performed in two different ways. One of them is based on detecting the similarity of a suspect document and original documents that exist within the reference text dataset. This approach of detecting the existing plagiarism is called external plagiarism detection [50]. The second method is based on detecting the plagiarism that exists in a suspicious document without having need of a reference text dataset. Since this kind of method does not need a reference text dataset, this method of detecting the existing plagiarism is called intrinsic plagiarism detection [26].
Today, plagiarism is a serious crime. In order to defend the rights of the original owner of the works, ways of detecting plagiarism are being investigated as a research field in computer science. Several approaches have been proposed to detect plagiarized materials. Generally, two documents are compared based on the word or sentence level using those approaches [42].
In this paper our objective is 2-fold. First, we propose a plagiarism detection method that combines several linguistic features such as the syntactic information (word-order), semantic information (SRL) and content word expansion. Although some of previous systems used n-gram to consider syntactic information, or some of them used the SRL technique to consider semantic information, to the best of our knowledge, they did not combine the syntactic information (word-order), semantic information (SRL) and content word expansion. We incorporate these features in order to (a) capture the meaning of the sentences; (b) detect passive and active sentences; (c) bridge the lexical gaps. Second, we aim to compare the performance of the proposed method with the participating systems in PAN-PC-11 and other existing methods.
In text relevance context, the semantic relations between words and their syntactic structure have a key role in sentence comprehension. The syntactic information, like word-order, can prepare important information to distinguish the meaning of two sentences when two sentences share the similar bag-of-words. For instance, “S1: Teacher helps student” and “S2: Student helps teacher” will be judged as similar sentences because they have the same surface text. However, their meaning is different. Therefore, to compare two documents, the source document, and a suspicious document, the proposed method should contribute syntactic information to determine the suspicious similarity between two documents; otherwise, it fails to capture the meaning in comparison and often there is a conflict to identify the suspicious similarity between documents. However, it leads to incorrect or even unnecessary matching results.
On the other hand, another situation for the plagiarism can be displayed through the following example. Given two sentences (i.e., source sentence: “Father likes his child”; suspected sentence: “Child is liked by his father”), an author may change the structure of a source sentence. As shown in the example, the structure of two sentences may differ if the active versus passive voice is used. The proposed method also used the Semantic Role Labeling (SRL) approach to identify this type of plagiarism. It was noted that the SRL approach is able to capture the arguments (subject, object, verb and indirect object) for a sentence despite changing the places for the labels inside the sentences. The main task of the SRL approach is the detection of the semantic arguments associated with the verb of a sentence and their classification into their specific roles. The explanation of SRL approach is described in detail in the Section 3.3.6.
Furthermore, in the comparison between two sentences, two sentences are considered to be similar if most of the words are the same or if they are a paraphrase of each other. However, given two sentences S3 and S4 (i.e., S3: “Teacher helps the student”; S4: “Instructor aids student”), it is not always the case that sentences with similar meaning necessarily share many similar words. Hence, semantic information such as the semantic similarity between words and synonymous words can provide useful information when two sentences have similar meaning, but they used different words in the sentences. This is because people can express the same meaning using various sentences in terms of word content. However, the more similar sentence may be represented with similar words, rather than the original words expressed in the source document sentences; hence the semantic information will help to identify the similar ideas, when an author presents someone else's idea as his or her own words by text manipulation approach, paraphrase or synonym words.
In summary, we make the following contributions in this research: (1) we present a robust external plagiarism detection method based on a combined set of linguistic features such as the semantic information, syntactic information and the SRL technique; (2) our proposed method is a comprehensive plagiarism detection technique, which focuses on many types of detection such as detecting ‘verbatim copying’, ‘paraphrasing’, ‘changing of word structure’ and ‘changing sentence from: active to passive and vice versa’; (3) the method combines the semantic and syntactic information to distinguish the meaning of two sentences, when two sentences share the similar bag-of-words; (4) the method employed a semantic approach for plagiarism detection based on Semantic Role Labeling. The method does not analyze the content of a text document as text syntax only, but also captures the underlying semantic meaning in terms of the relationships among its words; (5) the method also employed a word semantic similarity measuring method to overcome vocabulary mismatch problem in sentence comparison; (6) we show that our method, which is tested on the PAN-PC-11 dataset, provides competitive results with other methods.
The system is able to work on any language, if the lexical database (WordNet), the stop-word list, shallow parsing and the SRL approach can be replaced with the lexical database, stop-word list, parser and SRL technique of the current language, respectively.
The rest of this paper is structured as follows. In Section 2 of this paper, we consider related work on plagiarism detection. In Section 3, we explain our proposed system. We then summarize the experimental results in Section 4. Finally, we conclude this paper in Section 5.
Section snippets
Brief review of literature
Plagiarism is defined as “the re-use of someone else's ideas, processes, results, or words without explicitly acknowledging the original author and source” [8]. Given a formal definition of a plagiarism Plag case = {Ssus, dsus, Ssrc, dsrc} as a 4-tuple which includes a passage Ssus in a document dsus that is the plagiarized version of some source passage Ssrc in dsrc. The main task of the plagiarism detector is to find Plag case. There are three main stages in plagiarism detection process [7,20
Proposed method
The overall system structure of IEPDM for plagiarism detection is shown in Fig. 1. Let Docsus be a suspicious document, Corsrc a large corpus of documents and Docsrc be a set of candidate or relevant document. The IEPDM performs the following main steps:
- 1.
Pre-processing step—in this step different NLP approaches are applied to the source and suspicious documents.
- 2.
Candidate retrieval step—this step retrieves the Docsrc that is most similar to the Docsus.
- 3.
Detailed comparison step—in the current
Experiments
We conducted our experiment on the PAN-PC-11 dataset to evaluate our proposed method.
Conclusion and future work
Automatic plagiarism detection systems aim to provide experts with evidence for taking decisions about potential cases of unauthorized text re-use. In this paper, we introduced a system to detect different types of plagiarism. A significant aspect of the proposed method is that the method is able to catch the meaning in the comparison between two passages when two passages have same surface text or various word/synonym has been employed in the passages.
On the other hand, the method is able to
Acknowledgements
This work is supported by The Ministry of Higher Education (MOHE) under Q.J130000.21A2.03E53 - STATISTICAL MACHINE LEARNING METHODS TO TEXT SUMMARIZATIONS. The authors would like to thank Research Management Centre (RMC), Universiti Teknologi Malaysia (UTM) for the support in R & D, UTM Big Data Centre (BDC) for the inspiration in making this study a success. The authors would also like to thank the anonymous reviewers who have contributed enormously to this work.
References (50)
- et al.
Automatic summarization assessment through a combination of semantic and syntactic information for intelligent educational systems
Inf. Process. Manage.
(2015) - et al.
PDLK: plagiarism detection using linguistic knowledge
Expert Syst. Appl.
(2015) - et al.
Methods for cross-language plagiarism detection
Knowl. Based Syst.
(2013) - et al.
Paraphrase extraction using fuzzy hierarchical clustering
Appl. Soft Comput.
(2015) - et al.
Boosting paraphrase detection through textual similarity metrics with abductive networks
Appl. Soft Comput.
(2015) - et al.
Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style
Expert Syst. Appl.
(2013) - et al.
DOCODE 3.0 (DOcument COpy DEtector): a system for plagiarism detection by applying an information fusion process from multiple documental data sources
Inf. Fus.
(2016) - et al.
Automated summarization assessment system: quality assessment without a reference summary
- et al.
Query-based multi-documents summarization using linguistic knowledge and content word expansion
Soft Comput.
(2015) - S. Avram, D. Caragem, T. Borangiu, NLP applications in external plagiarism detection. scientific bulletin.upb.ro, 76...
The berkeley framenet project
Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection
Comput. Linguist.
NLTK: the natural language toolkit
Developing a corpus of plagiarised short answers
Lang. Resour. Eval.
A high-performance plagiarism detection system-notebook for PAN at CLEF
A pairwise document analysis approach for monolingual plagiarism detection
Plagiarism detection in text using vector space model
Automatic labeling of semantic roles
Comput. Linguist.
Improved implementation for finding text similarities in large collections of data: notebook for PAN at CLEF
ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection
Plagiarism detection-different methods and their analysis: review
Int. J. Innov. Res. Adv. Eng.
The distribution of the flora in the alpine zone
New Phytolog.
Finding plagiarism by evaluating document similarities
Class-based construction of a verb lexicon
Cited by (16)
Automated scholarly paper review: Concepts, technologies, and challenges
2023, Information FusionAn external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
2022, Expert Systems with ApplicationsCitation Excerpt :Detailed comparison is performed using distinct words from the pair of sentences. Later, Abdi, Shamsuddin, Idris, Alguliyev, and Aliguliyev (2017) also proposed the usage of semantic role labeling in the measurement of similarities. Kanjirangat and Gupta (2017) proposed a document-level plagiarism detection system based on Vector Space Model (VSM).
Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection
2020, Expert Systems with ApplicationsCitation Excerpt :The goal is to identify the parts of the text that are inconsistent with other sections in the same document (Oberreuter & VeláSquez, 2013). The other class of systems (extrinsic) compares the suspicious document with a collection of source documents to find the plagiarized cases and their original sources (Abdi et al., 2017; Potthast et al., 2009). While the intrinsic systems look for dissimilarity within a document, the extrinsic approaches search to find similarity across documents (Stein et al., 2011).
A question answering system in hadith using linguistic knowledge
2020, Computer Speech and LanguageCitation Excerpt :The following tasks are performed to measure the word-order similarity between two sentences. For more details, please refer to Abdi et al. (2015c) and Abdi et al. (2017). To create the syntactic-vector.
An improved extrinsic monolingual plagiarism detection approach of the Bengali text
2023, International Journal of Electrical and Computer EngineeringDesign and Evaluation of a Similarity Checking Tool for Indonesian Documents
2023, 2023 8th International Conference on Informatics and Computing, ICIC 2023