Using word semantic concepts for plagiarism detection in text documents

Chang, Chia-Yang; Lee, Shie-Jue; Wu, Chih-Hung; Liu, Chih-Feng; Liu, Ching-Kuan

doi:10.1007/s10791-021-09394-4

Using word semantic concepts for plagiarism detection in text documents

Published: 14 July 2021

Volume 24, pages 298–321, (2021)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Using word semantic concepts for plagiarism detection in text documents

Download PDF

Chia-Yang Chang¹,
Shie-Jue Lee ORCID: orcid.org/0000-0001-8004-4625²,
Chih-Hung Wu³,
Chih-Feng Liu⁴ &
…
Ching-Kuan Liu⁵

966 Accesses
12 Citations
Explore all metrics

Abstract

Plagiarism is a common problem in the modern age. With the advance of Internet, it is more and more convenient to access other people’s writings or publications. When someone uses the content of a text in an undesirable way, plagiarism may occur. Plagiarism infringes the intellectual property rights, so it is a serious problem nowadays. However, detecting plagiarism effectively is a challenging work. Traditional methods, like vector space model or bag-of-words, are short of providing a good solution due to the incapability of handling the semantics of words satisfactorily. In this paper, we propose a new method for plagiarism detection. We use Word2vec to transform the words into word vectors which are able to reveal the semantic relationship among different words. Through word vectors, words are clustered into concepts. Then documents and their paragraphs are represented in terms of concepts, and plagiarism detection can be done more effectively. A number of experiments are conducted to demonstrate the good performance of our proposed method.

Survey on Plagiarism Detection Systems and Their Comparison

A Novel Technique for Detecting Plagiarism in Documents Exploiting Information Sources

Article 22 August 2017

Mansi Sahi & Vishal Gupta

A New Hybrid Technique for Detection of Plagiarism from Text Documents

Article 11 May 2020

Lovepreet Ahuja, Vishal Gupta & Rohit Kumar

1 Introduction

When the content of a text is used in an undesirable way, particularly when the source is not cited, plagiarism may occur (Sattler et al. 2017; Leung and Cheng 2017). Plagiarism is an age old issue and is only more problematic because of the ease with which text can be reused and disguised due to the widespread use of computers and the ubiquitous presence of the Internet. Plagiarism infringes the intellectual property rights, so it is a serious problem nowadays. However, detecting plagiarism effectively is a challenging work. The research on plagiarism detection has started since the 1990s, pioneered by the studies of copy detection in digital documents (Brin et al. 1995). In recent years, research on plagiarism detection has actively evolved, taking advantage of the developments in related fields such as information retrieval, natural language processing, computational linguistics, near-duplicate detection, and artificial intelligence (Clough 2000; Baeza-Yates and Ribeiro-Neto 2011; Barrón-Cedeño et al. 2009; Alzahrani and Salim 2010; Alzahrani et al. 2012; Meuschke et al. 2017; Schneider et al. 2018).

There are several types of plagiarism, committed in different ways (Alzahrani et al. 2012). Exact copying is the simplest type, with specific passages of text copied without any change from one document to another document (Monostori et al. 2000). Sentence rewriting (Jadalla and Elnagar 2012) is another type. It rewrites a sentence by changing the order of words, adding some words, deleting certain words, or even by substituting some words with semantically similar ones. The rewritten sentence looks different from the original one, but they actually have the same meaning. The most complex plagiarizing types are idea adoption and writing style change (Meyer zu Eissen and Stein 2006). After reading and understanding the original document, the plagiarist rewrites the sentences, paragraphs, or even the whole document in his/her own style. As a result, the plagiarized work itself looks like another original one.

Plagiarism detection systems can be divided into two categories, extrinsic and intrinsic (Stein et al. 2007a). Extrinsic detection systems compare a suspicious document with a reference collection, which is a set of documents assumed to be genuine. Based on a chosen document model and predefined similarity criteria, the detection task is to retrieve all documents which contain the text plagiarized in the suspicious document (Stein et al. 2007b; Potthast et al. 2009). When the reference collection cannot be accessed or a high degree of obfuscation existed, plagiarism detection can be a very difficult task. In contrast, intrinsic detection systems solely analyze the text under evaluation without performing comparisons to external documents, aiming to recognize changes in the unique writing style of the author as an indicator for potential plagiarism (Meyer zu Eissen and Stein 2006; Stein et al. 2011).

Plagiarism detection originated from the research on document retrieval (Blair and Maron 1985) and near-duplicate detection (Henzinger 2006). Document retrieval allows users to input a query and the documents matching the query are retrieved from the corpus. The query can be a set of words or sentences, or even a paragraph. The corpus is usually a collection of unstructured texts such as news, e-mails, web pages, and so on. The most famous example about document retrieval is web search engines, e.g., Google search engine. Near-duplicate detection, as the name suggests, is to find out a pair of documents which are nearly the same. In other words, near-duplicate does not require two documents to be totally identical, but instead allows insignificant differences to exist between the documents. Near-duplicate detection can also be used with web search engines. A web search engine may discourage a user if many near-duplicate web pages appear in the searched result.

Document retrieval and near-duplicate detection usually deal with texts in the document level. The global context is concerned, and small details are ignored. However, plagiarism can be involved in various scopes. A plagiarist may rewrite a whole document by idea adoption. In this case, plagiarism is committed in the global context. On the other hand, a plagiarism can be involved in the local context, e.g., copying or rewriting some sentences from a document. Therefore, plagiarism detection should be undertaken in both local and global contexts (Gipp 2014).

Most existing Plagiarism detection methods employ the vector space model (VSM), or bag-of-words (BOW), to convert documents into vectors. Each element of a vector corresponds to the weight of a word in the dictionary. The words in the dictionary are treated as independent from each other. However, in a plagiarized document, some words in the original document may be replaced with semantically similar words. Therefore, the treatment of the words being independent is invalid, and difficulties might be encountered when detecting such plagiarized documents due to the incapability of handling the semantics of words satisfactorily. The application of semantic networks and semantic compression was demonstrated to be a valuable addition to the existing methods used in plagiary detection (Ceglarek 2013).

In this paper, we are concerned with building an extrinsic plagiarism detection system. Given a query document and a corpus of reference documents, the system can retrieve all the reference documents that contain certain text plagiarized as part of the text in the query document. In addition, the original text and plagiarized text are located and presented as output to the user. In the system, Word2Vec is used to transform the words in the documents into word vectors which are able to reveal the semantic relationship among different words. The spherical K-means is applied to cluster the words into semantic concepts. Then documents and their paragraphs are represented in terms of the obtained concepts. Finally, a two-phase matching strategy is developed. In the first phase, possible source documents involved in plagiarism are located. In the second phase, the plagiarized parts are identified and shown to the user. A number of experiments are conducted to demonstrate the effectiveness of our proposed method in plagiarism detection.

The contributions of this work are the following: (1) The semantics are provided by Word2Vec word embeddings; (2) The embeddings are then clustered into semantic concepts; (3) The documents are represented at different levels of granularity using the semantic concepts; and (4) A two-stage, filtering and identifying, approach is introduced for effective plagiarism detection.

The remaining of this paper is organized as follows. In Sect. 2, related work is briefly reviewed. Section 3 gives an overview and detailed description of our proposed method. Experimental results are presented in Sect. 4. Finally, a conclusion is given in Sect. 5.

2 Related work

As mentioned, plagiarism detection systems can be divided in two categories, extrinsic and intrinsic. Extrinsic systems use a collection of reference documents and can be used to detect the plagiarism due to text copying, sentence rewriting, and idea adoption. In contrast, intrinsic systems detect plagiarism by analyzing and recognizing the writing style of the authors. Table 1 summarizes the characteristics of the two categories. Many different plagiarism detection systems have been proposed. A brief survey is given below.

Table 1 Categories of plagiarism detection systems

Full size table

2.1 Detection for text copying

Text-matching techniques were developed for detecting text copying, to find the text copied between two documents. In (Monostori et al. 2000), the proposed MatchDetectReveal (MDR) system is capable of identifying overlapping and plagiarised documents. The matching-engine component uses a modified suffix tree representation, which is able to identify the exact overlapping chunks. In (Campbell et al. 2000), a sentence-based system is proposed to produce the distribution of overlap that exists between overlapping documents. It is resistant to inaccuracy due to large variations in document size.

2.2 Detection for sentence rewriting

For detecting plagiarism by sentence rewriting, techniques calculating commonality between two documents based on fingerprints, n-grams, or bags of words were developed (Jadalla and Elnagar 2012; Muhr et al. 2009; Deepa et al. 2016). Question answering systems (QASs), in some sense are an example of plagiarism detection. A QAS retrieves from a collection of documents text portions which contain answer to the user’s questions (Waheeb and Babu 2016; Chacko 2018). In (Sarrouti and Alaoui 2017), an efficient passage retrieval method is proposed to retrieve relevant passages in biomedical QASs with high mean average precision. Chow et al. propose MultiLayer Self-Organizing Map (MLSOM) (Chow and Rahman 2009) for document retrieval and plagiarism detection. They split a document into pages and a page into paragraphs. A document is represented in a three-level tree-structured form, corresponding to document, page, and paragraph, respectively. A three-layer SOM is built and local matching techniques are developed for comparing text documents. Zhang and Chow propose a framework, named MultiLevel Matching (MLM), for plagiarism detection (Zhang and Chow 2011). They use a 2-level structure, document-paragraph, to represent each document. Histogram vectors are used to represent the information extracted in the document level and paragraph level, respectively, and a hybrid distance is adopted to measure the difference between two documents. Ceglarek (Ceglarek 2013) demonstrates that the application of the semantic compression boosts the efficiency of Sentence Hashing Algorithm for Plagiarism Detection 2 (SHAPD2) and w-shingling algorithm.

2.3 Detection for idea adoption

Idea adoption occurs frequently in paraphrase plagiarism (Franco-Salvador et al. 2016; Alvarez-Carmona et al. 2018). Paraphrase plagiarism contains reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions (P4PIN 2020). In (Marti et al. 2013), attention is paid to the paraphrase phenomena which underlie acts of plagiarism. The presented experiments show that more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and paraphrase mechanisms tend to shorten the plagiarized text. In (Naawab et al. 2016), query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to word sense disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term, i.e., replacement of words or phrases. In (Gonzalez-Agirre 2017), two computational models, semantic textual similarity (STS) and typed similarity, are developed for computing textual similarity. STS aims to measure the degree of semantic equivalence between two sentences by assigning graded similarity values. Typed Similarity tries to identify the type of relation that holds between a pair of similar items in a digital library.

2.4 Detection with word embedding

Various approaches have been proposed to use word embedding techniques, e.g., Word2Vec, for detecting plagiarism by sentence rewriting or idea adoption. Word embedding techniques are applied to quantify and categorize semantic similarities between linguistic items based on their distributional properties, and to project the words contained in a set of training documents to a semantic space of specified dimensionality. In (Baba et al. 2017), the validity of using distributed representation of words is evaluated for defining a document similarity. The paper proposes a plagiarism detection method based on the local maximal value of the length of the longest common subsequence (LCS) with the weight defined by a distributed representation. In (Mahmoud et al. 2017), Word2Vec is used to generate word vectors which are combined subsequently to produce a sentence vectors representation. A Convolutional Neural Network is proposed to measure the similarity between the representations of source and suspicious sentences. In (E et al. 2018), an embedding based document representation to detect plagiarism in documents is proposed. Words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors into a sentence vectors representation. Sentence pairs with the highest similarity score are considered as the candidates of the plagiarism cases. In (Shahmohammadi et al. 2020), a single Bi-LSTM neural network is trained to encode the input document by leveraging its pretrained GloVe word vectors. Three sets of handcrafted similarity features are incorporated with the output of the Bi-LSTM network to detect the sentences or phrases that convey the same meaning but use different wording. Alotaibi and Joy (Alotaibi and Joy 2020) introduce a technique for English-Arabic cross-language plagiarism detection. This method combines word embedding, term weighting techniques, and universal sentence encoder models to improve detection of sentence similarity.

2.5 Detection by writing style

Intrinsic plagiarism analysis is to identify potential plagiarism by analyzing a document with respect to undeclared changes in writing style. In (Meyer zu Eissen and Stein 2006), stylometry, subsuming statistical methods for quantifying an author’s unique writing style, is proposed. By constructing and comparing stylometric models for different text segments, passages that are stylistically different from others can be detected. In (Sánchez-Vega et al. 2017), it is pointed out that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur. A text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features, is proposed. Kuznetsov et al. (Kuznetsov et al. 2016) develop a plagiarism detection method based on constructing an author style function from features of text sentences and detecting outliers. The method is also adapted for the diarization problem by segmenting author style statistics on text parts, which correspond to different authors. In (Vysotska et al. 2018), the usage of linguometry and stylometry technologies for author’s style detection is discussed. The statistical linguistic analysis of author’s text is used in stylometry for definition of attribution degree of analyzed text to the specific author.

2.6 PAN for plagiarism detection

PAN is a series of scientific events and shared tasks on digital text forensics and stylometry (PAN 2020). The FIRE Initiative (Forum for Information Retrieval Evaluation), organized by the Information Retrieval Society of India, has evolved continuously to meet the new challenges in information access including plagiarism detection. Recent efforts have been conducted to the better development of models for plagiarism detection. Probably one of the most interesting cases is the PAN International Competition on Plagiarism Detection held in conjunction with CLEF (Potthast et al. 2009). To create simulated plagiarism cases, crowdsourcing has been employed (Potthast et al. 2010b). The obfuscation of the plagiarism cases obtained closely resembles the way human plagiarists would work. Two corpora, PAN-PC-10 (Potthast et al. 2010a) and P4PIN (Sánchez-Vega et al. 2017), were constructed. Based on these corpora, various plagiarism detection techniques have been developed and published at the CLEF forums. For example, Alzahrani and Salim (Alzahrani and Salim 2010) propose a plagiarism detection method using fuzzy semantic-based string similarity approach. A list of candidate documents for each suspicious document are retrieved using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate documents. This entails the computation of fuzzy degree of similarity. Two sentences are marked as plagiarised if they gain a fuzzy similarity score above a certain threshold.

3 Proposed method

In BOW or VSM based methods, e.g., MLM, words are regarded to be independent of each other. Two words which are semantically similar are treated as different words. In plagiarism, an author may substitute some words with semantically similar words. Consider the following two sentences:

$$\begin{aligned} {\text {A small orange cat sits on the sofa and looks sleepy.}} \end{aligned}$$

(1)

and

$$\begin{aligned} {\text {An orange kitten sits on the settee and looks drowsy.}} \end{aligned}$$

(2)

After preprocessing, these two sentences may contain the following words:

$$\begin{aligned} s_1= & {} \{{\text {small, orange, cat, sit, sofa, look, sleepy}}\}, \end{aligned}$$

(3)

$$\begin{aligned} s_2= & {} \{{\text {orange, kitten, sit, settee, look, drowsy}}\}. \end{aligned}$$

(4)

Syntactically, there are 10 different words in total. By the VSM method, there are only 3 words, orange, sit, and look, in common out of the 10 words. Therefore, the similarity between these sentences is low. By the n-gram method, e.g., 3-gram, we have 5 3-grams, $\{{\text {small, orange, cat}}\}$, ..., $\{{\text {sofa, look, sleepy}}\}$, for $s_1$ and 4 3-grams, $\{{\text {orange, kitten, sit}}\}$, ..., $\{{\text {settee, look, drowsy}}\}$, for $s_2$. Clearly, there are no common 3-grams in these two sets of 3-grams, and the 3-gram method fails to detect plagiarism for the two sentences. However, if semantics is considered and somehow it is learned that kitten and cat are semantically similar, sofa and settee are similar, and so are sleepy and drowsy, then there are only 7 semantically different words in total out of which 6 are in common in the two sentences. So one concludes that the similarity between these sentences is high and plagiarism occurs.

Our proposed method intends to perform plagiarism detection by taking semantic relationship between words into account in the representation of the documents. The flow diagram of our method is shown in Fig. 1. We use Word2vec to transform the words collected from reference documents into word vectors which are able to reveal the semantic relationship among different words. Then the word vectors are clustered, and semantic concepts are developed. A concept is a representative standing for a group of words which are semantically similar to each other. Documents are then represented in terms of concepts, and plagiarism detection can be done effectively in two phases, filtering and identifying phases. In the filtering phase, the reference documents in the corpus are compared with the query document and suspicious source documents are selected. Usually, the corpus is enormous, so we filter out those source documents which are irrelevant to the query document. The suspicious documents are then presented to the identifying phase. In the identifying phase, the sentences involved in plagiarism from the source and query documents are identified and they are output to the user. A document is represented as a three-level structure, consisting of document, paragraph, and sentence levels. The upper two levels are used in the filtering phase, while the lower two levels are used in the identifying phase.

Para2vec and doc2vec (Mikolov et al. 2013b) look into the text in the global context, ignoring small details in the local context. Instead, our method is concerned with both the global context, looking into the document and paragraph levels to decide whether the query is a plagiarism of some documents in the corpus, and the local context, looking into the paragraph and sentence levels to locate the sentences involved in the plagiarism cases.

3.1 Computing word vectors

To find the semantic relationship among words, we convert the words into word vectors by Word2Vec (Mikolov et al. 2013a, b). FastText (Bojanowski et al. 2017), GloVe (Pennington et al. 2014), and BERT (Devlin et al. 2018) are all suitable for our work. However, Word2Vec was more accessible and thus was adopted in our work. The following steps are taken:

1.
The reference documents are scanned and a vocabulary is built.
2.
Training patterns are extracted from the sentences of the reference documents.
3.
A Word2Vec neural network is built and trained with the training patterns. Then the word vectors for the words in the vocabulary are obtained.

Noe that if two words are semantically similar, their word vectors are close; otherwise, the word vectors are not near. Therefore, a distance measure, e.g., cosine, can be used to decide if two words are semantically similar by calculating the distance between their corresponding word vectors.

3.1.1 Constructing vocabulary

Let the number of reference documents be N. First of all, like MLM (Zhang and Chow 2011), preprocessing is done on the N reference documents. For example, upper case letters are changed to lower case, punctuation marks are removed, stemming is applied, and stop words are deleted. Then all the words are collected. For each word, w, the weight ${\text {wt}}(w)$ is calculated:

$$\begin{aligned} {\text {wt}}(w) = {{\text {tf}}(w)} \times \log _2 \left( \frac{N}{{\text {df}}(w)}\right) \end{aligned}$$

(5)

where ${\text {tf}}(w)$ is the term frequency of word w appearing in all the reference documents and ${\text {df}}(w)$ is the number of reference documents in which word w appears. The t words with the t highest weights are selected to form the vocabulary V. Let these t words be denoted as $w_1$, $w_2$, ..., $w_t$.

3.1.2 Extracting training patterns

Next, the training patterns are extracted from the reference documents. A training pattern is an input-output word pair (x, y). Each reference document is divided into a sequence of non-overlapping windows. From each window, the central word is taken as input x and each of its context words is taken as output y. Let a window contain $2s+1$ words:

$w^{'}_{r+1}{\ldots }w^{'}_{r+s}w^{'}_{r+s+1}w^{'}_{r+s+2}{\ldots }w^{'}_{r+2s+1}$ where $w^{'}_{r+s+1}$ is the central word and the other words are context words. Then 2s training patterns:

$(w^{'}_{r+s+1},w^{'}_{r+1})$, ..., $(w^{'}_{r+s+1},w^{'}_{r+s})$, $(w^{'}_{r+s+1},w^{'}_{r+s+2})$, ..., $(w^{'}_{r+s+1},w^{'}_{r+2s+1})$

are extracted from the window. For example, consider the following window of consecutive 7 words, with $s=3$:

$$\begin{aligned} \{{\text {dad, look, very, upset, because, favor, basketball}}\}. \end{aligned}$$

Then the following 6 training patterns:

(upset, dad), (upset, look), (upset,very),
(upset, because), (upset, favor), (upset, basketball).

are extracted. Note that the first word in each pair is the input and the second is the output.

3.1.3 Getting word vectors

After training patterns are collected, a Word2Vec network, as shown in Fig. 2, is built. Let the dimensionality of the vector space be H. The network has three layers, the input layer, hidden layer, and output layer, which contains t, H, and t neurons, respectively. Note that the number of neurons in the input and output layer, respectively, is equal to the number of words in the vocabulary, and the number of neurons in the hidden layer is equal to the dimension of the resulting word vectors. The weights on the connections between the input layer and the hidden layer are named “syn0”, while the weights on the connections between the output layer and the hidden layer are named “syn1”.

The Word2Vec network is then trained with the training patterns. After training, the collection of the “syn0” weights between the ith input neuron and all the hidden neurons becomes the word vector of the word $w_i$. Let $v_1$, $v_2$, ..., $v_H$ be these weights, then

$$\begin{aligned} {\text {vec}}(w_i) = \begin{bmatrix} v_1&v_2&\ldots&v_H \end{bmatrix}^T \end{aligned}$$

(6)

is the word vector of the word $w_i$.

In this work, we use a pre-existing set of embeddings (Google 2020). The repository provides pre-trained vectors trained on part of Google News dataset (about 100 billion words). The set of embeddings contains 300-dimensional vectors for 3 million words and phrases, i.e., $H=300$. By mapping the vocabulary to the provided set of embeddings, we have t H-dimensional word vectors, denoted as ${\text {vec}}(w_1)$, ${\text {vec}}(w_2)$, ..., ${\text {vec}}(w_t)$ for words $w_1$, $w_2$, ..., $w_t$, respectively.

3.2 Construction of concepts

Next, we group the words in the vocabulary into semantic concepts by clustering. Before it, we do dimensionality reduction on the word vectors.

3.2.1 Dimensionality reduction by PCA

The word vectors obtained have a dimensionality of H which can be too large for effective clustering. Principal component analysis (PCA) (Wold et al. 1987; Jolliffe 2002) is applied to reduce the dimensionality of the word vectors. Technically, a principal component can be defined as a linear combination of the original variables, and the coefficients in this combination are actually an eigenvector of the covariance matrix of the word vectors. Assume there are f eigenvalues $e_1,e_2,\ldots ,e_f$ associated with the covariance matrix and $e_1 \ge e_2 \ge \ldots \ge e_f$. We choose q principal components such that q is as small as possible and the cumulative energy is above a certain threshold $\theta $, i.e.,

$$\begin{aligned} \frac{\sum _{i=1}^{q} e_i}{\sum _{i=1}^{f} e_i} \ge \theta \end{aligned}$$

(7)

As a result, there are q, instead of H, components in ${\text {vec}}(w_i)$, $i=1,\ldots ,t$, after the application of PCA.

3.2.2 Getting semantic concepts

Then we group the resulting t word vectors into K clusters by a clustering algorithm. Many types of clustering algorithms have been proposed, such as centroid-based clustering (Sarmiento et al. 2019), Self-organizing mapping (SOM), Hierarchical clustering (Gagolewski et al. 2016), Distribution-based clustering (Fellows et al. 2011), Fuzzy C-means, GMM-EM, Density-based algorithms (Wang et al. 2019), and Subspace clustering (Luo et al. 2018). Clustering is to divide a set of objects into different clusters such that objects in the same cluster are more similar to each other than to those in other clusters. We adopt spherical K-means (Dhillon and Modha 2012; Pratap et al. 2018; Hedar et al. 2018) for clustering. Spherical K-means is a variant of K-means (Lloyd 1982) which is probably the most well-known clustering algorithm in the AI community. Different from K-means, spherical K-means uses cosine instead of Euclidean distance to measure the similarity between vectors. However, like K-means, spherical K-means runs iteratively to divide a given set of vectors into a pre-specified number, K, of clusters. Spherical K-means operates as follows.

1.
All the involved vectors are normalized to unit length.
2.
The value of K is chosen by the user.
3.
K vectors are selected arbitrarily, by the user, as the initial centroids of the clusters.
4.
For each vector, the cosine between the vector and each centroid is computed, and the vector is assigned to the cluster with the nearest centroid.
5.
The centroids of the K clusters are recalculated and normalized.
6.
Steps 4 and 5 are repeated until the centroids no longer move.

When the algorithm terminates, we have K clusters and their centroids.

By Word2Vec, the word vectors of two semantically similar words are located in close proximity. Therefore, a cluster can be regarded as a concept which consists of a set of semantically similar words. On the other hand, the words in different clusters are semantically dissimilar. Let K clusters, with centroid vectors being ${\mathbf {c}}_1$, ${\mathbf {c}}_2$, ..., ${\mathbf {c}}_K$, respectively, be obtained by spherical K-means. Then we have K concepts, each containing the words semantically similar to each other. The concepts will be denoted thereafter as ${\mathbf {c}}_1$, ${\mathbf {c}}_2$, ..., ${\mathbf {c}}_K$.

3.3 Representing documents in concepts

After obtaining K concepts from the reference documents, we represent documents in terms of these concepts. A three-level representation, concerning document, paragraphs, and sentences, respectively, is formed for each document. The top level is the document level. For a document d, the document vector ${\mathbf {D}}(d)$ for dis formed:

$$\begin{aligned} {\mathbf {D}}(d)= \begin{bmatrix} g_1&g_2&\ldots&g_K \end{bmatrix}^T \end{aligned}$$

(8)

where $g_k$ is the strength of concept k in this document, defined as

$$\begin{aligned} g_k= \sum _{w \in d} {\text {wt}}_d(w){\times }{\text {SIM}}({\text {vec}}(w),{\mathbf {c}}_k) \end{aligned}$$

(9)

for $1\le k\le K$. Note that ${\text {wt}}_d(w)$ is the weight of word w, defined as

$$\begin{aligned} {\text {wt}}_d(w) = {{\text {tf}}_d(w)} \times \log _2 \left( \frac{N}{{\text {df}}(w)}\right) \end{aligned}$$

(10)

where ${{\text {tf}}_d(w)}$ is the term frequency of word w appearing in d, and ${\text {SIM}}({\mathbf {x}},{\mathbf {y}})$ is the cosine similarity between vectors ${\mathbf {x}}$ and ${\mathbf {y}}$, defined as

$$\begin{aligned} {\text {SIM}}({\mathbf {x}},{\mathbf {y}})= & {} \frac{{\mathbf {x}}\cdot {\mathbf {y}}}{{\Vert {\mathbf {x}}\Vert }{\Vert {\mathbf {y}} \Vert }}. \end{aligned}$$

(11)

So, we use a K-vector to denote a document in a global manner.

The middle level is the paragraph level. For a document d with r paragraphs $p_{1}$, $p_{2}$, ..., $p_{r}$, r paragraph vectors are formed. The paragraph vector ${\mathbf {P}}(p_a,d)$ of paragraph $p_a$, $1\le a\le r$, is defined as

$$\begin{aligned} {\mathbf {P}}(p_a,d)= \begin{bmatrix} e_1&e_2&\ldots&e_K \end{bmatrix}^T \end{aligned}$$

(12)

where $e_k$ is the strength of concept k in paragraph $p_a$:

$$\begin{aligned} e_k= \sum _{w \in p_a} {\text {wt}}_p(w){\times }{\text {SIM}}({\text {vec}}(w),{\mathbf {c}}_k) \end{aligned}$$

(13)

for $1\le k\le K$. Note that ${\text {wt}}_p(w)$ indicates the weight of word w, defined as:

$$\begin{aligned} {\text {wt}}_p(w) = {{\text {tf}}_p(w)} \times \log _2(\frac{r}{{\text {pf}}_p(w)}) \end{aligned}$$

(14)

where ${\text {tf}}_p(w)$ is the term frequency of word w appearing in paragraph $p_a$, and ${\text {pf}}_p(w)$ is the number of paragraphs in which word w appears in d.

The bottom level is the sentence level. For each sentence s in a paragraph p of a document d, the sentence vector ${\mathbf {S}}(s,p,d)$ is defined as

$$\begin{aligned} {\mathbf {S}}(s,p,d)= \begin{bmatrix} {\mathbf {f}}_1&{\mathbf {f}}_2&\ldots&{\mathbf {f}}_{\ell } \end{bmatrix}^T \end{aligned}$$

(15)

where ${\ell }$ is the number of words contained in sentence s and ${\mathbf {f}}_k$, $1\le k\le \ell $, is the word vector of the kth word in s. Therefore, if paragraph p has h sentences, we have h sentence vectors for p.

3.4 Filtering phase

In this phase, the reference documents $d_1$, $d_2$, ..., $d_N$ in the corpus are compared with the query document q, and those documents suspiciously plagiarized by the query are selected. This phase is basically similar to the candidate document selection phase of (Stein et al. 2007b).

For two documents $d_1$ and $d_2$, with document vectors being ${\mathbf {D}}(d_1)$ and ${\mathbf {D}}(d_2)$, respectively, and paragraph vectors being ${\mathbf {P}}(p_{1},d_{1})$, ${\mathbf {P}}(p_{2},d_{1})$, ..., ${\mathbf {P}}(p_{r_1},d_{1})$ and ${\mathbf {P}}(p_{1},d_{2})$, ${\mathbf {P}}(p_{2},d_{2})$, ..., ${\mathbf {P}}(p_{r_2},d_{2})$, , respectively. Note that $d_1$ has $r_1$ paragraphs and $d_2$ has $r_2$ paragraphs. As in (Zhang and Chow 2011), we define a high-level dissimilarity of $d_1$ from $d_2$, ${\text {DIS}}_{{\text {high}}}(d_1,d_2)$, as

$$\begin{aligned} {\text {DIS}}_{{\text {high}}}(d_1,d_2) = \lambda {\times }{\text {DIS}}_{{\text {doc}}}(d_1,d_2) + (1-\lambda ){\times }{\text {DIS}}_{{\text {par}}}(d_1,d_2), \end{aligned}$$

(16)

where $\lambda \in [0,1]$ is the pre-specified constant and

$$\begin{aligned} {\text {DIS}}_{{\text {doc}}}(d_1,d_2)= & {} d({\mathbf {D}}(d_1), {\mathbf {D}}(d_2)),, \end{aligned}$$

(17)

$$\begin{aligned} {\text {DIS}}_{{\text {par}}}(d_1,d_2)= & {} \frac{\sum _{j=1}^{r_1} \min \{d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{1},d_2)}), \ldots , d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{r_2},d_2)})\}}{r_1} \end{aligned}$$

(18)

where $d({\mathbf {x}}, {\mathbf {y}})$ is defined as (Zhang and Chow 2011)

$$\begin{aligned} d({\mathbf {x}}, {\mathbf {y}})= & {} 1 - e^{-(1-{\text {SIM}}({\mathbf {x}},{\mathbf {y}}))}. \end{aligned}$$

(19)

Note that a less ${\text {DIS}}_{{\text {high}}}(d_1,d_2)$ indicates that $d_1$ is more similar to $d_2$.

In (Zhang and Chow 2011), ${\text {DIS}}_{{\text {par}}}(d_1,d_2)$ is named as ${\text {DIS}}_{{\text {local}}}(d_1,d_2)$ and is defined as

$$\begin{aligned} {\text {DIS}}_{{\text {local}}}(d_1,d_2) = \frac{\sum _{i=1}^{r_1}\sum _{j=1}^{r_2}d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{1},d_2)})}{r_1r_2}. \end{aligned}$$

(20)

Note that Eq.(18) and Eq.(20) are different. In Eq.(20), all the distances between paragraphs are averaged. This may not be able to detect a plagiarism desirably. Consider two documents $d_1$ and $d_2$ each having 2 paragraphs, and

$$\begin{aligned} {\mathbf {P}}(p_1,d_1)= & {} \begin{bmatrix} 3.9&-1.4 \end{bmatrix},\ {\mathbf {P}}(p_2,d_1)= \begin{bmatrix} 3.9&-1.4 \end{bmatrix}; \\ {\mathbf {P}}(p_1,d_2)= & {} \begin{bmatrix} 3.7&-1.5 \end{bmatrix},\ {\mathbf {P}}(p_2,d_2)= \begin{bmatrix} -1.7&6.5 \end{bmatrix}. \end{aligned}$$

Note that

$$\begin{aligned} d({\mathbf {P}}(p_{1},d_1),{\mathbf {P}}(p_1,d_2))= & {} 0.0005,\ d({\mathbf {P}}(p_{1},d_1),{\mathbf {P}}(p_2,d_2))=0.79, \\ d({\mathbf {P}}(p_{2},d_1),{\mathbf {P}}(p_1,d_2))= & {} 0.63,\ d({\mathbf {P}}(p_{2},d_1),{\mathbf {P}}(p_2,d_2))=0.19. \end{aligned}$$

By observation, we can see that paragraph $p_1$ of $d_1$ is very similar to paragraph $p_1$ of $d_2$, and a plagiarism may occur between $d_1$ and $d_2$. But by Eq.(20),

$$\begin{aligned} {\text {DIS}}_{{\text {local}}}(d_1,d_2)= & {} \frac{\sum _{i=1}^{2}\sum _{j=1}^{2}d({\mathbf {P}}(p_{i},d_1),{\mathbf {P}}(p_j,d_2))}{2{\times }2} \\= & {} \frac{0.0005+0.79+0.63+0.19}{4} = 0.403 \end{aligned}$$

which shows that $d_1$ is fairly distant from $d_2$ and may fail on plagiarism detection. However, by Eq.(18) we have

$$\begin{aligned}&{\text {DIS}}_{{\text {par}}}(d_1,d_2) \\&\quad =\frac{\sum _{j=1}^{2} \min \{d({\mathbf {P}}({p_{j},d_1}), {\mathbf {P}}({p_{1},d_2)}),d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{2},d_2)})\}}{2} \\&\quad = \frac{\min \{0.0005,0.79\}+\min \{0.63,0.19\}}{2} \\&\quad = \frac{0.0005+0.19}{2} = 0.095. \end{aligned}$$

Clearly, our method gives a distance of 0.095 between $d_1$ and $d_2$, showing that a plagiarism may occur between $d_1$ and $d_2$.

The filtering process proceeds as follows. For each reference document $d_i$, $1\le i\le N$, we compute ${\text {DIS}}_{{\text {high}}}(q,d_i)$. If ${\text {DIS}}_{{\text {high}}}(q,d_i) < \tau $, with $\tau $ a pre-specified threshold, document $d_i$ is regarded as a document suspiciously plagiarized by the query document q. All such documents are collected and used in the following identifying phase.

3.5 Identifying phase

Now we have a collection of reference documents selected from the filtering phase. We’d like to identify all the sentences involved in plagiarism from the source and query documents.

We find out the sentences involved in plagiarism from a reference document $d_i$ and query document q as follows. Let q have $r_q$ paragraphs and $d_i$ have $r_i$ paragraphs. For each paragraph $p_{k}$ in q, $1\le k\le r_q$, we compute the similarity between this paragraph and every paragraphs $p_{j}$ in $d_i$, i.e.,

$$\begin{aligned} {\text {SIM}}({\mathbf {P}}({p_{k},q)}),{\mathbf {P}}({p_{j},d_i)}) \end{aligned}$$

(21)

for $1\le j\le r_i$. Let paragraph $p_{J}$ of $d_i$ be the most similar to $p_{k}$ of q. If

$$\begin{aligned} {\text {SIM}}({\mathbf {P}}({p_{k},q)}),{\mathbf {P}}({p_{J},d_i)}) < \eta , \end{aligned}$$

(22)

where $\eta $ is a threshold, then we do nothing. Otherwise, we proceed forward. For each sentence $s_1$ in $p_{k}$ of q and each sentence $s_2$ in $p_{J}$ of $d_i$, with the sentence vector of $s_1$ being ${\mathbf {S}}(s_1,p_k,q)=\begin{bmatrix}{\mathbf {f}}_{1,1}&\ldots&{\mathbf {f}}_{\ell _1,1}\end{bmatrix}$ and the sentence vector of $s_2$ being ${\mathbf {S}}(s_2,p_J,d_i)=\begin{bmatrix}{\mathbf {f}}_{1,2}&\ldots&{\mathbf {f}}_{\ell _2,2}\end{bmatrix}$, the similarity of $s_1$ to $s_2$ is calculated as

$$\begin{aligned} {\text {PLA}}(s_1,s_2) = \frac{\sum _{j=1}^{\ell _1} \max \{{\text {SIM}} ({\mathbf {f}}_{j,1},{\mathbf {f}}_{1,2}),\ldots ,{\text {SIM}}({\mathbf {f}}_{j,1}, {\mathbf {f}}_{\ell _2,2})\}}{\ell _1}. \end{aligned}$$

(23)

Let $s^{\star }_2$ in $p_J$ be the sentence most similar to $s_1$. If ${\text {PLA}}(s_1, s^{\star }_2) \ge \epsilon $ for some pre-specified threshold $\epsilon $, then sentence $s_1$ in $p_k$ of q is regarded as a plagiarism of sentence $s^{\star }_2$ in $p_{J}$ of $d_i$.

Consider the two sentences $s_1$ and $s_2$ in Eq.(3) and Eq.(4). Assume that $s_1$ appears in paragraph $p_1$ of one document A and $s_2$ appears in paragraph $p_2$ of another document B. As mentioned, if using the bag-of-words model, there are 10 different words in total but only 3 words, orange, sit, and look, are in common in $s_1$ and $s_2$. The similarity between them is low and plagiarism may not be detected. Suppose that, by considering semantics, the most similar words to orange, kitten, sit, settee, look, and drowsy in $s_2$ are orange, cat, sit, sofa, look, and sleepy, respectively, in $s_1$, and

SIM(vec(orange),vec(orange))=1.0, SIM(vec(kitten),vec(cat))=0.764,
SIM(vec(sit),vec(sit))=1.0, SIM(vec(settee),vec(sofa))=0.776,
SIM(vec(look),vec(look))=1.0, SIM(vec(drowsy),vec(sleepy))=0.52.

Then, by Eq.(23) we have

$$\begin{aligned} {\text {PLA}}(s_2,s_1) = \frac{1.0+ 0.764+ 1.0+ 0.776+ 1.0+ 0.52}{6} = 0.84. \end{aligned}$$

(24)

For $\epsilon =0.6$, we conclude that $s_2$ of paragraph 2 in document B is plagiarized by $s_1$ of paragraph 1 in document A.

4 Experimental results

In this section, experimental results are presented to demonstrate the effectiveness of our method. Comparisons with other methods are also presented.

4.1 Datasets

Three datasets are used in the experiments. The first one is the Html_CityU1 dataset, which is used in (Zhang and Chow 2011). In this dataset, there are 26 categories, each category containing 400 documents. So there are 10,400 source documents in this dataset. These documents were downloaded from the Internet and served as reference documents. A pair of documents in the same category were randomly chosen, one as the plagiarized document and the other as the source document. Part of the content of the source document was copied in the plagiarized document. This process was repeated three times for each category. As a result, $26{\times }3=78$ plagiarized documents in total were obtained and they formed the testing set. The second dataset is the PAN Plagiarism Corpus 2010, PAN-PC-10 (Potthast et al. 2010a). The documents in the corpus are based on 22,000 English books, 520 German books, and 210 Spanish books. This corpus does not contain any real plagiarism cases. To create simulated plagiarism cases, crowdsourcing, namely Amazon’s Mechanical Turk, has been employed. Text passages which were chosen at random from a source document have been presented to a human whose task was to rewrite the passage so that the wording would be different but the semantics preserved. The rewritten text passages have then been inserted into the suspicious documents. By this, the obfuscation of the plagiarism cases obtained closely resembles the way human plagiarists would work. The third dataset, P4PIN (Paraphrase for Plagiarism Including Negatives examples) (Sánchez-Vega et al. 2017), contains in total of 3354 instances, 847 positives and 2507 negatives. Each instance contains a pair of text fragments, the suspicious text and the possible source text, and a set of tags that identifies the class of each instance (Plagiarism or NoPlagiarism). The positive instances are manually constructed paraphrase cases taken from the PAN-PC-10 Corpus. The negative instances are conformed from text fragments of the PAN-PC-10 Corpus following a special selection strategy in order to get complex negative instances.

4.2 Performance measures

We use the same three measures, FDR (failed detection ratio), AR (average rank), and CR (composite rank), adopted in (Zhang and Chow 2011) for performance evaluation. The testing documents are tested against the source documents. For each testing document, the detection succeeds if its corresponding source document is decided to be one of the 500 source documents most similar to the testing document. FDR is defined as

$$\begin{aligned} {\text {FDR}} = \frac{{\text {number of failed detections}}}{{\text {total number of plagiarized documents}}}. \end{aligned}$$

(25)

For the 500 most similar source documents of a plagiarized document, they are sorted in descending order according to the similarity degree, with the rankings being 1, 2, ..., 500, respectively. Then AR is defined as

$$\begin{aligned} {\text {AR}} = \frac{\sum _{i=1}^{N_t} R_i}{{\text {total number of succeeded detections}}} \end{aligned}$$

(26)

where $N_t$ is the total number of testing documents, and $R_i$ is the ranking of the detected source document, with $R_i=0$ if the detection fails, for the ith testing document. CR is a composite metric defined as

$$\begin{aligned} {\text {CR}} = \frac{{\text {AR}}}{1-{\text {FDR}}}. \end{aligned}$$

(27)

Note that small values of these measures indicate a better performance in plagiarism detection. In addition, we also use 4 other indicators (Potthast et al. 2010b), precision, recall, granularity, and PlagDet score, employed by PAN-PC competitions in plagiarism detection. Granularity is defined to be the average number of reported plagiarisms per one plagiarized text passage. Precision and recall are defined as

$$\begin{aligned} {\text {precision}} = \frac{r_s}{R},\ \ \ {\text {recall}} = \frac{r_s}{S} \end{aligned}$$

(28)

where $r_s$ is the number of plagiarism cases correctly detected, R is the number of reported suspicious plagiarism cases, and S is the number of plagiarism cases. Finally, PlagDet is calculated as

$$ {\text{PlagDet}} = \frac{{2\; \times \;{\text{precision}}\; \times \;{\text{recall}}}}{{({\text{precision}}\; + \;{\text{recall}})\; \times \;\log _{2} (granularity)}}.{\text{ }}$$

(29)

Note that large values of these measures indicate a better performance in plagiarism detection. In our method, we transform the words into numeric vectors, group the vectors into semantic concepts, and represent the documents in the form of vectors. To detect possible plagiarism between two documents, we basically calculate the cosine between the representation vectors of the two documents, instead of computing the distance between the two documents at character level as explained in (Potthast et al. 2010b).

Table 2 Performance comparisons between our method and other methods, with the Html_CityU1 dataset

Full size table

4.3 Comparison with other methods

Table 2 shows comparisons on FDR, AR, and CR between our method and other methods, with the Html_CityU1 dataset. In this table, the value obtained by the best method for each case is shown in boldface. For MLM, the vocabulary is set to contain 5000 words and dimensionality is set to be 200 after PCA. MLMS-Hybrid, MLMS-Local, MLMS-Global, and MLMH are four MLM versions in (Zhang and Chow 2011). For MLMS-Hybrid and MLMH, $\lambda $ is set to be 0.35 and $\lambda $ is set to be 0. For our method, we use $H=300$ with Word2Vec, $\theta =0.9$ with PCA, and $K=200$ with spherical K-means. Two n-gram methods (Sidorov et al. 2014), 3-gram and 5-gram, are also compared. As can be seen, our method is the best. Our method has the lowest FDR value, 2.56%, the lowest AR value, 72.91, and the lowest CR value, 74.83.

Table 3 Performance comparisons between our method and other methods, with the PAN-PC-10 dataset

Full size table

Table 3 shows comparisons on precision, recall, granularity, and PlagDet score between our method and other methods, with the PAN-PC-10 dataset. K&M (Kasprzak and Brandejs 2010) find pairs of source and suspicious documents, and their common chunk IDs. Document pairs with less than 20 chunks in common are discarded. The value of PlagDet was not provided in the paper. PDLK (Abdi et al. 2015) computes the semantic and syntactic similarity in the sentence-to-sentence manner to avoid selecting the source text sentence whose similarity with suspicious text sentence is high but its meaning is different. The ${\text {detailed}}_{{\text {fuzjac}}}$ method (Kadhim and Mohammed 2019) integrates exact and fuzzy similarity for improving detection of external textual plagiarism. SHAPD2 and w-shingling (Ceglarek 2013) are the semantic compression versions of the Sentence Hashing Algorithm for Plagiarism Detection 2 (SHAPD2) and the w-shingling algorithm, respectively. Note that K&M, PDLK, ${\text {detailed}}_{{\text {fuzjac}}}$, SHAPD2, and w-shingling are the approaches published at the PAN workshops or evaluated against the PAN-PC-10 dataset. As can be seen, no method outperforms the other methods in all the cases. However, our method performs pretty well in recall, granularity, and PlagDet. In particular, our method performs best in PlagDet. Table 4 shows comparisons on precision, recall, granularity, and PlagDet score between our method and MLMH, with the P4PIN dataset. We can see that our method performs better than MLMH. By applying paired t-test to the results shown in Table 3 and Table 4, we concluded that our method is slightly better than MLMH in PlagDet and is almost significantly better than MLMH in recall under the 90% confidence level. The comparison could have been more meaningful with more experiments. We hope to continue the research and experiment with more datasets in the future.

Table 4 Performance comparisons between our method and MLMH, with the P4PIN dataset

Full size table

4.4 Comparison with LSI

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique to identify patterns in the relationships between the words and concepts contained in an unstructured collection of text (Deerwester 1988). A matrix containing the occurrences of words, rows corresponding to words and columns corresponding to documents, is constructed from a large piece of text, and singular value decomposition (SVD) is used to reduce the rows of the matrix. Documents are then compared by taking the cosine of the angle between any two column vectors. The similarity of two words depends on the number of times the words appearing in each document, not on the meaning of the words with respect to their surrounding context words. In this experiment, we compare our method and LSI for different datasets. The results are shown in Tables5, 6, and 7, respectively.

Table 5 Performance comparisons between our method and LSI, with the Html_CityU1 dataset

Full size table

Table 6 Performance comparisons between our method and LSI, with the PAN-PC-10 dataset

Full size table

Table 7 Performance comparisons between our method and LSI, with the P4PIN dataset

Full size table

4.5 Comparison with BM25

In our work, tf-idf is adopted for term weighting. One of the alternatives is BM25 which is a ranking function used by search engines to estimate the relevance of documents to a given search query (Robertson and Zaragoza 2009). Given a query q, containing keywords $q_1$, $q_2$, ..., $q_n$, the BM25 score of a document d is:

$$\begin{aligned} {\text {Score}}(q,d) = \sum _{i=1}^{n} {\text {IDF}}(q_i){\times }R(q_i,d) \end{aligned}$$

(30)

where ${\text {IDF}}(q_i)$ is the idf weight of the query term $q_i$ and is usually computed as

$$\begin{aligned} {\text {IDF}}(q_i) = \log {\frac{N-{\text {df}}(q_i)+0.5}{{\text {df}}(q_i)+0.5}} \end{aligned}$$

(31)

and $R(q_i,d)$ indicates the relevance between $q_i$ and d, defined as

$$\begin{aligned} R(q_i,d) = \frac{{\text {tf}}_d(q_i){\times }(k_i+1)}{{\text {tf}}_d(q_i)+k_i} \end{aligned}$$

(32)

in which $k_i$ is a constant and is usually set to be 2. In this experiment, we use BM25 in place of tf-idf in Eq.(10) and Eq.(14). The comparison results are shown in Tables 8 and 9, respectively.

Table 8 Performance comparisons between our method and BM25, with the PAN-PC-10 dataset

Full size table

Table 9 Performance comparisons between our method and BM25, with the P4PIN dataset

Full size table

4.6 Comparison with K-means

We choose spherical K-means instead of K-means for clustering. Note that spherical K-means applies cosine similarity, while K-means uses Euclidean distance, in the clustering process. Cosine calculates the direction difference of two vectors. When two words are more semantically similar, their word vectors are more aligned in the same direction and the cosine between them becomes larger, indicating the two words are more likely to be grouped in the same cluster. Therefore, the semantic similarity between two words is better measured by cosine, instead of Euclidean distance, between the corresponding word vectors. Table 10 shows the performance comparison between using K-means and using spherical K-means, under the condition of $\theta = 0.9$ and $K=200$, in our method. As can be seen, spherical K-means is much better in FDR.

Table 10 Comparisons of performance on Html_CityU1 between K-means and spherical K-means

Full size table

4.7 Testing of parameter values

Our method requires a number of parameters to be set, e.g., $\theta $, K, $\epsilon $, $\lambda $, H. We use a pre-existing set of embeddings from Google, in which H is set to be 300 which is fixed. The parameter $\lambda $ is used to balance the document-level distance and paragraph-level distance. As in (Zhang and Chow 2011), we have also observed that the document-level distance is less critical and thus we set $\lambda = 0$. The parameter $\epsilon $ is used to control whether a plagiarism occurs between a pair of sentences, a smaller $\epsilon $ indicating that plagiarism is more likely to be reported. The parameters $\theta $ and K are used to control the degree of dimensionality reduction performed by PCA and the number of concepts obtained by spherical K-means. Table 11 shows the performance of our method on the Html_CityU1 dataset, with different values of $\theta $ and K. From this table, we can see that the performance is better with $\theta = 0.9$ than with $\theta = 0.8$. When $\theta $ increases, more information is kept and more details are maintained. Therefore, it is more likely to detect plagiarism. On the other hand, as K increases, more concepts are produced, allowing more detailed descriptions about words and helping to improve the detection of plagiarism.

Table 11 Comparisons of performance on Html_CityU1 with different values of $\theta $ and K

Full size table

5 Conclusion

Plagiarism detection is a challenging task due to the high complexity of the task, even for human beings, when obfuscation of the source texts occurs. A plagiarized work may contain reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions. It is even harder when plagiarism occurs at cross-lingual level. As a result, to detect plagiarism reliably needs to rely on the detection of similar semantic concepts.

Existing methods might have difficulties with plagiarized documents due to the incapability of handling the semantics of words satisfactorily. We have presented a method to enhance extrinsic plagiarism detection by using computational semantics of words. We use Word2vec to transform the words into word vectors which reveal the semantic relationship among different words. The spherical K-means is applied to cluster the words into semantic concepts. Then documents and their paragraphs are represented in terms of concepts. A two-phase matching strategy is developed. In the first phase, possible source documents of plagiarism are located, while in the second phase, the plagiarized parts are identified and shown to the user.

Our approach has limitations. The terms used are single words. Phrases, e.g., United Nations, are more descriptive than single words for plagiarism detection. Also, semantic meaning of the sentences or documents remains unknown. For example, the negative sense of the sentence “there is nothing on the menu that a gourmet would like” may be improperly interpreted and represented. In these cases, our system could fail to make satisfactory decisions and result in a high false positive rate. We’ll perform an in-depth failure analysis on the failed cases. Furthermore, homonyms, or multiple-meaning words, are words with the same spelling but have different meanings. We’ll try with other word embeddings, e.g, FastText, GloVe, or BERT, in our work to look into this issue and other issues associated with ambiguity and mis-alignments between paragraphs. In our system, many parameters are involved. We’ll look into the sensitivity of each parameter on the detection performance of the system. We’ll also make use of passage retrieval and field-weightings to maintain a hierarchy of sections in order to increase the efficiency of the detection process. Finally, to make a thorough detection against a huge number of reference documents, parallel or distributed processing facilities are required to increase the scalability of our system.

References

Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42, 8936–8946.
Article Google Scholar
Alotaibi, N., & Joy, M. (2020). Using sentence embedding for cross-language In lecture notes in computer science plagiarism detection. Berlin: Springer.
Google Scholar
Alvarez-Carmona, M. A., Franco-Salvador, M., Montes-y Gómez, M., Rosso, P., Villasenor-Pineda, L., & Villatoro-Tello, E. (2018). Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent & Fuzzy Systems, 34(5), 2983–2990.
Article Google Scholar
Alzahrani, S., & Salim, N. (2010). Fuzzy semantic-based string similarity for extrinsic plagiarism detection. In Lab Report for PAN at CLEF 2010 - Conference and Labs of the Evaluation Forum CLEF (pp. 22–23).
Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 133–149.
Article Google Scholar
Baba, K., Nakatoh, T., & Minami, T. (2017). Plagiarism detection using document similarity based on distributed representation. Procedia Computer Science, 111, 382–387.
Article Google Scholar
Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search (2nd ed.). New York: ACM press.
Google Scholar
Barrón-Cedeño, A., Rosso, P. & Benedí, J.-M. (2009). Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance. In Proceedings of International conference on intelligent text processing and computational linguistics (pp. 523–534). Springer.
Blair, D. C., & Maron, M. E. (1985). An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, 28(3), 289–299.
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching word vectors with subword information. arXiv:1607.04606v2 [cs.CL].
Brin, S., Davis, J., & Garcia-Molina, H. (1995). Copy detection mechanisms for digital documents. ACM SIGMOD Record, 24(2), 398–409.
Article Google Scholar
Campbell, D., Chen, W. & Smith, R. (2000). Copy detection systems for digital documents. In Proceedings IEEE Advances in Digital Libraries 2000 (pp. 78–88). IEEE.
Ceglarek, D. (2013). Evaluation of the SHAPD2 algorithm efficiency in plagiarism detection tasks. In 2013 The International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE) (pp. 465–470).
Chacko, A. M. (2018). A comprehensive review on question answering systems. IOSR Journal of Engineering, 8(4), 18–21.
Google Scholar
Chow, T. W. S., & Rahman, M. K. M. (2009). Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9), 1385–1402.
Article Google Scholar
Clough, P. (2000). Plagiarism in natural and programming languages: an overview of current tools and technologies Department of Computer Science, University of Sheffield. Sheffield: Tech. rep.
Google Scholar
Deepa, G., Vani, K., & Leema, L. M. (2016). Plagiarism detection in text documents using sentence bounded stop word n-grams. Journal of Engineering Science and Technology, 11(10), 1403–1420.
Google Scholar
Deerwester, S. (1988). Improving information retrieval with latent semantic indexing. In Proceedings of the 51st Annual Meeting of the American Society for Information Science (vol. 25, pp. 36–40).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805v2 [cs.CL].
Dhillon, I. S., & Modha, D. S. (2012). Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2), 143–175.
Article MATH Google Scholar
Erfaneh, G., Veisi, H., Bijari, K., & Zahirnia, K. (2018). A fast multi-level plagiarism detection method based In lecture notes in computer science on document embedding representation. Berlin: Springer.
Google Scholar
Fellows, M. R., Guo, J., Komusiewicz, C., Niedermeier, R., & Uhlmann, J. (2011). Graph-based data clustering with overlaps. Discrete Optimization, 8(1), 2–17.
Article MathSciNet MATH Google Scholar
Franco-Salvador, M., Rosso, P., & Montes-y Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550–570.
Article Google Scholar
Gagolewski, M., Bartoszuk, M., & Cena, A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 363, 8–23.
Article Google Scholar
Gipp, B. (2014). Citation-based plagiarism detection. New York: Springer Vieweg Research.
Book Google Scholar
Gonzalez-Agirre, A. (2017). Computational models for semantic textual similarity. Ph.D. thesis, Department of Computer Languages and Systems, University of the Basque Country.
Google, cited Jan. 2020. word2vec. https://code.google.com/archive/p/word2vec/.
Hedar, A.-R., Ibrahim, A.-M.M., Abdel-Hakim, A. E., & SewisyDhillon, A. A. (2018). K-means cloning: Adaptive spherical k-means clustering. Algorithms, 11(151), 1–21.
MathSciNet MATH Google Scholar
Henzinger, M. (2006). Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 284–291). ACM.
Jadalla, A., & Elnagar, A. (2012). Iqtebas 1.0: A fingerprinting-based plagiarism detection system for arabic text-based documents. International Journal on Data Mining and Intelligent Information Technology Applications, 2(2), 31–43.
Article Google Scholar
Jolliffe, I. T. (2002). Principal Component Analysis. New York: Springer-Verlag.
MATH Google Scholar
Kadhim, N. J., & Mohammed, M. T. (2019). VSM based models and integration of exact and fuzzy similarity for improving detection of external textual plagiarism. Journal of Mechanics of Continua and Mathematical Sciences, 14(3), 555–578.
Google Scholar
Kasprzak, J. & Brandejs, M. (2010). Improving the reliability of the plagiarism detection system. In Lab Report for PAN at CLEF 2010 - Conference and Labs of the Evaluation Forum CLEF.
Kuznetsov, M., Motrenko, A., Kuznetsova, R. & Strijov, V. (2016). Methods for intrinsic plagiarism detection and authordiarization. In Working Notes for PAN at CLEF 2016 - Conference and Labs of the Evaluation Forum (pp. 912–919).
Leung, C. H., & Cheng, S. C. L. (2017). An instructional approach to practical solutions for plagiarism. Universal Journal of Educational Research, 5(9), 1646–1652.
Article Google Scholar
Lloyd, S. P. (1982). Least square quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
Article MathSciNet MATH Google Scholar
Luo, S., Zhang, C., Zhang, W. & Cao, X. (2018). Consistent and specific multi-view subspace clustering. In Proceedings of 32nd AAAI Conference on Artificial Intelligence (pp. 3730–3737).
Mahmoud, A., Zrigui, A., & Zrigui, M. (2017). A text semantic similarity approach for arabic paraphrase detection In Lecture Notes in Computer Science. Berlin: Springer.
Google Scholar
Marti, M. A., Barrón-Cedeño, A., Vila, M., & Rosso, P. (2013). Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics, 39(4), 917–947.
Article Google Scholar
Meuschke, N., Schubotz, M., Hamborg, F., Skopal, T. & Gipp, B. (2017). Analyzing mathematical content to detect academic plagiarism. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 2211–2214). ACM.
Meyer zu Eissen, S. & Stein, B. (2006). Intrinsic plagiarism detection. In Proceedings of 28th European Conference on IR Research (pp. 565–569).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546v1 [cs.CL].
Monostori, K., Zaslavsky, A. & Schmidt, H. (2000). Document overlap detection system for distributed digital libraries. In Proceedings of the fifth ACM conference on Digital Libraries (pp. 226–227).
Muhr, M., Zechner, M., Kern, R., & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. CEUR Workshop Proceedings., 502, 47–55.
Naawab, R. M. A., Stevenson, M., & Clough, P. (2016). An ir-based approach utilizing query expansion for plagiarism detection in MEDLINE. IEEE Transactions on Computational Biology and Bioinformatics, 14(4), 796–804.
Article Google Scholar
P4PIN, cited Jan. 2020. Paraphrasing. http://ccc.inaoep.mx/~mmontesg/resources/corpusP4PIN.zip.
PAN, cited Jan. 2020. Plagiarism detection. https://pan.webis.de.
Pennington, J., Socher, R. & Manning, C. (Oct. 2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics, Doha, Qatar. https://www.aclweb.org/anthology/D14-1162.
Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B. & Rosso, P. (2010a). Overview of the 2nd international competition on plagiarism detection. In Notebook Papers of CLEF 2010 LABs and Workshops.
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., & Rosso, P. (2009). Overview of the 1st international competition on plagiarism detection. CEUR Workshop Proceedings., 502, 1–9.
Potthast, M., Stein, B., no, A. B.-C. & Rosso, P. (2010b). An evaluation framework for plagiarism detection. In Proceedings of 23rd International Conference on Computational Linguistics (pp. 997–1005).
Pratap, R., Deshmukh, A., Nair, P., & Dutt, T. (2018). A faster sampling algorithm for spherical k-means. Proceedings of Machine Learning Research - Asian Conference on Machine Learning., 95, 343–358.
Google Scholar
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.
Article Google Scholar
Sánchez-Vega, F., Villatoro-Tello, E., y Gómez, M. M., Rosso, P., Stamatatos, E., & Pineda, L. V. (2017). Paraphrase plagiarism identification with character-level features. Pattern Analysis and Applications, 22, 669–681.
Article MathSciNet Google Scholar
Sarmiento, A., Fondón, I., Durán-Díaz, I., & Cruces, S. (2019). Centroid-based clustering with $\alpha \beta $-divergences. Entropy, 21(196).
Sarrouti, M., & Alaoui, S. O. E. (2017). A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering. Journal of Biomedical Informatics, 68, 96–103.
Article Google Scholar
Sattler, S., Wiegel, C., & Veen, Fv. (2017). The use frequency of 10 different methods for preventing and detecting academic dishonesty and the factors influencing their use. Studies in Higher Education, 42(6), 1126–1144.
Article Google Scholar
Schneider, J., Bernstein, A., vom Brocke, J., Damevski, K., & Shepherd, D. (2018). Detecting plagiarism based on the creation process. IEEE Transactions on Learning Technologies, 11(3), 348–361.
Article Google Scholar
Shahmohammadi, H., Dezfoulian, M., & Mansoorizadeh, M. (2020). Paraphrase detection using LSTM networks and handcrafted features. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-020-09996-y.
Article Google Scholar
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernández, L. (2014). Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3), 853–860.
Article Google Scholar
Stein, B., Koppel, M., & Stamatatos, E. (2007a). Plagiarism analysis, authorship identification, and near-duplicate detection pan’07. ACM SIGIR Forum, 41(2), 68–71.
Article Google Scholar
Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1), 63–82.
Article Google Scholar
Stein, B., Meyer zu Eissen, S. & Potthast, M. (2007b). Strategies for retrieving plagiarized documents. In Proceedings of 30th Annual International ACM SIGIR Conference (pp. 825–826). ACM.
Vysotska, V., Burov, Y., Lytvyn, V. & Demchuk, A. (2018). Defining author’s style for plagiarism detection in academic environment. In 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP) (pp. 128–133). IEEE.
Waheeb, A., & Babu, A. P. (2016). Answer extraction and passage retrieval for questionanswering systems. International Journal of Advanced Research in Computer Engineering & Technology, 5(12), 2703–2706.
Google Scholar
Wang, T., Ren, C., Luo, Y., & Tian, J. (2019). NS-DBSCAN: A density-based clustering algorithm in network space. International Journal of Geo-Information, 8(218), 1–20.
Google Scholar
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.
Article Google Scholar
Zhang, H., & Chow, T. W. (2011). A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition, 44(2), 471–487.
Article Google Scholar

Download references

Acknowledgements

This paper was financially supported in part by the grants MOST-106-2321-B-037-003, MOST-107-2221-E-110-065 and MOST-107-2622-E-110-008-CC3, MOST, by MOST-107-EPA-F-012-001, EPA, by NSYSU-KMU Joint Research Project (#NSYSUKMU 108-P042), and by the Intelligent Electronic Commerce Research Center, NSYSU. The authors would like to express their sincere appreciations to the anonymous reviewers for their comments which were very helpful in improving the quality and presentation of the paper.

Author information

Authors and Affiliations

Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan
Chia-Yang Chang
Department of Electrical Engineering and Intelligent Electronic Commerce Research Center, National Sun Yat-Sen University, Kaohsiung, Taiwan
Shie-Jue Lee
Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
Chih-Hung Wu
Incubation Center, Cheng Shiu University, Kaohsiung, Taiwan
Chih-Feng Liu
Department of Neurology, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
Ching-Kuan Liu

Authors

Chia-Yang Chang
View author publications
You can also search for this author in PubMed Google Scholar
Shie-Jue Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Hung Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ching-Kuan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shie-Jue Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, CY., Lee, SJ., Wu, CH. et al. Using word semantic concepts for plagiarism detection in text documents. Inf Retrieval J 24, 298–321 (2021). https://doi.org/10.1007/s10791-021-09394-4

Download citation

Received: 07 August 2019
Accepted: 02 July 2021
Published: 14 July 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s10791-021-09394-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Using word semantic concepts for plagiarism detection in text documents

Abstract

Similar content being viewed by others

Survey on Plagiarism Detection Systems and Their Comparison

A Novel Technique for Detecting Plagiarism in Documents Exploiting Information Sources

A New Hybrid Technique for Detection of Plagiarism from Text Documents

1 Introduction

2 Related work

2.1 Detection for text copying

2.2 Detection for sentence rewriting

2.3 Detection for idea adoption

2.4 Detection with word embedding

2.5 Detection by writing style

2.6 PAN for plagiarism detection

3 Proposed method

3.1 Computing word vectors

3.1.1 Constructing vocabulary

3.1.2 Extracting training patterns

3.1.3 Getting word vectors

3.2 Construction of concepts

3.2.1 Dimensionality reduction by PCA

3.2.2 Getting semantic concepts

3.3 Representing documents in concepts

3.4 Filtering phase

3.5 Identifying phase

4 Experimental results

4.1 Datasets

4.2 Performance measures

4.3 Comparison with other methods

4.4 Comparison with LSI

4.5 Comparison with BM25

4.6 Comparison with K-means

4.7 Testing of parameter values

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation