1 Introduction

When the content of a text is used in an undesirable way, particularly when the source is not cited, plagiarism may occur (Sattler et al. 2017; Leung and Cheng 2017). Plagiarism is an age old issue and is only more problematic because of the ease with which text can be reused and disguised due to the widespread use of computers and the ubiquitous presence of the Internet. Plagiarism infringes the intellectual property rights, so it is a serious problem nowadays. However, detecting plagiarism effectively is a challenging work. The research on plagiarism detection has started since the 1990s, pioneered by the studies of copy detection in digital documents (Brin et al. 1995). In recent years, research on plagiarism detection has actively evolved, taking advantage of the developments in related fields such as information retrieval, natural language processing, computational linguistics, near-duplicate detection, and artificial intelligence (Clough 2000; Baeza-Yates and Ribeiro-Neto 2011; Barrón-Cedeño et al. 2009; Alzahrani and Salim 2010; Alzahrani et al. 2012; Meuschke et al. 2017; Schneider et al. 2018).

There are several types of plagiarism, committed in different ways (Alzahrani et al. 2012). Exact copying is the simplest type, with specific passages of text copied without any change from one document to another document (Monostori et al. 2000). Sentence rewriting (Jadalla and Elnagar 2012) is another type. It rewrites a sentence by changing the order of words, adding some words, deleting certain words, or even by substituting some words with semantically similar ones. The rewritten sentence looks different from the original one, but they actually have the same meaning. The most complex plagiarizing types are idea adoption and writing style change (Meyer zu Eissen and Stein 2006). After reading and understanding the original document, the plagiarist rewrites the sentences, paragraphs, or even the whole document in his/her own style. As a result, the plagiarized work itself looks like another original one.

Plagiarism detection systems can be divided into two categories, extrinsic and intrinsic (Stein et al. 2007a). Extrinsic detection systems compare a suspicious document with a reference collection, which is a set of documents assumed to be genuine. Based on a chosen document model and predefined similarity criteria, the detection task is to retrieve all documents which contain the text plagiarized in the suspicious document (Stein et al. 2007b; Potthast et al. 2009). When the reference collection cannot be accessed or a high degree of obfuscation existed, plagiarism detection can be a very difficult task. In contrast, intrinsic detection systems solely analyze the text under evaluation without performing comparisons to external documents, aiming to recognize changes in the unique writing style of the author as an indicator for potential plagiarism (Meyer zu Eissen and Stein 2006; Stein et al. 2011).

Plagiarism detection originated from the research on document retrieval (Blair and Maron 1985) and near-duplicate detection (Henzinger 2006). Document retrieval allows users to input a query and the documents matching the query are retrieved from the corpus. The query can be a set of words or sentences, or even a paragraph. The corpus is usually a collection of unstructured texts such as news, e-mails, web pages, and so on. The most famous example about document retrieval is web search engines, e.g., Google search engine. Near-duplicate detection, as the name suggests, is to find out a pair of documents which are nearly the same. In other words, near-duplicate does not require two documents to be totally identical, but instead allows insignificant differences to exist between the documents. Near-duplicate detection can also be used with web search engines. A web search engine may discourage a user if many near-duplicate web pages appear in the searched result.

Document retrieval and near-duplicate detection usually deal with texts in the document level. The global context is concerned, and small details are ignored. However, plagiarism can be involved in various scopes. A plagiarist may rewrite a whole document by idea adoption. In this case, plagiarism is committed in the global context. On the other hand, a plagiarism can be involved in the local context, e.g., copying or rewriting some sentences from a document. Therefore, plagiarism detection should be undertaken in both local and global contexts (Gipp 2014).

Most existing Plagiarism detection methods employ the vector space model (VSM), or bag-of-words (BOW), to convert documents into vectors. Each element of a vector corresponds to the weight of a word in the dictionary. The words in the dictionary are treated as independent from each other. However, in a plagiarized document, some words in the original document may be replaced with semantically similar words. Therefore, the treatment of the words being independent is invalid, and difficulties might be encountered when detecting such plagiarized documents due to the incapability of handling the semantics of words satisfactorily. The application of semantic networks and semantic compression was demonstrated to be a valuable addition to the existing methods used in plagiary detection (Ceglarek 2013).

In this paper, we are concerned with building an extrinsic plagiarism detection system. Given a query document and a corpus of reference documents, the system can retrieve all the reference documents that contain certain text plagiarized as part of the text in the query document. In addition, the original text and plagiarized text are located and presented as output to the user. In the system, Word2Vec is used to transform the words in the documents into word vectors which are able to reveal the semantic relationship among different words. The spherical K-means is applied to cluster the words into semantic concepts. Then documents and their paragraphs are represented in terms of the obtained concepts. Finally, a two-phase matching strategy is developed. In the first phase, possible source documents involved in plagiarism are located. In the second phase, the plagiarized parts are identified and shown to the user. A number of experiments are conducted to demonstrate the effectiveness of our proposed method in plagiarism detection.

The contributions of this work are the following: (1) The semantics are provided by Word2Vec word embeddings; (2) The embeddings are then clustered into semantic concepts; (3) The documents are represented at different levels of granularity using the semantic concepts; and (4) A two-stage, filtering and identifying, approach is introduced for effective plagiarism detection.

The remaining of this paper is organized as follows. In Sect. 2, related work is briefly reviewed. Section 3 gives an overview and detailed description of our proposed method. Experimental results are presented in Sect. 4. Finally, a conclusion is given in Sect. 5.

2 Related work

As mentioned, plagiarism detection systems can be divided in two categories, extrinsic and intrinsic. Extrinsic systems use a collection of reference documents and can be used to detect the plagiarism due to text copying, sentence rewriting, and idea adoption. In contrast, intrinsic systems detect plagiarism by analyzing and recognizing the writing style of the authors. Table 1 summarizes the characteristics of the two categories. Many different plagiarism detection systems have been proposed. A brief survey is given below.

Table 1 Categories of plagiarism detection systems

2.1 Detection for text copying

Text-matching techniques were developed for detecting text copying, to find the text copied between two documents. In (Monostori et al. 2000), the proposed MatchDetectReveal (MDR) system is capable of identifying overlapping and plagiarised documents. The matching-engine component uses a modified suffix tree representation, which is able to identify the exact overlapping chunks. In (Campbell et al. 2000), a sentence-based system is proposed to produce the distribution of overlap that exists between overlapping documents. It is resistant to inaccuracy due to large variations in document size.

2.2 Detection for sentence rewriting

For detecting plagiarism by sentence rewriting, techniques calculating commonality between two documents based on fingerprints, n-grams, or bags of words were developed (Jadalla and Elnagar 2012; Muhr et al. 2009; Deepa et al. 2016). Question answering systems (QASs), in some sense are an example of plagiarism detection. A QAS retrieves from a collection of documents text portions which contain answer to the user’s questions (Waheeb and Babu 2016; Chacko 2018). In (Sarrouti and Alaoui 2017), an efficient passage retrieval method is proposed to retrieve relevant passages in biomedical QASs with high mean average precision. Chow et al. propose MultiLayer Self-Organizing Map (MLSOM) (Chow and Rahman 2009) for document retrieval and plagiarism detection. They split a document into pages and a page into paragraphs. A document is represented in a three-level tree-structured form, corresponding to document, page, and paragraph, respectively. A three-layer SOM is built and local matching techniques are developed for comparing text documents. Zhang and Chow propose a framework, named MultiLevel Matching (MLM), for plagiarism detection (Zhang and Chow 2011). They use a 2-level structure, document-paragraph, to represent each document. Histogram vectors are used to represent the information extracted in the document level and paragraph level, respectively, and a hybrid distance is adopted to measure the difference between two documents. Ceglarek (Ceglarek 2013) demonstrates that the application of the semantic compression boosts the efficiency of Sentence Hashing Algorithm for Plagiarism Detection 2 (SHAPD2) and w-shingling algorithm.

2.3 Detection for idea adoption

Idea adoption occurs frequently in paraphrase plagiarism (Franco-Salvador et al. 2016; Alvarez-Carmona et al. 2018). Paraphrase plagiarism contains reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions (P4PIN 2020). In (Marti et al. 2013), attention is paid to the paraphrase phenomena which underlie acts of plagiarism. The presented experiments show that more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, lexical substitutions are the paraphrase mechanisms used the most when plagiarizing, and paraphrase mechanisms tend to shorten the plagiarized text. In (Naawab et al. 2016), query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to word sense disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term, i.e., replacement of words or phrases. In (Gonzalez-Agirre 2017), two computational models, semantic textual similarity (STS) and typed similarity, are developed for computing textual similarity. STS aims to measure the degree of semantic equivalence between two sentences by assigning graded similarity values. Typed Similarity tries to identify the type of relation that holds between a pair of similar items in a digital library.

2.4 Detection with word embedding

Various approaches have been proposed to use word embedding techniques, e.g., Word2Vec, for detecting plagiarism by sentence rewriting or idea adoption. Word embedding techniques are applied to quantify and categorize semantic similarities between linguistic items based on their distributional properties, and to project the words contained in a set of training documents to a semantic space of specified dimensionality. In (Baba et al. 2017), the validity of using distributed representation of words is evaluated for defining a document similarity. The paper proposes a plagiarism detection method based on the local maximal value of the length of the longest common subsequence (LCS) with the weight defined by a distributed representation. In (Mahmoud et al. 2017), Word2Vec is used to generate word vectors which are combined subsequently to produce a sentence vectors representation. A Convolutional Neural Network is proposed to measure the similarity between the representations of source and suspicious sentences. In (E et al. 2018), an embedding based document representation to detect plagiarism in documents is proposed. Words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors into a sentence vectors representation. Sentence pairs with the highest similarity score are considered as the candidates of the plagiarism cases. In (Shahmohammadi et al. 2020), a single Bi-LSTM neural network is trained to encode the input document by leveraging its pretrained GloVe word vectors. Three sets of handcrafted similarity features are incorporated with the output of the Bi-LSTM network to detect the sentences or phrases that convey the same meaning but use different wording. Alotaibi and Joy (Alotaibi and Joy 2020) introduce a technique for English-Arabic cross-language plagiarism detection. This method combines word embedding, term weighting techniques, and universal sentence encoder models to improve detection of sentence similarity.

2.5 Detection by writing style

Intrinsic plagiarism analysis is to identify potential plagiarism by analyzing a document with respect to undeclared changes in writing style. In (Meyer zu Eissen and Stein 2006), stylometry, subsuming statistical methods for quantifying an author’s unique writing style, is proposed. By constructing and comparing stylometric models for different text segments, passages that are stylistically different from others can be detected. In (Sánchez-Vega et al. 2017), it is pointed out that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur. A text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features, is proposed. Kuznetsov et al. (Kuznetsov et al. 2016) develop a plagiarism detection method based on constructing an author style function from features of text sentences and detecting outliers. The method is also adapted for the diarization problem by segmenting author style statistics on text parts, which correspond to different authors. In (Vysotska et al. 2018), the usage of linguometry and stylometry technologies for author’s style detection is discussed. The statistical linguistic analysis of author’s text is used in stylometry for definition of attribution degree of analyzed text to the specific author.

2.6 PAN for plagiarism detection

PAN is a series of scientific events and shared tasks on digital text forensics and stylometry (PAN 2020). The FIRE Initiative (Forum for Information Retrieval Evaluation), organized by the Information Retrieval Society of India, has evolved continuously to meet the new challenges in information access including plagiarism detection. Recent efforts have been conducted to the better development of models for plagiarism detection. Probably one of the most interesting cases is the PAN International Competition on Plagiarism Detection held in conjunction with CLEF (Potthast et al. 2009). To create simulated plagiarism cases, crowdsourcing has been employed (Potthast et al. 2010b). The obfuscation of the plagiarism cases obtained closely resembles the way human plagiarists would work. Two corpora, PAN-PC-10 (Potthast et al. 2010a) and P4PIN (Sánchez-Vega et al. 2017), were constructed. Based on these corpora, various plagiarism detection techniques have been developed and published at the CLEF forums. For example, Alzahrani and Salim (Alzahrani and Salim 2010) propose a plagiarism detection method using fuzzy semantic-based string similarity approach. A list of candidate documents for each suspicious document are retrieved using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate documents. This entails the computation of fuzzy degree of similarity. Two sentences are marked as plagiarised if they gain a fuzzy similarity score above a certain threshold.

3 Proposed method

In BOW or VSM based methods, e.g., MLM, words are regarded to be independent of each other. Two words which are semantically similar are treated as different words. In plagiarism, an author may substitute some words with semantically similar words. Consider the following two sentences:

$$\begin{aligned} {\text {A small orange cat sits on the sofa and looks sleepy.}} \end{aligned}$$
(1)

and

$$\begin{aligned} {\text {An orange kitten sits on the settee and looks drowsy.}} \end{aligned}$$
(2)

After preprocessing, these two sentences may contain the following words:

$$\begin{aligned} s_1= & {} \{{\text {small, orange, cat, sit, sofa, look, sleepy}}\}, \end{aligned}$$
(3)
$$\begin{aligned} s_2= & {} \{{\text {orange, kitten, sit, settee, look, drowsy}}\}. \end{aligned}$$
(4)

Syntactically, there are 10 different words in total. By the VSM method, there are only 3 words, orange, sit, and look, in common out of the 10 words. Therefore, the similarity between these sentences is low. By the n-gram method, e.g., 3-gram, we have 5 3-grams, \(\{{\text {small, orange, cat}}\}\), ..., \(\{{\text {sofa, look, sleepy}}\}\), for \(s_1\) and 4 3-grams, \(\{{\text {orange, kitten, sit}}\}\), ..., \(\{{\text {settee, look, drowsy}}\}\), for \(s_2\). Clearly, there are no common 3-grams in these two sets of 3-grams, and the 3-gram method fails to detect plagiarism for the two sentences. However, if semantics is considered and somehow it is learned that kitten and cat are semantically similar, sofa and settee are similar, and so are sleepy and drowsy, then there are only 7 semantically different words in total out of which 6 are in common in the two sentences. So one concludes that the similarity between these sentences is high and plagiarism occurs.

Fig. 1
figure 1

Flow diagram of our method

Our proposed method intends to perform plagiarism detection by taking semantic relationship between words into account in the representation of the documents. The flow diagram of our method is shown in Fig. 1. We use Word2vec to transform the words collected from reference documents into word vectors which are able to reveal the semantic relationship among different words. Then the word vectors are clustered, and semantic concepts are developed. A concept is a representative standing for a group of words which are semantically similar to each other. Documents are then represented in terms of concepts, and plagiarism detection can be done effectively in two phases, filtering and identifying phases. In the filtering phase, the reference documents in the corpus are compared with the query document and suspicious source documents are selected. Usually, the corpus is enormous, so we filter out those source documents which are irrelevant to the query document. The suspicious documents are then presented to the identifying phase. In the identifying phase, the sentences involved in plagiarism from the source and query documents are identified and they are output to the user. A document is represented as a three-level structure, consisting of document, paragraph, and sentence levels. The upper two levels are used in the filtering phase, while the lower two levels are used in the identifying phase.

Para2vec and doc2vec (Mikolov et al. 2013b) look into the text in the global context, ignoring small details in the local context. Instead, our method is concerned with both the global context, looking into the document and paragraph levels to decide whether the query is a plagiarism of some documents in the corpus, and the local context, looking into the paragraph and sentence levels to locate the sentences involved in the plagiarism cases.

3.1 Computing word vectors

To find the semantic relationship among words, we convert the words into word vectors by Word2Vec (Mikolov et al. 2013a, b). FastText (Bojanowski et al. 2017), GloVe (Pennington et al. 2014), and BERT (Devlin et al. 2018) are all suitable for our work. However, Word2Vec was more accessible and thus was adopted in our work. The following steps are taken:

  1. 1.

    The reference documents are scanned and a vocabulary is built.

  2. 2.

    Training patterns are extracted from the sentences of the reference documents.

  3. 3.

    A Word2Vec neural network is built and trained with the training patterns. Then the word vectors for the words in the vocabulary are obtained.

Noe that if two words are semantically similar, their word vectors are close; otherwise, the word vectors are not near. Therefore, a distance measure, e.g., cosine, can be used to decide if two words are semantically similar by calculating the distance between their corresponding word vectors.

3.1.1 Constructing vocabulary

Let the number of reference documents be N. First of all, like MLM (Zhang and Chow 2011), preprocessing is done on the N reference documents. For example, upper case letters are changed to lower case, punctuation marks are removed, stemming is applied, and stop words are deleted. Then all the words are collected. For each word, w, the weight \({\text {wt}}(w)\) is calculated:

$$\begin{aligned} {\text {wt}}(w) = {{\text {tf}}(w)} \times \log _2 \left( \frac{N}{{\text {df}}(w)}\right) \end{aligned}$$
(5)

where \({\text {tf}}(w)\) is the term frequency of word w appearing in all the reference documents and \({\text {df}}(w)\) is the number of reference documents in which word w appears. The t words with the t highest weights are selected to form the vocabulary V. Let these t words be denoted as \(w_1\), \(w_2\), ..., \(w_t\).

3.1.2 Extracting training patterns

Next, the training patterns are extracted from the reference documents. A training pattern is an input-output word pair (xy). Each reference document is divided into a sequence of non-overlapping windows. From each window, the central word is taken as input x and each of its context words is taken as output y. Let a window contain \(2s+1\) words:

\(w^{'}_{r+1}{\ldots }w^{'}_{r+s}w^{'}_{r+s+1}w^{'}_{r+s+2}{\ldots }w^{'}_{r+2s+1}\) where \(w^{'}_{r+s+1}\) is the central word and the other words are context words. Then 2s training patterns:

\((w^{'}_{r+s+1},w^{'}_{r+1})\), ..., \((w^{'}_{r+s+1},w^{'}_{r+s})\), \((w^{'}_{r+s+1},w^{'}_{r+s+2})\), ..., \((w^{'}_{r+s+1},w^{'}_{r+2s+1})\)

are extracted from the window. For example, consider the following window of consecutive 7 words, with \(s=3\):

$$\begin{aligned} \{{\text {dad, look, very, upset, because, favor, basketball}}\}. \end{aligned}$$

Then the following 6 training patterns:

  • (upset, dad), (upset, look), (upset,very),

  • (upset, because), (upset, favor), (upset, basketball).

are extracted. Note that the first word in each pair is the input and the second is the output.

Fig. 2
figure 2

A Word2Vec neural network

3.1.3 Getting word vectors

After training patterns are collected, a Word2Vec network, as shown in Fig. 2, is built. Let the dimensionality of the vector space be H. The network has three layers, the input layer, hidden layer, and output layer, which contains t, H, and t neurons, respectively. Note that the number of neurons in the input and output layer, respectively, is equal to the number of words in the vocabulary, and the number of neurons in the hidden layer is equal to the dimension of the resulting word vectors. The weights on the connections between the input layer and the hidden layer are named “syn0”, while the weights on the connections between the output layer and the hidden layer are named “syn1”.

The Word2Vec network is then trained with the training patterns. After training, the collection of the “syn0” weights between the ith input neuron and all the hidden neurons becomes the word vector of the word \(w_i\). Let \(v_1\), \(v_2\), ..., \(v_H\) be these weights, then

$$\begin{aligned} {\text {vec}}(w_i) = \begin{bmatrix} v_1&v_2&\ldots&v_H \end{bmatrix}^T \end{aligned}$$
(6)

is the word vector of the word \(w_i\).

In this work, we use a pre-existing set of embeddings (Google 2020). The repository provides pre-trained vectors trained on part of Google News dataset (about 100 billion words). The set of embeddings contains 300-dimensional vectors for 3 million words and phrases, i.e., \(H=300\). By mapping the vocabulary to the provided set of embeddings, we have t H-dimensional word vectors, denoted as \({\text {vec}}(w_1)\), \({\text {vec}}(w_2)\), ..., \({\text {vec}}(w_t)\) for words \(w_1\), \(w_2\), ..., \(w_t\), respectively.

3.2 Construction of concepts

Next, we group the words in the vocabulary into semantic concepts by clustering. Before it, we do dimensionality reduction on the word vectors.

3.2.1 Dimensionality reduction by PCA

The word vectors obtained have a dimensionality of H which can be too large for effective clustering. Principal component analysis (PCA) (Wold et al. 1987; Jolliffe 2002) is applied to reduce the dimensionality of the word vectors. Technically, a principal component can be defined as a linear combination of the original variables, and the coefficients in this combination are actually an eigenvector of the covariance matrix of the word vectors. Assume there are f eigenvalues \(e_1,e_2,\ldots ,e_f\) associated with the covariance matrix and \(e_1 \ge e_2 \ge \ldots \ge e_f\). We choose q principal components such that q is as small as possible and the cumulative energy is above a certain threshold \(\theta \), i.e.,

$$\begin{aligned} \frac{\sum _{i=1}^{q} e_i}{\sum _{i=1}^{f} e_i} \ge \theta \end{aligned}$$
(7)

As a result, there are q, instead of H, components in \({\text {vec}}(w_i)\), \(i=1,\ldots ,t\), after the application of PCA.

3.2.2 Getting semantic concepts

Then we group the resulting t word vectors into K clusters by a clustering algorithm. Many types of clustering algorithms have been proposed, such as centroid-based clustering (Sarmiento et al. 2019), Self-organizing mapping (SOM), Hierarchical clustering (Gagolewski et al. 2016), Distribution-based clustering (Fellows et al. 2011), Fuzzy C-means, GMM-EM, Density-based algorithms (Wang et al. 2019), and Subspace clustering (Luo et al. 2018). Clustering is to divide a set of objects into different clusters such that objects in the same cluster are more similar to each other than to those in other clusters. We adopt spherical K-means (Dhillon and Modha 2012; Pratap et al. 2018; Hedar et al. 2018) for clustering. Spherical K-means is a variant of K-means (Lloyd 1982) which is probably the most well-known clustering algorithm in the AI community. Different from K-means, spherical K-means uses cosine instead of Euclidean distance to measure the similarity between vectors. However, like K-means, spherical K-means runs iteratively to divide a given set of vectors into a pre-specified number, K, of clusters. Spherical K-means operates as follows.

  1. 1.

    All the involved vectors are normalized to unit length.

  2. 2.

    The value of K is chosen by the user.

  3. 3.

    K vectors are selected arbitrarily, by the user, as the initial centroids of the clusters.

  4. 4.

    For each vector, the cosine between the vector and each centroid is computed, and the vector is assigned to the cluster with the nearest centroid.

  5. 5.

    The centroids of the K clusters are recalculated and normalized.

  6. 6.

    Steps 4 and 5 are repeated until the centroids no longer move.

When the algorithm terminates, we have K clusters and their centroids.

By Word2Vec, the word vectors of two semantically similar words are located in close proximity. Therefore, a cluster can be regarded as a concept which consists of a set of semantically similar words. On the other hand, the words in different clusters are semantically dissimilar. Let K clusters, with centroid vectors being \({\mathbf {c}}_1\), \({\mathbf {c}}_2\), ..., \({\mathbf {c}}_K\), respectively, be obtained by spherical K-means. Then we have K concepts, each containing the words semantically similar to each other. The concepts will be denoted thereafter as \({\mathbf {c}}_1\), \({\mathbf {c}}_2\), ..., \({\mathbf {c}}_K\).

3.3 Representing documents in concepts

After obtaining K concepts from the reference documents, we represent documents in terms of these concepts. A three-level representation, concerning document, paragraphs, and sentences, respectively, is formed for each document. The top level is the document level. For a document d, the document vector \({\mathbf {D}}(d)\) for dis formed:

$$\begin{aligned} {\mathbf {D}}(d)= \begin{bmatrix} g_1&g_2&\ldots&g_K \end{bmatrix}^T \end{aligned}$$
(8)

where \(g_k\) is the strength of concept k in this document, defined as

$$\begin{aligned} g_k= \sum _{w \in d} {\text {wt}}_d(w){\times }{\text {SIM}}({\text {vec}}(w),{\mathbf {c}}_k) \end{aligned}$$
(9)

for \(1\le k\le K\). Note that \({\text {wt}}_d(w)\) is the weight of word w, defined as

$$\begin{aligned} {\text {wt}}_d(w) = {{\text {tf}}_d(w)} \times \log _2 \left( \frac{N}{{\text {df}}(w)}\right) \end{aligned}$$
(10)

where \({{\text {tf}}_d(w)}\) is the term frequency of word w appearing in d, and \({\text {SIM}}({\mathbf {x}},{\mathbf {y}})\) is the cosine similarity between vectors \({\mathbf {x}}\) and \({\mathbf {y}}\), defined as

$$\begin{aligned} {\text {SIM}}({\mathbf {x}},{\mathbf {y}})= & {} \frac{{\mathbf {x}}\cdot {\mathbf {y}}}{{\Vert {\mathbf {x}}\Vert }{\Vert {\mathbf {y}} \Vert }}. \end{aligned}$$
(11)

So, we use a K-vector to denote a document in a global manner.

The middle level is the paragraph level. For a document d with r paragraphs \(p_{1}\), \(p_{2}\), ..., \(p_{r}\), r paragraph vectors are formed. The paragraph vector \({\mathbf {P}}(p_a,d)\) of paragraph \(p_a\), \(1\le a\le r\), is defined as

$$\begin{aligned} {\mathbf {P}}(p_a,d)= \begin{bmatrix} e_1&e_2&\ldots&e_K \end{bmatrix}^T \end{aligned}$$
(12)

where \(e_k\) is the strength of concept k in paragraph \(p_a\):

$$\begin{aligned} e_k= \sum _{w \in p_a} {\text {wt}}_p(w){\times }{\text {SIM}}({\text {vec}}(w),{\mathbf {c}}_k) \end{aligned}$$
(13)

for \(1\le k\le K\). Note that \({\text {wt}}_p(w)\) indicates the weight of word w, defined as:

$$\begin{aligned} {\text {wt}}_p(w) = {{\text {tf}}_p(w)} \times \log _2(\frac{r}{{\text {pf}}_p(w)}) \end{aligned}$$
(14)

where \({\text {tf}}_p(w)\) is the term frequency of word w appearing in paragraph \(p_a\), and \({\text {pf}}_p(w)\) is the number of paragraphs in which word w appears in d.

The bottom level is the sentence level. For each sentence s in a paragraph p of a document d, the sentence vector \({\mathbf {S}}(s,p,d)\) is defined as

$$\begin{aligned} {\mathbf {S}}(s,p,d)= \begin{bmatrix} {\mathbf {f}}_1&{\mathbf {f}}_2&\ldots&{\mathbf {f}}_{\ell } \end{bmatrix}^T \end{aligned}$$
(15)

where \({\ell }\) is the number of words contained in sentence s and \({\mathbf {f}}_k\), \(1\le k\le \ell \), is the word vector of the kth word in s. Therefore, if paragraph p has h sentences, we have h sentence vectors for p.

3.4 Filtering phase

In this phase, the reference documents \(d_1\), \(d_2\), ..., \(d_N\) in the corpus are compared with the query document q, and those documents suspiciously plagiarized by the query are selected. This phase is basically similar to the candidate document selection phase of (Stein et al. 2007b).

For two documents \(d_1\) and \(d_2\), with document vectors being \({\mathbf {D}}(d_1)\) and \({\mathbf {D}}(d_2)\), respectively, and paragraph vectors being \({\mathbf {P}}(p_{1},d_{1})\), \({\mathbf {P}}(p_{2},d_{1})\), ..., \({\mathbf {P}}(p_{r_1},d_{1})\) and \({\mathbf {P}}(p_{1},d_{2})\), \({\mathbf {P}}(p_{2},d_{2})\), ..., \({\mathbf {P}}(p_{r_2},d_{2})\), , respectively. Note that \(d_1\) has \(r_1\) paragraphs and \(d_2\) has \(r_2\) paragraphs. As in (Zhang and Chow 2011), we define a high-level dissimilarity of \(d_1\) from \(d_2\), \({\text {DIS}}_{{\text {high}}}(d_1,d_2)\), as

$$\begin{aligned} {\text {DIS}}_{{\text {high}}}(d_1,d_2) = \lambda {\times }{\text {DIS}}_{{\text {doc}}}(d_1,d_2) + (1-\lambda ){\times }{\text {DIS}}_{{\text {par}}}(d_1,d_2), \end{aligned}$$
(16)

where \(\lambda \in [0,1]\) is the pre-specified constant and

$$\begin{aligned} {\text {DIS}}_{{\text {doc}}}(d_1,d_2)= & {} d({\mathbf {D}}(d_1), {\mathbf {D}}(d_2)),, \end{aligned}$$
(17)
$$\begin{aligned} {\text {DIS}}_{{\text {par}}}(d_1,d_2)= & {} \frac{\sum _{j=1}^{r_1} \min \{d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{1},d_2)}), \ldots , d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{r_2},d_2)})\}}{r_1} \end{aligned}$$
(18)

where \(d({\mathbf {x}}, {\mathbf {y}})\) is defined as (Zhang and Chow 2011)

$$\begin{aligned} d({\mathbf {x}}, {\mathbf {y}})= & {} 1 - e^{-(1-{\text {SIM}}({\mathbf {x}},{\mathbf {y}}))}. \end{aligned}$$
(19)

Note that a less \({\text {DIS}}_{{\text {high}}}(d_1,d_2)\) indicates that \(d_1\) is more similar to \(d_2\).

In (Zhang and Chow 2011), \({\text {DIS}}_{{\text {par}}}(d_1,d_2)\) is named as \({\text {DIS}}_{{\text {local}}}(d_1,d_2)\) and is defined as

$$\begin{aligned} {\text {DIS}}_{{\text {local}}}(d_1,d_2) = \frac{\sum _{i=1}^{r_1}\sum _{j=1}^{r_2}d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{1},d_2)})}{r_1r_2}. \end{aligned}$$
(20)

Note that Eq.(18) and Eq.(20) are different. In Eq.(20), all the distances between paragraphs are averaged. This may not be able to detect a plagiarism desirably. Consider two documents \(d_1\) and \(d_2\) each having 2 paragraphs, and

$$\begin{aligned} {\mathbf {P}}(p_1,d_1)= & {} \begin{bmatrix} 3.9&-1.4 \end{bmatrix},\ {\mathbf {P}}(p_2,d_1)= \begin{bmatrix} 3.9&-1.4 \end{bmatrix}; \\ {\mathbf {P}}(p_1,d_2)= & {} \begin{bmatrix} 3.7&-1.5 \end{bmatrix},\ {\mathbf {P}}(p_2,d_2)= \begin{bmatrix} -1.7&6.5 \end{bmatrix}. \end{aligned}$$

Note that

$$\begin{aligned} d({\mathbf {P}}(p_{1},d_1),{\mathbf {P}}(p_1,d_2))= & {} 0.0005,\ d({\mathbf {P}}(p_{1},d_1),{\mathbf {P}}(p_2,d_2))=0.79, \\ d({\mathbf {P}}(p_{2},d_1),{\mathbf {P}}(p_1,d_2))= & {} 0.63,\ d({\mathbf {P}}(p_{2},d_1),{\mathbf {P}}(p_2,d_2))=0.19. \end{aligned}$$

By observation, we can see that paragraph \(p_1\) of \(d_1\) is very similar to paragraph \(p_1\) of \(d_2\), and a plagiarism may occur between \(d_1\) and \(d_2\). But by Eq.(20),

$$\begin{aligned} {\text {DIS}}_{{\text {local}}}(d_1,d_2)= & {} \frac{\sum _{i=1}^{2}\sum _{j=1}^{2}d({\mathbf {P}}(p_{i},d_1),{\mathbf {P}}(p_j,d_2))}{2{\times }2} \\= & {} \frac{0.0005+0.79+0.63+0.19}{4} = 0.403 \end{aligned}$$

which shows that \(d_1\) is fairly distant from \(d_2\) and may fail on plagiarism detection. However, by Eq.(18) we have

$$\begin{aligned}&{\text {DIS}}_{{\text {par}}}(d_1,d_2) \\&\quad =\frac{\sum _{j=1}^{2} \min \{d({\mathbf {P}}({p_{j},d_1}), {\mathbf {P}}({p_{1},d_2)}),d({\mathbf {P}}({p_{j},d_1}),{\mathbf {P}}({p_{2},d_2)})\}}{2} \\&\quad = \frac{\min \{0.0005,0.79\}+\min \{0.63,0.19\}}{2} \\&\quad = \frac{0.0005+0.19}{2} = 0.095. \end{aligned}$$

Clearly, our method gives a distance of 0.095 between \(d_1\) and \(d_2\), showing that a plagiarism may occur between \(d_1\) and \(d_2\).

The filtering process proceeds as follows. For each reference document \(d_i\), \(1\le i\le N\), we compute \({\text {DIS}}_{{\text {high}}}(q,d_i)\). If \({\text {DIS}}_{{\text {high}}}(q,d_i) < \tau \), with \(\tau \) a pre-specified threshold, document \(d_i\) is regarded as a document suspiciously plagiarized by the query document q. All such documents are collected and used in the following identifying phase.

3.5 Identifying phase

Now we have a collection of reference documents selected from the filtering phase. We’d like to identify all the sentences involved in plagiarism from the source and query documents.

We find out the sentences involved in plagiarism from a reference document \(d_i\) and query document q as follows. Let q have \(r_q\) paragraphs and \(d_i\) have \(r_i\) paragraphs. For each paragraph \(p_{k}\) in q, \(1\le k\le r_q\), we compute the similarity between this paragraph and every paragraphs \(p_{j}\) in \(d_i\), i.e.,

$$\begin{aligned} {\text {SIM}}({\mathbf {P}}({p_{k},q)}),{\mathbf {P}}({p_{j},d_i)}) \end{aligned}$$
(21)

for \(1\le j\le r_i\). Let paragraph \(p_{J}\) of \(d_i\) be the most similar to \(p_{k}\) of q. If

$$\begin{aligned} {\text {SIM}}({\mathbf {P}}({p_{k},q)}),{\mathbf {P}}({p_{J},d_i)}) < \eta , \end{aligned}$$
(22)

where \(\eta \) is a threshold, then we do nothing. Otherwise, we proceed forward. For each sentence \(s_1\) in \(p_{k}\) of q and each sentence \(s_2\) in \(p_{J}\) of \(d_i\), with the sentence vector of \(s_1\) being \({\mathbf {S}}(s_1,p_k,q)=\begin{bmatrix}{\mathbf {f}}_{1,1}&\ldots&{\mathbf {f}}_{\ell _1,1}\end{bmatrix}\) and the sentence vector of \(s_2\) being \({\mathbf {S}}(s_2,p_J,d_i)=\begin{bmatrix}{\mathbf {f}}_{1,2}&\ldots&{\mathbf {f}}_{\ell _2,2}\end{bmatrix}\), the similarity of \(s_1\) to \(s_2\) is calculated as

$$\begin{aligned} {\text {PLA}}(s_1,s_2) = \frac{\sum _{j=1}^{\ell _1} \max \{{\text {SIM}} ({\mathbf {f}}_{j,1},{\mathbf {f}}_{1,2}),\ldots ,{\text {SIM}}({\mathbf {f}}_{j,1}, {\mathbf {f}}_{\ell _2,2})\}}{\ell _1}. \end{aligned}$$
(23)

Let \(s^{\star }_2\) in \(p_J\) be the sentence most similar to \(s_1\). If \({\text {PLA}}(s_1, s^{\star }_2) \ge \epsilon \) for some pre-specified threshold \(\epsilon \), then sentence \(s_1\) in \(p_k\) of q is regarded as a plagiarism of sentence \(s^{\star }_2\) in \(p_{J}\) of \(d_i\).

Consider the two sentences \(s_1\) and \(s_2\) in Eq.(3) and Eq.(4). Assume that \(s_1\) appears in paragraph \(p_1\) of one document A and \(s_2\) appears in paragraph \(p_2\) of another document B. As mentioned, if using the bag-of-words model, there are 10 different words in total but only 3 words, orange, sit, and look, are in common in \(s_1\) and \(s_2\). The similarity between them is low and plagiarism may not be detected. Suppose that, by considering semantics, the most similar words to orange, kitten, sit, settee, look, and drowsy in \(s_2\) are orange, cat, sit, sofa, look, and sleepy, respectively, in \(s_1\), and

  • SIM(vec(orange),vec(orange))=1.0, SIM(vec(kitten),vec(cat))=0.764,

  • SIM(vec(sit),vec(sit))=1.0, SIM(vec(settee),vec(sofa))=0.776,

  • SIM(vec(look),vec(look))=1.0, SIM(vec(drowsy),vec(sleepy))=0.52.

Then, by Eq.(23) we have

$$\begin{aligned} {\text {PLA}}(s_2,s_1) = \frac{1.0+ 0.764+ 1.0+ 0.776+ 1.0+ 0.52}{6} = 0.84. \end{aligned}$$
(24)

For \(\epsilon =0.6\), we conclude that \(s_2\) of paragraph 2 in document B is plagiarized by \(s_1\) of paragraph 1 in document A.

4 Experimental results

In this section, experimental results are presented to demonstrate the effectiveness of our method. Comparisons with other methods are also presented.

4.1 Datasets

Three datasets are used in the experiments. The first one is the Html_CityU1 dataset, which is used in (Zhang and Chow 2011). In this dataset, there are 26 categories, each category containing 400 documents. So there are 10,400 source documents in this dataset. These documents were downloaded from the Internet and served as reference documents. A pair of documents in the same category were randomly chosen, one as the plagiarized document and the other as the source document. Part of the content of the source document was copied in the plagiarized document. This process was repeated three times for each category. As a result, \(26{\times }3=78\) plagiarized documents in total were obtained and they formed the testing set. The second dataset is the PAN Plagiarism Corpus 2010, PAN-PC-10 (Potthast et al. 2010a). The documents in the corpus are based on 22,000 English books, 520 German books, and 210 Spanish books. This corpus does not contain any real plagiarism cases. To create simulated plagiarism cases, crowdsourcing, namely Amazon’s Mechanical Turk, has been employed. Text passages which were chosen at random from a source document have been presented to a human whose task was to rewrite the passage so that the wording would be different but the semantics preserved. The rewritten text passages have then been inserted into the suspicious documents. By this, the obfuscation of the plagiarism cases obtained closely resembles the way human plagiarists would work. The third dataset, P4PIN (Paraphrase for Plagiarism Including Negatives examples) (Sánchez-Vega et al. 2017), contains in total of 3354 instances, 847 positives and 2507 negatives. Each instance contains a pair of text fragments, the suspicious text and the possible source text, and a set of tags that identifies the class of each instance (Plagiarism or NoPlagiarism). The positive instances are manually constructed paraphrase cases taken from the PAN-PC-10 Corpus. The negative instances are conformed from text fragments of the PAN-PC-10 Corpus following a special selection strategy in order to get complex negative instances.

4.2 Performance measures

We use the same three measures, FDR (failed detection ratio), AR (average rank), and CR (composite rank), adopted in (Zhang and Chow 2011) for performance evaluation. The testing documents are tested against the source documents. For each testing document, the detection succeeds if its corresponding source document is decided to be one of the 500 source documents most similar to the testing document. FDR is defined as

$$\begin{aligned} {\text {FDR}} = \frac{{\text {number of failed detections}}}{{\text {total number of plagiarized documents}}}. \end{aligned}$$
(25)

For the 500 most similar source documents of a plagiarized document, they are sorted in descending order according to the similarity degree, with the rankings being 1, 2, ..., 500, respectively. Then AR is defined as

$$\begin{aligned} {\text {AR}} = \frac{\sum _{i=1}^{N_t} R_i}{{\text {total number of succeeded detections}}} \end{aligned}$$
(26)

where \(N_t\) is the total number of testing documents, and \(R_i\) is the ranking of the detected source document, with \(R_i=0\) if the detection fails, for the ith testing document. CR is a composite metric defined as

$$\begin{aligned} {\text {CR}} = \frac{{\text {AR}}}{1-{\text {FDR}}}. \end{aligned}$$
(27)

Note that small values of these measures indicate a better performance in plagiarism detection. In addition, we also use 4 other indicators (Potthast et al. 2010b), precision, recall, granularity, and PlagDet score, employed by PAN-PC competitions in plagiarism detection. Granularity is defined to be the average number of reported plagiarisms per one plagiarized text passage. Precision and recall are defined as

$$\begin{aligned} {\text {precision}} = \frac{r_s}{R},\ \ \ {\text {recall}} = \frac{r_s}{S} \end{aligned}$$
(28)

where \(r_s\) is the number of plagiarism cases correctly detected, R is the number of reported suspicious plagiarism cases, and S is the number of plagiarism cases. Finally, PlagDet is calculated as

$$ {\text{PlagDet}} = \frac{{2\; \times \;{\text{precision}}\; \times \;{\text{recall}}}}{{({\text{precision}}\; + \;{\text{recall}})\; \times \;\log _{2} (granularity)}}.{\text{ }}$$
(29)

Note that large values of these measures indicate a better performance in plagiarism detection. In our method, we transform the words into numeric vectors, group the vectors into semantic concepts, and represent the documents in the form of vectors. To detect possible plagiarism between two documents, we basically calculate the cosine between the representation vectors of the two documents, instead of computing the distance between the two documents at character level as explained in (Potthast et al. 2010b).

Table 2 Performance comparisons between our method and other methods, with the Html_CityU1 dataset

4.3 Comparison with other methods

Table 2 shows comparisons on FDR, AR, and CR between our method and other methods, with the Html_CityU1 dataset. In this table, the value obtained by the best method for each case is shown in boldface. For MLM, the vocabulary is set to contain 5000 words and dimensionality is set to be 200 after PCA. MLMS-Hybrid, MLMS-Local, MLMS-Global, and MLMH are four MLM versions in (Zhang and Chow 2011). For MLMS-Hybrid and MLMH, \(\lambda \) is set to be 0.35 and \(\lambda \) is set to be 0. For our method, we use \(H=300\) with Word2Vec, \(\theta =0.9\) with PCA, and \(K=200\) with spherical K-means. Two n-gram methods (Sidorov et al. 2014), 3-gram and 5-gram, are also compared. As can be seen, our method is the best. Our method has the lowest FDR value, 2.56%, the lowest AR value, 72.91, and the lowest CR value, 74.83.

Table 3 Performance comparisons between our method and other methods, with the PAN-PC-10 dataset

Table 3 shows comparisons on precision, recall, granularity, and PlagDet score between our method and other methods, with the PAN-PC-10 dataset. K&M (Kasprzak and Brandejs 2010) find pairs of source and suspicious documents, and their common chunk IDs. Document pairs with less than 20 chunks in common are discarded. The value of PlagDet was not provided in the paper. PDLK (Abdi et al. 2015) computes the semantic and syntactic similarity in the sentence-to-sentence manner to avoid selecting the source text sentence whose similarity with suspicious text sentence is high but its meaning is different. The \({\text {detailed}}_{{\text {fuzjac}}}\) method (Kadhim and Mohammed 2019) integrates exact and fuzzy similarity for improving detection of external textual plagiarism. SHAPD2 and w-shingling (Ceglarek 2013) are the semantic compression versions of the Sentence Hashing Algorithm for Plagiarism Detection 2 (SHAPD2) and the w-shingling algorithm, respectively. Note that K&M, PDLK, \({\text {detailed}}_{{\text {fuzjac}}}\), SHAPD2, and w-shingling are the approaches published at the PAN workshops or evaluated against the PAN-PC-10 dataset. As can be seen, no method outperforms the other methods in all the cases. However, our method performs pretty well in recall, granularity, and PlagDet. In particular, our method performs best in PlagDet. Table 4 shows comparisons on precision, recall, granularity, and PlagDet score between our method and MLMH, with the P4PIN dataset. We can see that our method performs better than MLMH. By applying paired t-test to the results shown in Table 3 and Table 4, we concluded that our method is slightly better than MLMH in PlagDet and is almost significantly better than MLMH in recall under the 90% confidence level. The comparison could have been more meaningful with more experiments. We hope to continue the research and experiment with more datasets in the future.

Table 4 Performance comparisons between our method and MLMH, with the P4PIN dataset

4.4 Comparison with LSI

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique to identify patterns in the relationships between the words and concepts contained in an unstructured collection of text (Deerwester 1988). A matrix containing the occurrences of words, rows corresponding to words and columns corresponding to documents, is constructed from a large piece of text, and singular value decomposition (SVD) is used to reduce the rows of the matrix. Documents are then compared by taking the cosine of the angle between any two column vectors. The similarity of two words depends on the number of times the words appearing in each document, not on the meaning of the words with respect to their surrounding context words. In this experiment, we compare our method and LSI for different datasets. The results are shown in Tables5, 6, and 7, respectively.

Table 5 Performance comparisons between our method and LSI, with the Html_CityU1 dataset
Table 6 Performance comparisons between our method and LSI, with the PAN-PC-10 dataset
Table 7 Performance comparisons between our method and LSI, with the P4PIN dataset

4.5 Comparison with BM25

In our work, tf-idf is adopted for term weighting. One of the alternatives is BM25 which is a ranking function used by search engines to estimate the relevance of documents to a given search query (Robertson and Zaragoza 2009). Given a query q, containing keywords \(q_1\), \(q_2\), ..., \(q_n\), the BM25 score of a document d is:

$$\begin{aligned} {\text {Score}}(q,d) = \sum _{i=1}^{n} {\text {IDF}}(q_i){\times }R(q_i,d) \end{aligned}$$
(30)

where \({\text {IDF}}(q_i)\) is the idf weight of the query term \(q_i\) and is usually computed as

$$\begin{aligned} {\text {IDF}}(q_i) = \log {\frac{N-{\text {df}}(q_i)+0.5}{{\text {df}}(q_i)+0.5}} \end{aligned}$$
(31)

and \(R(q_i,d)\) indicates the relevance between \(q_i\) and d, defined as

$$\begin{aligned} R(q_i,d) = \frac{{\text {tf}}_d(q_i){\times }(k_i+1)}{{\text {tf}}_d(q_i)+k_i} \end{aligned}$$
(32)

in which \(k_i\) is a constant and is usually set to be 2. In this experiment, we use BM25 in place of tf-idf in Eq.(10) and Eq.(14). The comparison results are shown in Tables 8 and 9, respectively.

Table 8 Performance comparisons between our method and BM25, with the PAN-PC-10 dataset
Table 9 Performance comparisons between our method and BM25, with the P4PIN dataset

4.6 Comparison with K-means

We choose spherical K-means instead of K-means for clustering. Note that spherical K-means applies cosine similarity, while K-means uses Euclidean distance, in the clustering process. Cosine calculates the direction difference of two vectors. When two words are more semantically similar, their word vectors are more aligned in the same direction and the cosine between them becomes larger, indicating the two words are more likely to be grouped in the same cluster. Therefore, the semantic similarity between two words is better measured by cosine, instead of Euclidean distance, between the corresponding word vectors. Table 10 shows the performance comparison between using K-means and using spherical K-means, under the condition of \(\theta = 0.9\) and \(K=200\), in our method. As can be seen, spherical K-means is much better in FDR.

Table 10 Comparisons of performance on Html_CityU1 between K-means and spherical K-means

4.7 Testing of parameter values

Our method requires a number of parameters to be set, e.g., \(\theta \), K, \(\epsilon \), \(\lambda \), H. We use a pre-existing set of embeddings from Google, in which H is set to be 300 which is fixed. The parameter \(\lambda \) is used to balance the document-level distance and paragraph-level distance. As in (Zhang and Chow 2011), we have also observed that the document-level distance is less critical and thus we set \(\lambda = 0\). The parameter \(\epsilon \) is used to control whether a plagiarism occurs between a pair of sentences, a smaller \(\epsilon \) indicating that plagiarism is more likely to be reported. The parameters \(\theta \) and K are used to control the degree of dimensionality reduction performed by PCA and the number of concepts obtained by spherical K-means. Table 11 shows the performance of our method on the Html_CityU1 dataset, with different values of \(\theta \) and K. From this table, we can see that the performance is better with \(\theta = 0.9\) than with \(\theta = 0.8\). When \(\theta \) increases, more information is kept and more details are maintained. Therefore, it is more likely to detect plagiarism. On the other hand, as K increases, more concepts are produced, allowing more detailed descriptions about words and helping to improve the detection of plagiarism.

Table 11 Comparisons of performance on Html_CityU1 with different values of \(\theta \) and K

5 Conclusion

Plagiarism detection is a challenging task due to the high complexity of the task, even for human beings, when obfuscation of the source texts occurs. A plagiarized work may contain reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions. It is even harder when plagiarism occurs at cross-lingual level. As a result, to detect plagiarism reliably needs to rely on the detection of similar semantic concepts.

Existing methods might have difficulties with plagiarized documents due to the incapability of handling the semantics of words satisfactorily. We have presented a method to enhance extrinsic plagiarism detection by using computational semantics of words. We use Word2vec to transform the words into word vectors which reveal the semantic relationship among different words. The spherical K-means is applied to cluster the words into semantic concepts. Then documents and their paragraphs are represented in terms of concepts. A two-phase matching strategy is developed. In the first phase, possible source documents of plagiarism are located, while in the second phase, the plagiarized parts are identified and shown to the user.

Our approach has limitations. The terms used are single words. Phrases, e.g., United Nations, are more descriptive than single words for plagiarism detection. Also, semantic meaning of the sentences or documents remains unknown. For example, the negative sense of the sentence “there is nothing on the menu that a gourmet would like” may be improperly interpreted and represented. In these cases, our system could fail to make satisfactory decisions and result in a high false positive rate. We’ll perform an in-depth failure analysis on the failed cases. Furthermore, homonyms, or multiple-meaning words, are words with the same spelling but have different meanings. We’ll try with other word embeddings, e.g, FastText, GloVe, or BERT, in our work to look into this issue and other issues associated with ambiguity and mis-alignments between paragraphs. In our system, many parameters are involved. We’ll look into the sensitivity of each parameter on the detection performance of the system. We’ll also make use of passage retrieval and field-weightings to maintain a hierarchy of sections in order to increase the efficiency of the detection process. Finally, to make a thorough detection against a huge number of reference documents, parallel or distributed processing facilities are required to increase the scalability of our system.