Keywords

1 Introduction

Plagiarism detection for a huge amount of document data requires efficient methods. An approach to a fast plagiarism detection is finding “similar” documents using statistics of word occurrences, such as the bag-of-words model [11]. Similar documents in the statistical sense can be found from a large dataset in a practical time using suitable data structures, such as indices based on locality sensitive hashing [9]. This kind of detection is expected to be effective against plagiarisms of ideas or rough structures of documents. The target of our study is plagiarisms of superficial descriptions such as “copy and paste”, which can be detected more accurately using techniques based on pattern matching on strings [7].

A difficulty in applying string matching-based techniques to plagiarism detection for general documents lies on setting the similarity between words; we need to define a similarity between words so that the document similarity based on the word similarity is computed as fast as possible while keeping acceptable accuracy. The edit distance [17] and its weighted and local version [15] are the bases of sequence alignment in bioinformatics [13], and the weight, that is, a kind of similarity between words is often given as substitution matrices [8] on the basis of expert knowledge. There exist some plagiarism detection methods based on the edit distance with a weight [10, 16]. However, the processing time of computing the document similarity for these methods is O(mn) for the lengths m and n of the target documents.

We proposed a plagiarism detection algorithm that runs in \(O(n\log {m})\) time with acceptable accuracy. The algorithm uses a document similarity based on the score vector [8] with a weight defined by vector representation of words. For two documents, the ith element of the score vector is the number of matches between corresponding words in the documents aligned with the gap i between the start positions. The vector is computed in \(O(n\log {m})\) time using the convolution theorem [6] and a fast Fourier transform (FFT) [8]. We represented a weight for the score (that is, a similarity between two words) by the inner product of the vectors mapped from the words, and the document similarity based on the word similarity is also computed in \(O(n\log {m})\) time using the FFT-based computation.

The aim of our study is to clarify what kind of vector representation of words is suitable for plagiarism detection. In the experiments, we evaluated two types of vector representation based on the score vector with a weight for the proposed algorithm. One uses the vectors generated randomly in order that those would represent the match and mismatch of words approximately with a small dimensionality. The idea of this vector representation corresponds to the randomization of the FFT-based algorithm for the score vector [4]. The other was a distributed representation generated by a neural network-based method word2vec [12]. We applied the plagiarism detection algorithm with the two types of vector representation and a naive vector representation that corresponds to the score vector without weight, to the dataset for a plagiarism detection competition in PAN [14] to investigate the processing time and the accuracy of plagiarism detection.

This study tried to find an application of distributed representation of words which is attracting attentions as a key technology for statistical processing on document data. A distributed representation is regarded as a function that maps a word to a numerical vector with a small dimensionality, and the distance between vectors represents a similarity between the words correspond to the vectors. A simple distributed representation is available by reducing the dimensionality of a straightforward vector representation based on word frequency [11]. The recent work [12] in neural networks made easy to achieve a distributed representation that represents a word similarity well from actual document data, and the tool for generating the distributed representation is available from the Internet [3].

As the result of the evaluation, we achieved a tradeoff between the processing time and the accuracy of plagiarism detection, which is affected by the dimensionality of vector representation of words. We found that the proposed algorithm based on the weighted score vector could reduce the processing time extremely with a slight decrease of the accuracy from that based on the normal score vector, and that the randomized vector representation could generate a better tradeoff than the distributed representation. For example, the proposed algorithm with the randomized vector representation could reduce about 90% of the processing time required for the algorithm with the normal score vector with a decrease of only 1% in the accuracy.

The rest of this paper is organized as follows. Section 2 introduces the plagiarism detection algorithm based on the weighted score vector. This section also describes the methods of experiments to evaluate the algorithm. Section 3 reports the experimental results. Section 4 gives considerations on the results and future directions of our study.

2 Methods

We proposed a plagiarism detection algorithm based on a document similarity, that is, a weighted version of the score vector between documents. This section introduces the document similarity and the plagiarism detection algorithm, and describes the methods to evaluate the proposed algorithm.

2.1 Preliminaries

W is a finite set of words. \(x\notin W\) is the never-match word. \(\delta \) is a function from \((W\cup \{x\})\times (W\cup \{x\})\) to \(\{0,1\}\) such that \(\delta (v,w)\) is 1 if \(v,w\in W\) and \(v=w\), and 0 otherwise.

A document is a list of words. The length of a document is the size of the list. \(W^n\) for an integer \(n>0\) is the set of the documents of length n over W. For a document p of length n, \(p_i\) for \(1\le i\le n\) is the ith word of p. pq is the concatenation of documents p and q. \(w^n\) for a word w and an integer \(n>0\) is the document of n w’s.

2.2 Score Vector

The score vector between \(p\in W^m\) and \(q\in W^n\) is defined to be the \((m+n-1)\)-dimensional vector whose ith element is

$$\begin{aligned} \sum _{j=1}^{m}\delta (p_j,q^\prime _{i+j-1}), \end{aligned}$$
(1)

where \(q^\prime =x^{m-1}qx^{m-1}\).

Example 1

Let p and q be the documents “I have a pen I have an apple” and “I have a pineapple”, respectively. Then, \(q^\prime \) is “x x x x x x x I have a pineapple x x x x x x x” and the score vector between p and q is (0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0).

Let \(\phi \) be a function from \(W\cup \{x\}\) to \(\mathbf{R}^d\) for the set \(\mathbf{R}\) of the real numbers. Then, the weighted score vector between \(p\in W^m\) and \(q\in W^n\) with \(\phi \) is defined to be the \((m+n-1)\)-dimensional vector whose ith element is

$$\begin{aligned} \sum _{j=1}^{m}{\langle \phi (p_j),\phi (q^\prime _{i+j-1})\rangle }. \end{aligned}$$
(2)

We call \(\phi \) a vector representation of words.

Example 2

Let \(\phi \) be a vector representation of words such that \(\langle \phi (v),\phi (w)\rangle \) is 1 if \(v,w\in W\) and \(v=w\), 0.9 if v and w are “an” and “a”, 0.5 if v and w are “apple” and “pineapple”, and 0 otherwise. Then, for the documents pq in Example 1, the weighted score vector between p and q with \(\phi \) is (0, 0, 0, 3.4, 0, 0, 0, 3, 0, 0, 0).

The processing time for computing the normal score vector between \(p\in W^m\) and \(q\in W^n\) is \(O(\vert W\vert n\log {m})\) by using the algorithm for the match-count problem based on the convolution theorem and an FFT [8]. Practically, \(\vert W\vert \) can be reduced to the number of the different words that occurred in both of the documents. The processing time for the weighted score vector with a vector representation of words of dimensionality d is computed in \(O(dn\log {m})\) time in the same way as the FFT-based algorithm. The size of the alphabet or the dimensionality of vector representation equals to the number of \(O(n\log {m})\) computations for convolutions repeated in the algorithm.

2.3 Vector Representation of Words

To implement a plagiarism detection algorithm, we defined the vector representation \(\phi \) of words in the following three methods:

  • Naive vector representation of words,

  • Randomized vector representation of words, and

  • Distributed representation of words.

The naive vector representation \(\phi _n\) is defined as follows. Let \(\varphi \) be a bijective function from \(W\cup \{x\}\) to \(\{0,1,\ldots , \vert W\vert \}\) and \(\varphi (x)=0\). Then, \(\phi _n\) is the function from \(W\cup \{x\}\) to \(\{0,1\}^{\vert W\vert }\) such that, the ith element of \(\phi _n(w)\) for \(w\in W\cup \{x\}\) is 1 if \(i=\varphi (w)\), and 0 otherwise. Then, \(\langle \phi _n(v),\phi _n(w)\rangle =\delta (v,w)\) for any \(v,w\in W\cup \{x\}\).

Example 3

W in Example 1 is regarded as \(\{\mathrm{I},\mathrm{have},\mathrm{a},\mathrm{pen},\mathrm{an},\mathrm{apple},\mathrm{pineapple}\}\). Then, an example of the naive vector representation is the function from \(W\cup \{x\}\) to \(\{0,1\}^7\) such that \(\phi _n(x)=(0,0,0,0,0,0,0)\), \(\phi _n(\mathrm{I})=(1,0,0,0,0,0,0)\), \(\phi _n(\mathrm{have})=(0,1,0,0,0,0,0)\), and so on. For computing the normal score vector, we can reduce W to \(\{\mathrm{I},\mathrm{have},\mathrm{a}\}\), and then \(\phi _n\) can be the function from \(W\cup \{x\}\) to \(\{0,1\}^3\) such that \(\phi _n(x)=(0,0,0)\), \(\phi _n(\mathrm{I})=(1,0,0)\), \(\phi _n(\mathrm{have})=(0,1,0)\), and \(\phi _n(\mathrm{a})=(0,0,1)\).

The randomized vector representation \(\phi _r\) is defined to be the function from \(W\cup \{x\}\) to \(\{-1,0,1\}^d\) for an integer d such that, \(\phi _r(x)\) is the d-dimensional zero-vector, and \(\phi _r(w)\) for \(w\in W\) is a vector chosen randomly from \(\{-1,1\}^d\). Then, for any d, \(\langle \phi _r(v),\phi _r(w)\rangle \) for \(v,w\in W\) is d if \(v=w\), and the expectation of the inner product is 0 otherwise. The idea of this vector representation corresponds to the randomization of the FFT-based algorithm for the score vector proposed by Atallah et al. [4], and we used the function with integers proposed by Baba et al. [5].

Example 4

In the case of Example 3, an example of \(\phi _r\) with \(d=4\) is the function such that \(\phi _r(x)=(0,0,0,0)\), \(\phi _r(\mathrm{I})=(1,1,1,1)\), \(\phi _r(\mathrm{have})=(1,1,-1,-1)\), \(\phi _r(\mathrm{a})=(1,-1,1,-1)\), and so on.

The distributed representation \(\phi _d\) was implemented by using word2vec [12]. We configured the dimensionality d in the available tool, and normalized the output vectors. Therefore, the \(\phi _d\) is defined to be a function from \(W\cup \{x\}\) to \([-1,1]^d\) such that, \(\langle \phi _d(v),\phi _d(w)\rangle \) for \(v,w\in W\cup \{x\}\) is 1 if \(v,w\in W\) and \(v=w\), 0 if \(v=x\) or \(w=x\), and a value in \([-1,1)\) otherwise.

2.4 Plagiarism Detection Algorithm

A plagiarism detection is, for a pair of documents, to predict “positive” (that is, there exists a plagiarism in a document from the other document) or “negative”.

The plagiarism detection algorithm in this paper is, for two input documents,

  1. 1.

    Calculate the weighted score vector between the documents with a vector representation of words, and

  2. 2.

    Predict positive or negative using the obtained vector and a threshold.

In Process 1, we used the three vector representations defined in Subsect. 2.3. In Process 2, we determined the threshold from training data by applying a support vector machine with a linear kernel to pairs of the peak value of the obtained vector and the length of the shorter document, where the peak value of a vector v is the minimal element in \(v^{\prime \prime }\) and \(v^\prime _i=v_{i+1}-v_i\) for \(1\le i<\vert v\vert \).

Example 5

The peak value of the weighted score vector in Example 2 is \(-6.8\). In the proposed algorithm, the support vector machine is applied to the pair of the peak value \(-6.8\) and the length 4 of the shorter document.

The processing time of the proposed algorithm is mainly due to the \(O(n\log {m})\) computation for the (weighted) score vector. Additionally, we need an \(O(m+n)\) computation for the detection of the peak value in the computed score vector.

2.5 Experiments

We applied the plagiarism detection algorithm defined in Subsect. 2.4 to a dataset to measure the accuracy for the three vector representations of words defined in Subsect. 2.3.

We used a dataset of a plagiarism detection competition in PAN 2013 [14] which is available from the Internet [1]. The dataset contains pairs of documents with a plagiarism of “copy and paste” (positive pairs) and pairs with no plagiarism (negative pairs). We picked 2,000 positive pairs and 2,000 negative pairs, and then divided the data equally into training and test data for validation of the algorithms. The average length of the documents was 1,432. We used the training data for learning in word2vec to generate \(\phi _d\) in addition to fitting the support vector machine for the prediction.

The accuracy of a plagiarism detection algorithm is defined to be the ratio of the number of the correct predictions to the number of the total predictions. The processing time of the algorithms is proportional to the dimensionality d of the vector representation of words, while the accuracy of the proposed algorithm is expected to be better for a larger d. Therefore, this experiment clarifies the relation between the processing time and the accuracy of the proposed algorithm.

We also applied the algorithm to other data in the competition that include plagiarisms with some kinds of obfuscations. We had expected that the accuracy would be improved by

  • Using the weighted score vector generated by the distributed representation instead of the normal score vector,

  • Increasing the dimensionality of the distributed representation, and

  • Using a larger training data for generating the distributed representation instead of the given data.

The new dataset for plagiarism detection contains 8,370 positive pairs and 2,000 negative pairs. We generated three types of distributed representation of words: one for the dimensionality 100 and 200 trained with the given data of PAN, and one of dimensionality 200 trained with an archived data in Wikipedia [2]. The size of the extra data of Wikipedia was 13.1 GB while that of the training data of PAN was 30 MB.

3 Results

Figure 1 shows the accuracy of the proposed algorithm for the three vector representations of words, that is, the naive, the randomized, and the distributed one. The dimensionality for the naive vector representation was fixed to the “restricted” alphabet (vocabulary) size. The alphabet size of the document data and the average number of the different words that occurred in both of the input documents were 143,600 and 96, respectively. Then, the result for this vector representation is the point of the accuracy 1 at the dimensionality 96. For the other vector representations, the graph shows the accuracy against the dimensionality of the vector representations, where the accuracy is assumed to be 0.5 for the dimensionality 0 and 1 for any dimensionality larger than 100.

Fig. 1.
figure 1

Accuracy of the proposed algorithm against the dimensionality of vectors for the three types of vector representation of words.

Figure 2 shows the relation between the processing time of the algorithm and the dimensionality of vector representation of words. The results were generated by the algorithm with the randomized vector representation. The computation with the other types of vector representation can be simulated by this case because the difference is only the values of vectors by the definition in Subsect. 2.3. We can estimate that the processing time is proportional to the dimensionality with an overhead of about 10 ms, corresponding to the theoretical analysis in Subsect. 2.4. Therefore, the dimensionality of vector representation can be used as the measure for the processing time of the algorithm.

Fig. 2.
figure 2

Processing time of the proposed algorithm with the randomized vector representation of words against the dimensionality of vectors.

These results show that the proposed algorithm with the randomized vector representation or the distributed representation can reduce the processing time extremely with only a slight decrease of the accuracy from the algorithm with the naive vector representation, and that the tradeoff between the processing time and the accuracy obtained by the randomized vector representation is better than that with the distributed representation.

There was no significant improvement by using the distributed representation for the dataset of plagiarisms with obfuscations. Table 1 shows the results opposite to our expectations in Subsect. 2.5.

Table 1. Accuracy of the proposed algorithm with the distributed representation of words for the dataset that includes plagiarisms with obfuscations.

4 Discussion

From the experimental results in Sect. 3, we can conclude that the processing time for detecting plagiarisms in documents can be reduced extremely by using the weighted score vector for the document similarity, although the accuracy decreases slightly. For example, the processing time and the accuracy of the proposed algorithm with the randomized vector representation of dimensionality 4 can be respectively about 10% and 99% of those with the normal score vector, which is a probable situation in actual applications of plagiarism detection.

Additionally, we found that the randomized vector representation is more suitable for our plagiarism detection algorithm than the distributed representation generated by word2vec. The weighted score vector with the distributed representation contained noises generated by putting scores also on mismatches of words. It is supposed that finding completely-different words correctly is effective against plagiarisms of superficial descriptions rather than finding similar words.

One of our future work is to investigate the applicability of our idea, that is, to combine string matching-based techniques and vector representations obtained by statistical learning methods from data, to other tasks with documents. In this study, we achieved a tradeoff between the processing time and the accuracy of plagiarism detection, which yields just an approximation of the process based on the simple method of string matching. Actually, using the distributed representation generated by word2vec was not effective against plagiarisms with obfuscations. We expect that using vector representation of words may obviously improve an accuracy in some other tasks that treat a kind of semantics in addition to the syntax of documents.

5 Conclusion

In this paper, we evaluated the validity of using vector representation of words for setting a document similarity. We proposed a plagiarism detection algorithm that uses a document similarity based on the score vector with the weight defined by vector representation of words. We experimented the processing time and the plagiarism detection accuracy of the proposed algorithm with the three types of vector representation of words. The results show that the proposed algorithm based on the weighted score vector can detect plagiarisms in an extremely shorter time with a slightly worse accuracy than the algorithm based on the normal score vector. Additionally, we found that the randomized vector representation is more suitable for the plagiarism detection algorithm than the distributed representation. To take a concrete example, the proposed algorithm with the randomized vector representation can reduce about 90% of the processing time required for the algorithm with the normal score vector with a decrease of only 1% in the accuracy.