A coarse-to-fine framework to efficiently thwart plagiarism

doi:10.1016/j.patcog.2010.08.023

Pattern Recognition

Volume 44, Issue 2, February 2011, Pages 471-487

https://doi.org/10.1016/j.patcog.2010.08.023 Get rights and content

Abstract

This paper presents a systematic framework using multilevel matching approach for plagiarism detection (PD). A multilevel structure, i.e. document–paragraph–sentence, is used to represent each document. In document and paragraph level, we use traditional dimensionality reduction technique to project high dimensional histograms into latent semantic space. The Earth Mover’s Distance (EMD), instead of exhaustive matching, is employed to retrieve relevant documents, which enables us to markedly shrink the searching domain. Two PD algorithms are designed and implemented to efficiently flag the suspected plagiarized document sources. We conduct extensive experimental verifications including document retrieval, PD, the study of the effects of parameters, and the empirical study of the system response. The results corroborate that the proposed approach is accurate and computationally efficient for performing PD.

Introduction

The Internet has, undoubtedly, become an indispensable component of our daily life ranging from restaurant booking to technology research. The online fashion is, however, posing a severe challenge to textual intellectual property because the Internet and computer technology have made disseminating knowledge across the world facile. People can search, copy, save, and reuse online sources in ease. The most flagrant instance of plagiarism is to copy a document from another source without any kind of modifications. But this type of plagiarism is easy to be identified using the plagiarism detection (PD) system. Less obvious examples occur when people integrate an existing work into their work. They attempt to bypass the detection system by conducting substitution of words or sentences within an already existing document, or pasting some phrases from an outside source into a new document. Cut-and-paste PD, at present, has become a growing concern in education system. One of the difficulties of efficiently detecting plagiarism is to search the source with speedy query response because people may copy from one of millions of documents in the Internet, where each document usually involves thousands of words.

Existing techniques for anti-plagiarism include fingerprinting, a method developed specifically for detecting co-derivatives, and ranking, a method developed for document retrieval. Hoad and Zobel [1] investigated the performance of these techniques and demonstrated that the ranking method is superior to the fingerprinting method. Chow et al. [2] also reported promising results by using the ranking approach. Following this line, this paper presents a coarse-to-fine framework to detect plagiarism using multilevel matching (MLM). The proposed approach delivers a number of desirable features that include generality, robustness, and efficiency. Concretely, these features can be described as follows:

•
The generality refers to the multilevel-structured document representation and its encoding features. We use document–paragraph–sentence structure to form a coarse-to-fine representation of each document. In document and paragraph level, principal component analysis (PCA), a traditional dimensionality reduction tool, is used to capture the hidden latent semantic topics. Instead of PCA, any other latent semantic analysis or dimensionality reduction techniques can be incorporated into this scheme.
•
The proposed system is robust due to its use of signature matching. The signature in document and paragraph level is constructed by involving the length and the histograms of terms in each component. Each sentence is featured by using the index number of each term that indicates the presence of the corresponding term in vocabulary. In this signature encoding, we do not consider the sequence of terms in a sentence, which is reasonable because plagiarists strive to substitute words in each sentence or reorganize the sentence structure so as to bypass the PD system.
•
Document modeling and its applications are notoriously computational intensive due to their involvement of at least thousands of words. Our proposed system is based on depth matching by using a coarse-to-fine strategy to filter out the unpromising searching domain. This pruning capability enables us to bring large computational efficiency. Therefore, the proposed approach can be used for a large dataset and practical online applications.

The main contributions of this paper are threefold. First, we propose a multilevel-structured document representation together with encoding features. Second, we investigate MLM approaches, i.e. histogram based MLM (MLMH) and signature based MLM (MLMS), for relevant document retrieval (DR). Third, two detection algorithms are implemented by setting appropriate thresholds such that undesirable paths are pruned in advance during multilevel matching process.

The remaining sections of this paper are organized as follows. A brief overview of document modeling and its applications are presented in Section 2. The relationship of PD versus document categorization (or classification, and or clustering) and DR are also discussed, respectively. Section 3 introduces a multilevel-structured document representation together with the document segmentation, dimensionality reduction, and feature encoding. Document segmentation is done using HTML tags. In Section 4, we discuss various document retrieval approaches based on histograms and signatures. Two detection algorithms are implemented in Section 5. We conduct extensive experimental verifications in Section 6. Section 7 lists the discussion based on observed results and proposes the system framework from a practical viewpoint. Finally, Section 8 ends the paper with conclusion and future work propositions.

Section snippets

Related work

This section briefly reviews the previous work, as partially covered by Tommy et al. [2] and our recent work [3], [4]. It involves document modeling and its applications (e.g. categorization, retrieval, and PD). We also discuss the relationship between PD and other applications.

Document representation

This section involves the document preprocess and the overall feature extraction procedures. It includes the detailed steps to partition a document into paragraphs and further partition each paragraph into sentences for HTML format documents (see Section 3.1), building two different sizes of vocabularies (see Section 3.2), and construction of multilevel representation (see Section 3.3).

Document retrieval

In this section, we present the DR approaches that are different from most existing models, because our proposed methods add local information of a document into the retrieval process by taking advantage of multilevel representation. They also pave the way for the subsequent PD. Currently document modeling methods (e.g. VSM [5], LSI [6], PLSI [7], LDA [8], EFH [9], and RAP [10]) only consider the global information of a document (i.e. term frequency). Two documents, however, containing similar

Plagiarism detection

After retrieving N_ret documents, which are regarded as the suspected plagiarized sources, we are now in the position to develop the PD algorithms. It is straightforward to sort the N_ret documents again in ascending order by further matching sentences and using distance fusion techniques, and eventually return a short list to the users. This method is called ranking based PD (see Section 5.1). Another way to implement PD is to set an offset value to make the binary decision on the presence of

Experiments

In this section, we conduct the detailed experiments as efficiency verifications of our proposed PD approach. This section involves the dataset description (see Section 6.1), the performance of relevant DR (see Section 6.2), the performance of PD (see Section 6.3), the impact study of parameters (see Section 6.4), and the empirical study of computational time (see Section 6.5).

Discussion and extension

Currently developing an efficient PD system is a very demanding work because plagiarism easily occurs in the information age. Although we conducted experiments in a simulation platform, many interesting results can be observed:

•
The usage of local information from sections or paragraphs significantly enhances the performance of the DR because it explores the spatial distributions of words.
•
Histogram based DR approach with an appropriate distance fusion performs better in the DR than in the PD.
•

Conclusion

A coarse-to-fine framework to efficiency thwart plagiarism is proposed in this study. Each document is represented by a multilevel structure, i.e. document–paragraph–sentence. Different signatures are constructed to represent components in different levels. Relevant DR approaches by adding or only using local information to explore rich semantics from documents are introduced to retrieve the suspected sources. Two PD algorithms by further sentence matching are designed and implemented to

Acknowledgments

The authors would like to express many thanks to anonymous reviewers for helpful comments to improve the standard of this paper.

Haijun Zhang received his B.Eng. degree in the Department of Civil Engineering and Master degree in the Department of Control Theory and Engineering from the Northeastern University, Shenyang, PR China in 2004 and 2007, respectively. He worked as a research assistant at the City University of Hong Kong in April–September 2007. He is currently working towards his Ph.D. degree at the City University of Hong Kong, Hong Kong. His research interests are evolutionary computation, structure

References (45)

Tommy W.S. Chow et al.
A new document representation using term frequency and vectorized graph connectionists with application to document retrieval
Expert Systems with Applications
(2009)
Haijun Zhang et al.
A new dual wing harmonium model for document retrieval
Pattern Recognition
(2009)
A. Georgakis et al.
Marginal median SOM for document organization and retrieval
Neural Networks
(2004)
M.K.M. Rahman et al.
A flexible multi-layer self-organizing map for generic processing of tree-structured data
Pattern Recognition
(2007)
N. Rooney et al.
A scalable document clustering approach for large document corpora
Information Processing and Management
(2006)
M. Fuketa et al.
A document classification method by using field association words
Information Sciences
(2000)
C.M. Tan et al.
The use of bigrams to enhance text categorization
Information Processing and Management
(2002)
A. Selamat et al.
Web page feature selection and classification using neural networks
Information Sciences
(2004)
S. Singh et al.
A new customized document categorization scheme using rough membership
Applied Soft Computing
(2005)
X. Cui et al.
A flocking based algorithm for document clustering analysis
Journal of Systems Architecture
(2006)

S.L. Bang et al.

Hierarchical document categorization with k-NN and concept-based thesauri

Information Processing and Management

(2006)

Francesc Serratosa et al.

Signatures versus histograms: definitions, distances and algorithms

Pattern Recognition

(2006)

X. Gao et al.

Image categorization: graph edit distance+edge direction histogram

Pattern Recognition

(2008)

T.C. Hoad et al.

Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology

(2003)

Tommy W.S. Chow et al.

Multi-layer SOM with tree structured data for efficient document retrieval and plagiarism detection

IEEE Transactions on Neural Networks

(2009)

S. Deerwester et al.

Indexing by latent semantic analysis

Journal of the American Society of Information Science

(1990)

T. Hofmann, Probabilistic latent semantic indexing, in: Proceedings of the Twenty-Second Annual International SIGIR...

D. Blei et al.

Latent Dirichlet allocation

Journal of Machine Learning Research

(2003)

M. Welling, M. Rosen-Zvi, G. Hinton, Exponential family harmoniums with an application to information retrieval. In:...

P. Gehler, A. Holub, M. Welling, The rate adapting Poisson model for information retrieval and object recognition, in:...

R.B. Yates et al.

Modern Information Retrieval

(1999)

Cited by (24)

Patterning of writing style evolution by means of dynamic similarity
2018, Pattern Recognition
Citation Excerpt :
The intrinsic approach operates only with the provided texts (one of acknowledged authorship and one being examined) and leads to a one-class classification problem [30–33]. Such problems also appear in the plagiarism detection area (see, e.g. [5,34–37]). Extrinsic methods transform the verification task into a binary-classification problem.
This paper suggests a new methodology for patterning writing style evolution using dynamic similarity. We divide a text into sequential, disjoint portions (chunks) of the same size and exploit the Mean Dependence measure, aspiring to model the writing process via association between the current text chunk and its predecessors. To expose the evolution of a style, a new two-step clustering procedure is applied. In the first phase, a distance based on the Mean Dependence between each pair of chunks is evaluated. All document chunks in a pair are embedded in a high dimensional space using a Kuratowski-type embedding procedure and clustered by means of the introduced distance. In the next phase, the rows of the binary cluster classification documents matrix are clustered via the hierarchical single linkage clustering algorithm. By this way, a visualization of the inner stylistic structure of a texts’ collection, the resulting classification tree, is provided by the appropriate dendrogram. The approach applied to studying writing style evolution in the “Foundation Universe” by Isaac Asimov, the “Rama” series by Arthur C. Clarke, the “Forsyte Saga” of John Galsworthy, “The Lord of the Rings” by John Ronald Reuel Tolkien and a collection of books prescribed to Romain Gary demonstrates that the suggested methodology is capable of identifying style development over time. Additional numerical experiments with author determination and author verification tasks exhibit the high ability of the method to provide accurate solutions.
Minmax Circular Sector Arc for External Plagiarism's Heuristic Retrieval stage
2017, Knowledge-Based Systems
Citation Excerpt :
Indeed, recent approaches aims to handle the intelligent plagiarism through syntax [24–31], semantic [15,27,28,30–32], structural [25] and cross-language-based approaches [30,31,33,34]. However, is infeasible to apply intelligent PD approaches, in larger collections, without a suitable reduction to the comparison space [25,35]. Hence, the Heuristic Retrieval (HR) stage is essential to enable PD systems to achieve real External PD problems and the next section discuss the HR stage state-of-art.
Heuristic Retrieval (HR) task aims to retrieve a set of documents from which the External Plagiarism detection identifies plagiarized pieces of text. In this context, we present Minmax Circular Sector Arcs (MinmaxCSA) algorithms that treats HR task as an approximate k-nearest neighbor search problem. Moreover, MinmaxCSA algorithms aim to retrieve the set of documents with greater amounts of plagiarized fragments, while reducing the amount of time to accomplish the HR task. Our theoretical framework is based on two aspects: (i) a triangular property to encode a range of sketches on a unique value; and (ii) a Circular Sector Arc property which enables (i) to be more accurate. Both properties were proposed for handling high-dimensional spaces, hashing them to a lower number of hash values. Our two MinmaxCSA methods, Minmax Circular Sector Arcs Lower Bound (CSA_L) and Minmax Circular Sector Arcs Full Bound (CSA), achieved Recall levels slightly more imprecise than Minmaxwise hashing in exchange for a better Speedup in document indexing and query extraction and retrieval time in high-dimensional plagiarism-related datasets.
Using word semantic concepts for plagiarism detection in text documents
2021, Information Retrieval Journal
University learning with anti-plagiarism systems
2021, Accountability in Research
Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data
2020, Neural Computing and Applications
Fuzzy Semantic-Based String Similarity Experiments to Detect Plagiarism in Indonesian Documents
2019, ICICOS 2019 - 3rd International Conference on Informatics and Computational Sciences: Accelerating Informatics and Computational Research for Smarter Society in The Era of Industry 4.0, Proceedings

View all citing articles on Scopus

Tommy W.S. Chow (IEEE M’93–SM’03) received his B.Sc. (First Hons.) and Ph.D. degrees from the University of Sunderland, Sunderland, U.K. He joined the City University of Hong Kong, Hong Kong, as a Lecturer in 1988. He is currently a Professor in the Electronic Engineering Department. His research interests include machine fault diagnosis, HOS analysis, system identification, and neural network learning algorithms and applications.

View full text

A coarse-to-fine framework to efficiently thwart plagiarism

Abstract

Introduction

Section snippets

Related work

Document representation

Document retrieval

Plagiarism detection

Experiments

Discussion and extension

Conclusion

Acknowledgments

Expert Systems with Applications

Pattern Recognition

Neural Networks

Pattern Recognition

Information Processing and Management

Information Sciences

Information Processing and Management

Information Sciences

Applied Soft Computing

Journal of Systems Architecture

Information Processing and Management

Pattern Recognition

Pattern Recognition

Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology

Multi-layer SOM with tree structured data for efficient document retrieval and plagiarism detection

IEEE Transactions on Neural Networks

Indexing by latent semantic analysis

Journal of the American Society of Information Science

Latent Dirichlet allocation

Journal of Machine Learning Research

Modern Information Retrieval