Elsevier

Pattern Recognition

Volume 44, Issue 2, February 2011, Pages 471-487
Pattern Recognition

A coarse-to-fine framework to efficiently thwart plagiarism

https://doi.org/10.1016/j.patcog.2010.08.023Get rights and content

Abstract

This paper presents a systematic framework using multilevel matching approach for plagiarism detection (PD). A multilevel structure, i.e. document–paragraph–sentence, is used to represent each document. In document and paragraph level, we use traditional dimensionality reduction technique to project high dimensional histograms into latent semantic space. The Earth Mover’s Distance (EMD), instead of exhaustive matching, is employed to retrieve relevant documents, which enables us to markedly shrink the searching domain. Two PD algorithms are designed and implemented to efficiently flag the suspected plagiarized document sources. We conduct extensive experimental verifications including document retrieval, PD, the study of the effects of parameters, and the empirical study of the system response. The results corroborate that the proposed approach is accurate and computationally efficient for performing PD.

Introduction

The Internet has, undoubtedly, become an indispensable component of our daily life ranging from restaurant booking to technology research. The online fashion is, however, posing a severe challenge to textual intellectual property because the Internet and computer technology have made disseminating knowledge across the world facile. People can search, copy, save, and reuse online sources in ease. The most flagrant instance of plagiarism is to copy a document from another source without any kind of modifications. But this type of plagiarism is easy to be identified using the plagiarism detection (PD) system. Less obvious examples occur when people integrate an existing work into their work. They attempt to bypass the detection system by conducting substitution of words or sentences within an already existing document, or pasting some phrases from an outside source into a new document. Cut-and-paste PD, at present, has become a growing concern in education system. One of the difficulties of efficiently detecting plagiarism is to search the source with speedy query response because people may copy from one of millions of documents in the Internet, where each document usually involves thousands of words.

Existing techniques for anti-plagiarism include fingerprinting, a method developed specifically for detecting co-derivatives, and ranking, a method developed for document retrieval. Hoad and Zobel [1] investigated the performance of these techniques and demonstrated that the ranking method is superior to the fingerprinting method. Chow et al. [2] also reported promising results by using the ranking approach. Following this line, this paper presents a coarse-to-fine framework to detect plagiarism using multilevel matching (MLM). The proposed approach delivers a number of desirable features that include generality, robustness, and efficiency. Concretely, these features can be described as follows:

  • The generality refers to the multilevel-structured document representation and its encoding features. We use document–paragraph–sentence structure to form a coarse-to-fine representation of each document. In document and paragraph level, principal component analysis (PCA), a traditional dimensionality reduction tool, is used to capture the hidden latent semantic topics. Instead of PCA, any other latent semantic analysis or dimensionality reduction techniques can be incorporated into this scheme.

  • The proposed system is robust due to its use of signature matching. The signature in document and paragraph level is constructed by involving the length and the histograms of terms in each component. Each sentence is featured by using the index number of each term that indicates the presence of the corresponding term in vocabulary. In this signature encoding, we do not consider the sequence of terms in a sentence, which is reasonable because plagiarists strive to substitute words in each sentence or reorganize the sentence structure so as to bypass the PD system.

  • Document modeling and its applications are notoriously computational intensive due to their involvement of at least thousands of words. Our proposed system is based on depth matching by using a coarse-to-fine strategy to filter out the unpromising searching domain. This pruning capability enables us to bring large computational efficiency. Therefore, the proposed approach can be used for a large dataset and practical online applications.

The main contributions of this paper are threefold. First, we propose a multilevel-structured document representation together with encoding features. Second, we investigate MLM approaches, i.e. histogram based MLM (MLMH) and signature based MLM (MLMS), for relevant document retrieval (DR). Third, two detection algorithms are implemented by setting appropriate thresholds such that undesirable paths are pruned in advance during multilevel matching process.

The remaining sections of this paper are organized as follows. A brief overview of document modeling and its applications are presented in Section 2. The relationship of PD versus document categorization (or classification, and or clustering) and DR are also discussed, respectively. Section 3 introduces a multilevel-structured document representation together with the document segmentation, dimensionality reduction, and feature encoding. Document segmentation is done using HTML tags. In Section 4, we discuss various document retrieval approaches based on histograms and signatures. Two detection algorithms are implemented in Section 5. We conduct extensive experimental verifications in Section 6. Section 7 lists the discussion based on observed results and proposes the system framework from a practical viewpoint. Finally, Section 8 ends the paper with conclusion and future work propositions.

Section snippets

Related work

This section briefly reviews the previous work, as partially covered by Tommy et al. [2] and our recent work [3], [4]. It involves document modeling and its applications (e.g. categorization, retrieval, and PD). We also discuss the relationship between PD and other applications.

Document representation

This section involves the document preprocess and the overall feature extraction procedures. It includes the detailed steps to partition a document into paragraphs and further partition each paragraph into sentences for HTML format documents (see Section 3.1), building two different sizes of vocabularies (see Section 3.2), and construction of multilevel representation (see Section 3.3).

Document retrieval

In this section, we present the DR approaches that are different from most existing models, because our proposed methods add local information of a document into the retrieval process by taking advantage of multilevel representation. They also pave the way for the subsequent PD. Currently document modeling methods (e.g. VSM [5], LSI [6], PLSI [7], LDA [8], EFH [9], and RAP [10]) only consider the global information of a document (i.e. term frequency). Two documents, however, containing similar

Plagiarism detection

After retrieving Nret documents, which are regarded as the suspected plagiarized sources, we are now in the position to develop the PD algorithms. It is straightforward to sort the Nret documents again in ascending order by further matching sentences and using distance fusion techniques, and eventually return a short list to the users. This method is called ranking based PD (see Section 5.1). Another way to implement PD is to set an offset value to make the binary decision on the presence of

Experiments

In this section, we conduct the detailed experiments as efficiency verifications of our proposed PD approach. This section involves the dataset description (see Section 6.1), the performance of relevant DR (see Section 6.2), the performance of PD (see Section 6.3), the impact study of parameters (see Section 6.4), and the empirical study of computational time (see Section 6.5).

Discussion and extension

Currently developing an efficient PD system is a very demanding work because plagiarism easily occurs in the information age. Although we conducted experiments in a simulation platform, many interesting results can be observed:

  • The usage of local information from sections or paragraphs significantly enhances the performance of the DR because it explores the spatial distributions of words.

  • Histogram based DR approach with an appropriate distance fusion performs better in the DR than in the PD.

Conclusion

A coarse-to-fine framework to efficiency thwart plagiarism is proposed in this study. Each document is represented by a multilevel structure, i.e. document–paragraph–sentence. Different signatures are constructed to represent components in different levels. Relevant DR approaches by adding or only using local information to explore rich semantics from documents are introduced to retrieve the suspected sources. Two PD algorithms by further sentence matching are designed and implemented to

Acknowledgments

The authors would like to express many thanks to anonymous reviewers for helpful comments to improve the standard of this paper.

Haijun Zhang received his B.Eng. degree in the Department of Civil Engineering and Master degree in the Department of Control Theory and Engineering from the Northeastern University, Shenyang, PR China in 2004 and 2007, respectively. He worked as a research assistant at the City University of Hong Kong in April–September 2007. He is currently working towards his Ph.D. degree at the City University of Hong Kong, Hong Kong. His research interests are evolutionary computation, structure

References (45)

  • S.L. Bang et al.

    Hierarchical document categorization with k-NN and concept-based thesauri

    Information Processing and Management

    (2006)
  • Francesc Serratosa et al.

    Signatures versus histograms: definitions, distances and algorithms

    Pattern Recognition

    (2006)
  • X. Gao et al.

    Image categorization: graph edit distance+edge direction histogram

    Pattern Recognition

    (2008)
  • T.C. Hoad et al.

    Methods for identifying versioned and plagiarized documents

    Journal of the American Society for Information Science and Technology

    (2003)
  • Tommy W.S. Chow et al.

    Multi-layer SOM with tree structured data for efficient document retrieval and plagiarism detection

    IEEE Transactions on Neural Networks

    (2009)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society of Information Science

    (1990)
  • T. Hofmann, Probabilistic latent semantic indexing, in: Proceedings of the Twenty-Second Annual International SIGIR...
  • D. Blei et al.

    Latent Dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • M. Welling, M. Rosen-Zvi, G. Hinton, Exponential family harmoniums with an application to information retrieval. In:...
  • P. Gehler, A. Holub, M. Welling, The rate adapting Poisson model for information retrieval and object recognition, in:...
  • R.B. Yates et al.

    Modern Information Retrieval

    (1999)
  • Cited by (24)

    • Patterning of writing style evolution by means of dynamic similarity

      2018, Pattern Recognition
      Citation Excerpt :

      The intrinsic approach operates only with the provided texts (one of acknowledged authorship and one being examined) and leads to a one-class classification problem [30–33]. Such problems also appear in the plagiarism detection area (see, e.g. [5,34–37]). Extrinsic methods transform the verification task into a binary-classification problem.

    • Minmax Circular Sector Arc for External Plagiarism's Heuristic Retrieval stage

      2017, Knowledge-Based Systems
      Citation Excerpt :

      Indeed, recent approaches aims to handle the intelligent plagiarism through syntax [24–31], semantic [15,27,28,30–32], structural [25] and cross-language-based approaches [30,31,33,34]. However, is infeasible to apply intelligent PD approaches, in larger collections, without a suitable reduction to the comparison space [25,35]. Hence, the Heuristic Retrieval (HR) stage is essential to enable PD systems to achieve real External PD problems and the next section discuss the HR stage state-of-art.

    • University learning with anti-plagiarism systems

      2021, Accountability in Research
    • Fuzzy Semantic-Based String Similarity Experiments to Detect Plagiarism in Indonesian Documents

      2019, ICICOS 2019 - 3rd International Conference on Informatics and Computational Sciences: Accelerating Informatics and Computational Research for Smarter Society in The Era of Industry 4.0, Proceedings
    View all citing articles on Scopus

    Haijun Zhang received his B.Eng. degree in the Department of Civil Engineering and Master degree in the Department of Control Theory and Engineering from the Northeastern University, Shenyang, PR China in 2004 and 2007, respectively. He worked as a research assistant at the City University of Hong Kong in April–September 2007. He is currently working towards his Ph.D. degree at the City University of Hong Kong, Hong Kong. His research interests are evolutionary computation, structure optimization, neural networks, machine learning, data mining, pattern recognition and their applications.

    Tommy W.S. Chow (IEEE M’93–SM’03) received his B.Sc. (First Hons.) and Ph.D. degrees from the University of Sunderland, Sunderland, U.K. He joined the City University of Hong Kong, Hong Kong, as a Lecturer in 1988. He is currently a Professor in the Electronic Engineering Department. His research interests include machine fault diagnosis, HOS analysis, system identification, and neural network learning algorithms and applications.

    View full text