Elsevier

Information Systems

Volume 31, Issue 7, November 2006, Pages 595-609
Information Systems

Accurate discovery of co-derivative documents via duplicate text detection

https://doi.org/10.1016/j.is.2005.11.006Get rights and content

Abstract

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype package that combines the spex algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.

Introduction

Many document collections contain sets of documents that are co-derived. Examples of co-derived documents include plagiarised documents, document revisions, and documents written by amending a template. Knowledge of co-derivative document relationships in a collection can be used for returning more informative results from search engines, detection of plagiarism, and management of document versioning in an enterprise.

Depending upon the application, we may wish to identify all pairs of co-derived documents in a given collection (the n×n or discovery problem) or only those documents that are co-derived with a specified query document (the 1×n or search problem). We focus in this research on the more difficult discovery problem. While it is possible to naïvely solve the discovery problem by repeated application of an algorithm for solving the search problem, such an application becomes too time-consuming for practical use.

Though the task of detecting co-derivative documents is superficially similar to that of document search or categorisation, there are marked differences. Ranking and categorisation are concerned with the semantics of documents, while co-derivative detection is concerned with a document's syntactic structure. While independently authored documents can have similar semantics (student essays on the same topic are an example), it is exceedingly unlikely for documents from different sources to have the same syntactic structure.

Existing feasible techniques for solving the discovery problem are based on document fingerprinting, in which a compact representation of a selected subset of contiguous text chunks occurring in each document—its fingerprint—is stored. Pairs of documents are identified as possibly co-derived if enough of the chunks in their respective fingerprints match. Fingerprinting schemes differ primarily in the way in which chunks to be stored are selected.

In this paper we introduce spex, a novel and efficient algorithm for identifying those chunks that occur more than once within a collection. We present the deco package, which uses the shared-chunk indexes generated by spex as the basis for accurate and efficient identification of co-derivative documents in a collection. We show that deco effectively addresses some of the deficiencies of existing approaches to this problem. Using several collections, we experimentally demonstrate that deco is able to reliably and accurately identify co-derivative documents within a collection while using fewer resources than previous techniques of similar capability. Our results also suggest that deco scales well to large collections.

Section snippets

What is co-derivation?

We consider two documents to be co-derived if some portion of one document is derived from the other, or some portion that is present in both documents is derived from a third. The notion of co-derivation is in many ways analogous to the idea of a genetic or ‘blood’ relationship in a human family.

While the above is an intuitive and appealing definition, it is purely qualitative. It tells us nothing of how to detect co-derivation, or even what characteristics we expect a pair of co-derived

The relationship graph

We introduce the concept of a relationship graph for representing and analysing co-derivation relationships within a collection. In a relationship graph for a given collection, each document is represented by a vertex. A co-derivation relationship between a pair of documents is indicated by the presence of an edge between the vertices representing these documents. The relationship graph emphasises the essentially pairwise nature of the co-derivation relationship, and allows for easy

Existing work: strategies for co-derivative discovery

There are several approaches to solving the search problem, most of which can be categorised as being either relative-frequency or fingerprinting methods:

The spex algorithm

Our contribution in this work is the spex algorithm, a resource-efficient technique for lossless chunk selection. The spex algorithm is a novel hash-based method for duplicate-chunk extraction and has far more modest and flexible memory requirements than the algorithms discussed in Section 4.3 and is thus the first selection algorithm that is able to provide lossless chunk selection within large collections. In the case of large collections, the memory needs of spex are in most cases many times

The deco package

Our deco system for co-derivative detection is a software package that combines the spex algorithm with advanced indexing techniques, sophisticated scoring functions, and other previous innovations in the field.

Deco operates in two phases: index construction and relationship graph generation.

Experimental methodology

We seek to experimentally investigate two facets of the deco package: the accuracy and reliability of the package in identifying co-derivative document pairs, and the scaling characteristics of the system.

Document collections: We make use of six document collections for our experiments. The webdata+xml and linuxdocs collections were accumulated by Hoad and Zobel [4]. The webdata+xml collection consists of 3307 web documents totalling approximately 35 MB, into which have been seeded 9 documents

Index growth rate

In order to investigate the growth trend of the shared-chunk index as the source collection grows, we extracted subcollections of various sizes from the LATimes collection and the linuxdocs collection, and observed the number of duplicate chunks extracted as the size of the collection was increased.

This growth trend is important for the scalability of spex and by extension the deco package: if the growth trend were quadratic, for example, this would set a practical upper bound on the size of the

Future work & conclusions

There are many reasons why one may wish to discover co-derivation relationships amongst the documents in a collection. Previous feasible solutions to this task have been based on fingerprinting algorithms that used heuristic chunk selection techniques. We have argued that, with these techniques, one can have either reliability or acceptable resource usage, but not both at once.

We have introduced the spex algorithm for efficiently identifying non-unique chunks in a collection. Unique chunks

Acknowledgements

This research was supported by the Australian Research Council.

References (24)

  • A.Z. Broder et al.

    Syntactic clustering of the Web

    Computer Networks and ISDN Systems

    (1997)
  • I.H. Witten et al.

    Source models for natural language text

    Int. J. Man Machine Studies

    (1990)
  • D. Harman

    Overview of the second text retrieval conference (TREC-2)

    Information Processing and Management

    (1995)
  • A.Z. Broder, On the resemblance and containment of documents, in: Compression and Complexity of Sequences...
  • M. Sanderson, Duplicate detection in the Reuters collection, Technical Report TR-1997-5, University of Glasgow,...
  • N. Shivakumar, H. García-Molina, SCAM: a copy detection mechanism for digital documents, in: Proceedings of the Second...
  • T.C. Hoad et al.

    Methods for identifying versioned and plagiarised documents

    Journal of the American Society for Information Science and Technology

    (2003)
  • U. Manber, Finding similar files in a large file system, in: Proceedings of the USENIX Winter 1994 Technical...
  • S. Brin, J. Davis, H. García-Molina, Copy detection mechanisms for digital documents, in: Proceedings of the ACM SIGMOD...
  • N. Heintze, Scalable document fingerprinting, in: 1996 USENIX Workshop on Electronic Commerce,...
  • I.H. Witten et al.

    Managing Gigabytes: Compressing and Indexing Documents and Images

    (1999)
  • S. Schleimer et al.

    Winnowinglocal algorithms for document fingerprinting

  • Cited by (14)

    View all citing articles on Scopus

    This article is based on a conference presentation: Y. Bernstein, J. Zobel, A scalable system for identifying co-derivative documents, Proceedings of the String Processing and Information Retrieval Symposium, October 2004, Padua, Italy, pp. 55–67.

    View full text