Accurate discovery of co-derivative documents via duplicate text detection☆
Introduction
Many document collections contain sets of documents that are co-derived. Examples of co-derived documents include plagiarised documents, document revisions, and documents written by amending a template. Knowledge of co-derivative document relationships in a collection can be used for returning more informative results from search engines, detection of plagiarism, and management of document versioning in an enterprise.
Depending upon the application, we may wish to identify all pairs of co-derived documents in a given collection (the or discovery problem) or only those documents that are co-derived with a specified query document (the or search problem). We focus in this research on the more difficult discovery problem. While it is possible to naïvely solve the discovery problem by repeated application of an algorithm for solving the search problem, such an application becomes too time-consuming for practical use.
Though the task of detecting co-derivative documents is superficially similar to that of document search or categorisation, there are marked differences. Ranking and categorisation are concerned with the semantics of documents, while co-derivative detection is concerned with a document's syntactic structure. While independently authored documents can have similar semantics (student essays on the same topic are an example), it is exceedingly unlikely for documents from different sources to have the same syntactic structure.
Existing feasible techniques for solving the discovery problem are based on document fingerprinting, in which a compact representation of a selected subset of contiguous text chunks occurring in each document—its fingerprint—is stored. Pairs of documents are identified as possibly co-derived if enough of the chunks in their respective fingerprints match. Fingerprinting schemes differ primarily in the way in which chunks to be stored are selected.
In this paper we introduce spex, a novel and efficient algorithm for identifying those chunks that occur more than once within a collection. We present the deco package, which uses the shared-chunk indexes generated by spex as the basis for accurate and efficient identification of co-derivative documents in a collection. We show that deco effectively addresses some of the deficiencies of existing approaches to this problem. Using several collections, we experimentally demonstrate that deco is able to reliably and accurately identify co-derivative documents within a collection while using fewer resources than previous techniques of similar capability. Our results also suggest that deco scales well to large collections.
Section snippets
What is co-derivation?
We consider two documents to be co-derived if some portion of one document is derived from the other, or some portion that is present in both documents is derived from a third. The notion of co-derivation is in many ways analogous to the idea of a genetic or ‘blood’ relationship in a human family.
While the above is an intuitive and appealing definition, it is purely qualitative. It tells us nothing of how to detect co-derivation, or even what characteristics we expect a pair of co-derived
The relationship graph
We introduce the concept of a relationship graph for representing and analysing co-derivation relationships within a collection. In a relationship graph for a given collection, each document is represented by a vertex. A co-derivation relationship between a pair of documents is indicated by the presence of an edge between the vertices representing these documents. The relationship graph emphasises the essentially pairwise nature of the co-derivation relationship, and allows for easy
Existing work: strategies for co-derivative discovery
There are several approaches to solving the search problem, most of which can be categorised as being either relative-frequency or fingerprinting methods:
The spex algorithm
Our contribution in this work is the spex algorithm, a resource-efficient technique for lossless chunk selection. The spex algorithm is a novel hash-based method for duplicate-chunk extraction and has far more modest and flexible memory requirements than the algorithms discussed in Section 4.3 and is thus the first selection algorithm that is able to provide lossless chunk selection within large collections. In the case of large collections, the memory needs of spex are in most cases many times
The deco package
Our deco system for co-derivative detection is a software package that combines the spex algorithm with advanced indexing techniques, sophisticated scoring functions, and other previous innovations in the field.
Deco operates in two phases: index construction and relationship graph generation.
Experimental methodology
We seek to experimentally investigate two facets of the deco package: the accuracy and reliability of the package in identifying co-derivative document pairs, and the scaling characteristics of the system.
Document collections: We make use of six document collections for our experiments. The and linuxdocs collections were accumulated by Hoad and Zobel [4]. The collection consists of 3307 web documents totalling approximately 35 MB, into which have been seeded 9 documents
Index growth rate
In order to investigate the growth trend of the shared-chunk index as the source collection grows, we extracted subcollections of various sizes from the LATimes collection and the linuxdocs collection, and observed the number of duplicate chunks extracted as the size of the collection was increased.
This growth trend is important for the scalability of spex and by extension the deco package: if the growth trend were quadratic, for example, this would set a practical upper bound on the size of the
Future work & conclusions
There are many reasons why one may wish to discover co-derivation relationships amongst the documents in a collection. Previous feasible solutions to this task have been based on fingerprinting algorithms that used heuristic chunk selection techniques. We have argued that, with these techniques, one can have either reliability or acceptable resource usage, but not both at once.
We have introduced the spex algorithm for efficiently identifying non-unique chunks in a collection. Unique chunks
Acknowledgements
This research was supported by the Australian Research Council.
References (24)
- et al.
Syntactic clustering of the Web
Computer Networks and ISDN Systems
(1997) - et al.
Source models for natural language text
Int. J. Man Machine Studies
(1990) Overview of the second text retrieval conference (TREC-2)
Information Processing and Management
(1995)- A.Z. Broder, On the resemblance and containment of documents, in: Compression and Complexity of Sequences...
- M. Sanderson, Duplicate detection in the Reuters collection, Technical Report TR-1997-5, University of Glasgow,...
- N. Shivakumar, H. García-Molina, SCAM: a copy detection mechanism for digital documents, in: Proceedings of the Second...
- et al.
Methods for identifying versioned and plagiarised documents
Journal of the American Society for Information Science and Technology
(2003) - U. Manber, Finding similar files in a large file system, in: Proceedings of the USENIX Winter 1994 Technical...
- S. Brin, J. Davis, H. García-Molina, Copy detection mechanisms for digital documents, in: Proceedings of the ACM SIGMOD...
- N. Heintze, Scalable document fingerprinting, in: 1996 USENIX Workshop on Electronic Commerce,...
Managing Gigabytes: Compressing and Indexing Documents and Images
Winnowinglocal algorithms for document fingerprinting
Cited by (14)
Transformer induced enhanced feature engineering for contextual similarity detection in text
2022, Bulletin of Electrical Engineering and InformaticsLucene-P<sup>2</sup>: A Distributed Platform for Privacy-Preserving Text-Based Search
2021, IEEE Transactions on Dependable and Secure ComputingDetecting Short Near-Duplicates with Semantic Relations
2018, Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESSExperiment and evaluation in information retrieval models
2017, Experiment and Evaluation in Information Retrieval ModelsBoilerplate detection and recoding
2014, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Indexing word sequences for ranked retrieval
2014, ACM Transactions on Information Systems
- ☆
This article is based on a conference presentation: Y. Bernstein, J. Zobel, A scalable system for identifying co-derivative documents, Proceedings of the String Processing and Information Retrieval Symposium, October 2004, Padua, Italy, pp. 55–67.