Abstract.
Detecting duplicates in document image databases is a problem of growing importance. The task is made difficult by the various degradations suffered by printed documents, and by conflicting notions of what it means to be a “duplicate”. To address these issues, this paper introduces a framework for clarifying and formalizing the duplicate detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data derived from real-world noise sources. Also described are several heuristics that have the potential to speed up the computation by several orders of magnitude.
Similar content being viewed by others
Author information
Authors and Affiliations
Additional information
Received February 3, 1999 / Revised December 2, 1999
Rights and permissions
About this article
Cite this article
Lopresti, D. String techniques for detecting duplicates in document databases. IJDAR 2, 186–199 (2000). https://doi.org/10.1007/PL00021525
Issue Date:
DOI: https://doi.org/10.1007/PL00021525