skip to main content
10.1145/2536146.2536157acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

Duplicate detection approaches for quality assurance of document image collections

Published: 28 October 2013 Publication History

Abstract

This paper presents an evaluation of different methods for automatic duplicate detection in digitized collections. These approaches are meant to support quality assurance and decision making for long term preservation of digital content in libraries and archives. In this paper we demonstrate advantages and drawbacks of different approaches. Our goal is to select the most efficient method which satisfies the digital preservation requirements for duplicate detection in digital document image collections. Workflows of different complexity were designed in order to demonstrate possible duplicate detection approaches. Assessment of individual approaches is based on workflow simplicity, detection accuracy and acceptable performance, since image processing methods typically require significant computation. Applied image processing methods create expert knowledge that facilitates decision making for long term preservation. We employ AI technologies like expert rules and clustering for inferring explicit knowledge on the content of the digital collection. A statistical analysis of the aggregated information and the qualitative analysis of the aggregated knowledge are presented in the evaluation part of the paper.

References

[1]
C. Becker, H. Kulovits, M. Guttenbrunner, S. Strodl, A. Rauber, and H. Hofman. Systematic planning for digital preservation: evaluating potential strategies and building preservation plans. In Int. Journal on Digital Libraries, volume 10, pages 133--157, 2009.
[2]
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1--22, 2004.
[3]
M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24: 381--395, June 1981.
[4]
J. S. Hare, S. Samangooei, and D. P. Dupplaw. Openimaj and imageterrier: Java libraries and tools for scalable multimedia analysis and indexing of images. In Proceedings of the 19th ACM international conference on Multimedia, pages 691--694, Scottsdale, Arizona, USA, Nobember 28 - December 1 2011.
[5]
R. Huber-Mörk and A. Schindler. Quality assurance for document image collections in digital preservation. In Proc. of the 14th Intl. Conf. on ACIVS (ACIVS 2012), volume 7517 of LNCS, pages 108--119, Brno, Czech Republic, September 4--7 2012. Springer.
[6]
R. Huber-Mörk, A. Schindler, and S. Schlarb. Duplicate deterction for quality assurcance of document image collections. In In iPRES 2012 - Proceedings of the 9th International Conference on Preservation of Digital Objects, pages 136--143, Toronto, Canada, October 1--5 2012.
[7]
Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In Proceedings of the 12th annual ACM international conference on Multimedia, MULTIMEDIA '04, pages 869--876, New York, NY, USA, 2004. ACM.
[8]
D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. of Comput. Vision, 60(2): 91--110, 2004.
[9]
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In In: Proc. of the IEEE CCVPR, 2007.
[10]
S. Ramachandrula, G. Joshi, S. Noushath, P. Parikh, and V. Gupta. Paperdiff: A script independent automatic method for finding the text differences between two document images. In The Eighth IAPR Intl. Workshop on DAS, pages 585--590, Sep 2008.
[11]
E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564--2571, 2011.
[12]
S. Schlarb, E. Michaelar, M. Kaiser, A. Lindley, B. Aitken, S. Ross, and A. Jackson. A case study on performing a complex file-format migration experiment using the planets testbed. IS&T Archiving Conference, 7: 58--63, 2010.
[13]
C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. Int. J. of Computer Vision, 37(2): 151--172, 2000.
[14]
S. Strodl, C. Becker, R. Neumayer, and A. Rauber. How to choose a digital preservation strategy: evaluating a preservation planning procedure. In In: JCDL 2007: Proceedings of the 2007 conference on digital libraries, pages 29--38, New York, NY, USA, 2007. ACM.
[15]
J. van Beusekom, D. Keysers, F. Shafait, and T. Breuel. Distance measures for layout-based document image retrieval. In 2nd ICDIAL, 2006. DIAL '06, pages 231--242, April 2006.
[16]
Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4): 600--612, April 2004.
[17]
X. Wu, W.-L. Zhao, and C.-W. Ngo. Near-duplicate keyframe retrieval with visual keywords and semantic context. In Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR '07, pages 162--169, New York, NY, USA, 2007. ACM.
[18]
W.-L. Zhao, C.-W. Ngo, H.-K. Tan, and X. Wu. Near-duplicate keyframe identification with interest point matching and pattern learning. Trans. Multi., 9(5): 1037--1048, Aug. 2007.

Cited By

View all
  • (2014)Quality Assurance Tool Suite for Error Detection in Digital RepositoriesThe Emergence of Digital Libraries – Research and Practices10.1007/978-3-319-12823-8_6(48-58)Online publication date: 2014

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEDES '13: Proceedings of the Fifth International Conference on Management of Emergent Digital EcoSystems
October 2013
358 pages
ISBN:9781450320047
DOI:10.1145/2536146
  • Conference Chairs:
  • Latif Ladid,
  • Antonio Montes,
  • General Chair:
  • Peter A. Bruck,
  • Program Chairs:
  • Fernando Ferri,
  • Richard Chbeir
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • LBBC: Luxembourg Brazil Business Council
  • IPv6 Luxembourg Council: Luxembourg IPv6 Council
  • Luxembourg Green Business Awards 2013: Luxembourg Green Business Awards 2013
  • LUXINNOVATION: Agence Nationale pour la Promotion de l Innovation et de la Recherche
  • Pro Newtech: Pro Newtech
  • CTI: Centro de Tecnologia da Informação Renato Archer

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. digital preservation
  2. image processing
  3. quality assurance

Qualifiers

  • Research-article

Funding Sources

Conference

MEDES '13
Sponsor:
  • LBBC
  • IPv6 Luxembourg Council
  • Luxembourg Green Business Awards 2013
  • LUXINNOVATION
  • Pro Newtech
  • CTI

Acceptance Rates

MEDES '13 Paper Acceptance Rate 56 of 122 submissions, 46%;
Overall Acceptance Rate 267 of 682 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2014)Quality Assurance Tool Suite for Error Detection in Digital RepositoriesThe Emergence of Digital Libraries – Research and Practices10.1007/978-3-319-12823-8_6(48-58)Online publication date: 2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media