Abstract
Digitization workflows for automatic acquisition of image collections are susceptible to errors and require quality assurance. This paper presents the automated quality assurance tools aiming at detection of possible quality issues that supports decision making for document image collections. The main contribution of this research is the implementation of various image processing tools for different error detection scenarios and their combination in to a single tool suite. The tool suite includes: (1) The matchbox tool for accurate near-duplicate detection in document image collections, based on SIFT feature extraction. (2) The finger detection tool aims at automatic detection of fingers that mistakenly appear in scans from digitized image collections, which uses processing techniques for edge detection, local image information extraction and its analysis for reasoning on scan quality. (3) The cropping error detection tool supports the detection of common cropping problems such as text shifted to the edge of the image, unwanted page borders, or unwanted text from a previous page on the image. Another important contribution of this work is a definition of the quality assurance workflow and its automatic execution for error detection in digital document collections. The presented tool suite detects described errors and presents them for additional manual analysis and collection cleaning. A statistical overview of evaluated data and characteristics like performance and accuracy is delivered. The results of the analysis confirm our hypothesis that an automated approach is able to detect errors with reliable quality, thus making quality control for large digitisation projects a feasible and affordable process.
This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Canny, J.: A computational approach to edge detection. IEEE Trans. Pat. Anal. Mach. Intell., 679–698 (1986)
Csurka, G., Dance, C.R., Fan, L., Willamowski, J.: Visual categorization with bags of keypoints. In: Workshop on SLCV, ECCV, pp. 1–22 (2004)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1627–1645 (2010)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Graf, R., King, R.: Finger detection for quality assurance of digitized image collections. In: Archiving Conference (2013)
Lu, G., Phillips, J.: Using perceptually weighted histograms for colour-based image retrieval. In: Fourth International Conference on Signal Processing, vol. 2 (1998)
Huber-Mörk, R., Schindler, A.: Quality assurance for document image collections in digital preservation. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P., Zemčík, P. (eds.) ACIVS 2012. LNCS, vol. 7517, pp. 108–119. Springer, Heidelberg (2012)
Huber-Mörk, R., Schindler, A.: Quality assurance for document image collections in digital preservation. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P., Zemčík, P. (eds.) ACIVS 2012. LNCS, vol. 7517, pp. 108–119. Springer, Heidelberg (2012)
Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: Proceedings of the 12th Annual ACM International Conference on Multimedia, MULTIMEDIA 2004, pp. 869–876. ACM, New York (2004)
Le Bourgeois, F., Trinh, E., Allier, B., Eglin, V., Emptoz, H.: Document images analysis solutions for digital libraries, document image analysis for libraries. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 2–24 (2004)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Comput. Vision 60(2), 91–110 (2004)
Marr, D., Hildreth, E.: Theory of edge detection. In: Proc. of the Royal Soc. London, pp. 187–217 (1980)
Meyer, F.: Color image segmentation. In: Image Processing and its Applications, pp. 303–306 (1992)
Graf, R., King, R., Schlarb, S.: Blank page and duplicate detection for quality assurance of document image collections. In: APA CDAC 2014 (2014)
Wu, X., Zhao, W.-L., Ngo, C.-W.: Near-duplicate keyframe retrieval with visual keywords and semantic context. In: Proc. of the 6th ACM ICIVR, pp. 162–169. ACM, New York (2007)
Zhao, W.-L., Ngo, C.-W., Tan, H.-K., Wu, X.: Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Transactions on Multimedia 9(5), 1037–1048 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Graf, R., King, R. (2014). Quality Assurance Tool Suite for Error Detection in Digital Repositories. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds) The Emergence of Digital Libraries – Research and Practices. ICADL 2014. Lecture Notes in Computer Science, vol 8839. Springer, Cham. https://doi.org/10.1007/978-3-319-12823-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-12823-8_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12822-1
Online ISBN: 978-3-319-12823-8
eBook Packages: Computer ScienceComputer Science (R0)