Abstract
When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Cattoni R., Coianiz T., Messelodi S., Modena C.M.: Geometric layout analysis techniques for document image understanding: a review, Tech. Rep. 9703-09. IRST, Trento (1998)
Baird H.S.: Background structure in document images. In: Bunke, H., Wang, P., Baird, H.S. (eds) Document Image Analysis, pp. 17–34. World Scientific, Singapore (1994)
Breuel, T.M.: Two geometric algorithms for layout analysis. In: Proceedings of Document Analysis Systems. Lecture Notes in Computer Science, vol. 2423, Princeton, NY, USA, pp. 188–199 (2002)
O’Gorman L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)
Shafait F., Keysers D., Breuel T.M.: Performance evaluation and benchmarking of six page segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 941–954 (2008)
Le, D.X., Thoma, G.R., Wechsler, H.: Automated borders detection and adaptive segmentation for binary document images. In: 13th International Conference on Pattern Recognition, Vienna, Austria, pp. 737–741 (1996)
Avila, B.T., Lins, R.D.: Efficient removal of noisy borders from monochromatic documents. In: International Conference on Image Analysis and Recognition, Porto, Portugal, pp. 249–256 (2004)
Fan K.C., Wang Y.K., Lay T.R.: Marginal noise removal of document images. Pattern Recognit. 35(11), 2593–2611 (2002)
Cinque L., Levialdi S., Lombardi L., Tanimoto S.: Segmentation of page images having artifacts of photocopying and scanning. Pattern Recognit. 35(5), 1167–1177 (2002)
Peerawit, W., Kawtrakul, A.: Marginal noise removal from document images using edge density. In: 4th Information and Computer Engineering Postgraduate Workshop, Phuket, Thailand (2004)
Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic borders detection of camera document images. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil, pp. 71–78 (2007)
van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: 2nd IEEE International Conference on Document Image Analysis for Libraries, Lyon, France, pp. 232–242 (2006)
Shafait, F., van Beusekom, J., Keysers, D., Breuel, T.M.: Page frame detection for marginal noise removal from scanned documents, in: SCIA 2007, Image Analysis, Proceedings. Lecture Notes in Computer Science, vol. 4522, Aalborg, Denmark, pp. 651–660 (2007)
Dengel, A., Barth, G., ANASTASIL: Hybrid knowledge-based system for document image analysis. In: Proceedings of International Joint Conference on Artificial Intelligence, Detroit, MI, USA, pp. 1249–1254 (1989)
Liang J., Phillips I.T., Haralick R.M.: Performance evaluation of document structure extraction algorithms. Comput. Vis. Image Underst. 84(1), 144–159 (2001)
Das A.K., Saha S.K., Chanda B.: An empirical measure of the performance of a document image segmentation algorithm. Int. J. Document Anal. Recognit. 4(3), 183–190 (2002)
Kise K., Sato A., Iwata M.: Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Shafait, F., Keysers, D., Breuel, T.M.: Performance comparison of six algorithms for page segmentation. In: 7th IAPR Workshop on Document Analysis Systems. Lecture Notes in Computer Science, vol. 3872, Nelson, New Zealand, pp. 368–379 (2006)
Breuel, T.M.: The OCRopus open source OCR system. In: Proceedings of SPIE Document Recognition and Retrieval XV, San Jose, CA, USA, pp. 0F1–0F15 (2008)
Mao S., Kanungo T.: Software architecture of PSET: a page segmentation evaluation toolkit. Int. J. Document Anal. Recognit. 4(3), 205–217 (2002)
Okun, O., Pietikainen, M., Sauvola, J.: Robust skew estimation on low-resolution document images. In: 5th International Conference on Document Analysis and Recognition, Bangalore, India, pp. 621–624 (1999)
Breuel, T.M.: Robust least square baseline finding using a branch and bound algorithm. In: Proceedings of SPIE Document Recognition and Retrieval IX, San Jose, CA, USA, pp. 20–27 (2002)
Breuel T.M.: A practical, globally optimal algorithm for geometric matching under uncertainty. Electronic Notes Theor. Comput. Sci. 46, 1–15 (2001)
Breuel T.M.: On the use of interval arithmetic in geometric branch-and-bound algorithms. Pattern Recognit. Lett. 24(9–10), 1375–1384 (2003)
Breuel T.M.: Implementation techniques for geometric branch-and-bound matching methods. Comput. Vis. Image Underst 90(3), 258–294 (2003)
Levenshtein V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Phillips I.T.: User’s reference manual for the UW english/technical document image database III, Tech. rep. Seattle University, Washington (1996)
Breuel, T.M. (1993) Recognition by Adaptive Subdivision of Transformation Space: practical experiences and comparison with the Hough transform. In: IEE Colloquium on ’Hough Transforms’ (Digest No.106), pp. 71–74 (1993)
Nagy G., Seth S., Viswanathan M.: A prototype document image analysis system for technical journals. Computer 7(25), 10–22 (1992)
Antonacopoulos, A., Gatos, B., Bridson, D.: Page segmentation competition. In: Proceedings of 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil, pp. 1279–1283 (2007)
Ulges, A., Lampert, C., Breuel, T.: Document image dewarping using robust estimation of curled text lines. In: Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 1001–1005 (2005)
Shafait, F., Breuel, T.M.: Document image dewarping contest. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil, pp. 181–188 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shafait, F., van Beusekom, J., Keysers, D. et al. Document cleanup using page frame detection. IJDAR 11, 81–96 (2008). https://doi.org/10.1007/s10032-008-0071-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-008-0071-7