skip to main content
10.1145/2037342.2037372acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

A design of a preprocessing framework for large database of historical documents

Published: 16 September 2011 Publication History

Abstract

The objective of document preprocessing is to ease the text recognition or the document indexing processes. The analysis of historical documents seems to be a big challenge because the majority of those documents are noisy and present many degradations. In this paper we propose a preprocessing framework for a large dataset of historical documents. The proposed framework is decomposed of two phases, the selection and the evaluation. During the first phase one or multiple methods are corresponded for each book of the used database. The validation of the selection results is performed during the evaluation. The experiments are applied on printed and handwritten documents extracted respectively from Google-Books and Bayerische Staatsbibliothek databases. The results returned during the evaluation are very promising.

References

[1]
I. Ben Messaoud and H. El Abed, "Automatic annotation for handwritten historical documents using markov models," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 381--386.
[2]
K. Ntirogiannis, B. Gatos, and I. Pratikakis, "An objective evaluation methodology for document image binarization techniques," in IAPR Inter. Workshop on Document Analysis Systems (DAS), September 2008, pp. 217--224.
[3]
B. Su, S. Lu, and C. Tan, "Binarization of historical document images using the local maximum and minimum," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 159--165.
[4]
P. Stathis, E. Kavallieratou, and N. Papamarkos, "An evaluation technique for binarization algorithms," Journal of Universal Computer Science, vol. 14, no. 18, pp. 3011--3030, October 2008.
[5]
B. Gatos, K. Ntirogiannis, and I. Pratikakis, "ICDAR 2009 document image binarization contest (DIBCO 2009)," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 1375--1382.
[6]
I. Pratikakis, B. Gatos, and K. Ntirogiannis, "H-DIBCO 2010-handwritten document image binarization competition," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 727--726.
[7]
R. Prasad, P. Natarajan, K. Subramanian, S. Saleem, and R. Schwartz, "Finding structure in noisy text: Topic classification and unsupervised clustering," in Workshop on Analytics for Noisy Unstructured Text Data, January 2007, pp. 3--8.
[8]
E. Saund, J. Lind, and P. S. and, "Pixlabeler: User interface for pixel-level labeling of elements in document images," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 646--650.
[9]
E. Barney Smith, "An anlysis of binarization ground truth," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 27--34.
[10]
N. Otsu, "A threshold selection method from gray level histograms," IEEE Trans. Syst., Man, Cybern., vol. 9, pp. 62--66, 1979.
[11]
J. Bernsen, "Dynamic thresholding of grey-level images," in Inter. Conf. on Pattern Recognition (ICPR), 1986, pp. 1251--1255.
[12]
W. Niblack, "An introduction to digital image processing," in Prentice Hall Englewood Cliffs, 1986, pp. 115--116.
[13]
J. Sauvola and M. Pietikäinen, "Adaptive document image binarization," Pattern Recognition, vol. 33, no. 2, pp. 225--236, February 2000.
[14]
B. Gatos, I. Pratikakis, and S. Perantonis, "Adaptive degraded document image binarization," Pattern Recognition, vol. 39, pp. 317--327, September 2006.
[15]
I. Ben Messaoud, H. El Abed, H. Amiri, and V. Märgner, "New binarization approach based on text block extraction," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2011.
[16]
R. Schilling, Fundamentals of Robotics Analysis and Control, E. Cliffs, Ed. Prentice-Hall, 1990.
[17]
M. Kamel and A. Zhao, "Extraction of binary character/graphics images from grayscale document images," CVGIP: Graphical Models and Image Processing, vol. 55, pp. 203--217, May 1993.
[18]
Y. Yang and H. Yan, "An adaptive logical method for binarization of degraded document image," Pattern Recognition, vol. 33, no. 5, pp. 787--807, May 2000.
[19]
S. Lu and B. S. C. L. Ta, "Document image binarization using background estimation and stroke edge," Inter. Journal on Document Analysis and Recognition, vol. 13, no. 4, pp. 303--314, December 2010.
[20]
L. Lam, S. W. Lee, and C. Y. Suen, "Thinning methodologies-a comprehensive survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 9, pp. 869--885, September 1992.
[21]
R. Paredes and E. Kavallieratou, "ICFHR 2010 contest: Quantitative evaluation of binarization algorithms," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 733--736.
[22]
K. Coyle, "Mass digitization of books," Journal of Academic Librarianship, vol. 32, no. 6, pp. 641--645, 2006.

Cited By

View all
  • (2021)Benchmark and Survey of Automated Machine Learning FrameworksJournal of Artificial Intelligence Research10.1613/jair.1.1185470(409-472)Online publication date: 1-May-2021
  • (2020)Historical Document Image Binarization: A ReviewSN Computer Science10.1007/s42979-020-00176-11:3Online publication date: 16-May-2020
  • (2019)Automatic Composition and Optimization of Multicomponent Predictive Systems With an Extended Auto-WEKAIEEE Transactions on Automation Science and Engineering10.1109/TASE.2018.287643016:2(946-959)Online publication date: Apr-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
September 2011
195 pages
ISBN:9781450309165
DOI:10.1145/2037342
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • IAPR: International Association for Pattern Recognition

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation metrics
  2. ground-truth generation
  3. method selection
  4. preprocessing framework

Qualifiers

  • Research-article

Conference

HIP '11
Sponsor:
  • IAPR
HIP '11: Historical Document Imaging and Processing
September 16 - 17, 2011
China, Beijing, USA

Acceptance Rates

Overall Acceptance Rate 52 of 90 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Benchmark and Survey of Automated Machine Learning FrameworksJournal of Artificial Intelligence Research10.1613/jair.1.1185470(409-472)Online publication date: 1-May-2021
  • (2020)Historical Document Image Binarization: A ReviewSN Computer Science10.1007/s42979-020-00176-11:3Online publication date: 16-May-2020
  • (2019)Automatic Composition and Optimization of Multicomponent Predictive Systems With an Extended Auto-WEKAIEEE Transactions on Automation Science and Engineering10.1109/TASE.2018.287643016:2(946-959)Online publication date: Apr-2019
  • (2016)Beyond the Ground Truth: Alternative Quality Measures of Document Binarizations2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)10.1109/ICFHR.2016.0097(495-500)Online publication date: Oct-2016
  • (2016)AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)10.1109/ICFHR.2016.0042(168-173)Online publication date: Oct-2016
  • (2016)Towards Automatic Composition of Multicomponent Predictive SystemsHybrid Artificial Intelligent Systems10.1007/978-3-319-32034-2_3(27-39)Online publication date: 14-Apr-2016
  • (2015)An initial study on the construction of ground truth binarized images of ancient palm leaf manuscriptsProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333843(656-660)Online publication date: 23-Aug-2015
  • (2013)Evaluating glyph binarizations based on their propertiesProceedings of the 2013 ACM symposium on Document engineering10.1145/2494266.2494318(127-130)Online publication date: 10-Sep-2013
  • (2013)Performance Evaluation Methodology for Historical Document Image BinarizationIEEE Transactions on Image Processing10.1109/TIP.2012.221955022:2(595-609)Online publication date: 1-Feb-2013
  • (2012)Collaborative Access to Ancient DocumentsInternational Journal of Mobile Computing and Multimedia Communications10.4018/jmcmc.20120701034:3(34-53)Online publication date: 1-Jul-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media