skip to main content
10.1145/2809544.2809552acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

Layout Analysis Algorithm Based on Probabilistic Graphical Model for Dunhuang Historical Documents

Authors Info & Claims
Published:22 August 2015Publication History

ABSTRACT

The Dunhuang historical documents are of great significance to the study of ancient Chinese Buddhist culture and other topics. It would greatly benefit the protection and the study of historical documents with full-text information generated by historical document recognition technology. However, many historical documents from Dunhuang are old and broken, and to make it more challenging, the style and layout of these documents are casual as well. Traditional layout analysis algorithm failed to pay much attention to these problems. In this paper, a new layout analysis algorithm based on Probabilistic Graphical Model is proposed, including both rough segmentation and fine segmentation. After the input historical document images are pre-processed by Gaussian smoothed filtering and binarization, the rough segmentation step uses projection information to get rough text-column regions. In the fine segmentation step, a connected component analysis algorithm based on Probabilistic Graphical Model is developed. The method models the extracted connected components based on Markov Random Field, and combines connected components to get output text columns. Experiments were conducted on some Dunhuang historical documents, and the proposed method could correctly segment text columns with a recall rate of 90.0% and an accuracy of 77.7%. The segmented text-column regions could cover 99.2% characters in historical document images. The result shows that the proposed layout analysis algorithm could be successfully applied to degraded historical document images.

References

  1. Wong, K. Y., Casey, R. G., and Wahl, F. M. 1982. Document analysis system. IBM journal of research and development, 26(6), 647--656. DOI= http://dx.doi.org/10.1147/rd.266.0647Google ScholarGoogle Scholar
  2. Wahl, F. M., Wong, K. Y., & Casey, R. G. 1982. Block segmentation and text extraction in mixed text/image documents. Computer graphics and image processing, 20(4), 375--390. DOI= http://dx.doi.org/10.1016/0146-664X(82)90059-4Google ScholarGoogle Scholar
  3. Nagy, G. and Seth, S. 1984. Hierarchical representation of optically scanned documents. In Proceedings of International Conference on Pattern Recognition, Vol. 1, 347--349.Google ScholarGoogle Scholar
  4. Nagy, G., Seth, S., and Viswanathan, M. 1992. A prototype document image analysis system for technical journals. Computer, 25(7), 10--22. DOI= http://dx.doi.org/10.1109/2.144436Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hadjar, K. and Ingold, R. 2003. Arabic newspaper page segmentation. In Proceedings of Seventh International Conference on Document Analysis and Recognition, 895. DOI= http://doi.ieeecomputersociety.org/10.1109/ICDAR.2003.1227789Google ScholarGoogle Scholar
  6. Garz, A., Sablatnig, R., and Diem, M. 2011. Layout analysis for historical manuscripts using sift features. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 508--512. DOI= http://dx.doi.org/10.1109/ICDAR.2011.108Google ScholarGoogle Scholar
  7. Pintus, R., Yang, Y., & Rushmeier, H. 2015. Athena: automatic text height extraction for the analysis of text lines in old handwritten manuscripts. In Journal on Computing and Cultural Heritage, 8(1), 1. DOI= http://dx.doi.org/10.1145/2659020Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Asi, A., Cohen, R., Kedem, K., El-Sana, J., and Dinstein, I. 2014. A coarse-to-fine approach for layout analysis of ancient manuscripts. In Proceedings of 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), 140--145. DOI= http://dx.doi.org/10.1109/ICFHR.2014.31Google ScholarGoogle Scholar
  9. Chen, K., Wei, H., Liwicki, M., Hennebert, J., and Ingold, R. 2014. Robust text line segmentation for historical manuscript images using color and texture. In Proceedings of 22nd International Conference on Pattern Recognition, 2978--2983. DOI= http://dx.doi.org/10.1109/ICPR.2014.514Google ScholarGoogle Scholar
  10. Mehri, M., Gomez-Krämer, P., Héroux, P., and Mullot, R. 2013. Old document image segmentation using the autocorrelation function and multiresolution analysis. In Proc. SPIE 8658, IS&T/SPIE Electronic Imaging - Document Recognition and Retrieval XX, 86580K-86580K. DOI= 10.1117/12.2002365Google ScholarGoogle Scholar
  11. Bukhari, S. S., Breuel, T. M., Asi, A., and El-Sana, J. 2012. Layout analysis for Arabic historical document images using machine learning. In Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR), 639--644. DOI= http://dx.doi.org/10.1109/ICFHR.2012.227Google ScholarGoogle Scholar
  12. Fletcher, L. A., and Kasturi, R. 1988. A robust algorithm for text string separation from mixed text/graphics images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6), 910--918. DOI= http://dx.doi.org/10.1109/34.9112Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kleber, F., Sablatnig, R., Gau, M., and Miklas, H. 2008. Ancient document analysis based on text line extraction. In Proceedings of 19th International Conference on Pattern Recognition, 1--4. DOI= http://dx.doi.org/10.1109/ICPR.2008.4761530Google ScholarGoogle Scholar
  14. Montreuil, F., Nicolas, S., Grosicki, E., and Heutte, L. 2010. A new hierarchical handwritten document layout extraction based on conditional random field modeling. In Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR), 31--36. DOI= http://dx.doi.org/10.1109/ICFHR.2010.13Google ScholarGoogle Scholar
  15. Cruz, F., and Ramos Terrades, O. 2014. EM-based layout analysis method for structured documents. In Proceedings of 22nd International Conference on Pattern Recognition (ICPR), 315--320. DOI= http://dx.doi.org/10.1109/ICPR.2014.63Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bosch, V., Toselli, A. H., and Vidal, E. 2012. Statistical text line analysis in handwritten documents. In Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR), 201--206. DOI= http://dx.doi.org/10.1109/ICFHR.2012.274Google ScholarGoogle Scholar
  17. Wu, Y., Zha, S., Cao, H., Liu, D., and Natarajan, P. 2013. A Markov chain based line segmentation framework for handwritten character recognition. In Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210C. DOI= doi:10.1117/12.2042600Google ScholarGoogle Scholar
  18. Peng, L., Xiu, P., and Ding, X. 2003. Design and development of an ancient Chinese document recognition system. In Proc. SPIE 5296, Document Recognition and Retrieval XI, 166. DOI=http://dx.doi.org/10.1117/12.529107Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    HIP '15: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing
    August 2015
    155 pages
    ISBN:9781450336024
    DOI:10.1145/2809544

    Copyright © 2015 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 August 2015

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate52of90submissions,58%
  • Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader