ABSTRACT
The Dunhuang historical documents are of great significance to the study of ancient Chinese Buddhist culture and other topics. It would greatly benefit the protection and the study of historical documents with full-text information generated by historical document recognition technology. However, many historical documents from Dunhuang are old and broken, and to make it more challenging, the style and layout of these documents are casual as well. Traditional layout analysis algorithm failed to pay much attention to these problems. In this paper, a new layout analysis algorithm based on Probabilistic Graphical Model is proposed, including both rough segmentation and fine segmentation. After the input historical document images are pre-processed by Gaussian smoothed filtering and binarization, the rough segmentation step uses projection information to get rough text-column regions. In the fine segmentation step, a connected component analysis algorithm based on Probabilistic Graphical Model is developed. The method models the extracted connected components based on Markov Random Field, and combines connected components to get output text columns. Experiments were conducted on some Dunhuang historical documents, and the proposed method could correctly segment text columns with a recall rate of 90.0% and an accuracy of 77.7%. The segmented text-column regions could cover 99.2% characters in historical document images. The result shows that the proposed layout analysis algorithm could be successfully applied to degraded historical document images.
- Wong, K. Y., Casey, R. G., and Wahl, F. M. 1982. Document analysis system. IBM journal of research and development, 26(6), 647--656. DOI= http://dx.doi.org/10.1147/rd.266.0647Google Scholar
- Wahl, F. M., Wong, K. Y., & Casey, R. G. 1982. Block segmentation and text extraction in mixed text/image documents. Computer graphics and image processing, 20(4), 375--390. DOI= http://dx.doi.org/10.1016/0146-664X(82)90059-4Google Scholar
- Nagy, G. and Seth, S. 1984. Hierarchical representation of optically scanned documents. In Proceedings of International Conference on Pattern Recognition, Vol. 1, 347--349.Google Scholar
- Nagy, G., Seth, S., and Viswanathan, M. 1992. A prototype document image analysis system for technical journals. Computer, 25(7), 10--22. DOI= http://dx.doi.org/10.1109/2.144436Google ScholarDigital Library
- Hadjar, K. and Ingold, R. 2003. Arabic newspaper page segmentation. In Proceedings of Seventh International Conference on Document Analysis and Recognition, 895. DOI= http://doi.ieeecomputersociety.org/10.1109/ICDAR.2003.1227789Google Scholar
- Garz, A., Sablatnig, R., and Diem, M. 2011. Layout analysis for historical manuscripts using sift features. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 508--512. DOI= http://dx.doi.org/10.1109/ICDAR.2011.108Google Scholar
- Pintus, R., Yang, Y., & Rushmeier, H. 2015. Athena: automatic text height extraction for the analysis of text lines in old handwritten manuscripts. In Journal on Computing and Cultural Heritage, 8(1), 1. DOI= http://dx.doi.org/10.1145/2659020Google ScholarDigital Library
- Asi, A., Cohen, R., Kedem, K., El-Sana, J., and Dinstein, I. 2014. A coarse-to-fine approach for layout analysis of ancient manuscripts. In Proceedings of 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), 140--145. DOI= http://dx.doi.org/10.1109/ICFHR.2014.31Google Scholar
- Chen, K., Wei, H., Liwicki, M., Hennebert, J., and Ingold, R. 2014. Robust text line segmentation for historical manuscript images using color and texture. In Proceedings of 22nd International Conference on Pattern Recognition, 2978--2983. DOI= http://dx.doi.org/10.1109/ICPR.2014.514Google Scholar
- Mehri, M., Gomez-Krämer, P., Héroux, P., and Mullot, R. 2013. Old document image segmentation using the autocorrelation function and multiresolution analysis. In Proc. SPIE 8658, IS&T/SPIE Electronic Imaging - Document Recognition and Retrieval XX, 86580K-86580K. DOI= 10.1117/12.2002365Google Scholar
- Bukhari, S. S., Breuel, T. M., Asi, A., and El-Sana, J. 2012. Layout analysis for Arabic historical document images using machine learning. In Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR), 639--644. DOI= http://dx.doi.org/10.1109/ICFHR.2012.227Google Scholar
- Fletcher, L. A., and Kasturi, R. 1988. A robust algorithm for text string separation from mixed text/graphics images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6), 910--918. DOI= http://dx.doi.org/10.1109/34.9112Google ScholarDigital Library
- Kleber, F., Sablatnig, R., Gau, M., and Miklas, H. 2008. Ancient document analysis based on text line extraction. In Proceedings of 19th International Conference on Pattern Recognition, 1--4. DOI= http://dx.doi.org/10.1109/ICPR.2008.4761530Google Scholar
- Montreuil, F., Nicolas, S., Grosicki, E., and Heutte, L. 2010. A new hierarchical handwritten document layout extraction based on conditional random field modeling. In Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR), 31--36. DOI= http://dx.doi.org/10.1109/ICFHR.2010.13Google Scholar
- Cruz, F., and Ramos Terrades, O. 2014. EM-based layout analysis method for structured documents. In Proceedings of 22nd International Conference on Pattern Recognition (ICPR), 315--320. DOI= http://dx.doi.org/10.1109/ICPR.2014.63Google ScholarDigital Library
- Bosch, V., Toselli, A. H., and Vidal, E. 2012. Statistical text line analysis in handwritten documents. In Proceedings of International Conference on Frontiers in Handwriting Recognition (ICFHR), 201--206. DOI= http://dx.doi.org/10.1109/ICFHR.2012.274Google Scholar
- Wu, Y., Zha, S., Cao, H., Liu, D., and Natarajan, P. 2013. A Markov chain based line segmentation framework for handwritten character recognition. In Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210C. DOI= doi:10.1117/12.2042600Google Scholar
- Peng, L., Xiu, P., and Ding, X. 2003. Design and development of an ancient Chinese document recognition system. In Proc. SPIE 5296, Document Recognition and Retrieval XI, 166. DOI=http://dx.doi.org/10.1117/12.529107Google Scholar
Recommendations
Historical Document Layout Analysis Competition
ICDAR '11: Proceedings of the 2011 International Conference on Document Analysis and RecognitionThis paper presents an objective comparative evaluation of layout analysis methods for scanned historical documents. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2011 and the International ...
ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013)
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and RecognitionThis paper presents an objective comparative evaluation of layout analysis methods for scanned historical newspapers. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2013 and the 2nd ...
Layout Analysis of Tibetan Historical Documents Based on Deep Learning
PRAI '19: Proceedings of the 2019 the International Conference on Pattern Recognition and Artificial IntelligenceTibetan historical document are vast, second in quantity only to Chinese historical document in China, and they are considered a treasure of Chinese culture. The digital protection and utilization of Tibetan literature resources is a hot topic in the ...
Comments