ABSTRACT
In this paper, we present the first effort in preprocessing and character segmentation on digitized Nom document pages toward their digital archiving. Nom is an ideographic script to represent Vietnamese, used from the 10th century to 20th century. Because of various complex layouts, we propose an efficient method based on connected component analysis for extraction of characters from images. The area Voronoi diagram is then employed to represent the neighborhood and boundary of connected components. Based on this representation, each character can be considered as a group of extracted adjacent Voronoi regions. To improve the performance of segmentation, we use the recursive x-y cut method to segment separated regions. We evaluate the performance of this method on several pages in different layouts. The results confirm that the method is effective for character segmentation in Nom documents.
- V. J. Shih, T. L. Chu, "The Han Nom Digital Library,", in The International Nom Conference, The National Library of Vietnam, Hanoi, November 12--14, 2004.Google Scholar
- M. S. Kim, K. T. Cho, H. K. Kwag, J. H. Kim, "Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents," Document Analysis Systems 2004, 114--124.Google Scholar
- L. Y. Tseng, R. C. Chen, "Segmenting handwritten Chinese characters based on heuristic merging of stroke bounding boxes and dynamic programming," Pattern Recognition Letters 19(10), 1998, 963--973. Google ScholarDigital Library
- Y. H. Tseng, H. J. Lee, "Recognition-based handwritten Chinese character segmentation using a probabilistic Viterbi algorithm. Pattern Recognition Letters 20(8), 1999, 791--806. Google ScholarDigital Library
- S. Zhao, Z. Chi, P. Shi, H. Yan, "Two-stage segmentation of unconstrained handwritten Chinese characters," Pattern Recognition 36(1), 2003, 145--156.Google ScholarCross Ref
- K. Kise, A. Sato, M. Iwata, "Segmentation of page images using the area Voronoi diagram," Comput. Vis. Image Underst. 70(3), 1998, 370--382 Google ScholarDigital Library
- Y. Lu, Z. Wang, C. L Tan, "Word grouping in document images based on Voronoi tessellation," In Marinai, S., Dengel, A., eds.: Document Analysis Systems. Volume 3163 of Lecture Notes in Computer Science., Springer, 2004, 147--157.Google Scholar
- B. Su, S. Lu, C. L Tan, "Binarization of historical handwritten document images using local maximum and minimum filter," International Workshop on Document Analysis Systems, June 2010, 159--165 Google ScholarDigital Library
- J. Kittler, J. Illingworth, "Threshold selection based on a simple image statistics," Comput. Vision Graphics Image Process.30, 1985, 125--147.Google ScholarCross Ref
- N. Otsu, "A threshold selection method from gray-level histograms," IEEE Trans. System, Man Cybernetics9, 1979, 62--66.Google Scholar
- W. Peerawit, A. Kawtrakul, "Marginal noise removal from document images using edge density," In: 4th Information and Computer Engineering Postgraduate Workshop, Phuket, Thailand, 2004.Google Scholar
- F. Chang, C. J. Chen, "A Fast Method for Labeling Connected Components in an image," IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP), 2003, 327--333.Google Scholar
- A. Okabe, B. Boots, K. Sugihara, "Spatial Tessellations. Concepts and Applications of Voronoi Diagrams," J. Wiley and Sons, Chichester, 1992, 257--264. Google ScholarDigital Library
Index Terms
- Development of Nom character segmentation for collecting patterns from historical document pages
Recommendations
Construction of a text digitization system for Nom historical documents
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageThis paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification ...
A Nom historical document recognition system for digital archiving
A Nom historical document recognition system is being developed for digital archiving that uses image binarization, character segmentation, and character recognition. It incorporates two versions of off-line character recognition: one for automatic ...
Collecting Handwritten Nom Character Patterns from Historical Document Pages
DAS '12: Proceedings of the 2012 10th IAPR International Workshop on Document Analysis SystemsIn this paper, we present methods of segmenting Nom historical documents and clustering character patterns to build a Nom character pattern database. Nom is an ideographic script to represent Vietnamese, used from the 10th century to 20th century. ...
Comments