Skip to main content
Log in

Multi-font printed Mongolian document recognition system

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amin A., Mari F.J.: Machine recognition and correction of printed Arabic text. IEEE Trans. Syst. Man Cybern. 19(5), 1300–1306 (1989)

    Article  Google Scholar 

  2. Amin A.: Recognition of hand-printed characters based on structural description and inductive logic programming. Pattern Recognit. Lett. 24(16), 3187–3196 (2003)

    Article  Google Scholar 

  3. Auda, A.G., Raafat, H.: An automatic text reader using neural networks. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering, Vancouver, BC Canada, pp. 92–95 (1993)

  4. Bazzi I., Schwartz I., Makhoul J.: An omnifont open-vocabulary OCR system for English and Arabic. IEEE Trans. PAMI 21(6), 495–504 (1999)

    Google Scholar 

  5. Creating and Supporting OpenType Fonts for the Mongolian Script, http://www.microsoft.com/typography/otfntdev/mongolot/

  6. Ding, X., Wen, D., Peng, L., Liu, C.: Document digitization technology and its application for digital library in China. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries–DIAL, pp. 46–53 (2004)

  7. Fang, C., Liu, C., Peng, L., Ding, X.: Automatic performance evaluation of printed Chinese character recognition systems. IJDAR(4), no. 3, pp. 177–182 (2002)

  8. Feng, Z.D., Huo, Q.: Confidence guided progressive search and fast match techniques for high performance Chinese/English OCR. In: 16th International Conference on Pattern Recognition, pp. 89–92 (2002)

  9. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. New York (1990)

  10. Gao, G., Li, W., Hou, H., et al.: Multi-agent based recognition system of printed Mongolian characters. In: Proceedings of the International Conference on Active Media Technology, pp. 376–381 (2003)

  11. Guo, H., Ding, X.Q., Zhang, Z., Guo, F.X.: Realization of a high-performance bilingual Chinese–English OCR system. In: 3rd International Conference on Document Analysis and Recognition, pp. 978–981 (1995)

  12. Hubel D.H., Wiesel T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962)

    Google Scholar 

  13. Huo, Q., Feng, Z.D.: Improving Chinese/English OCR performance by using MCE-based character-pair modeling and negative training. In: 7th International Conference on Document Analysis and Recognition, pp. 364–368 (2003)

  14. Juang B.H., Katagiri S.: Discriminative training for minimum error classification. IEEE Trans. Signal Process. 40(12), 3043–3054 (1992)

    Article  MATH  Google Scholar 

  15. Kato N. et al.: A handwritten character recognition system using directional element feature and asymmetric mahalanobis distance. IEEE Trans. PAMI 21(3), 258–262 (1999)

    Google Scholar 

  16. Kimura F., Takashina K., Tsuruoka S., Miyake Y.: Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. Pattern Anal. Mach. Intell. 9(1), 149–153 (1987)

    Article  Google Scholar 

  17. Lin X., Ding X., Chen M. et al.: Adaptive confidence transform based classifier combination for Chinese character recognition. Pattern Recognit. Lett. 19(10), 975–988 (1998)

    Article  Google Scholar 

  18. Lorigo L.M., Govindaraju V.: Offline Arabic handwriting recognition: a survey. IEEE Trans. PAMI 28(5), 712–724 (2006)

    Google Scholar 

  19. Miled, H., Ben Amara, N.E.: Planar Markov modeling for Arabic writing recognition: advancement state. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 69–73 (2001)

  20. Peng, L., Liu, C., Ding, X., et al.: Multilingual document recognition research and its application in China. In: 2nd International Conference on Document Image Analysis for Libraries, pp. 126–132 (2006)

  21. Peng, L., Liu, C., Ding, X., Wang, H., Jin, J.: Multi-font printed Mongolian document recognition system, SPIE 2009, DRR 7247-20, 72470J-1 to 7247OJ-7. (2009)

  22. Qoijongjab: Mongolian encoding (in Chinese). Publishing house of Inner Mongolia University, Hohhot (2000)

  23. Romeo-Pakker, K., Miled, H., Lecourtier, Y.: A new approach for Latin Arabic character segmentation. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montral, pp. 874–877 (1995)

  24. The Unicode Standard, Version 5.1.0, http://www.unicode.org/versions/Unicode5.1.0/

  25. Wang, K., Wang, Q.: A high performance European OCR system. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, vol. 1, pp. 232–236 (2007)

  26. Ymin, A., Aoki, Y.: On the segmentation of multi-font printed Uygur scripts. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, pp. 215–219 (1996)

  27. Zahour, A., Taconet, B., Mercy, P., Ramdane, S.: Arabic hand-written text-line extraction. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, pp. 281–285 (2001)

  28. Zheng Y.F., Liu C.S., Ding X.Q.: Single character type identification. Proc. SPIE Doc. Recognit. Retr. IX 4670, 49–56 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liangrui Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, L., Liu, C., Ding, X. et al. Multi-font printed Mongolian document recognition system. IJDAR 13, 93–106 (2010). https://doi.org/10.1007/s10032-009-0106-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-009-0106-8

Keywords

Navigation