Skip to main content
Log in

Lexicon reduction for printed Farsi subwords using pictorial and textual dictionaries

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we present a method to reduce the lexicon size of printed Farsi subwords, which utilizes the holistic shape along with the key character information to dynamically reduce the size of lexicon organized accordingly in two approaches: (1) based on the global shape description (to build a pictorial dictionary) and (2) based on constitutive character information (to build a textual dictionary). Given an input word image, the reduction procedure is accomplished in two successive stages. First, characteristic loci features are extracted and compared with the pictorial dictionary to select the candidate subwords based on their shapes similarity. The lexicon is further reduced in the second stage by determining the key character in the input image and comparing it with the textual dictionary. The key characters are defined as the ones which can be segmented and recognized rather easily and also, together with global descriptors, characterize the word image efficiently. A method for optimal selection of key characters is also proposed which is based on the mutual information of pictorial and textual dictionaries. The final candidate subwords are those sharing the same key character with the input image. The performance of the proposed method was studied experimentally on a set of 5,000 subword samples. The results obtained show a reduction rate of 97.83 % on a lexicon of 6,900 printed Farsi subwords.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Alma’adeed, S., Higgens, C., Elliman, D.: Recognition of off-line handwritten arabic words using hidden markov model approach. In: Pattern Recognition, 2002. Proceedings of IEEE 16th international conference, vol 3, pp 481–484 (2002)

  2. Azmi, R.: Recognition of omnifont printed farsi text. PhD thesis, Tarbiat Modares University (2001)

  3. Azmi, R., Kabir, E.: A new segmentation technique for omnifont farsi text. Pattern Recognit. Lett. 22(2), 97–104 (2001)

    Article  MATH  Google Scholar 

  4. Bertolami, R., Gutmann, C., Bunke, H., Spitz, A.: Shape code based lexicon reduction for offline handwritten word recognition. In: Document Analysis Systems, 2008. DAS ’08. The Eighth IAPR International Workshop on, pp 158–163 (2008)

  5. Brakensiek, A., Rottland, J., Kosmala, A., Rigoll, G.: Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In: 7th International Workshop on Frontiers in Handwritten Recognition, pp 343–352 (2000)

  6. Chherawala, Y., Cheriet, M.: W-tsv: Weighted topological signature vector for lexicon reduction in handwritten arabic documents. Pattern Recognit. 45(9), 3277–3287 (2012)

    Article  Google Scholar 

  7. Dom, B.E.: An information-theoretic external cluster-validity measure. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp 137–145 (2002)

  8. Ebrahimi, A., Kabir, E.: A pictorial dictionary for printed farsi subwords. Pattern Recognit. Lett. 29(5), 656–663 (2008)

    Article  Google Scholar 

  9. El-Yacoubi, A., Gilloux, M., Sabourin, R., Suen, C.: Unconstrained handwritten word recognition using hidden markov models. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 752–760 (1999)

    Article  Google Scholar 

  10. Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In. In: Proceedings of the 7th International Workshop on Frontiers in Handwriting Recognition, Citeseer (2000)

  11. Hu, J., Gek Lim, S.: Writer independent on-line handwriting recognition using an hmm approach. Pattern Recognit. 33(1), 133–147 (2000)

    Article  Google Scholar 

  12. Kaltenmeier, A., Caesar, T., Gloger, J., Mandler, E.: Sophisticated topology of hidden markov models for cursive script recognition. In: Document Analysis and Recognition, 1993, Proceedings of the Second International Conference on, IEEE, pp 139–142 (1993)

  13. Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an framework based on quantized feature vectors. In: Document Analysis and Recognition, 1997, Proceedings of the Fourth International Conference on, IEEE, vol 2, pp 1097–1101 (1997)

  14. Khosravi, H., Kabir, E.: A blackboard approach towards integrated farsi ocr system. Int. J. Doc. Anal. Recognit. (IJDAR) 12(1), 21–32 (2009)

    Article  Google Scholar 

  15. Koerich, A.L., Sabourin, R., Suen, C.Y.: A timelength constrained level building algorithm for large vocabulary handwritten word recognition. In: Advances in Pattern Recognition ICAPR 2001, Springer, pp 127–136 (2001)

  16. Koerich, A.L., Sabourin, R., Suen, C.Y.: Recognition and verification of unconstrained handwritten words. IEEE Trans. Pattern Anal. Mach. Intell 27(10), 1509–1522 (2005)

    Article  Google Scholar 

  17. Lu, Y., Tan, C.L.: Information retrieval in document image databases. IEEE Trans. Knowl. Data Eng. 16(11), 1398–1410 (2004)

    Article  Google Scholar 

  18. Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction. In: Proceedings of International Workshop on Frontiers in Handwriting Recognition, pp 71–81 (1993)

  19. Madhvanath, S., Govindaraju, V.: The role of holistic paradigms in handwritten word recognition. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 149–164 (2001)

  20. Madhvanath, S., Srihari, S.: Effective reduction of large lexicons for recognition of offline cursive script. In: Proceedings of 5th International Workshop on Frontiers in Handwriting Recognition, Essex, UK (1996) pp. 189–194

  21. Madhvanath, S., Krpasundar, V., Govindaraju, V.: Syntactic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)

  22. Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: Pattern Recognition, 2000. Proceedings of 15th International Conference on, IEEE, vol 3, pp 463–466 (2000)

  23. Menier, G., Lorette, G.: Lexical analyzer based on a self-organizing feature map. In: Document Analysis and Recognition, 1997, Proceedings of the Fourth International Conference on, IEEE, vol 2, pp 1067–1071 (1997)

  24. Mozaffari, S., Faez, K., Märgner, V., El-Abed, H.: Lexicon reduction using dots for off-line farsi/arabic handwritten word recognition. Pattern Recognit. Lett. 29(6), 724–734 (2008)

    Article  Google Scholar 

  25. Palla, S., Lei, H., Govindaraju, V.: Signature and lexicon pruning techniques. In: Frontiers in Handwriting Recognition, 2004. IWFHR-9 2004. Ninth International Workshop on, IEEE pp. 474–478 (2004)

  26. Powalka, R., Sherkat, N., Whitrow, R.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)

    Article  Google Scholar 

  27. Razavi, S., Kabir, E.: A database for on-line handwritten farsi subwords. In: Sixth Conference on Intelligent Systems, Iran, p 218225, in Farsi (2004)

  28. Smeaton, A.F., Spitz, A.L.: Using character shape coding for information retrieval. In: Document Analysis and Recognition, 1997, Proceedings of the Fourth International Conference on, IEEE, vol 2, pp 974–978 (1997)

  29. Srihari, S.N.: Recognition of handwritten and machine-printed text for postal address interpretation. Pattern Recognit. Lett. 14(4), 291–302 (1993)

    Article  Google Scholar 

  30. Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without ocr. IEEE Trans. Pattern Anal. Mach. Intell 24(6), 838–844 (2002)

    Article  Google Scholar 

  31. Zimmermann, M., Mao, J.: Lexicon reduction using key characters in cursive handwritten words. Pattern Recognit. Lett. 20(11), 1297–1304 (1999)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Homa Davoudi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davoudi, H., Kabir, E. Lexicon reduction for printed Farsi subwords using pictorial and textual dictionaries. IJDAR 17, 359–374 (2014). https://doi.org/10.1007/s10032-014-0223-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-014-0223-x

Keywords

Navigation