Abstract
In this paper, we present a method to reduce the lexicon size of printed Farsi subwords, which utilizes the holistic shape along with the key character information to dynamically reduce the size of lexicon organized accordingly in two approaches: (1) based on the global shape description (to build a pictorial dictionary) and (2) based on constitutive character information (to build a textual dictionary). Given an input word image, the reduction procedure is accomplished in two successive stages. First, characteristic loci features are extracted and compared with the pictorial dictionary to select the candidate subwords based on their shapes similarity. The lexicon is further reduced in the second stage by determining the key character in the input image and comparing it with the textual dictionary. The key characters are defined as the ones which can be segmented and recognized rather easily and also, together with global descriptors, characterize the word image efficiently. A method for optimal selection of key characters is also proposed which is based on the mutual information of pictorial and textual dictionaries. The final candidate subwords are those sharing the same key character with the input image. The performance of the proposed method was studied experimentally on a set of 5,000 subword samples. The results obtained show a reduction rate of 97.83 % on a lexicon of 6,900 printed Farsi subwords.
















Similar content being viewed by others
References
Alma’adeed, S., Higgens, C., Elliman, D.: Recognition of off-line handwritten arabic words using hidden markov model approach. In: Pattern Recognition, 2002. Proceedings of IEEE 16th international conference, vol 3, pp 481–484 (2002)
Azmi, R.: Recognition of omnifont printed farsi text. PhD thesis, Tarbiat Modares University (2001)
Azmi, R., Kabir, E.: A new segmentation technique for omnifont farsi text. Pattern Recognit. Lett. 22(2), 97–104 (2001)
Bertolami, R., Gutmann, C., Bunke, H., Spitz, A.: Shape code based lexicon reduction for offline handwritten word recognition. In: Document Analysis Systems, 2008. DAS ’08. The Eighth IAPR International Workshop on, pp 158–163 (2008)
Brakensiek, A., Rottland, J., Kosmala, A., Rigoll, G.: Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In: 7th International Workshop on Frontiers in Handwritten Recognition, pp 343–352 (2000)
Chherawala, Y., Cheriet, M.: W-tsv: Weighted topological signature vector for lexicon reduction in handwritten arabic documents. Pattern Recognit. 45(9), 3277–3287 (2012)
Dom, B.E.: An information-theoretic external cluster-validity measure. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., pp 137–145 (2002)
Ebrahimi, A., Kabir, E.: A pictorial dictionary for printed farsi subwords. Pattern Recognit. Lett. 29(5), 656–663 (2008)
El-Yacoubi, A., Gilloux, M., Sabourin, R., Suen, C.: Unconstrained handwritten word recognition using hidden markov models. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 752–760 (1999)
Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In. In: Proceedings of the 7th International Workshop on Frontiers in Handwriting Recognition, Citeseer (2000)
Hu, J., Gek Lim, S.: Writer independent on-line handwriting recognition using an hmm approach. Pattern Recognit. 33(1), 133–147 (2000)
Kaltenmeier, A., Caesar, T., Gloger, J., Mandler, E.: Sophisticated topology of hidden markov models for cursive script recognition. In: Document Analysis and Recognition, 1993, Proceedings of the Second International Conference on, IEEE, pp 139–142 (1993)
Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an framework based on quantized feature vectors. In: Document Analysis and Recognition, 1997, Proceedings of the Fourth International Conference on, IEEE, vol 2, pp 1097–1101 (1997)
Khosravi, H., Kabir, E.: A blackboard approach towards integrated farsi ocr system. Int. J. Doc. Anal. Recognit. (IJDAR) 12(1), 21–32 (2009)
Koerich, A.L., Sabourin, R., Suen, C.Y.: A timelength constrained level building algorithm for large vocabulary handwritten word recognition. In: Advances in Pattern Recognition ICAPR 2001, Springer, pp 127–136 (2001)
Koerich, A.L., Sabourin, R., Suen, C.Y.: Recognition and verification of unconstrained handwritten words. IEEE Trans. Pattern Anal. Mach. Intell 27(10), 1509–1522 (2005)
Lu, Y., Tan, C.L.: Information retrieval in document image databases. IEEE Trans. Knowl. Data Eng. 16(11), 1398–1410 (2004)
Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction. In: Proceedings of International Workshop on Frontiers in Handwriting Recognition, pp 71–81 (1993)
Madhvanath, S., Govindaraju, V.: The role of holistic paradigms in handwritten word recognition. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 149–164 (2001)
Madhvanath, S., Srihari, S.: Effective reduction of large lexicons for recognition of offline cursive script. In: Proceedings of 5th International Workshop on Frontiers in Handwriting Recognition, Essex, UK (1996) pp. 189–194
Madhvanath, S., Krpasundar, V., Govindaraju, V.: Syntactic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)
Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: Pattern Recognition, 2000. Proceedings of 15th International Conference on, IEEE, vol 3, pp 463–466 (2000)
Menier, G., Lorette, G.: Lexical analyzer based on a self-organizing feature map. In: Document Analysis and Recognition, 1997, Proceedings of the Fourth International Conference on, IEEE, vol 2, pp 1067–1071 (1997)
Mozaffari, S., Faez, K., Märgner, V., El-Abed, H.: Lexicon reduction using dots for off-line farsi/arabic handwritten word recognition. Pattern Recognit. Lett. 29(6), 724–734 (2008)
Palla, S., Lei, H., Govindaraju, V.: Signature and lexicon pruning techniques. In: Frontiers in Handwriting Recognition, 2004. IWFHR-9 2004. Ninth International Workshop on, IEEE pp. 474–478 (2004)
Powalka, R., Sherkat, N., Whitrow, R.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)
Razavi, S., Kabir, E.: A database for on-line handwritten farsi subwords. In: Sixth Conference on Intelligent Systems, Iran, p 218225, in Farsi (2004)
Smeaton, A.F., Spitz, A.L.: Using character shape coding for information retrieval. In: Document Analysis and Recognition, 1997, Proceedings of the Fourth International Conference on, IEEE, vol 2, pp 974–978 (1997)
Srihari, S.N.: Recognition of handwritten and machine-printed text for postal address interpretation. Pattern Recognit. Lett. 14(4), 291–302 (1993)
Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without ocr. IEEE Trans. Pattern Anal. Mach. Intell 24(6), 838–844 (2002)
Zimmermann, M., Mao, J.: Lexicon reduction using key characters in cursive handwritten words. Pattern Recognit. Lett. 20(11), 1297–1304 (1999)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Davoudi, H., Kabir, E. Lexicon reduction for printed Farsi subwords using pictorial and textual dictionaries. IJDAR 17, 359–374 (2014). https://doi.org/10.1007/s10032-014-0223-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-014-0223-x