Abstract
Optical character recognition (OCR) system holds great significance in human-machine interaction. OCR has been the subject of intensive research especially for Latin, Chinese and Japanese script. Comparatively, little work has been done for Urdu OCR, due to the complexities and segmentation errors associated with its cursive script. This paper proposes an Urdu OCR system which aims at ligature-level recognition of Urdu text. This ligature based recognition approach overcomes the character-levelsegmentation problems associated with cursive scripts. A newly developed OCR algorithm is introduced that uses a semi-supervised multi-level clustering for categorization of the ligatures. Classification is performed using four machine learning techniques i.e. decision trees, linear discriminant analysis, naive Bayes and k-nearest neighbor (K-NN). The system was implemented and the results show 62, 61, 73 and 90% accuracy for decision tree, linear discriminant analysis, naive Bayes and K-NN respectively.
















Similar content being viewed by others
References
Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187 (2010)
Olszewska, J.I.: Active contour based optical character recognition for automated scene understanding. Neurocomputing 161, 65–71 (2015)
Kharma, N.N., Ward, R.K.: Character recognition systems for the non-expert. IEEE Can. Rev. 33, 5–8 (1999)
Ahmad, R., Naz, S., Afzal, M.Z., Amin, S.H., Breuel, T.: Robust optical recognition of cursive Pashto script using scale, rotation and location invariant approach. PLoS ONE 10(9), e0133648 (2015)
Choudhary, P., Nain, N.: A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on Urdu Script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 26 (2016)
Naz, S., Umar, A.I., Ahmad, R., Ahmed, S.B., Shirazi, S.H., Siddiqi, I., Razzak, M.I.: Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177, 228–241 (2016)
Hakro, D.N., Talib, A.Z.: Printed text image database for Sindhi OCR. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 21 (2016)
Ahmad, Z., Orakzai, J.K., Shamsher, I., Adnan, A.: Urdu Nastaleeq Optical Character Recognition. In: Proceedings of World Academy of Science, Engineering and Technology, pp. 249–252 (2007)
Husain, S.A.: A multi-tier holistic approach for Urdu Nastaliq recognition. In: Proceedings of the 8th International Multi Topic Conference, Abstracts 2002, pp. 79–84 (2002)
Shah, Z.A.: Ligature based optical character recognition of Urdu-Nastaleeq font. In: Proceedings of 6th International Multitopic IEEE Conference (INMIC) (2002)
Husain, S.A., Sajjad, A., Anwar, F.: Online Urdu character recognition system. In: MVA2007 IAPR Conference on Machine Vision Applications (2007)
Khan, K., Siddique, M., Aamir, M., Khan, R.: An efficient method for Urdu language text search in image based Urdu text. IJCSI Int. J. Comput. Sci. Issues 9(2), 523–527 (2012)
Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online Urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8, 3149–3161 (2012)
Razzak, M.I., Anwar, F., Husain, S.A., Belaid, A., Sher, M.: HMM and fuzzy logic: a hybrid approach for online Urdu script-based languages’ character recognition. Knowl Based Syst. 23(8), 914–923 (2010). doi:10.1016/j.knosys.2010.06.007
Akram, Q.u.A., Hussain, S., Habib, Z.: Font size independent OCR for Noori Nastaleeq. In: Proceedings of Graduate Colloquium on Computer Sciences (GCCS), NUCES, Lahore (2010)
Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation Free Nastalique Urdu OCR. In: Proceedings of World Academy Of Science, Engineering and Technology, vol. 70 (2010)
Sattar, S.A., Haque, S., Pathan, M.K.: A finite state model for Urdu Nastalique optical character recognition. Int. J. Comput. Sci. Netw. Security 9(9), 116 (2009)
Pal, U., Sarkar, A.: Recognition of Printed Urdu Script. Paper presented at the Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2 (2003)
Malik, S., Khan, S.A.: Urdu online handwriting recognition. In: Proceedings of the IEEE Symposium on Emerging Technologies, vol. 17(18), Islamabad (2005)
Chanda, S., Pal, U.: English, Devnagari and Urdu text identification. In: Proceedings of the International Conference on Cognition and Recognition, pp. 538–546 (2005)
Pathan, R.R.J.I.K., Ali, A.A.: Recognition of offline handwritten isolated Urdu character. Adv. Comput. Res. 4(1), 117–121 (2012)
Zaman, S., Slany, W., Sahito, F.: Recognition of segmented Arabic/Urdu characters using pixel values as their features. In: ICCIT (2012)
Shahzad, N., Paulson, B., Hammond, T.: Urdu Qaeda: Recognition system for isolated Urdu characters. In: IUI 2009 Workshop on Sketch Recognition, Sanibel Island, Florida (2009)
Nawaz, T., Naqvi, S.A.H.S., ur Rehman, H.: Optical character recognition system for Urdu (Naskh Font) using pattern matching technique. Int. J. Image Process. 3, 92–104 (2009)
Ahmad, Z., Orakzai, J.K., Shamsher, I.: Urdu compound character recognition using feed forward neural networks. In: ICCSIT 2009, pp. 457–462 (2009)
Shamsher, I., Ahmad, Z., Orakzai, J.K., Adnan, A.: OCR for printed Urdu Script using feed forward neural network. In: Proceedings of World Academy of Science, Engineering and Technology (2007)
Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation free nastalique urdu OCR. In: Proceedings of World Academy of Science, Engineering and Technology, vol. 46, pp. 456–461 (2010)
Ahmed, S.B., Naz, S., Razzak, M.I., Rashid, S.F., Afzal, M.Z., Breuel, T.M.: Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 27(3), 603–613 (2016)
Javed, S.T., Hussain, S.: Segmentation based Urdu Nastalique OCR. In: Iberoamerican Congress on Pattern Recognition 2013, pp. 41–49. Springer, Heidelberg (2013)
Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8(5), 21 (2012)
Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. pp. 53–58. Springer, Heidelberg (2007)
Hussain, S.: Complexity of Asian writing systems: a case study of Nafees Nasta’leeq for urdu. In: Proceedings of the 12th AMIC Annual Conference on e-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore 2003. Citeseer
Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Madani, S.A., Khan, S.U.: The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 47(3), 12291248 (2014)
Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Akbar, H.: Arabic script based character segmentation: a review. In: 2013 IEEE World Congress on Computer and Information Technology (WCCIT), pp. 1–6 (2013)
Satti, D.A., Saleem, K.: Complexities and implementation challenges in offline Urdu Nastaliq OCR. In: Proceedings of the Conference on Language & Technology 2012, pp. 85–91 (2012)
Sabbour, N., Shafait, F.: A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE Electronic Imaging 2013. International Society for Optics and Photonics, pp. 86580N-86580N-86512 (2013)
Akram, M., Hussain, S.: Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China, pp. 88–94 (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khan, N.H., Adnan, A. & Basar, S. Urdu ligature recognition using multi-level agglomerative hierarchical clustering. Cluster Comput 21, 503–514 (2018). https://doi.org/10.1007/s10586-017-0916-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0916-2