Abstract
Especially, during last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. This paper presents an unsupervised approach for the development of an Urdu stemmer. To train the system a training dataset, taken from CRULP [22], consists of 111,887 words is used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that these are very efficient algorithms having accuracy of 85.36% and 79.76%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Rizvi, J., et al.: Modeling case marking system of Urdu-Hindi languages by using semantic information. In: Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE 2005 (2005)
Butt, M., King, T.: Non-Nominative Subjects in Urdu: A Computational Analysis. In: Proceedings of the International Symposium on Non-nominative Subjects, Tokyo, pp. 525–548 (December 2001)
Savoy, J.: Stemming of French words based on grammatical categories. Journal of the American Society for Information Science 44(1), 1–9 (1993)
Chen, A., Gey, F.: Building and Arabic Stemmer for Information Retrieval. In: Proceedings of the Text Retrieval Conference, p. 47 (2002)
Mokhtaripour, A., Jahanpour, S.: Introduction to a New Farsi Stemmer. In: Proceedings of CIKM, Arlington, VA, USA, pp. 826–827 (2006)
Wicentowski, R.: Multilingual Noise-Robust Supervised Morphological Analysis using the Word Frame Model. In: Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77 (2004)
Rizvi, Hussain, M.: Analysis, Design and Implementation of Urdu Morphological Analyzer. In: SCONEST, pp. 1–7 (2005)
Krovetz, R.: View Morphology as an Inference Process. In: The Proceedings of 5th International Conference on Research and Development in Information Retrieval (1993)
Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Thabet, N.: Stemming the Qur’an. In: The Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (2004)
Paik, Pauri: A Simple Stemmer for Inflectional Languages. In: FIRE 2008 (2008)
Sharifloo, A.A., Shamsfard, M.: A Bottom up Approach to Persian Stemming. In: IJCNLP (2008)
Croft, Xu: Corpus-Based Stemming Using Co occurrence of Word Variants. ACM Transactions on Information Systems, 61–81 (1998)
Kumar, A., Siddiqui, T.: An Unsupervised Hindi Stemmer with Heuristics Improvements. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data (2008)
Kumar, M.S., Murthy, K.N.: Corpus Based Statistical Approach for Stemming Telugu. In: Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC, Mumbai, India (2007)
Akram, Q.-U.-A., Naseer, A., Hussain, S.: Assas-Band, an Affix-Exception-List Based Urdu Stemmer. In: Proceedings of ACL-IJCNLP 2009 (2009)
Siddiqui, T.: Natural Language processing and Information Retrieval, U S Tiwary
Frakes, W.B., Baeza-Yates, R.: Information retrieval: data structure and algorithms
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Husain, M.S., Ahamad, F., Khalid, S. (2013). A Language Independent Approach to Develop Urdu Stemmer. In: Meghanathan, N., Nagamalai, D., Chaki, N. (eds) Advances in Computing and Information Technology. Advances in Intelligent Systems and Computing, vol 178. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31600-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-31600-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31599-2
Online ISBN: 978-3-642-31600-5
eBook Packages: EngineeringEngineering (R0)