A Language Independent Approach to Develop Urdu Stemmer

Husain, Mohd. Shahid; Ahamad, Faiyaz; Khalid, Saba

doi:10.1007/978-3-642-31600-5_5

A Language Independent Approach to Develop Urdu Stemmer

Mohd. Shahid Husain⁴,
Faiyaz Ahamad⁵ &
Saba Khalid⁵

Conference paper

2316 Accesses
4 Citations
1 Altmetric

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 178))

Abstract

Especially, during last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. This paper presents an unsupervised approach for the development of an Urdu stemmer. To train the system a training dataset, taken from CRULP [22], consists of 111,887 words is used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that these are very efficient algorithms having accuracy of 85.36% and 79.76%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rizvi, J., et al.: Modeling case marking system of Urdu-Hindi languages by using semantic information. In: Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE 2005 (2005)
Google Scholar
Butt, M., King, T.: Non-Nominative Subjects in Urdu: A Computational Analysis. In: Proceedings of the International Symposium on Non-nominative Subjects, Tokyo, pp. 525–548 (December 2001)
Google Scholar
Savoy, J.: Stemming of French words based on grammatical categories. Journal of the American Society for Information Science 44(1), 1–9 (1993)
Article Google Scholar
Chen, A., Gey, F.: Building and Arabic Stemmer for Information Retrieval. In: Proceedings of the Text Retrieval Conference, p. 47 (2002)
Google Scholar
Mokhtaripour, A., Jahanpour, S.: Introduction to a New Farsi Stemmer. In: Proceedings of CIKM, Arlington, VA, USA, pp. 826–827 (2006)
Google Scholar
Wicentowski, R.: Multilingual Noise-Robust Supervised Morphological Analysis using the Word Frame Model. In: Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77 (2004)
Google Scholar
Rizvi, Hussain, M.: Analysis, Design and Implementation of Urdu Morphological Analyzer. In: SCONEST, pp. 1–7 (2005)
Google Scholar
Krovetz, R.: View Morphology as an Inference Process. In: The Proceedings of 5th International Conference on Research and Development in Information Retrieval (1993)
Google Scholar
Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Thabet, N.: Stemming the Qur’an. In: The Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (2004)
Google Scholar
Paik, Pauri: A Simple Stemmer for Inflectional Languages. In: FIRE 2008 (2008)
Google Scholar
Sharifloo, A.A., Shamsfard, M.: A Bottom up Approach to Persian Stemming. In: IJCNLP (2008)
Google Scholar
Croft, Xu: Corpus-Based Stemming Using Co occurrence of Word Variants. ACM Transactions on Information Systems, 61–81 (1998)
Google Scholar
Kumar, A., Siddiqui, T.: An Unsupervised Hindi Stemmer with Heuristics Improvements. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data (2008)
Google Scholar
Kumar, M.S., Murthy, K.N.: Corpus Based Statistical Approach for Stemming Telugu. In: Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC, Mumbai, India (2007)
Google Scholar
Akram, Q.-U.-A., Naseer, A., Hussain, S.: Assas-Band, an Affix-Exception-List Based Urdu Stemmer. In: Proceedings of ACL-IJCNLP 2009 (2009)
Google Scholar
http://en.wikipedia.org/wiki/Urdu
http://www.bbc.co.uk/languages/other/guide/urdu/steps.shtml
http://www.andaman.org/BOOK/reprints/weber/rep-weber.html
Siddiqui, T.: Natural Language processing and Information Retrieval, U S Tiwary
Google Scholar
Frakes, W.B., Baeza-Yates, R.: Information retrieval: data structure and algorithms
Google Scholar
http://www.crulp.org/software/ling_resources.html

Download references

Author information

Authors and Affiliations

Department of Information Technology, Integral University, Lucknow, India
Mohd. Shahid Husain
Department of Computer Science & Engineering, Integral University, Lucknow, India
Faiyaz Ahamad & Saba Khalid

Authors

Mohd. Shahid Husain
View author publications
You can also search for this author in PubMed Google Scholar
Faiyaz Ahamad
View author publications
You can also search for this author in PubMed Google Scholar
Saba Khalid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohd. Shahid Husain .

Editor information

Editors and Affiliations

, Department of Computer Science, Jackson State University, John R. Lynch Street 1400, Jackson, 39217, Mississippi, USA
Natarajan Meghanathan
Wireilla Net Solutions PTY ltd, Merlow Street 3, Melbourne, 3207, Australia
Dhinaharan Nagamalai
, Department of Computer Science & Eng., University of Calcutta, Calcutta, 700 073, India
Nabendu Chaki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Husain, M.S., Ahamad, F., Khalid, S. (2013). A Language Independent Approach to Develop Urdu Stemmer. In: Meghanathan, N., Nagamalai, D., Chaki, N. (eds) Advances in Computing and Information Technology. Advances in Intelligent Systems and Computing, vol 178. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31600-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-31600-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31599-2
Online ISBN: 978-3-642-31600-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics