A Rule-Based Approach to Identify Stop Words for Gujarati Language

Rakholia, Rajnish M.; Saini, Jatinderkumar R.

doi:10.1007/978-981-10-3153-3_79

A Rule-Based Approach to Identify Stop Words for Gujarati Language

Rajnish M. Rakholia¹⁸ &
Jatinderkumar R. Saini¹⁹

Conference paper
First Online: 17 March 2017

1177 Accesses
12 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 515))

Abstract

Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming and complex owing to file or dictionary preparation by collecting possible stop words from a large vocabulary, complex framework and a morphologically variant Gujarati document. Even the other proposed approaches in the literature are also very restricted due to their dependence on word-length, word-frequency, and/or training data set. For the first time in scientific community worldwide, this paper proposes a dynamic approach independent of all factors namely usage of file or dictionary, word-length, word-frequency, and training dataset. An 11 rule-based approach is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words. Extensive empirical evidence has been presented through deployment of proposed algorithm on nearly 600 Gujarati documents, categorized into routine and domain-specific categories. The respective results with 98.10 and 94.08% average accuracy show that the proposed approach is effective and promising enough for implementation in NLP tasks involving Gujarati written documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Microsoft Research, Natural Language Processing [online] available: http://research.microsoft.com/en-us/groups/nlp/ [Feb 10 2016].
Wikipedia, Stop Words Basic [online] available: https://en.wikipedia.org/wiki/Stop_words [Feb 5, 2016].
Rakholia R and Saini J, “The Design and Implementation of Diacritic Extraction Technique for Gujarati Written Script Using Unicode Transformation Format”, Proceeding of ICECCT, IEEE, 2015, pp. 654–659.
Google Scholar
UCLC, Gujarati Language [online]: http://www.lmp.ucla.edu/Profile.aspx?LangID=85&menu=004 [Feb 10 2016].
The Unicode Consortium, USA, The Unicode Standard [Online]. Available: http://www.unicode.org/standard/standard.html [December 15, 2015].
Pandey A and Siddiqui T, “Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval”, Proceedings of the First International Conference on Intelligent Human Computer Interaction, Springer, 2009, pp. 316–326.
Google Scholar
Kaur J and Saini J, “POS Word Class based Categorization of Gurmukhi Language Stemmed Stop Words”, accepted for publication in the proceedings of International Conference on ICT for Intelligent Systems (ICTIS-2015), supported by ACM, CSI and Information Security Research Association and held during November 28–29, 2015, Ahmedabad.
Google Scholar
Kaur R and Sharma S, "Pre-processing of Domain Ontology Graph Generation System in Punjabi", International Journal of Engineering Trends and Technology, Volume 17 Number 3 – Nov 2014, pp. 141–146.
Google Scholar
Kaur J and Saini J, “A Natural Language Processing Approach for Identification of Stop Words in Punjabi Language”, published in International Journal of Data Mining and Emerging Technologies; ISSN: 2249-3212 (eISSN: 2249-3220); Indian Journals, New Delhi, India; vol. 5, issue 2, November 2015; pages 114–120.
Google Scholar
Thangarasu M and Manavalan R, “Design and Development of Stemmer for Tamil Language: Cluster Analysis”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 7, pp. 812–818, July 2013.
Google Scholar
Yao Z and Ze-wen C, “Research on the construction and filter method of stop-word list in text Preprocessing”, Fourth International Conference on Intelligent Computation Technology and Automation, 2011.
Google Scholar
Zheng G and Gaowa G, “The Selection of Mongolian Stop Words”, IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), 2010.
Google Scholar
Alajmi A. et al., “Toward an ARABIC Stop-Words List Generation”, International Journal of Computer Applications, Volume 46– No. 8, May 2012.
Google Scholar
Chauhan K, Patel R and Joshi H “Towards Improvement in Gujarati Text Information Retrieval by using Effective Gujarati Stemmer” Journal of Information, Knowledge and Research in Computer Engineering, Nov 12 TO Oct 13, Volume – 02, Issue – 02, Page 218.
Google Scholar
Joshi H. et al, “To stop or not to stop — Experiments on stopword elimination for information retrieval of Gujarati text documents” Engineering (NUiCONE), 2012 Nirma University International Conference on, 6–8 Dec. 2012, Page 1–4, IEEE.
Google Scholar
Rakholia R and Saini J, “A Study and Comparative Analysis of Different Stemmer and Character Recognition Algorithms for Indian Gujarati Script”, published in International Journal of Computer Application (IJCA); Digital Library ISSN: 0975-8887; ISBN: 973-93-80883-64-4; Foundation of Computer Science, USA; vol. 106, issue 2; November 2014; pages 45–50; DOI: 10.5120/18496-9558
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, R K University, Rajkot, Gujarat, India
Rajnish M. Rakholia
Narmada College of Computer Application, Bharuch, Gujarat, India
Jatinderkumar R. Saini

Authors

Rajnish M. Rakholia
View author publications
You can also search for this author in PubMed Google Scholar
Jatinderkumar R. Saini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajnish M. Rakholia .

Editor information

Editors and Affiliations

ANITS, Prof., Comp. Sci. & Engg. Dept. ANITS, Visakhapatnam, Andhra Pradesh, India
Suresh Chandra Satapathy
Dept. of ECE, Shri Ramswaroop Mem. Group of Prof. Clg Dept. of ECE, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
SCIS, University of Hyderabad , Hyderabad, India
Siba K. Udgata
KIIT University, School of Computer Engineering KIIT University, Bhubaneswar, Odisha, India
Prasant Kumar Pattnaik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rakholia, R.M., Saini, J.R. (2017). A Rule-Based Approach to Identify Stop Words for Gujarati Language. In: Satapathy, S., Bhateja, V., Udgata, S., Pattnaik, P. (eds) Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications . Advances in Intelligent Systems and Computing, vol 515. Springer, Singapore. https://doi.org/10.1007/978-981-10-3153-3_79

Download citation

DOI: https://doi.org/10.1007/978-981-10-3153-3_79
Published: 17 March 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3152-6
Online ISBN: 978-981-10-3153-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics