Skip to main content

A Rule-Based Approach to Identify Stop Words for Gujarati Language

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 515))

Abstract

Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming and complex owing to file or dictionary preparation by collecting possible stop words from a large vocabulary, complex framework and a morphologically variant Gujarati document. Even the other proposed approaches in the literature are also very restricted due to their dependence on word-length, word-frequency, and/or training data set. For the first time in scientific community worldwide, this paper proposes a dynamic approach independent of all factors namely usage of file or dictionary, word-length, word-frequency, and training dataset. An 11 rule-based approach is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words. Extensive empirical evidence has been presented through deployment of proposed algorithm on nearly 600 Gujarati documents, categorized into routine and domain-specific categories. The respective results with 98.10 and 94.08% average accuracy show that the proposed approach is effective and promising enough for implementation in NLP tasks involving Gujarati written documents.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Microsoft Research, Natural Language Processing [online] available: http://research.microsoft.com/en-us/groups/nlp/ [Feb 10 2016].

  2. Wikipedia, Stop Words Basic [online] available: https://en.wikipedia.org/wiki/Stop_words [Feb 5, 2016].

  3. Rakholia R and Saini J, “The Design and Implementation of Diacritic Extraction Technique for Gujarati Written Script Using Unicode Transformation Format”, Proceeding of ICECCT, IEEE, 2015, pp. 654–659.

    Google Scholar 

  4. UCLC, Gujarati Language [online]: http://www.lmp.ucla.edu/Profile.aspx?LangID=85&menu=004 [Feb 10 2016].

  5. The Unicode Consortium, USA, The Unicode Standard [Online]. Available: http://www.unicode.org/standard/standard.html [December 15, 2015].

  6. Pandey A and Siddiqui T, “Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval”, Proceedings of the First International Conference on Intelligent Human Computer Interaction, Springer, 2009, pp. 316–326.

    Google Scholar 

  7. Kaur J and Saini J, “POS Word Class based Categorization of Gurmukhi Language Stemmed Stop Words”, accepted for publication in the proceedings of International Conference on ICT for Intelligent Systems (ICTIS-2015), supported by ACM, CSI and Information Security Research Association and held during November 28–29, 2015, Ahmedabad.

    Google Scholar 

  8. Kaur R and Sharma S, "Pre-processing of Domain Ontology Graph Generation System in Punjabi", International Journal of Engineering Trends and Technology, Volume 17 Number 3 – Nov 2014, pp. 141–146.

    Google Scholar 

  9. Kaur J and Saini J, “A Natural Language Processing Approach for Identification of Stop Words in Punjabi Language”, published in International Journal of Data Mining and Emerging Technologies; ISSN: 2249-3212 (eISSN: 2249-3220); Indian Journals, New Delhi, India; vol. 5, issue 2, November 2015; pages 114–120.

    Google Scholar 

  10. Thangarasu M and Manavalan R, “Design and Development of Stemmer for Tamil Language: Cluster Analysis”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 7, pp. 812–818, July 2013.

    Google Scholar 

  11. Yao Z and Ze-wen C, “Research on the construction and filter method of stop-word list in text Preprocessing”, Fourth International Conference on Intelligent Computation Technology and Automation, 2011.

    Google Scholar 

  12. Zheng G and Gaowa G, “The Selection of Mongolian Stop Words”, IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), 2010.

    Google Scholar 

  13. Alajmi A. et al., “Toward an ARABIC Stop-Words List Generation”, International Journal of Computer Applications, Volume 46– No. 8, May 2012.

    Google Scholar 

  14. Chauhan K, Patel R and Joshi H “Towards Improvement in Gujarati Text Information Retrieval by using Effective Gujarati Stemmer” Journal of Information, Knowledge and Research in Computer Engineering, Nov 12 TO Oct 13, Volume – 02, Issue – 02, Page 218.

    Google Scholar 

  15. Joshi H. et al, “To stop or not to stop — Experiments on stopword elimination for information retrieval of Gujarati text documents” Engineering (NUiCONE), 2012 Nirma University International Conference on, 6–8 Dec. 2012, Page 1–4, IEEE.

    Google Scholar 

  16. Rakholia R and Saini J, “A Study and Comparative Analysis of Different Stemmer and Character Recognition Algorithms for Indian Gujarati Script”, published in International Journal of Computer Application (IJCA); Digital Library ISSN: 0975-8887; ISBN: 973-93-80883-64-4; Foundation of Computer Science, USA; vol. 106, issue 2; November 2014; pages 45–50; DOI: 10.5120/18496-9558

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajnish M. Rakholia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Rakholia, R.M., Saini, J.R. (2017). A Rule-Based Approach to Identify Stop Words for Gujarati Language. In: Satapathy, S., Bhateja, V., Udgata, S., Pattnaik, P. (eds) Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications . Advances in Intelligent Systems and Computing, vol 515. Springer, Singapore. https://doi.org/10.1007/978-981-10-3153-3_79

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3153-3_79

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3152-6

  • Online ISBN: 978-981-10-3153-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics