skip to main content
research-article

A Study on Corpus-based Stopword Lists in Indian Language IR

Published: 25 July 2023 Publication History

Abstract

We explore and evaluate the effect of different stopword lists (non-corpus-based and corpus-based) in the information retrieval (IR) tasks with different Indian languages such as Bengali, Marathi, Gujarati, Hindi, and English. The issue was investigated from three viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword removal in chosen Indian languages? Can corpus-based stopword lists improve performance in Indian languages IR? If yes, to what extent? Among the different corpus-based stopword lists, which stopword list provides the best IR performance? Does the length of a corpus-based stopword list affect the retrieval performance in Indian languages? If yes, to what extent? It was observed that a corpus-based stopword list provides better retrieval performance than a non-corpus-based stopword list in different Indian languages. Among the different corpus-based stopword lists generated and experimented with, Zipf’s law-based stopword list (idf-based one) provides the best retrieval performance in various Indian languages. The aggregation1-based stopword list provides better retrieval than the aggregation2-based list in Indian languages, but in English, the aggregation2-based stopword list performs better than the aggregation1-based list. The best performing idf-based stopword list improves MAP score by 5.43% in Bengali, 1.91% in Marathi, 5.4% in Gujarati, 1.5% in Hindi, and 2.12% in English, respectively, over their baseline counterparts. The probabilistic retrieval models (BM25 and TF-IDF) perform best in different Indian languages. A smaller length of corpus-based stopword lists performs better than a larger length of non-corpus-based stopword lists for all the Indian languages considered. The proposed schemes demonstrate that a stopword list can be heuristically generated in a language-independent statistical method and effectively used for IR tasks with performance comparable, to or even better than non-corpus-based stopword lists.

References

[1]
Bassam Al-Shargabi, Fekry Olayah, and Waseem A. L. Romimah. 2011. An experimental study for the effect of stop words elimination for Arabic text classification algorithms. International Journal of Information Technology and Web Engineering (IJITWE) 6, 2 (2011), 68–75.
[2]
A. Alajmi, E. M. Saad, and R. R. Darwish. 2012. Toward an Arabic stop-words list generation. International Journal of Computer Applications 46, 8 (2012), 8–13.
[3]
Toluwase Victor Asubiaro. 2013. Entropy-based generic stopwords list for Yoruba texts. International Journal of Computer and Information Technology 2, 5 (2013).
[4]
Hakan Ayral and Sirma Yavuz. 2011. An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, 500–503.
[5]
David C. Blair. 1979. Information retrieval, 2nd ed. C. J. Van Rijsbergen. London: Butterworths; 1979: 208. Journal of the American Society for Information Science 30, 6 (1979), 374–375.
[6]
Chris Buckley and Ellen M. Voorhees. 2017. Evaluating evaluation measure stability. In ACM SIGIR Forum, Vol. 51. ACM, 235–242.
[7]
Khalifa Chekima and Rayner Alfred. 2016. An automatic construction of Malay stop words based on aggregation method. In International Conference on Soft Computing in Data Science. Springer, 180–189.
[8]
Murphy Choy. 2012. Effective listings of function stop words for Twitter. arXiv preprint arXiv:1205.6396 (2012).
[9]
Cherie Courseault Trumbach and Dinah Payne. 2007. Identifying synonymous concepts in preparation for technology mining. Journal of Information Science 33, 6 (2007), 660–677.
[10]
Thomas M. Cover. 1999. Elements of Information Theory. John Wiley & Sons.
[11]
Mohammad Reza Davarpanah, M. Sanji, and M. Aramideh. 2009. Farsi lexical analysis and stop word list. Library Hi Tech 27, 3 (2009), 435–449.
[12]
Ljiljana Dolamic and Jacques Savoy. 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3 (2010), 11.
[13]
Ljiljana Dolamic and Jacques Savoy. 2010. When stopword lists make the difference. Journal of the American Society for Information Science and Technology 61, 1 (2010), 200–203.
[14]
Ibrahim Abu El-Khair. 2017. Effects of stop words elimination for Arabic information retrieval: A comparative study. arXiv preprint arXiv:1702.01925 (2017).
[15]
Christopher Fox. 1989. A stop list for general text. In ACM SIGIR Forum, Vol. 24. ACM, 19–21.
[16]
Winthrop Nelson Francis, Henry Kučera, and Andrew W. Mackie. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin.
[17]
Stephen P. Harter. 1986. Online Information Retrieval: Concepts, Principles, and Techniques. Academic Press Professional, Inc.
[18]
MUHİTTİN IŞIK and Hasan Dağ. 2020. The impact of text preprocessing on the prediction of review ratings. Turkish Journal of Electrical Engineering & Computer Sciences 28, 3 (2020), 1405–1421.
[19]
Kaur Jasleen and R. Saini Jatinderkumar. 2016. POS word class based categorization of Gurmukhi language stemmed stop words. In Proceedings of First International Conference on Information and Communication Technology for Intelligent Systems: Volume 2. Springer, 3–10.
[20]
R. Jayashree, K. Srikanta Murthy, and Basavaraj S. Anami. 2014. Effect of stop word removal on the performance of naive Bayesian methods for text classification in the Kannada language. International Journal of Artificial Intelligence and Soft Computing 4, 2-3 (2014), 264–282.
[21]
Vandana Jha, N. Manjunath, P. Deepa Shenoy, and K. R. Venugopal. 2016. HSRA: Hindi stopword removal algorithm. In 2016 International Conference on Microelectronics, Computing and Communications (MicroCom’16). IEEE, 1–5.
[22]
Jasleen Kaur and Jatinderkumar R. Saini. 2016. Punjabi stop words: A Gurmukhi, Shahmukhi and Roman scripted chronicle. In Proceedings of the ACM Symposium on Women in Research 2016. 32–37.
[23]
Henry Kučera and and Winthrop Nelson Francis. 1967. Computational Analysis of Present-day American English. Brown University Press.
[24]
Agus T. Kwee, Flora S. Tsai, and Wenyin Tang. 2009. Sentence-level novelty detection in English and Malay. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 40–51.
[25]
Rachel Tsz-Wai Lo, Ben He, and Iadh Ounis. 2005. Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR’05), Vol. 5. 17–24.
[26]
Hans Peter Luhn. 1957. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1, 4 (1957), 309–317.
[27]
Masoud Makrehchi and Mohamed S. Kamel. 2008. Automatic extraction of domain-specific stopwords from labeled documents. In European Conference on Information Retrieval. Springer, 222–233.
[28]
Sonika Rani Narang, Manish Kumar Jindal, and Munish Kumar. 2020. Ancient text recognition: A review. Artificial Intelligence Review 53, 8 (2020), 5517–5558.
[29]
Sonika Rani Narang, Munish Kumar, and Manish Kumar Jindal. 2021. DeepNetDevanagari: A deep learning model for Devanagari ancient character recognition. Multimedia Tools and Applications 80, 13 (2021), 20671–20686.
[30]
Rajnish M. Rakholia and Jatinderkumar R. Saini. 2016. Lexical classes based stop words categorization for Gujarati language. In 2016 2nd International Conference on Advances in Computing, Communication, & Automation (ICACCA) (Fall). IEEE, 1–5.
[31]
Rajnish M. Rakholia and Jatinderkumar R. Saini. 2017. A rule-based approach to identify stop words for Gujarati language. In Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. Springer, 797–806.
[32]
Ruby Rani and D. K. Lobiyal. 2018. Automatic construction of generic stop words list for Hindi text. Procedia Computer Science 132 (2018), 362–370.
[33]
Jaideepsinh K. Raulji and Jatinderkumar R. Saini. 2017. Generating stopword list for Sanskrit language. In 2017 IEEE 7th International Advance Computing Conference (IACC’17). IEEE, 799–802.
[34]
Stephen E. Robertson and K. Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 3 (1976), 129–146.
[35]
Mohammad Sadeghi and Jesús Vegas. 2014. Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science 40, 4 (2014), 476–487.
[36]
Siba Sankar Sahu and Sukomal Pal. 2022. Effect of stopwords in Indian language IR. Sādhanā 47, 1 (2022), 17.
[37]
Siba Sankar Sahu and Sukomal Pal. 2023. Building a text retrieval system for the Sanskrit language: Exploring indexing, stemming, and searching issues. Computer Speech & Language 81 (2023), 101518.
[38]
Jatinderkumar R. Saini and Rajnish M. Rakholia. 2016. On continent and script-wise divisions-based statistical measures for stop-words lists of international languages. Procedia Computer Science 89 (2016), 313–319.
[39]
Serhad Sarica and Jianxi Luo. 2020. Stopwords in technical language processing. arXiv preprint arXiv:2006.02633 (2020).
[40]
Jacques Savoy. 1999. A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science 50, 10 (1999), 944–952.
[41]
Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Vol. 39. Cambridge University Press Cambridge.
[42]
Claude E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 3 (1948), 379–423.
[43]
Sifatullah Siddiqi and Aditi Sharan. 2018. Construction of a generic stopwords list for Hindi language without corpus statistics. International Journal of Advanced Computer Research 8, 34 (2018), 35–40.
[44]
Catarina Silva and Bernardete Ribeiro. 2003. The importance of stop word removal on recall values in text categorization. In Proceedings of the International Joint Conference on Neural Networks, 2003, Vol. 3. IEEE, 1661–1666.
[45]
Mark P. Sinka and David W. Corne. 2003. Towards modernised and web-specific stoplists for web document analysis. In Proceedings IEEE/WIC International Conference on Web Intelligence (WI’03). IEEE, 396–402.
[46]
Rakib ul Haque, Parisa Mehera, M. F. Mridha, and Md. Abdul Hamid. 2019. A complete Bengali stop word detection mechanism. In 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR’19). IEEE, 103–107.
[47]
Mohammad-Ali Yaghoub-Zadeh-Fard, Behrouz Minaei-Bidgoli, Saeed Rahmani, and Saeed Shahrivari. 2015. PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI’15). IEEE, 111–117.
[48]
Gong Zheng and Guan Gaowa. 2010. The selection of Mongolian stop words. In 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems, Vol. 2. IEEE, 71–74.
[49]
George Kingsley Zipf. 1949. Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley.
[50]
Feng Zou, Fu Lee Wang, Xiaotie Deng, and Song Han. 2006. Evaluation of stop word lists in Chinese language. In LREC. 2497–2500.
[51]
Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang. 2006. Automatic construction of Chinese stop word list. In Proceedings of the 5th WSEAS International Conference on Applied Computer Science. 1010–1015.

Index Terms

  1. A Study on Corpus-based Stopword Lists in Indian Language IR

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 7
    July 2023
    422 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3610376
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2023
    Online AM: 04 July 2023
    Accepted: 12 June 2023
    Revised: 15 November 2022
    Received: 08 December 2021
    Published in TALLIP Volume 22, Issue 7

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Indian languages
    2. stopword
    3. evaluation

    Qualifiers

    • Research-article

    Funding Sources

    • IIT (B.H.U), Varanasi
    • National Supercomputing Mission, Government of India at the IIT (B.H.U)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 144
      Total Downloads
    • Downloads (Last 12 months)59
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media