ABSTRACT
Stopword removal has traditionally been an integral step in information retrieval pre-processing. In this paper, we question the utility of this step in retrieving relevant documents for verbose queries on standard datasets. We show that stopword removal does not lead to noticeable difference in retrieval performance as opposed to not removing them. We observe this phenomenon in 7 FIRE test collections for 4 Indian languages, Bangla, Hindi, Gujarati and Marathi, as well as for European languages such as Czech (CLEF 2007) and Hungarian (CLEF 2005 to 2007). Since these languages are inflective, the stopword lists are not significant. More interestingly, for languages such as English (TREC678 Ad Hoc) and French (CLEF 2005 to 2007), stopword removal leads to a statistically significant drop in performance. This is due to using a generic stopword list that does not suit in many document retrieval tasks.
- T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C.Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press. Google ScholarDigital Library
- C. Fox. 1992. Lexical analysis and stoplists. Information Retrieval - Data Structures & Algorithms (1992), 102--130. Google ScholarDigital Library
- R.T. Lo, B. He, and I. Ounis. 2005. Automatically Building a Stopword List for an Information Retrieval System. 5th Dutch-Belgium Information Retrieval Workshop (DIR) '05 (2005).Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Donald Metzler, Victor Lavrenko, and W. Bruce Croft. 2004. Formal Multiple-Bernoulli Models for Language Modeling. In SIGIR. 540--541. Google ScholarDigital Library
- T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. 2004. Indri: A language model-based search engine for complex queries. In ICIA. Available at: http://www.lemurproject.org/indri/.Google Scholar
Index Terms
- Stopword Removal: Why Bother? A Case Study on Verbose Queries
Recommendations
Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages
FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval EvaluationDespite the availability of massive open information and efforts to promote multilingualism on the Web, content in Bantu languages remains negligible. Additionally, Information Retrieval (IR) systems, such as the Google search engine, use algorithms ...
The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media
AbstractStopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords ...
A Study on Corpus-based Stopword Lists in Indian Language IR
We explore and evaluate the effect of different stopword lists (non-corpus-based and corpus-based) in the information retrieval (IR) tasks with different Indian languages such as Bengali, Marathi, Gujarati, Hindi, and English. The issue was investigated ...
Comments