skip to main content
10.1145/3140107.3140125acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomputeConference Proceedingsconference-collections
research-article

Stopword Removal: Why Bother? A Case Study on Verbose Queries

Authors Info & Claims
Published:16 November 2017Publication History

ABSTRACT

Stopword removal has traditionally been an integral step in information retrieval pre-processing. In this paper, we question the utility of this step in retrieving relevant documents for verbose queries on standard datasets. We show that stopword removal does not lead to noticeable difference in retrieval performance as opposed to not removing them. We observe this phenomenon in 7 FIRE test collections for 4 Indian languages, Bangla, Hindi, Gujarati and Marathi, as well as for European languages such as Czech (CLEF 2007) and Hungarian (CLEF 2005 to 2007). Since these languages are inflective, the stopword lists are not significant. More interestingly, for languages such as English (TREC678 Ad Hoc) and French (CLEF 2005 to 2007), stopword removal leads to a statistically significant drop in performance. This is due to using a generic stopword list that does not suit in many document retrieval tasks.

References

  1. T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C.Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Fox. 1992. Lexical analysis and stoplists. Information Retrieval - Data Structures & Algorithms (1992), 102--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R.T. Lo, B. He, and I. Ounis. 2005. Automatically Building a Stopword List for an Information Retrieval System. 5th Dutch-Belgium Information Retrieval Workshop (DIR) '05 (2005).Google ScholarGoogle Scholar
  4. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Donald Metzler, Victor Lavrenko, and W. Bruce Croft. 2004. Formal Multiple-Bernoulli Models for Language Modeling. In SIGIR. 540--541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. 2004. Indri: A language model-based search engine for complex queries. In ICIA. Available at: http://www.lemurproject.org/indri/.Google ScholarGoogle Scholar

Index Terms

  1. Stopword Removal: Why Bother? A Case Study on Verbose Queries

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          Compute '17: Proceedings of the 10th Annual ACM India Compute Conference
          November 2017
          148 pages
          ISBN:9781450353236
          DOI:10.1145/3140107

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 November 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Compute '17 Paper Acceptance Rate19of70submissions,27%Overall Acceptance Rate114of622submissions,18%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader