Skip to main content

Automatic Extraction of Domain-Specific Stopwords from Labeled Documents

  • Conference paper
Advances in Information Retrieval (ECIR 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Included in the following conference series:

Abstract

Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, A., Gey, F.C.: Building an Arabic stemmer for information retrieval. In: TREC (2002)

    Google Scholar 

  2. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI 1998), pp. 509–516 (1998)

    Google Scholar 

  3. Crow, D., De Santo, J.: A hybrid approach to concept extraction and recognition-based matching in the domain of human resources. In: ICTAI, pp. 535–539 (2004)

    Google Scholar 

  4. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    Article  MATH  Google Scholar 

  5. Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of ICML 2004, Twenty-first international conference on Machine learning, pp. 297–304 (2004)

    Google Scholar 

  6. Hayes, J.H., Dekhtyar, A., Sundaram, S.: Text mining for software engineering: how analyst feedback impacts final results. In: MSR 2005: Proceedings of the 2005 international workshop on Mining software repositories, pp. 1–5. ACM Press, New York (2005)

    Chapter  Google Scholar 

  7. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 143–151. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

  8. Kawahara, M., Kawano, H.: Mining association algorithm with threshold based on roc analysis. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), vol. 3, pp. 3010–3017. IEEE Computer Society, Los Alamitos (2001)

    Google Scholar 

  9. Koo, S.O., Lim, S.Y., Lee, S.-J.: Building an ontology based on hub words for information retrieval. In: Web Intelligence, pp. 466–469 (2003)

    Google Scholar 

  10. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  11. Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An evaluation on feature selection for text clustering. In: Proceedings of ICML 2003, pp. 488–495 (2003)

    Google Scholar 

  12. Lo, R.T., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. The Journal on Digital Information Management: special issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR 2005) 3(1), 3–8 (2005)

    Google Scholar 

  13. Maletic, J.I., Valluri, N.: Automatic software clustering via latent semantic analysis. In: Proceedings 14th IEEE International Conference on Automated Software Engineering (ASE 1999), Cocoa Beach Florida, October 1999, pp. 251–254 (1999)

    Google Scholar 

  14. McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Shavlik, J.W. (ed.) Proceedings of ICML 1998, 15th International Conference on Machine Learning, Madison, US, pp. 359–367. Morgan Kaufmann Publishers, San Francisco (1998)

    Google Scholar 

  15. Petras, V., Perelman, N., Gey, F.C.: UC berkeley at clef-2003 - Russian language experiments and domain-specific retrieval. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 401–411. Springer, Heidelberg (2004)

    Google Scholar 

  16. Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. In: Information Processing and Management, pp. 77–91 (1981)

    Google Scholar 

  17. Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 659–661 (2002)

    Google Scholar 

  18. Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 944–952 (1999)

    Google Scholar 

  19. Seki, K., Mostafa, J.: An application of text categorization methods to gene ontology annotation. In: SIGIR, pp. 138–145 (2005)

    Google Scholar 

  20. Sinka, M.P., Corne, D.W.: Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, 1015–1023 (2003)

    Google Scholar 

  21. Sinka, M.P., Corne, D.W.: Towards modernised and web-specific stoplists for web document analysis. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence, pp. 396–402. IEEE Computer Society, Los Alamitos (2003)

    Chapter  Google Scholar 

  22. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn., Dept. of Computer Science, University of Glasgow (1979)

    Google Scholar 

  23. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Makrehchi, M., Kamel, M.S. (2008). Automatic Extraction of Domain-Specific Stopwords from Labeled Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78646-7_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78645-0

  • Online ISBN: 978-3-540-78646-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics