Automatic Extraction of Domain-Specific Stopwords from Labeled Documents

Makrehchi, Masoud; Kamel, Mohamed S.

doi:10.1007/978-3-540-78646-7_22

Masoud Makrehchi¹ &
Mohamed S. Kamel¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Included in the following conference series:

European Conference on Information Retrieval

2591 Accesses
19 Citations

Abstract

Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, A., Gey, F.C.: Building an Arabic stemmer for information retrieval. In: TREC (2002)
Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI 1998), pp. 509–516 (1998)
Google Scholar
Crow, D., De Santo, J.: A hybrid approach to concept extraction and recognition-based matching in the domain of human resources. In: ICTAI, pp. 535–539 (2004)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Article MATH Google Scholar
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of ICML 2004, Twenty-first international conference on Machine learning, pp. 297–304 (2004)
Google Scholar
Hayes, J.H., Dekhtyar, A., Sundaram, S.: Text mining for software engineering: how analyst feedback impacts final results. In: MSR 2005: Proceedings of the 2005 international workshop on Mining software repositories, pp. 1–5. ACM Press, New York (2005)
Chapter Google Scholar
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 143–151. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Kawahara, M., Kawano, H.: Mining association algorithm with threshold based on roc analysis. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), vol. 3, pp. 3010–3017. IEEE Computer Society, Los Alamitos (2001)
Google Scholar
Koo, S.O., Lim, S.Y., Lee, S.-J.: Building an ontology based on hub words for information retrieval. In: Web Intelligence, pp. 466–469 (2003)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An evaluation on feature selection for text clustering. In: Proceedings of ICML 2003, pp. 488–495 (2003)
Google Scholar
Lo, R.T., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. The Journal on Digital Information Management: special issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR 2005) 3(1), 3–8 (2005)
Google Scholar
Maletic, J.I., Valluri, N.: Automatic software clustering via latent semantic analysis. In: Proceedings 14th IEEE International Conference on Automated Software Engineering (ASE 1999), Cocoa Beach Florida, October 1999, pp. 251–254 (1999)
Google Scholar
McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Shavlik, J.W. (ed.) Proceedings of ICML 1998, 15th International Conference on Machine Learning, Madison, US, pp. 359–367. Morgan Kaufmann Publishers, San Francisco (1998)
Google Scholar
Petras, V., Perelman, N., Gey, F.C.: UC berkeley at clef-2003 - Russian language experiments and domain-specific retrieval. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 401–411. Springer, Heidelberg (2004)
Google Scholar
Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. In: Information Processing and Management, pp. 77–91 (1981)
Google Scholar
Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 659–661 (2002)
Google Scholar
Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 944–952 (1999)
Google Scholar
Seki, K., Mostafa, J.: An application of text categorization methods to gene ontology annotation. In: SIGIR, pp. 138–145 (2005)
Google Scholar
Sinka, M.P., Corne, D.W.: Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, 1015–1023 (2003)
Google Scholar
Sinka, M.P., Corne, D.W.: Towards modernised and web-specific stoplists for web document analysis. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence, pp. 396–402. IEEE Computer Society, Los Alamitos (2003)
Chapter Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn., Dept. of Computer Science, University of Glasgow (1979)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Pattern Analysis and Machine Intelligence Lab, Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, N2L3G1, Canada
Masoud Makrehchi & Mohamed S. Kamel

Authors

Masoud Makrehchi
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed S. Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Makrehchi, M., Kamel, M.S. (2008). Automatic Extraction of Domain-Specific Stopwords from Labeled Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-78646-7_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78645-0
Online ISBN: 978-3-540-78646-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics