An Empirical Study of Category Skew on Feature Selection for Text Categorization

Simeon, Mondelle; Hilderman, Robert

doi:10.1007/978-3-642-01818-3_35

Mondelle Simeon²¹ &
Robert Hilderman²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5549))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

1595 Accesses

Abstract

In this paper, we present an empirical comparison of the effects of category skew on six feature selection methods. The methods were evaluated on 36 datasets generated from the 20 Newsgroups, OHSUMED, and Reuters-21578 text corpora. The datasets were generated to possess particular category skew characteristics (i.e., the number of documents assigned to each category). Our objective was to determine the best performance of the six feature selection methods, as measured by F-measure and Precision, regardless of the number of features needed to produce the best performance. We found the highest F-measure values were obtained by bi-normal separation and information gain and the highest Precision values were obtained by categorical proportional difference and chi-squared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Kim, S.B., Han, K.S., Rim, H.C., Myaeng, S.H.: Some effective techniques for naive bayes text classification. IEEE Trans. on Knowl. and Data Eng. 18(11), 1457–1466 (2006)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Han, E.-H., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS, vol. 2035, pp. 53–65. Springer, Heidelberg (2001)
Chapter Google Scholar
Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2008)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Simeon, M., Hilderman, R.J.: Categorical proportional difference: A feature selection method for text categorization. In: AusDM, pp. 201–208 (2008)
Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Regina, Regina, Saskatchewan, S4S 0A2, Canada
Mondelle Simeon & Robert Hilderman

Authors

Mondelle Simeon
View author publications
You can also search for this author in PubMed Google Scholar
Robert Hilderman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science Irving K. Barber School of Arts and Sciences, University of British Columbia Okanagan, 3333 University Way, V1V 1V5, Kelowna, British Columbia, Canada
Yong Gao
School of Information Technology & Engineering, University of Ottawa, 800 King Edward Avenue, P.O. Box 450, K1N 6N5, Stn. A, Ottawa, Ontario, Canada
Nathalie Japkowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Simeon, M., Hilderman, R. (2009). An Empirical Study of Category Skew on Feature Selection for Text Categorization. In: Gao, Y., Japkowicz, N. (eds) Advances in Artificial Intelligence. Canadian AI 2009. Lecture Notes in Computer Science(), vol 5549. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01818-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-01818-3_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01817-6
Online ISBN: 978-3-642-01818-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics