Feature Selection in Text Mining

Mladenić, Dunja

doi:10.1007/978-0-387-30164-8_307

Dunja Mladenić

1205 Accesses
1 Citations

Synonyms

Dimensionality reduction on text via feature selection

Definition

The term feature selection is used in machine learning for the process of selecting a subset of features (dimensions) used to represent the data (see Feature Selection, and Dimensionality Reduction). Feature selection can be seen as a part of data pre-processing potentially followed or coupled with feature construction Feature Construction in Text Mining, but can also be coupled with the learning phase if embedded in the learning algorithm. An Assumption of feature selection is that we have defined an original feature space that can be used to represent the data, and our goal is to reduce its dimensionality by selecting a subset of original features. The original feature space of the data is then mapped onto a new feature space. Feature selection in text mining is addressed here separately due to the specificity of textual data compared to the data commonly addressed in machine learning.

Motivation and Background

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Recommended Reading

Apte, C., Damerau, F., & Weiss, S. M. (1994). Toward language independent automated learning of text categorization models. In Proceedings of the 17th annual International ACM SIGIR conference on research and development in Information Retrieval, pp. 23–30, Dublin, Ireland, 1994.
Google Scholar
Brank, J., Grobelnik, M., Milič-Frayling, N., & Mladenić, D. (2002). Feature selection using support vector machines. In A. Zanasi (Ed.), Data mining III (pp. 261–273). Southampton, UK: WIT.
Google Scholar
Bi, J., Bennett, K. P., Embrechts, M., Breneman, C. M., & Song, M. (2003). Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research, 3, 1229–1243.
Article MATH Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3, 1183–1208.
Article MATH Google Scholar
Chakrabarti, S., Dom, B., Agrawal, R., & Raghavan, P. (1998). Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7, 163–178.
Article Google Scholar
Dhillon, I., Mallela, S., & Kumar, R. (2003). A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3, 1265–1287.
Article MATH Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
Article MATH Google Scholar
Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. Journal of Machine Learning Research, 3, 1307–1331.
Article MATH Google Scholar
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the 14th international conference on machine learning ICML’97 (pp. 170–178). Nashrille, TN.
Google Scholar
Lewis, D. D., & Ringuette, M. (1994). Comparison of two learning algorithms for text categorization. In Proceedings of the 3rd annual symposium on document analysis and information retrieval SDAIR-1994. Las Vegas, NV.
Google Scholar
Mladenić, D. (1998). Feature subset selection in text-learning. In Proceedings of the 10th European conference on machine learning ECML’98. Chemnitz, Germany.
Google Scholar
Mladenić, D. (2006). Feature selection for dimensionality reduction. In C. Saunders, S. Gunn, J. Shawe-Taylor, & M. Grobelink (Eds.), Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop: Lecture notes in computer science (Vol. 3940, pp. 84–102). Berlin, Heidelberg: Springer.
Google Scholar
Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35, 45–87.
Article Google Scholar
Quinlan, J. R. (1993). Constructing decision tree. In C4.5: Programs for machine learning. San Francisco: Morgan Kaufman Publishers.
Google Scholar
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference on machine learning ICML’97 (pp. 412–420). Las Vegas, NV.
Google Scholar

Download references

Author information

Authors and Affiliations

Authors

Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Mladenić, D. (2011). Feature Selection in Text Mining. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_307

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_307
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics