Definition
The term feature selection is used in machine learning for the process of selecting a subset of features (dimensions) used to represent the data (see Feature Selection, and Dimensionality Reduction). Feature selection can be seen as a part of data pre-processing potentially followed or coupled with feature construction Feature Construction in Text Mining, but can also be coupled with the learning phase if embedded in the learning algorithm. An Assumption of feature selection is that we have defined an original feature space that can be used to represent the data, and our goal is to reduce its dimensionality by selecting a subset of original features. The original feature space of the data is then mapped onto a new feature space. Feature selection in text mining is addressed here separately due to the specificity of textual data compared to the data commonly addressed in machine learning.
Motivation and Background
Recommended Reading
Apte, C., Damerau, F., & Weiss, S. M. (1994). Toward language independent automated learning of text categorization models. In Proceedings of the 17th annual International ACM SIGIR conference on research and development in Information Retrieval, pp. 23–30, Dublin, Ireland, 1994.
Brank, J., Grobelnik, M., Milič-Frayling, N., & Mladenić, D. (2002). Feature selection using support vector machines. In A. Zanasi (Ed.), Data mining III (pp. 261–273). Southampton, UK: WIT.
Bi, J., Bennett, K. P., Embrechts, M., Breneman, C. M., & Song, M. (2003). Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research, 3, 1229–1243.
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3, 1183–1208.
Chakrabarti, S., Dom, B., Agrawal, R., & Raghavan, P. (1998). Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7, 163–178.
Dhillon, I., Mallela, S., & Kumar, R. (2003). A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3, 1265–1287.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. Journal of Machine Learning Research, 3, 1307–1331.
Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the 14th international conference on machine learning ICML’97 (pp. 170–178). Nashrille, TN.
Lewis, D. D., & Ringuette, M. (1994). Comparison of two learning algorithms for text categorization. In Proceedings of the 3rd annual symposium on document analysis and information retrieval SDAIR-1994. Las Vegas, NV.
Mladenić, D. (1998). Feature subset selection in text-learning. In Proceedings of the 10th European conference on machine learning ECML’98. Chemnitz, Germany.
Mladenić, D. (2006). Feature selection for dimensionality reduction. In C. Saunders, S. Gunn, J. Shawe-Taylor, & M. Grobelink (Eds.), Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop: Lecture notes in computer science (Vol. 3940, pp. 84–102). Berlin, Heidelberg: Springer.
Mladenić, D., & Grobelnik, M. (2003). Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35, 45–87.
Quinlan, J. R. (1993). Constructing decision tree. In C4.5: Programs for machine learning. San Francisco: Morgan Kaufman Publishers.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference on machine learning ICML’97 (pp. 412–420). Las Vegas, NV.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
Mladenić, D. (2011). Feature Selection in Text Mining. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_307
Download citation
DOI: https://doi.org/10.1007/978-0-387-30164-8_307
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering