Abstract
In this paper, we focus on feature coverage policies used for feature selection in the text classification domain. Two alternative policies are discussed and compared: corpus-based and class-based selection of features. We make a detailed analysis of pruning and keyword selection by varying the parameters of the policies and obtain the optimal usage patterns. In addition, by combining the optimal forms of these methods, we propose a novel two-stage feature selection approach. The experiments on three independent datasets showed that the proposed method results in a statistically significant increase over the traditional methods in the success rates of the classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the European Conference on Machine Learning (ECML), 137–142, Springer (1998)
Aizawa, A., Linguistic techniques to improve the performance of automatic text categorization. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, 307–314, Tokyo (2001)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Forman, G.: Feature selection for text classification, in Computational methods of feature selection, ed. Liu, H., Hiroshi, M.: Chapman and Hall/CRC Press (2007)
Singh, S.R., Murthy, H.A., Gonsalves, T.A.: Feature selection for text classification based on Gini coefficient of inequality. In: Proceedings of the 4th International Workshop on Feature Selection in Data Mining, 76–85, India (2010)
Shoushan, L., Rui, X., Chengqing, Z., Huang, C.R.: A framework of feature selection methods for text categorization. In: Proceedings of the 47th Annual Meeting of the ACL, 692–700, Singapore (2009)
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature selection methods for text classification, In: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining, 230–239, ACM, San Jose (2007)
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Zhihai, W.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33, 1–5 (2007)
Özgür, L., Güngör, T.: Text classification with the support of pruned dependency patterns. Pattern Recogn. Lett. 31, 1598–1607 (2010)
Özgür, Arzucan, Özgür, Levent, Güngör, Tunga: Text Categorization with Class-Based and Corpus-Based Keyword Selection. In: Yolum, pInar, Güngör, Tunga, Gürgen, Fikret, Özturan, Can (eds.) Computer and Information Sciences - ISCIS 2005. Lecture Notes in Computer Science, vol. 3733, pp. 606–615. Springer, Heidelberg (2005)
Ghiassi, M., Olschimke, M., Moon, B., Arnaudo, P.: Automated text classificiation using a dynamic artificial neural network model. Expert Syst. Appl. 39(12), 10967–10976 (2012)
Frank, A., Asuncion, A.: UCI machine learning repository, University of California, School of Information and Computer Science, Irvine, CA (2010). http://archive.ics.uci.edu/ml
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval, Cambridge University Press (2008)
Gao, Y., Sun, S.: An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 1502–1505, China (2010)
Joachims, T.: Advances in kernel methods: support vector learning, MIT Press (1999)
Acknowledgments
This work was supported by the Boğaziçi University Research Fund under the grant number 05A103D and the Turkish State Planning Organization (DPT) under the TAM Project, number 200K120610.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Özgür, L., Güngör, T. (2016). Two-Stage Feature Selection for Text Classification. In: Abdelrahman, O., Gelenbe, E., Gorbil, G., Lent, R. (eds) Information Sciences and Systems 2015. Lecture Notes in Electrical Engineering, vol 363. Springer, Cham. https://doi.org/10.1007/978-3-319-22635-4_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-22635-4_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22634-7
Online ISBN: 978-3-319-22635-4
eBook Packages: EngineeringEngineering (R0)