Abstract
In text classification tasks, the high dimensionality of data would result in a high computational complexity and decrease the classification accuracy because of high correlation between features; so, it is necessary to execute feature selection. In this paper, we propose a Discriminant Mutual Information (DMI) criterion to select features for text classification tasks. DMI measures the discriminant ability of features from two aspects. One is the mutual information between features and the label information. The other is the discriminant correlation degree between a feature and a target feature subset based on the label information, which could be used for judging whether a feature is redundant in the target feature subset. Thus, DMI is a de-redundancy text feature selection method considering discriminant information. In order to prove the superiority of DMI, we compare it with the state-of-the-art filter methods for text feature selection and conduct experiments on two datasets: Reuters-21578 and WebKB. K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are taken as the subsequent classifiers. Experimental results shows that the proposed DMI has significantly improved the classification accuracy and F1-score of both Reuters-21578 and WebKB.
This work was supported in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 19KJA550002, by the Six Talent Peak Project of Jiangsu Province of China under Grant No. XYDXX-054, by the Priority Academic Program Development of Jiangsu Higher Education Institutions, and by the Collaborative Innovation Center of Novel Software Technology and Industrialization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques, CoRR abs/1707.02919 (2017)
Cardoso-Cachopo, A.: Improving methods for single-label text categorization. Ph.D. Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa (2007)
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Hirschberg, J. (ed.) Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, University of British Columbia, Vancouver, BC, Canada, 26–29 June 1989, pp. 76–83. ACL (1989). https://doi.org/10.3115/981623.981633
Clark, S.: Vector space models of lexical meaning. In: The Handbook of Contemporary Semantic Theory, pp. 493–522 (2015)
Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers, vol. 34, pp. 1–7 (2007)
Feng, G., Li, S., Sun, T., Zhang, B.: A probabilistic model derived term weighting scheme for text classification. Patt Recogn. Lett. 110, 23–29 (2018)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Brodley, C.E. (ed.) Proceedings of the 21st International Conference on Machine Learning, ICML 2004, Banff, Alberta, Canada, 4–8 July 2004. ACM International Conference Proceeding Series, vol. 69. ACM (2004). https://doi.org/10.1145/1015330.1015356
Hoque, N., Bhattacharyya, D.K., Kalita, J.K.: MIFS-ND: a mutual information-based feature selection method. Exp. Syst. Appl. 41(14), 6371–6385 (2014). https://doi.org/10.1016/j.eswa.2014.04.019
Kohavi, R., John, G.H., et al.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)
Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018). https://doi.org/10.1016/j.engappai.2017.12.014
Lin, Y., Hu, Q., Liu, J., Duan, J.: Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing 168, 92–103 (2015). https://doi.org/10.1016/j.neucom.2015.06.010
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Peng, H., Fan, Y.: Feature selection by optimizing a lower bound of conditional mutual information. Inf. Sci. 418, 652–667 (2017). https://doi.org/10.1016/j.ins.2017.08.036
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/eb046814
Rehman, A., Javed, K., Babri, H.A., Saeed, M.: Relative discrimination criterion-a novel feature ranking method for text data. Exp. Syst. Appl. 42(7), 3670–3681 (2015)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Tang, L., Duan, J., Xu, H., Liang, L.: Mutual information maximization based feature selection algorithm in text classification. Comput. Eng. Appl. 44(13), 130–133 (2008). (in Chinese)
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)
Vapnik, V.N.: Statistical learning theory. In: Encyclopedia of the Sciences of Learning, vol. 41, no. 4, p. 3185 (1998)
Xu, Y., Jones, G., Li, J., Wang, B., Sun, C.: A study on mutual information-based feature selection for text categorization. J. Comput. Inf. Syst. 3(3), 1007–1012 (2007)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, ICML 1997, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Zhang, X., Wu, G., Dong, Z., Crawford, C.: Embedded feature-selection support vector machine for driving pattern recognition. J. Franklin Inst. 352(2), 669–685 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, J., Zhang, L. (2021). Discriminant Mutual Information for Text Feature Selection. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-73197-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73196-0
Online ISBN: 978-3-030-73197-7
eBook Packages: Computer ScienceComputer Science (R0)