Skip to main content

Discriminant Mutual Information for Text Feature Selection

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12682))

Included in the following conference series:

Abstract

In text classification tasks, the high dimensionality of data would result in a high computational complexity and decrease the classification accuracy because of high correlation between features; so, it is necessary to execute feature selection. In this paper, we propose a Discriminant Mutual Information (DMI) criterion to select features for text classification tasks. DMI measures the discriminant ability of features from two aspects. One is the mutual information between features and the label information. The other is the discriminant correlation degree between a feature and a target feature subset based on the label information, which could be used for judging whether a feature is redundant in the target feature subset. Thus, DMI is a de-redundancy text feature selection method considering discriminant information. In order to prove the superiority of DMI, we compare it with the state-of-the-art filter methods for text feature selection and conduct experiments on two datasets: Reuters-21578 and WebKB. K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are taken as the subsequent classifiers. Experimental results shows that the proposed DMI has significantly improved the classification accuracy and F1-score of both Reuters-21578 and WebKB.

This work was supported in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 19KJA550002, by the Six Talent Peak Project of Jiangsu Province of China under Grant No. XYDXX-054, by the Priority Academic Program Development of Jiangsu Higher Education Institutions, and by the Collaborative Innovation Center of Novel Software Technology and Industrialization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques, CoRR abs/1707.02919 (2017)

    Google Scholar 

  2. Cardoso-Cachopo, A.: Improving methods for single-label text categorization. Ph.D. Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa (2007)

    Google Scholar 

  3. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Hirschberg, J. (ed.) Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, University of British Columbia, Vancouver, BC, Canada, 26–29 June 1989, pp. 76–83. ACL (1989). https://doi.org/10.3115/981623.981633

  4. Clark, S.: Vector space models of lexical meaning. In: The Handbook of Contemporary Semantic Theory, pp. 493–522 (2015)

    Google Scholar 

  5. Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers, vol. 34, pp. 1–7 (2007)

    Google Scholar 

  6. Feng, G., Li, S., Sun, T., Zhang, B.: A probabilistic model derived term weighting scheme for text classification. Patt Recogn. Lett. 110, 23–29 (2018)

    Article  Google Scholar 

  7. Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Brodley, C.E. (ed.) Proceedings of the 21st International Conference on Machine Learning, ICML 2004, Banff, Alberta, Canada, 4–8 July 2004. ACM International Conference Proceeding Series, vol. 69. ACM (2004). https://doi.org/10.1145/1015330.1015356

  8. Hoque, N., Bhattacharyya, D.K., Kalita, J.K.: MIFS-ND: a mutual information-based feature selection method. Exp. Syst. Appl. 41(14), 6371–6385 (2014). https://doi.org/10.1016/j.eswa.2014.04.019

    Article  Google Scholar 

  9. Kohavi, R., John, G.H., et al.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)

    Article  Google Scholar 

  10. Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018). https://doi.org/10.1016/j.engappai.2017.12.014

    Article  Google Scholar 

  11. Lin, Y., Hu, Q., Liu, J., Duan, J.: Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing 168, 92–103 (2015). https://doi.org/10.1016/j.neucom.2015.06.010

    Article  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  14. Peng, H., Fan, Y.: Feature selection by optimizing a lower bound of conditional mutual information. Inf. Sci. 418, 652–667 (2017). https://doi.org/10.1016/j.ins.2017.08.036

    Article  Google Scholar 

  15. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/eb046814

    Article  Google Scholar 

  16. Rehman, A., Javed, K., Babri, H.A., Saeed, M.: Relative discrimination criterion-a novel feature ranking method for text data. Exp. Syst. Appl. 42(7), 3670–3681 (2015)

    Article  Google Scholar 

  17. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  18. Tang, L., Duan, J., Xu, H., Liang, L.: Mutual information maximization based feature selection algorithm in text classification. Comput. Eng. Appl. 44(13), 130–133 (2008). (in Chinese)

    Google Scholar 

  19. Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)

    Article  Google Scholar 

  20. Vapnik, V.N.: Statistical learning theory. In: Encyclopedia of the Sciences of Learning, vol. 41, no. 4, p. 3185 (1998)

    Google Scholar 

  21. Xu, Y., Jones, G., Li, J., Wang, B., Sun, C.: A study on mutual information-based feature selection for text categorization. J. Comput. Inf. Syst. 3(3), 1007–1012 (2007)

    Google Scholar 

  22. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, ICML 1997, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)

    Google Scholar 

  23. Zhang, X., Wu, G., Dong, Z., Crawford, C.: Embedded feature-selection support vector machine for driving pattern recognition. J. Franklin Inst. 352(2), 669–685 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Zhang, L. (2021). Discriminant Mutual Information for Text Feature Selection. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-73197-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-73196-0

  • Online ISBN: 978-3-030-73197-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics