Skip to main content
Log in

Feature selection for multi-label classification by maximizing full-dimensional conditional mutual information

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Conditional mutual information (CMI) maximization is a promising criterion for feature selection in a computationally efficient stepwise way, but it is hard to be applied comprehensively because of imprecise probability calculation and heavy computational load. Many dimension-reduced CMI-based and mutual information (MI)-based methods have been reported to achieve state-of-art performances in terms of classification. However, model deviations are introduced into the CMI and MI formulations in these methods during dimension reduction. In this paper, we start with the full-dimensional CMI to deal with the feature selection problem, so as to retain full inter-feature and feature-label mutual information when selecting new features. The cost function is approximated and simplified from a mathematical perspective to overcome the difficulties for maximizing the original full-dimensional CMI. A relationship is established between the proposed feature selection criterion and the one based on Hilbert-Schmidt independence, which explains qualitatively how the new criterion succeeds to achieve relevance maximization and redundance minimization simultaneously. Experiments on real-world datasets demonstrate the predominance of the proposed method over the existing ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. In this paper, we consider the multi-label problem to avoid loss of generality. The single-label problem is a simplification of it by setting K = 1 and Y = y.

  2. A multiplier of K is added to weight the inter-feature mutual information of independent labels, as K-label models are considered in this paper instead of the single-label one in [27].

  3. Available at: http://pubchem.ncbi.nlm.nih.gov

  4. Assay IDs include: 1416 (PERK), 1446 (JAK2), 1481 (ATPase), 1531 (MEK)

  5. More detailed descriptions of the two datasets can be found in [16].

References

  1. Bache K, Lichman M (2013) Uci machine learning repository

  2. Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42(22):8520–8532

    Article  Google Scholar 

  3. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1):245–271

    Article  MathSciNet  Google Scholar 

  4. Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66

    MathSciNet  MATH  Google Scholar 

  5. Bu Z, Li HJ, Zhang C, Cao J, Li A, Shi Y Graph k-means based on leader identification, dynamic game and opinion dynamics, pp 1–1. https://doi.org/10.1109/TKDE.2019.2903712

  6. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  7. Chen Y, Bi J, Wang J (2006) Miles: Multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 28(12):1931–1947

    Article  Google Scholar 

  8. Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555

    MathSciNet  MATH  Google Scholar 

  9. Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with hilbert-schmidt norms. In: International conference on algorithmic learning theory. Springer, pp 63–77

  10. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  11. Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  12. Janecek A, Gansterer WN, Demel M, Ecker G (2008) On the relationship between feature selection and classification accuracy. FSDM 4:90–105

    Google Scholar 

  13. Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116

    Article  Google Scholar 

  14. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1-2):273–324

    Article  Google Scholar 

  15. Koller D, Sahami M (1996) Toward optimal feature selection. Technical report, Stanford InfoLab

  16. Kong X, Philip SY (2010) Multi-label feature selection for graph classification. In: 2010 IEEE 10th international conference on Data mining (ICDM). IEEE, pp 274–283

  17. Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159

    Article  Google Scholar 

  18. Li HJ, Bu Z, Wang Z, Cao J (2020) Dynamical clustering in electronic commerce systems via optimization and leadership expansion. IEEE, pp 5327–5334

  19. Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature selection: an ever evolving frontier in data mining. In: Feature selection in data mining, pp 4–13

  20. Liu H, Sun J, Liu L, Zhang H (2009) Feature selection with dynamic mutual information. Pattern Recogn 42(7):1330–1339

    Article  Google Scholar 

  21. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502

    Article  MathSciNet  Google Scholar 

  22. Makoto Y, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. MIT, pp 185–207

  23. Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312

    Article  Google Scholar 

  24. Nakariyakul S, Casasent DP (2009) An improvement on floating search algorithms for feature subset selection. Pattern Recogn 42(9):1932–1940

    Article  Google Scholar 

  25. Neumann J, Schnörr C, Steidl G (2005) Combined svm-based feature selection and classification. Mach Learn 61(1):129–150

    Article  Google Scholar 

  26. Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: Methods and applications. Springer, pp 119–150

  27. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  28. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125

    Article  Google Scholar 

  29. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  Google Scholar 

  30. Song L, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13:1393–1434

    MathSciNet  MATH  Google Scholar 

  31. Sugiyama M (2012) Machine learning with squared-loss mutual information. Entropy 15(1):80–112

    Article  MathSciNet  Google Scholar 

  32. Suzuki T, Sugiyama M, Kanamori T, Sese J (2009) Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinform 10(1):S52

    Article  Google Scholar 

  33. Torkkola K (2003) Feature extraction by non-parametric mutual information maximization. J Mach Learn Res 3:1415–1438

    MathSciNet  MATH  Google Scholar 

  34. Tu CJ, Chuang LY, Chang JY, Yang CH et al (2007) Feature selection using pso-svm. International Journal of Computer Science

  35. Unler A, Murat A, Chinnam RB (2011) mr2pso: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Elsevier, pp 4625–4641

  36. Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Computi Appl 24(1):175–186

    Article  Google Scholar 

  37. Wang J, Wei JM, Yang Z, Wang SQ (2017) Feature selection by maximizing independent classification information. IEEE Trans Knowl Data Eng 29(4):828–841

    Article  Google Scholar 

  38. Wang T, Lu J, Zhang G (2018) Two-stage fuzzy multiple kernel learning based on hilbert-schmidt independence criterion. IEEE, pp 1–1

  39. Yan X, Cheng H, Han J, Yu PS (2008) Mining significant graph patterns by leap search. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 433–444

  40. Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: 2002. ICDM 2003. Proceedings. 2002 IEEE international conference on Data mining. IEEE, pp 721–724

  41. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: ICML, vol 3, pp 856–863

  42. Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048

    Article  Google Scholar 

  43. Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837

    Article  Google Scholar 

  44. Zhang Y, Zhou ZH (2010) Multilabel dimensionality reduction via dependence maximization. ACM Trans Knowl Discov Data (TKDD) 4(3):14

    Google Scholar 

  45. Zhou Y, Jin R, Hoi SC (2010) Exclusive lasso for multi-task feature selection. In: AISTATS, vol 9, pp 988–995

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Chao Sha.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sha, ZC., Liu, ZM., Ma, C. et al. Feature selection for multi-label classification by maximizing full-dimensional conditional mutual information. Appl Intell 51, 326–340 (2021). https://doi.org/10.1007/s10489-020-01822-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01822-0

Keywords

Navigation