Skip to main content

On the Stability of Feature Selection in the Presence of Feature Correlations

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019)

Abstract

Feature selection is central to modern data science. The ‘stability’ of a feature selection algorithm refers to the sensitivity of its choices to small changes in training data. This is, in effect, the robustness of the chosen features. This paper considers the estimation of stability when we expect strong pairwise correlations, otherwise known as feature redundancy. We demonstrate that existing measures are inappropriate here, as they systematically underestimate the true stability, giving an overly pessimistic view of a feature set. We propose a new statistical measure which overcomes this issue, and generalises previous work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The software related to this paper is available at: https://github.com/sechidis.

References

  1. Allison, P.D.: Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, pp. 07–136. Sage, Thousand Oaks (2001)

    Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  3. Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)

    MathSciNet  MATH  Google Scholar 

  4. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic, Abingdon (1988)

    MATH  Google Scholar 

  5. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. Technical report, TCD-CS-2002-28, Trinity College Dublin, School of Computer Science (2002)

    Google Scholar 

  6. Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. (JMLR) 5, 1531–1555 (2004)

    MathSciNet  MATH  Google Scholar 

  7. Fonseca, C.M., Fleming, P.J.: On the performance assessment and comparison of stochastic multiobjective optimizers. In: Voigt, H.-M., Ebeling, W., Rechenberg, I., Schwefel, H.-P. (eds.) PPSN 1996. LNCS, vol. 1141, pp. 584–593. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61723-X_1022

    Chapter  Google Scholar 

  8. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  9. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. In: IEEE International Conference on Data Mining, pp. 218–255 (2005)

    Google Scholar 

  10. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95–116 (2007). https://doi.org/10.1007/s10115-006-0040-8

    Article  Google Scholar 

  11. Kuncheva, L.I.: A stability index for feature selection. In: Artificial Intelligence and Applications (2007)

    Google Scholar 

  12. Lipkovich, I., Dmitrienko, A., D’Agostino Sr., R.B.: Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Stat. Med. 36(1), 136–196 (2017)

    Article  MathSciNet  Google Scholar 

  13. Mok, T.S., et al.: Gefitinib or carboplatin/paclitaxel in pulmonary adenocarcinoma. N. Engl. J. Med. 361(10), 947–957 (2009)

    Article  Google Scholar 

  14. Nogueira, S., Sechidis, K., Brown, G.: On the stability of feature selection algorithms. J. Mach. Learn. Res. 18(174), 1–54 (2018)

    MathSciNet  MATH  Google Scholar 

  15. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  16. Sechidis, K., Papangelou, K., Metcalfe, P., Svensson, D., Weatherall, J., Brown, G.: Distinguishing prognostic and predictive biomarkers: an information theoretic approach. Bioinformatics 34(19), 3365–3376 (2018)

    Article  Google Scholar 

  17. Shi, L., Reid, L.H., Jones, W.D., et al.: The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24(9), 1151–61 (2006)

    Article  Google Scholar 

  18. Yang, H.H., Moody, J.: Data visualization and feature selection: new algorithms for non-gaussian data. In: Neural Information Processing Systems, pp. 687–693 (1999)

    Google Scholar 

  19. Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 803–811. ACM (2008)

    Google Scholar 

  20. Zhang, M., et al.: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. Bioinformatics 25(13), 1662–1668 (2009)

    Article  Google Scholar 

Download references

Acknowledgements

KS was funded by the AstraZeneca Data Science Fellowship at the University of Manchester. KP was supported by the EPSRC through the Centre for Doctoral Training Grant [EP/1038099/1]. GB was supported by the EPSRC LAMBDA project [EP/N035127/1].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gavin Brown .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 266 KB)

A IPASS description

A IPASS description

The IPASS study [13] was a Phase III, multi-center, randomised, open-label, parallel-group study comparing gefitinib (Iressa, AstraZeneca) with carboplatin (Paraplatin, Bristol-Myers Squibb) plus paclitaxel (Taxol, Bristol-Myers Squibb) as first-line treatment in clinically selected patients in East Asia who had NSCLC. 1217 patients were balanced randomised (1:1) between the treatment arms, and the primary end point was progression-free survival (PFS); for full details of the trial see [13]. For the purpose of our work we model PFS as a Bernoulli endpoint, neglecting its time-to-event nature. We analysed the data at \(78\%\) maturity, when 950 subjects have had progression events.

The covariates used in the IPASS study are shown in Table 5. The following covariates have missing observations (as shown in parentheses): \(X_5\) (0.4%), \(X_{12}\) (0.2%), \(X_{13}\) (0.7%), \(X_{14}\) (0.7%), \(X_{16}\) (2%), \(X_{17}\) (0.3%), \(X_{18}\) (1%), \(X_{19}\) (1%), \(X_{20}\) (0.3%), \(X_{21}\) (0.3%), \(X_{22}\) (0.3%), \(X_{23}\) (0.3%). Following Lipkovich et al. [12], for the patients with missing values in biomarker X, we create an additional category, a procedure known as the missing indicator method [1].

Table 5. Covariates used in the IPASS clinical trial.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sechidis, K., Papangelou, K., Nogueira, S., Weatherall, J., Brown, G. (2020). On the Stability of Feature Selection in the Presence of Feature Correlations. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46150-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46149-2

  • Online ISBN: 978-3-030-46150-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics