On the Stability of Feature Selection in the Presence of Feature Correlations

Sechidis, Konstantinos; Papangelou, Konstantinos; Nogueira, Sarah; Weatherall, James; Brown, Gavin

doi:10.1007/978-3-030-46150-8_20

Konstantinos Sechidis¹⁴,
Konstantinos Papangelou¹⁴,
Sarah Nogueira¹⁵,
James Weatherall¹⁶ &
…
Gavin Brown¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11906))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2118 Accesses
9 Citations

Abstract

Feature selection is central to modern data science. The ‘stability’ of a feature selection algorithm refers to the sensitivity of its choices to small changes in training data. This is, in effect, the robustness of the chosen features. This paper considers the estimation of stability when we expect strong pairwise correlations, otherwise known as feature redundancy. We demonstrate that existing measures are inappropriate here, as they systematically underestimate the true stability, giving an overly pessimistic view of a feature set. We propose a new statistical measure which overcomes this issue, and generalises previous work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The software related to this paper is available at: https://github.com/sechidis.

References

Allison, P.D.: Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, pp. 07–136. Sage, Thousand Oaks (2001)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)
MathSciNet MATH Google Scholar
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic, Abingdon (1988)
MATH Google Scholar
Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. Technical report, TCD-CS-2002-28, Trinity College Dublin, School of Computer Science (2002)
Google Scholar
Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. (JMLR) 5, 1531–1555 (2004)
MathSciNet MATH Google Scholar
Fonseca, C.M., Fleming, P.J.: On the performance assessment and comparison of stochastic multiobjective optimizers. In: Voigt, H.-M., Ebeling, W., Rechenberg, I., Schwefel, H.-P. (eds.) PPSN 1996. LNCS, vol. 1141, pp. 584–593. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61723-X_1022
Chapter Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Article MathSciNet MATH Google Scholar
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. In: IEEE International Conference on Data Mining, pp. 218–255 (2005)
Google Scholar
Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95–116 (2007). https://doi.org/10.1007/s10115-006-0040-8
Article Google Scholar
Kuncheva, L.I.: A stability index for feature selection. In: Artificial Intelligence and Applications (2007)
Google Scholar
Lipkovich, I., Dmitrienko, A., D’Agostino Sr., R.B.: Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Stat. Med. 36(1), 136–196 (2017)
Article MathSciNet Google Scholar
Mok, T.S., et al.: Gefitinib or carboplatin/paclitaxel in pulmonary adenocarcinoma. N. Engl. J. Med. 361(10), 947–957 (2009)
Article Google Scholar
Nogueira, S., Sechidis, K., Brown, G.: On the stability of feature selection algorithms. J. Mach. Learn. Res. 18(174), 1–54 (2018)
MathSciNet MATH Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 27(8), 1226–1238 (2005)
Article Google Scholar
Sechidis, K., Papangelou, K., Metcalfe, P., Svensson, D., Weatherall, J., Brown, G.: Distinguishing prognostic and predictive biomarkers: an information theoretic approach. Bioinformatics 34(19), 3365–3376 (2018)
Article Google Scholar
Shi, L., Reid, L.H., Jones, W.D., et al.: The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24(9), 1151–61 (2006)
Article Google Scholar
Yang, H.H., Moody, J.: Data visualization and feature selection: new algorithms for non-gaussian data. In: Neural Information Processing Systems, pp. 687–693 (1999)
Google Scholar
Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 803–811. ACM (2008)
Google Scholar
Zhang, M., et al.: Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes. Bioinformatics 25(13), 1662–1668 (2009)
Article Google Scholar

Download references

Acknowledgements

KS was funded by the AstraZeneca Data Science Fellowship at the University of Manchester. KP was supported by the EPSRC through the Centre for Doctoral Training Grant [EP/1038099/1]. GB was supported by the EPSRC LAMBDA project [EP/N035127/1].

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Manchester, M13 9PL, UK
Konstantinos Sechidis, Konstantinos Papangelou & Gavin Brown
Criteo, Paris, France
Sarah Nogueira
Advanced Analytics Centre, Global Medicines Development, AstraZeneca, Cambridge, SG8 6EE, UK
James Weatherall

Authors

Konstantinos Sechidis
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Papangelou
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Nogueira
View author publications
You can also search for this author in PubMed Google Scholar
James Weatherall
View author publications
You can also search for this author in PubMed Google Scholar
Gavin Brown
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gavin Brown .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
IRISA/Inria, Rennes, France
Elisa Fromont
University of Würzburg, Würzburg, Germany
Andreas Hotho
Leiden University, Leiden, The Netherlands
Arno Knobbe
ETH Zurich, Zurich, Switzerland
Marloes Maathuis
Institut National des Sciences Appliquées, Villeurbanne, France
Céline Robardet

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 266 KB)

A IPASS description

The IPASS study [13] was a Phase III, multi-center, randomised, open-label, parallel-group study comparing gefitinib (Iressa, AstraZeneca) with carboplatin (Paraplatin, Bristol-Myers Squibb) plus paclitaxel (Taxol, Bristol-Myers Squibb) as first-line treatment in clinically selected patients in East Asia who had NSCLC. 1217 patients were balanced randomised (1:1) between the treatment arms, and the primary end point was progression-free survival (PFS); for full details of the trial see [13]. For the purpose of our work we model PFS as a Bernoulli endpoint, neglecting its time-to-event nature. We analysed the data at \(78\%\) maturity, when 950 subjects have had progression events.

The covariates used in the IPASS study are shown in Table 5. The following covariates have missing observations (as shown in parentheses): \(X_5\) (0.4%), \(X_{12}\) (0.2%), \(X_{13}\) (0.7%), \(X_{14}\) (0.7%), \(X_{16}\) (2%), \(X_{17}\) (0.3%), \(X_{18}\) (1%), \(X_{19}\) (1%), \(X_{20}\) (0.3%), \(X_{21}\) (0.3%), \(X_{22}\) (0.3%), \(X_{23}\) (0.3%). Following Lipkovich et al. [12], for the patients with missing values in biomarker X, we create an additional category, a procedure known as the missing indicator method [1].

Table 5. Covariates used in the IPASS clinical trial.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sechidis, K., Papangelou, K., Nogueira, S., Weatherall, J., Brown, G. (2020). On the Stability of Feature Selection in the Presence of Feature Correlations. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-46150-8_20
Published: 30 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46149-2
Online ISBN: 978-3-030-46150-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

On the Stability of Feature Selection in the Presence of Feature Correlations

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 266 KB)

A IPASS description

A IPASS description

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation