Abstract
The paper presents an algorithm to rank features in “small number of samples, large dimensionality” problems according to probabilistic feature relevance, a novel definition of feature relevance. Probabilistic feature relevance, intended as expected weak relevance, is introduced in order to address the problem of estimating conventional feature relevance in data settings where the number of samples is much smaller than the number of features. The resulting ranking algorithm relies on a blocking approach for estimation and consists in creating a large number of identical configurations to measure the conditional information of each feature in a paired manner. Its implementation can be made embarrassingly parallel in the case of very large n. A number of experiments on simulated and real data confirms the interest of the approach.
Keywords
- Proposal Distribution
- Ranking Algorithm
- Feature Selection Technique
- Probabilistic Relevance
- Conditional Mutual Information
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Boldface denotes random variables.
- 2.
All details on the datasets (number of samples, number of variables, number of classes) are available in https://github.com/ramhiser/datamicroarray/blob/master/README.md.
References
Bontempi, G.: A blocking strategy to improve gene selection for classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinf. 4(2), 293–300 (2007)
Bontempi, G., Meyer, P.E.: Causal filter selection in microarray data. In: Proceeding of the ICML 2010 Conference (2010)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (1990)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Meyer, P.E., Bontempi, G.: Information-theoretic gene selection in expression data. In: Biological Knowledge Discovery Handbook. IEEE Computer Society (2014)
Montgomery, D.C.: Design and Analysis of Experiments. Wiley, Hoboken (2001)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Ramey, J.A.: Datamicroarray: Collection of Data Sets for Classification (2013). R package version 0.2.2
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer, New York (1999)
Tsamardinos, I., Aliferis, C.: Towards principled feature selection: relevancy. In: Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics (2003)
Tsamardinos, I., Aliferis, C.F., Statnikov, A.: Algorithms for large scale Markov blanket discovery. In: Proceedings of the 16th International FLAIRS Conference (FLAIRS 2003) (2003)
Acknowledgements
The author acknowledges the support of the “BruFence: Scalable machine learning for automating defense system” project (RBC/14 PFS-ICT 5), funded by the Institute for the encouragement of Scientific Research and Innovation of Brussels (INNOVIRIS, Brussels Region, Belgium).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Bontempi, G. (2016). A Blocking Strategy for Ranking Features According to Probabilistic Relevance. In: Pardalos, P., Conca, P., Giuffrida, G., Nicosia, G. (eds) Machine Learning, Optimization, and Big Data. MOD 2016. Lecture Notes in Computer Science(), vol 10122. Springer, Cham. https://doi.org/10.1007/978-3-319-51469-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-51469-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51468-0
Online ISBN: 978-3-319-51469-7
eBook Packages: Computer ScienceComputer Science (R0)