Abstract
We propose a novel feature selection filter for supervised learning, which relies on the efficient estimation of the mutual information between a high-dimensional set of features and the classes. We bypass the estimation of the probability density function with the aid of the entropic-graphs approximation of Rényi entropy, and the subsequent approximation of the Shannon entropy. Thus, the complexity does not depend on the number of dimensions but on the number of patterns/samples, and the curse of dimensionality is circumvented. We show that it is then possible to outperform algorithms which individually rank features, as well as a greedy algorithm based on the maximal relevance and minimal redundancy criterion. We successfully test our method both in the contexts of image classification and microarray data classification. For most of the tested data sets, we obtain better classification results than those reported in the literature.
Similar content being viewed by others
Notes
LOOCV measure is used when the number of samples is so small that a test set cannot be built. It consists of building all possible classifiers, each time leaving out only one sample for test. Note that in this work, it is used just for evaluating the results, but not as a selection criterion.
Datasets can be downloaded from the Broad Institute http://www.broad.mit.edu/, Stanford Genomic Resources http://genome-www.stanford.edu/, and Princeton University http://microarray.princeton.edu/.
References
Sima C, Dougherty ER (2006) What should be expected from feature selection in small-sample settings. Bioinformatics 22(19):2430–2436
Xing EP, Jordan MI, Karp RM (2001) Feature selection for high-dimensional genomic microarray data. In: Proceedings of the 18th international conference on machine learning 601–608
Gentile C (2003) Fast feature selection from microarray expression data via multiplicative large margin algorithms. In: Thrun S, Saul L, Schölkopf B (eds) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Abe N, Kude M, Toyama J, Shimbo M (2006) Classifier-independent feature selection on the basis of divergence criterion. Pattern Anal Appl 9(2):127–137
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Perkins S, Theiler J (2003) Online feature selection using grafting. In: Proceedings of the 20th international conference on machine learning (ICML-2003), Washington
Harol A, Lai C, Pekalska E, Duin RPW (2007) Pairwise feature evaluation for constructing reduced representations. Pattern Anal Appl 10(1):55–68
Cover T, Thomas J (1991) Elements of information theory. Wiley, New York
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Hero AO, Michel O (2002) Applications of entropic spanning graphs. IEEE Signal Process Mag 19(5):85–95
Zyczkowski K (2003) Renyi extrapolation of Shannon entropy. Open Syst Inf Dyn 10(3):298–310
Mokkadem A (1989) Estimation of the entropy and information of absolutely continuous random variables. IEEE Trans Inf Theory 35(1):193–196
Torkkola K Feature (2003) Extraction by non-parametric mutual information maximization. J Mach Learn Res 3:1415–1438
Neemuchwala H, Hero A, Carson P (2006) Image registration methods in high-dimensional space. Int J Imaging Syst Technol 16(5):130–145
Paninski I (2003) Estimation of entropy and mutual information. Neural Comput 15(1):
Wolpert D, Wolf D (1995) Estimating function of probability distribution from a finite set of samples. Los Alamos National Laboratory Report LA-UR-92-4369, Santa Fe Institute Report TR-93-07-046
Wachowiak P, Smolíková R, Tourassi D, Elmaghraby S (2005) Estimation of generalized entropies with sample spacing. Pattern Anal Appl 8(1–2):95–101
Beirlant E, Dudewicz E, Gyorfi L, Van der Meulen E (1996) Nonparametric entropy estimation. Int J Math Stat Sci 5(1):17–39
Oubel E, Neemuchwala H, Hero A, Boisrobert L, Laclaustra M, Frangi AF (2005) Assessment of artery dilation by using image registration based on spatial features. In: Proceedings of SPIE medical imaging, April 2005, vol 5747, pp 1283–1291
Karger DR, Klein PN, Tarjan RE (1995) A randomized linear-time algorithm to find minimum spanning trees. J ACM 42(2): 321–328
Katriel I, Sanders P, Träff J (2003) A practical minimum spanning tree algorithm using the cycle property. 11th European Symposium on Algorithms(ESA), LNCS No. 2832, 679–690
Hero AO, Michel O (1999) Asymptotic theory of greedy aproximations to minnimal k-point random graphs. IEEE Trans Inf Theory 45(6):1921–1939
Bertsimas DJ, Van Ryzin G (1990) An asymptotic determination of the minimum spanning tree and minimum matching constants in geometrical probability. Oper Res Lett 9(1):223–231
Peñalver A, Escolano F, Sáez JM (2006) EBEM an entropy-based EM algorithm for Gaussian mixture models. ICPR 451–455
Tarr MJ, Bülthoff HH (1999) Object recognition in man, monkey, and machine. Cognition Special Issues, MIT Press, Massachusetts
Dill M, Wolf R, Heisenberg M (1993) Visual pattern recognition in Drosophila involves retinotopic matching. Nature 365(6448):639–644
Meese TS, Hess RF (2004) Low spatial frequencies are suppressively masked across spatial scale, orientation, field position, and eye of origin. J Vis 4(10):843–859
Carmichael O, Mahamud S, Hebert M (2002) Discriminant filters for object recognition. Technical report, Robotics Institute, Carnegie Mellon University, March, CMU-RI-TR-02-09
Ekvall S, Kragic D, Hoffmann F (2005) Object recognition and pose estimation using color cooccurrence histograms and geometric modeling. Image Vis Comput 23:943–955
Chang P, Krumm J (1999) Object recognition with color cooccurrence histograms. In: IEEE conference computer vision pattern recognition, Fort Collins, June 23–25
Stolovitzky G (2003) Gene selection in microarray data: the elephant, the blind men and our algorithms. Curr Opin Struct Biol 13(3):370–376
Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics 6:148
Pavlidis P, Poirazi P (2006) Individualized markers optimize class prediction of microarray data. BMC Bioinformatics 7:345
Díaz-Uriate R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7(1):3. doi:10.1186/1471-2105-7-3
Ruiz R, Riquelme JC, Aguilar-Ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognit 39(12):2383–2392
Singh D, Febbo PG et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209. doi:10.1016/s1535-6108(02)00030-2
Acknowledgments
This research is funded by the project DPI2005-01280 from the Spanish Government.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bonev, B., Escolano, F. & Cazorla, M. Feature selection, mutual information, and the classification of high-dimensional patterns. Pattern Anal Applic 11, 309–319 (2008). https://doi.org/10.1007/s10044-008-0107-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-008-0107-0