Abstract
Genomic data, and more generally biomedical data, are often characterized by high dimensionality. An input selection procedure can attain the two objectives of highlighting the relevant variables (genes) and possibly improving classification results. In this paper, we propose a wrapper approach to gene selection in classification of gene expression data using simulated annealing along with supervised classification. The proposed approach can perform global combinatorial searches through the space of all possible input subsets, can handle cases with numerical, categorical or mixed inputs, and is able to find (sub-)optimal subsets of inputs giving low classification errors. The method has been tested on publicly available bioinformatics data sets using support vector machines and on a mixed type data set using classification trees. We also propose some heuristics able to speed up the convergence. The experimental results highlight the ability of the method to select minimal sets of relevant features.
Similar content being viewed by others
References
Agrafiotis DK, Cedeo W (2002) Feature selection for structure-activity correlation using binary particle swarms. J Med Chem 45:1098–1107
Albrecht AA, Vinterbo SA, Ohno-Machado L (2003) An epicurean learning approach to gene-expression data classification. Artif Intell Med 28(1):75–87
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99:6562–6566
Andonie R, Fabry-Asztalos L, Abdul-Wahid, Collar C, S, Salim N (2006) An integrated soft computing approach for predicting biological activity of potential HIV-1 protease inhibitors. In: Proceedings of the IEEE international conference on neural networks, pp 7495–7502
Bangalore AS, Shaffer RE, Small GW, Arnold MA (1996) Genetic Algorithm-based method for selecting wavelength and model size for use with partial least-squares regression: application to near-infrared spectroscopy. Anal Chem 68:4200–4212
Barkai E (2003) Aging in subdiffusion generated by a deterministic dynamical system. Phys Rev Lett 90:104101
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth & Brooks, Pacific Grove
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Filippone M, Masulli F, Rovetta S (2005) Unsupervised gene selection and clustering using simulated annealing. In: Bloch I, Petrosino A, Tettamanzi A (eds) WILF, Lecture notes in computer science, vol 3849. Springer, New York, pp 229–235
Ganesan D, Greenstein B, Perelyubskiy D, Estrin D, Heidemann J (2003) An evaluation of multi-resolution storage for sensor networks. In: Proceedings of the first ACM conference on embedded networked sensor systems (SenSys 2003). ACM, pp 89–102
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Izrailev S, Agrafiotis DK (2002) Variable selection for QSAR by artificial ant colony systems. SAR QSAR Environ Res 13:417–423
Jouan-Rimbaud D, Massart D-L, Leardi R, de Noord OE (1995) Genetic algorithms as a tool for wavelength selection in multivariate calibration. Anal Chem 67:4295–4301
Debuse JCW, Rayward-Smith VJ (1997) Feature subset selection within a simulated annealing data mining algorithm. J Intell Inf Syst 9:57–81
Kalivas JH, Roberts N, Sutter JM (1989) Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry. Anal Chem 61:2024–2030
Kira K, Rendell L (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of 10th national conference on artificial intelligence (AAAI-92), pp 129–134
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:661–680
Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Koller D, Sahami M (1996) Toward optimal feature selection. In: Saitta L (ed) Proceedings of the thirteenth international conference (ICML ’96). Morgan Kaufmann, pp 284–292
Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of seventh European conference machine learning, pp 171–182
Kubinyi H (1994) Variable selection in QSAR studies. I. An evolutionary algorithm. Quant Struct-Act Relat 13:285–294
Leardi R, González AL (1998) Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemom Intell Lab Syst 41:195–207
Masulli F, Rovetta S (2003) Random Voronoi ensembles for gene selection. Neurocomputing 55(3–4):721–726
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations for fast computing machines. J Chem Phys 21:1087–1092
Michalewicz Z (1998) Genetic algorithms + data structures = evolution programs, 3rd edn. Springer-Verlag, Berlin
Moneta C, Parodi GC, Rovetta S, Zunino R (1992) Automated diagnosis and disease characterization using neural network analysis. In: Proceedings of the 1992 IEEE international conference on systems, man and cybernetics, Chicago, IL, USA, pp 123–128
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Numerical recipes in C, 2nd edn. Cambridge University Press, Cambridge
R Development Core Team (2005) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN:3-900051-07-0. http://www.R-project.org
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Romeo F, Sangiovanni-Vincentelli A (1985) Probabilistic hill-climbing algorithms: properties and applications. Computer Science Press, Chapell Hill
Siedlecki W, Sklansky J (1989) A note on genetic algorithms for large-scale feature selection. Pattern Recognit Lett 10:335–347
Slonim N, Tishby N (2000) Agglomerative information bottleneck. In: Advances in neural information processing systems, pp 617–623
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36(1):111–147
Sutter JM, Kalivas JH (1993) Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchem J 47:60–66
Sutter JM, Dixon SL, Jurs PC (1995) Automated descriptor selection for quantitative structure-activity relationships using generalized simulated annealing. J Chem Inf Comput Sci 35:77–84
Tanenbaum A (2001) Modern operating systems, 2nd edn. Prentice Hall, Englewood Cliffs
Vapnik VN (1995) The nature of statistical learning theory. Springer-Verlag, New York
Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KF, Mewes HW (2005) Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem 29(1):37–46
Weston J, Elisseff A, Schoelkopf B, Tipping M (2003) Use of the zero norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461
Acknowledgments
We thank Chih-Chung Chang for help about internals of LIBSVM in R. A discussion with Giorgio Valentini helped us to clarify an important issue of this paper. This work was funded by the the Italian Ministry of Education, University and Research (code 2004062740).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Filippone, M., Masulli, F. & Rovetta, S. Simulated annealing for supervised gene selection. Soft Comput 15, 1471–1482 (2011). https://doi.org/10.1007/s00500-010-0597-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-010-0597-8