Skip to main content
Log in

Simulated annealing for supervised gene selection

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Genomic data, and more generally biomedical data, are often characterized by high dimensionality. An input selection procedure can attain the two objectives of highlighting the relevant variables (genes) and possibly improving classification results. In this paper, we propose a wrapper approach to gene selection in classification of gene expression data using simulated annealing along with supervised classification. The proposed approach can perform global combinatorial searches through the space of all possible input subsets, can handle cases with numerical, categorical or mixed inputs, and is able to find (sub-)optimal subsets of inputs giving low classification errors. The method has been tested on publicly available bioinformatics data sets using support vector machines and on a mixed type data set using classification trees. We also propose some heuristics able to speed up the convergence. The experimental results highlight the ability of the method to select minimal sets of relevant features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.broad.mit.edu/cancer/software/genepattern/datasets/.

  2. http://microarray.princeton.edu/oncology/affydata/index.html.

  3. http://mlearn.ics.uci.edu/databases/heart-disease/cleve.mo.

References

  • Agrafiotis DK, Cedeo W (2002) Feature selection for structure-activity correlation using binary particle swarms. J Med Chem 45:1098–1107

    Article  Google Scholar 

  • Albrecht AA, Vinterbo SA, Ohno-Machado L (2003) An epicurean learning approach to gene-expression data classification. Artif Intell Med 28(1):75–87

    Article  Google Scholar 

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750

    Google Scholar 

  • Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99:6562–6566

    Article  MATH  Google Scholar 

  • Andonie R, Fabry-Asztalos L, Abdul-Wahid, Collar C, S, Salim N (2006) An integrated soft computing approach for predicting biological activity of potential HIV-1 protease inhibitors. In: Proceedings of the IEEE international conference on neural networks, pp 7495–7502

  • Bangalore AS, Shaffer RE, Small GW, Arnold MA (1996) Genetic Algorithm-based method for selecting wavelength and model size for use with partial least-squares regression: application to near-infrared spectroscopy. Anal Chem 68:4200–4212

    Article  Google Scholar 

  • Barkai E (2003) Aging in subdiffusion generated by a deterministic dynamical system. Phys Rev Lett 90:104101

    Article  Google Scholar 

  • Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

  • Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1–2):245–271

    Article  MATH  MathSciNet  Google Scholar 

  • Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth & Brooks, Pacific Grove

    MATH  Google Scholar 

  • Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  • Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  • Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  • Filippone M, Masulli F, Rovetta S (2005) Unsupervised gene selection and clustering using simulated annealing. In: Bloch I, Petrosino A, Tettamanzi A (eds) WILF, Lecture notes in computer science, vol 3849. Springer, New York, pp 229–235

  • Ganesan D, Greenstein B, Perelyubskiy D, Estrin D, Heidemann J (2003) An evaluation of multi-resolution storage for sensor networks. In: Proceedings of the first ACM conference on embedded networked sensor systems (SenSys 2003). ACM, pp 89–102

  • Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537

    Article  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Article  MATH  Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  • Izrailev S, Agrafiotis DK (2002) Variable selection for QSAR by artificial ant colony systems. SAR QSAR Environ Res 13:417–423

    Article  Google Scholar 

  • Jouan-Rimbaud D, Massart D-L, Leardi R, de Noord OE (1995) Genetic algorithms as a tool for wavelength selection in multivariate calibration. Anal Chem 67:4295–4301

    Article  Google Scholar 

  • Debuse JCW, Rayward-Smith VJ (1997) Feature subset selection within a simulated annealing data mining algorithm. J Intell Inf Syst 9:57–81

    Article  Google Scholar 

  • Kalivas JH, Roberts N, Sutter JM (1989) Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry. Anal Chem 61:2024–2030

    Article  Google Scholar 

  • Kira K, Rendell L (1992) The feature selection problem: traditional methods and a new algorithm. In: Proceedings of 10th national conference on artificial intelligence (AAAI-92), pp 129–134

  • Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:661–680

    Article  MathSciNet  Google Scholar 

  • Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324

    Article  MATH  Google Scholar 

  • Koller D, Sahami M (1996) Toward optimal feature selection. In: Saitta L (ed) Proceedings of the thirteenth international conference (ICML ’96). Morgan Kaufmann, pp 284–292

  • Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. In: Proceedings of seventh European conference machine learning, pp 171–182

  • Kubinyi H (1994) Variable selection in QSAR studies. I. An evolutionary algorithm. Quant Struct-Act Relat 13:285–294

    Google Scholar 

  • Leardi R, González AL (1998) Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemom Intell Lab Syst 41:195–207

    Article  Google Scholar 

  • Masulli F, Rovetta S (2003) Random Voronoi ensembles for gene selection. Neurocomputing 55(3–4):721–726

    Article  Google Scholar 

  • Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations for fast computing machines. J Chem Phys 21:1087–1092

    Article  Google Scholar 

  • Michalewicz Z (1998) Genetic algorithms + data structures = evolution programs, 3rd edn. Springer-Verlag, Berlin

  • Moneta C, Parodi GC, Rovetta S, Zunino R (1992) Automated diagnosis and disease characterization using neural network analysis. In: Proceedings of the 1992 IEEE international conference on systems, man and cybernetics, Chicago, IL, USA, pp 123–128

  • Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Numerical recipes in C, 2nd edn. Cambridge University Press, Cambridge

  • R Development Core Team (2005) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN:3-900051-07-0. http://www.R-project.org

  • Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Romeo F, Sangiovanni-Vincentelli A (1985) Probabilistic hill-climbing algorithms: properties and applications. Computer Science Press, Chapell Hill

    Google Scholar 

  • Siedlecki W, Sklansky J (1989) A note on genetic algorithms for large-scale feature selection. Pattern Recognit Lett 10:335–347

    Article  MATH  Google Scholar 

  • Slonim N, Tishby N (2000) Agglomerative information bottleneck. In: Advances in neural information processing systems, pp 617–623

  • Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36(1):111–147

    MATH  Google Scholar 

  • Sutter JM, Kalivas JH (1993) Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchem J 47:60–66

    Article  Google Scholar 

  • Sutter JM, Dixon SL, Jurs PC (1995) Automated descriptor selection for quantitative structure-activity relationships using generalized simulated annealing. J Chem Inf Comput Sci 35:77–84

    Google Scholar 

  • Tanenbaum A (2001) Modern operating systems, 2nd edn. Prentice Hall, Englewood Cliffs

  • Vapnik VN (1995) The nature of statistical learning theory. Springer-Verlag, New York

    MATH  Google Scholar 

  • Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KF, Mewes HW (2005) Gene selection from microarray data for cancer classification—a machine learning approach. Comput Biol Chem 29(1):37–46

    Article  Google Scholar 

  • Weston J, Elisseff A, Schoelkopf B, Tipping M (2003) Use of the zero norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461

    Article  MATH  Google Scholar 

Download references

Acknowledgments

We thank Chih-Chung Chang for help about internals of LIBSVM in R. A discussion with Giorgio Valentini helped us to clarify an important issue of this paper. This work was funded by the the Italian Ministry of Education, University and Research (code 2004062740).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Masulli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Filippone, M., Masulli, F. & Rovetta, S. Simulated annealing for supervised gene selection. Soft Comput 15, 1471–1482 (2011). https://doi.org/10.1007/s00500-010-0597-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-010-0597-8

Keywords

Navigation