Skip to main content

Advertisement

Log in

Feature clustering based support vector machine recursive feature elimination for gene selection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In a DNA microarray dataset, gene expression data often has a huge number of features(which are referred to as genes) versus a small size of samples. With the development of DNA microarray technology, the number of dimensions increases even faster than before, which could lead to the problem of the curse of dimensionality. To get good classification performance, it is necessary to preprocess the gene expression data. Support vector machine recursive feature elimination (SVM-RFE) is a classical method for gene selection. However, SVM-RFE suffers from high computational complexity. To remedy it, this paper enhances SVM-RFE for gene selection by incorporating feature clustering, called feature clustering SVM-RFE (FCSVM-RFE). The proposed method first performs gene selection roughly and then ranks the selected genes. First, a clustering algorithm is used to cluster genes into gene groups, in each which genes have similar expression profile. Then, a representative gene is found to represent a gene group. By doing so, we can obtain a representative gene set. Then, SVM-RFE is applied to rank these representative genes. FCSVM-RFE can reduce the computational complexity and the redundancy among genes. Experiments on seven public gene expression datasets show that FCSVM-RFE can achieve a better classification performance and lower computational complexity when compared with the state-the-art-of methods, such as SVM-RFE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. The dataset is download from gene expression model selector. http://www.gems-system.org/

  2. The dataset is download from kent ridge bio-medical dataset. http://datam.i2r.a-star.edu.sg/datasets/krbd/

  3. Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Nat Acad Sci 99(10):6562–6566

    Article  MATH  Google Scholar 

  4. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1):245–271

    Article  MathSciNet  MATH  Google Scholar 

  5. Chen H, Tiho P, Yao X (2009) Predictive ensemble pruning by expectation propagation. IEEE Trans Knowl Data Eng 21(7):999–1013

    Article  Google Scholar 

  6. Chu W, Ghahramani Z, Falciani F, Wild DL (2005) Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics 21(16):3385–3393

    Article  Google Scholar 

  7. Demṡar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  8. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S (2001) Delineation of prognostic biomarkers in prostate cancer. Nature 412(6849):822–826

    Article  Google Scholar 

  9. Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):1

    Article  Google Scholar 

  10. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205

    Article  Google Scholar 

  11. Duan KB, Rajapakse JC, Wang H, Azuaje F (2005) Multiple svm-rfe for gene selection in cancer classification with expression data. IEEE Trans NanoBiosci 4(3):228–234

    Article  Google Scholar 

  12. Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64

    Article  MathSciNet  MATH  Google Scholar 

  13. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Nat Acad Sci 95(25):14,863–14,868

    Article  Google Scholar 

  14. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Amer Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  15. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  16. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  17. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Series C (Appl Stat) 28(1):100– 108

    MATH  Google Scholar 

  18. Inza I, Larrañaga P., Blanco R, Cerrolaza AJ (2004) Filter versus wrapper gene selection approaches in dna microarray domains. Artif Intell Med 31(2):91–103

    Article  Google Scholar 

  19. Islam AT, Jeong BS, Bari AG, Lim CG, Jeon SH (2015) Mapreduce based parallel gene selection method. Appl Intell 42(2):147–156

    Article  Google Scholar 

  20. Jäger J, Sengupta R, Ruzzo WL (2002) Improved gene selection for classification of microarrays. In: Proceedings of the eighth Pacific symposium on biocomputing. Lihue, pp 53–64

  21. Karan D, Kelly DL, Rizzino A, Lin MF, Batra SK (2002) Expression profile of differentially-regulated genes during progression of androgen-independent growth in human prostate cancer cells. Carcinogenesis 23(6):967–976

    Article  Google Scholar 

  22. Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: AAAI, vol 2, pp 129–134

  23. Kishino H, Waddell PJ (2000) Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Inform 11:83–95

    Google Scholar 

  24. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324

    Article  MATH  Google Scholar 

  25. Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94. Springer, pp 171–182

  26. Lee S, Park YT, d’Auriol BJ, et al. (2012) A novel feature selection method based on normalized mutual information. Appl Intell 37(1):100–120

    Article  Google Scholar 

  27. Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6(1):1

    Article  Google Scholar 

  28. Magee JA, Araki T, Patil S, Ehrig T, True L, Humphrey PA, Catalona WJ, Watson MA, Milbrandt J (2001) Expression profiling reveals hepsin overexpression in prostate cancer. Cancer Res 61(15):5692–5696

    Google Scholar 

  29. Mao Z, Cai W, Shao X (2013) Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inform 46(4):594–601

    Article  Google Scholar 

  30. Mundra PA, Rajapakse JC (2010) Svm-rfe with mrmr filter for gene selection. IEEE Trans NanoBiosci 9(1):31–37

    Article  Google Scholar 

  31. Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the world congress on engineering, vol 1, pp 1–3

  32. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  33. Richards AL, Holmans P, O’Donovan MC, Owen MJ, Jones L (2008) A comparison of four clustering methods for brain expression microarray data. BMC Bioinform 9(1):1

    Article  Google Scholar 

  34. Ruiz R, Riquelme JC, Aguilar-Ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn 39(12):2383–2392

    Article  Google Scholar 

  35. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209

    Article  Google Scholar 

  36. Sun S, Peng Q, Shakoor A (2014) A kernel-based multivariate feature selection method for microarray data classification. PloS One 9(7):e102,541

    Article  Google Scholar 

  37. Szedmak S, Shawe-Taylor J, Saunders CJ, Hardoon DR et al (2004) Multiclass classification by l1 norm support vector machine. In: Pattern recognition and machine learning in computer vision workshop. Citeseer, pp 02–04

  38. Tan M, Wang L, Tsang IW (2010) Learning sparse svm for feature selection on very high dimensional datasets. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1047–1054

  39. Tang Y, Zhang YQ, Huang Z (2007) Development of two-stage svm-rfe gene selection strategy for microarray expression data analysis. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 4(3):365–381

    Article  Google Scholar 

  40. Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1. Wiley, New York

    MATH  Google Scholar 

  41. Wang X, Gotoh O (2009) Accurate molecular classification of cancer using simple rules. BMC Med Genom 2(1):1

    Article  Google Scholar 

  42. Xie ZX, Hu QH, Yu DR (2006) Improved feature selection algorithm based on svm and correlation. In: Advances in neural networks-ISNN 2006. Springer, pp 1373–1380

  43. Yedla M, Pathakota SR, Srinivasa T (2010) Enhancing k-means clustering algorithm with improved initial center. Int J Comput Sci Inform Technol 1(2):121–125

    Google Scholar 

  44. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    MathSciNet  MATH  Google Scholar 

  45. Zhang Y, Ding C, Li T (2008) Gene selection algorithm by combining relieff and mrmr. BMC Genom 9(2):1

    Google Scholar 

  46. Zhou X, Tuck DP (2007) Msvm-rfe: extensions of svm-rfe for multiclass gene selection on dna microarray data. Bioinformatics 23(9):1106–1114

    Article  Google Scholar 

Download references

Acknowledgment

This study was funded by the National Natural Science Foundation of China (grant numbers 61373093, 61672364, and 61672365), by the Natural Science Foundation of Jiangsu Province of China (grant number BK20140008), by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (grant number 13KJA520001), and by the Soochow Scholar Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, X., Zhang, L., Wang, B. et al. Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell 48, 594–607 (2018). https://doi.org/10.1007/s10489-017-0992-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-0992-2

Keywords

Navigation