Abstract
In a DNA microarray dataset, gene expression data often has a huge number of features(which are referred to as genes) versus a small size of samples. With the development of DNA microarray technology, the number of dimensions increases even faster than before, which could lead to the problem of the curse of dimensionality. To get good classification performance, it is necessary to preprocess the gene expression data. Support vector machine recursive feature elimination (SVM-RFE) is a classical method for gene selection. However, SVM-RFE suffers from high computational complexity. To remedy it, this paper enhances SVM-RFE for gene selection by incorporating feature clustering, called feature clustering SVM-RFE (FCSVM-RFE). The proposed method first performs gene selection roughly and then ranks the selected genes. First, a clustering algorithm is used to cluster genes into gene groups, in each which genes have similar expression profile. Then, a representative gene is found to represent a gene group. By doing so, we can obtain a representative gene set. Then, SVM-RFE is applied to rank these representative genes. FCSVM-RFE can reduce the computational complexity and the redundancy among genes. Experiments on seven public gene expression datasets show that FCSVM-RFE can achieve a better classification performance and lower computational complexity when compared with the state-the-art-of methods, such as SVM-RFE.
Similar content being viewed by others
References
The dataset is download from gene expression model selector. http://www.gems-system.org/
The dataset is download from kent ridge bio-medical dataset. http://datam.i2r.a-star.edu.sg/datasets/krbd/
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Nat Acad Sci 99(10):6562–6566
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1):245–271
Chen H, Tiho P, Yao X (2009) Predictive ensemble pruning by expectation propagation. IEEE Trans Knowl Data Eng 21(7):999–1013
Chu W, Ghahramani Z, Falciani F, Wild DL (2005) Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics 21(16):3385–3393
Demṡar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S (2001) Delineation of prognostic biomarkers in prostate cancer. Nature 412(6849):822–826
Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):1
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205
Duan KB, Rajapakse JC, Wang H, Azuaje F (2005) Multiple svm-rfe for gene selection in cancer classification with expression data. IEEE Trans NanoBiosci 4(3):228–234
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Nat Acad Sci 95(25):14,863–14,868
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Amer Stat Assoc 32(200):675–701
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Series C (Appl Stat) 28(1):100– 108
Inza I, Larrañaga P., Blanco R, Cerrolaza AJ (2004) Filter versus wrapper gene selection approaches in dna microarray domains. Artif Intell Med 31(2):91–103
Islam AT, Jeong BS, Bari AG, Lim CG, Jeon SH (2015) Mapreduce based parallel gene selection method. Appl Intell 42(2):147–156
Jäger J, Sengupta R, Ruzzo WL (2002) Improved gene selection for classification of microarrays. In: Proceedings of the eighth Pacific symposium on biocomputing. Lihue, pp 53–64
Karan D, Kelly DL, Rizzino A, Lin MF, Batra SK (2002) Expression profile of differentially-regulated genes during progression of androgen-independent growth in human prostate cancer cells. Carcinogenesis 23(6):967–976
Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. In: AAAI, vol 2, pp 129–134
Kishino H, Waddell PJ (2000) Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Inform 11:83–95
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Kononenko I (1994) Estimating attributes: analysis and extensions of relief. In: Machine learning: ECML-94. Springer, pp 171–182
Lee S, Park YT, d’Auriol BJ, et al. (2012) A novel feature selection method based on normalized mutual information. Appl Intell 37(1):100–120
Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6(1):1
Magee JA, Araki T, Patil S, Ehrig T, True L, Humphrey PA, Catalona WJ, Watson MA, Milbrandt J (2001) Expression profiling reveals hepsin overexpression in prostate cancer. Cancer Res 61(15):5692–5696
Mao Z, Cai W, Shao X (2013) Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inform 46(4):594–601
Mundra PA, Rajapakse JC (2010) Svm-rfe with mrmr filter for gene selection. IEEE Trans NanoBiosci 9(1):31–37
Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the world congress on engineering, vol 1, pp 1–3
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Richards AL, Holmans P, O’Donovan MC, Owen MJ, Jones L (2008) A comparison of four clustering methods for brain expression microarray data. BMC Bioinform 9(1):1
Ruiz R, Riquelme JC, Aguilar-Ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recogn 39(12):2383–2392
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Sun S, Peng Q, Shakoor A (2014) A kernel-based multivariate feature selection method for microarray data classification. PloS One 9(7):e102,541
Szedmak S, Shawe-Taylor J, Saunders CJ, Hardoon DR et al (2004) Multiclass classification by l1 norm support vector machine. In: Pattern recognition and machine learning in computer vision workshop. Citeseer, pp 02–04
Tan M, Wang L, Tsang IW (2010) Learning sparse svm for feature selection on very high dimensional datasets. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1047–1054
Tang Y, Zhang YQ, Huang Z (2007) Development of two-stage svm-rfe gene selection strategy for microarray expression data analysis. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 4(3):365–381
Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1. Wiley, New York
Wang X, Gotoh O (2009) Accurate molecular classification of cancer using simple rules. BMC Med Genom 2(1):1
Xie ZX, Hu QH, Yu DR (2006) Improved feature selection algorithm based on svm and correlation. In: Advances in neural networks-ISNN 2006. Springer, pp 1373–1380
Yedla M, Pathakota SR, Srinivasa T (2010) Enhancing k-means clustering algorithm with improved initial center. Int J Comput Sci Inform Technol 1(2):121–125
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zhang Y, Ding C, Li T (2008) Gene selection algorithm by combining relieff and mrmr. BMC Genom 9(2):1
Zhou X, Tuck DP (2007) Msvm-rfe: extensions of svm-rfe for multiclass gene selection on dna microarray data. Bioinformatics 23(9):1106–1114
Acknowledgment
This study was funded by the National Natural Science Foundation of China (grant numbers 61373093, 61672364, and 61672365), by the Natural Science Foundation of Jiangsu Province of China (grant number BK20140008), by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (grant number 13KJA520001), and by the Soochow Scholar Project.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huang, X., Zhang, L., Wang, B. et al. Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell 48, 594–607 (2018). https://doi.org/10.1007/s10489-017-0992-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-0992-2