Abstract
Classification using small sample size (limited number of samples) with high dimension is a challenging problem in both machine learning and medicine as there are a wide variety of possible modeling approaches. Furthermore, it is not always clear which method is optimal for a prediction task. Different modeling choices include feature selection (dimensionality reduction), classification algorithms, and ensemble selection. There are several possible combinations of these methods, and it is not always clear which is the best. In the previous works, researchers show that evolutionary computation is useful to build an ensemble from the pairs of feature selection and classification algorithms. However, there are several parameters to be determined for the evolutionary computation and it requires computational time for the optimization. In this paper, we attempt to improve the approach by adopting meta-classification with the farthest-first clustering algorithm. The effectiveness and accuracy of our method are validated by experiments on four real microarray datasets (colon, breast, prostate and lymphoma cancers) publicly available. The results confirm that the proposed method outperforms single individual classifiers and other alternatives (standard genetic algorithm, and methods from literature).
Similar content being viewed by others
Abbreviations
- AVG:
-
Average
- CC:
-
Cosine coefficient
- CF:
-
Classification
- DCGA:
-
Deterministic crowding genetic algorithm
- DLDA:
-
Diagonal linear discriminant analysis
- ED:
-
Euclidean distance
- F1–F4:
-
Fitness functions
- FS:
-
Feature selection
- G:
-
The number of genes
- G1–G2:
-
Global ranking feature selection methods
- GA:
-
Genetic algorithm
- IG:
-
Information gain
- IV:
-
Ideal vector
- KNN:
-
K-nearest neighbor
- KNNC:
-
KNN with cosine coefficient
- KNNE:
-
KNN with Euclidean distance
- KNNP:
-
KNN with Pearson correlation
- KNNS:
-
KNN with Spearman correlation
- LOOCV:
-
Leave-one-out cross-validation
- M :
-
The number of classification algorithms
- MDL:
-
Minimum description length
- MI:
-
Mutual information
- MLP:
-
Multi-layer perceptron
- N :
-
The number of feature selection methods
- NNGE:
-
Non-nested generalized exemplars
- P :
-
The number of training samples
- PAM:
-
Prediction analysis with microarray
- PC:
-
Pearson correlation
- PCP:
-
Pattern classification program
- SNR:
-
Signal-to-noise ratio
- SP:
-
Spearman correlation
- SPEGASOS:
-
Stochastic variant of primal estimated sub-gradient solver for SVM
- SVM:
-
Support vector machine
- SVML:
-
Linear SVM
- TS:
-
Training sample
References
Psomopoulos FE, Mitkas PA (2010) Bioinformatics algorithm development for grid environments. J Syst Softw 83:1249–1257
Slonim DK (2002) From patterns to pathways: gene expression data analysis comes of age. Nat Genet 32:502–508
Braga-Neto U (2007) Fads and fallacies in the name of small-sample microarray classification. IEEE Signal Process Mag 24:91–99
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Boston
Kim KJ, Cho SB (2008) An evolutionary algorithm approach to optimal ensemble classifiers for DNA microarray data analysis. IEEE Trans Evol Comput 12:377–388
Xie X, Ho JWK, Murhpy C, Kaiser G, Xu B, Chen TY (2011) Testing and validating machine learning classifiers by metamorphic testing. J Syst Softw 84:544–558
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517
Blanco R, Larranaga P, Inza I, Sierra B (2004) Gene selection for cancer classification using wrapper approaches. Int J Pattern Recognit Artif Intell 18:1373–1390
Inza I, Larranaga P, Blanco R, Cerrolaza AJ (2004) Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 31:91–103
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S (2003) RankGene: identification of diagnostic genes based on expression data. Bioinformatics 19:1578–1579
Liu H, Liu L, Zhang H (2010) Ensemble gene selection by grouping for microarray data classification. J Biomed Inform 43:81–87
Buturovic LJ (2006) PCP: a program for supervised classification of gene expression profiles. Bioinformatics 22:245–247
Diaz-Uriarte R, de Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7:3
Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20:3583–3593
Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinform 6:148
Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17:1131–1142
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Cho SB, Won HH (2003) Data mining for gene expression profiles from DNA microarray. Int J Softw Eng Knowl Eng 13:593–608
Pochet N, Smet FD, Suykens JAK, Moor BLRD (2004) Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics 20:3185–3195
Lee JW, Lee JB, Park M, Song SH (2005) An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal 48:869–885
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley, New York
Tan AC, Gilbert D (2003) Ensemble machine learning on gene expression data for cancer classification. Appl Bioinform 2:S75–S83
Cho SB, Ryu JW (2002) Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features. Proc IEEE 90:1744–1753
Cho SB, Won HH (2007) Cancer classification using ensemble of neural networks with multiple significant gene subsets. Appl Intell 26:243–250
Won HH, Cho SB (2003) Neural network ensemble with negatively correlated features for cancer classification. Lect Notes Comput Sci 2714:1143–1150
Hochbaum D, Shmoys DB (1985) A best possible heuristic for the k-center problem. Math Oper Res 10:180–184
Dasgupta S (2010) Hierarchical clustering with performance guarantees. In: Classification as a tool for research, studies in classification, data analysis, and knowledge organization, pp. 3–14. doi:10.1007/978-3-642-10745-0_1
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theoret Comput Sci 38:293–306
Cho SB, Park CH (2004) Speciated GA for optimal ensemble classifiers in DNA microarray classification. IEEE Congr Evolut Comput 590–597
Kim KJ, Cho SB (2005) DNA gene expression classification with ensemble classifiers optimized by speciated genetic algorithm. In: First international conference on pattern recognition and machine intelligence, pp 649–653
Park CH, Cho SB (2003) Evolutionary ensemble classifier for lymphoma and colon cancer classification. IEEE Congr Evolut Comput 2378–2385
Park CH, Cho SB (2003) Evolutionary computation for optimal ensemble classifier in lymphoma cancer. In: 14th international symposium on methodologies for intelligent systems, pp 521–530
Kim KJ, Cho SB (2010) Exploring features and classifiers to classify microRNA expression profiles of human cancer. In: 17th international conference on neural information processing, pp 234–241
Xu L, Krzyzak A, Suen CY (1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22:418–435
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D et al (1999) Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96:6745–6750
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C et al (2002) Gene expression correlates of clinical prostate cancer behaviour. Cancer Cell 1:203–209
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, London
WEKA Toolkit. www.cs.waikato.ac.nz/ml/weka/
Kim KJ, Cho SB (2006) Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing 70:187–199
Dehuri S, Roy R, Cho SB, Ghosh A (2012) An improved swarm optimized functional link artificial neural network (ISO-FLANN) for classification. J Syst Softw 85:1333–1345
Luo Y, Tao D, Geng Bo, Xu C, Maybank SJ (2013) Manifold regularized multitask learning for semi-supervised multilabel image classification. IEEE Trans Image Process 22:523–536
Luo Y, Tao D, Xu C, Xu C, Liu H, Wen Y (2013) Multiview vector-valued manifold regularization for multilabel image classification. IEEE Trans Neural Netw Learn Syst 24:709–722
Hwang TH, Tian Z, Kuang R, Kocher JP (2008) Learning on weighted hypergraphs to integrate protein interactions and gene expressions for cancer outcome prediction. In: IEEE international conference on data mining, pp 293–302
Tian Z, Hwang TH, Kuang R (2009) A hypergraph-based learning algorithm for classifying gene expression and array CGH data with prior knowledge. Bioinformatics 25:2831–2838
Zhou D, Huang J, Scholkopf (2005) Learning from labeled and unlabeled data on a directed graph. In: Proceedings of the 22nd international conference on machine learning, pp 1036–1043
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the international conference on machine learning, pp 912–919
Wu M, Scholkopf B (2007) Transductive classification via local learning regularization. J Mach Learn Res-Proc Track 2:628–635
Yu J, Tao D, Wang M (2012) Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process 21:3262–3272
Yu J, Wang M, Tao D (2012) Semisupervised multiview distance metric learning for cartoon synthesis. IEEE Trans Image Process 21:4636–4648
Yu J, Liu D, Tao D, Seah HS (2011) Complex object correspondence construction in two-dimensional animation. IEEE Trans Image Process 20:3257–3269
Tao D, Li X, Wu X, Maybank SJ (2007) General tensor discriminant analysis and Gabor features for gait recognition. IEEE Trans Pattern Anal Mach Intell 29:1700–1715
Tao D, Li X, Wu X, Maybank SJ (2009) Geometric mean for subspace selection. IEEE Trans Pattern Anal Mach Intell 31:260–274
Zhang T, Tao D, Li X, Yang J (2009) Patch alignment for dimensionality reduction. IEEE Trans Knowl Data Eng 21:1299–1313
Yu J, Liu D, Tao D, Seah HS (2012) On combining multiple features for cartoon character retrieval and clip synthesis. IEEE Trans Syst Man Cybern––Part B: Cybern 42:1413–1427
Yu J, Tao D (2013) Modern machine learning techniques and their applications in cartoon animation research, Wiley-IEEE Press, Piscataway
Dhillon IS, Guan Y, Kulis B (2004) Kernel k-menas: Spectral clustering and normalized cuts. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 551–556
Pauca VP, Shahnaz F, Berry MW, Plemmons RJ (2004) Text mining using non-negative matrix factorizations. In: Proceedings of the fourth SIAM international conference on data mining, pp 452–456
Guan N, Tao D, Luo Z, Yuan B (2011) Non-negative patch alignment framework. IEEE Trans Neural Netw 22:1218–1230
Guan N, Tao D, Luo Z, Yuan B (2012) NeNMF: an optimal gradient method for nonnegative matrix factorization. IEEE Trans Signal Process 60:2882–2898
Acknowledgement
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (2013 R1A2A2A01016589, 2010-0018950, 2010-0018948).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, KJ., Cho, SB. Meta-classifiers for high-dimensional, small sample classification for gene expression analysis. Pattern Anal Applic 18, 553–569 (2015). https://doi.org/10.1007/s10044-014-0369-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0369-7