Abstract
In this paper, we focus on the problem of extreme learning machine (ELM)-based microarray data classification. Different from the traditional classification problem, the goal in this case is not just to predict the class labels for the unseen samples, but to make clear what lead to the results, i.e., the genes involving with a specific disease. This is especially significant for biologists, since they need to decipher the causes of disease. As a black-box method, ELM could not measure up to the task by itself. In this work, we propose a diversified sequence feature selection-based framework to address the problem. In this framework, (1) a sequence model, EWave, is introduced to ensure the structural ordering information among genes exploitable; (2) a concept of irreducible sequence is proposed, where the genes work as an orderly whole to keep high confidence with a specific class and any reduction in the genes decreases the confidence much. An efficient sequence mining algorithm together with some effective pruning rules is developed to mine such sequences; and (3) we study how to extract a set of diversified sequence features as the representative of all mined results. The problem is proved to be NP-hard. A greedy algorithm is presented to approximate the optimal solution. Experimental results show that the proposed approach significantly improves the efficiency and the effectiveness of ELM w.r.t some widely used feature selection techniques.
Similar content being viewed by others
Notes
That is, IG (information gain), TR (twoing rule), SM (sum minority), MM (max minority), GI (Gini index) and SV (sum of variance).
References
Tavazoie S, Hughes J, Campbell M, Cho R, Church G (1999) Systematic determination of genetic network architecture. Nat Genetics 22:281–285
Eisen M, Spellman P, Brown P, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–14868
Alizadeh A (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Huang G-B, Zhu Q-Y, Siew C-K (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceedings of international joint conference on neural networks (IJCNN2004), vol 2, (Budapest, Hungary), pp 985–990
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70:489–501
Huang G-B, Zhu Q-Y, Mao KZ, Siew C-K, Saratchandran P, Sundararajan N (2006) Can threshold networks be trained directly? IEEE Trans Circuits Syst II 53(3):187–191
Zhang R, Huang G-B, Sundararajan N, Saratchandran P (2007) Multi-category classification using an extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Trans Comput Biol Bioinform 4(3):485–495
Zhao X, Wang G, Bi X, Gong P, Zhao Y (2011) Xml document classification based on elm. Neurocomputing 74(16):2444–2451
Wang G, Zhao Y, Wang D (2008) A protein secondary structure prediction framework based on the extreme learning machine. Neurocomputing 72(1–3):262–268
Wang DD, Wang R, Yan H (2014) Fast prediction of protein-protein interaction sites based on extreme learning machines. Neurocomputing 128:258–266
Zhang R, Huang G-B, Sundararajan N, Saratchandran P (2007) Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Trans Comput Biol Bioinform 4(3):485–495
Yeu CWT, Lim MH, Huang GB, Agarwal A, Ong YS (2006) A new machine learning paradigm for terrain reconstruction. IEEE Geosci Remote Sens Lett 3(3):382–386
Huang G-B, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1–3):155–163
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Lo D, Khoo S-C, Li J (2008) Mining and ranking generators of sequential patterns. In: SDM, pp 553–564
Cong G, Tung AKH, Xu X et al (2004) Farmer: finding interesting rule groups in microarray datasets. In: SIGMOD, pp 143–154
Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: ICDE, pp 79–90
Gao C, Wang J, He Y (2008) Efficient mining of frequent sequence generators. In: WWW, pp 1051–1052
Ding CHQ, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(2):185–206
Yu L, Liu H (2004) Redundancy based feature selection for microarray data. In: KDD, pp 737–742
Zuckerman D (1996) On unapproximable versions of np-complete problems. SIAM J Comput 25(6):1293–1304
Shipp MA, Ross KN, Tamayo P et al (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74
Hedenfalk I, Duggan D, Chen Y et al (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344(8):539–548
Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S (2003) Rankgene: identification of diagnostic genes based on expression data. Bioinformatics 19(12):1578–1579
Lee KE, Sha N, Dougherty ER et al (2003) Gene selection: a bayesian variable selection approach. Bioinformatics 19(1):90–97
Udler M, Maia AT, Cebrian A et al (2007) Common germline genetic variation in antioxidant defense genes and survival after diagnosis of breast cancer. J Clin Oncol 25(21):3015–3023
Acknowledgments
Supported by \(863\) program \((2012\hbox {AA}011004), \,973\) program \((2011\hbox {CB}302200\hbox {-G})\), National Science Fund for Distinguished Young Scholars \((61025007)\), State Key Program of National Natural Science of China \((60933001,\,61332014)\), National Natural Science Foundation of China \((61272182,\,61100028,\,61073063,\,61173030)\), New Century Excellent Talents (NCET-\(11\)-\(0085\)), China Postdoctoral Science Foundation \((2012\hbox {T}50263,\,2011\hbox {M}500568)\), and Fundamental Research Funds for the Central Universities \((\hbox {N}110404005,\,\hbox {N}110404017)\).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, Y., Wang, G., Yin, Y. et al. Improving ELM-based microarray data classification by diversified sequence features selection. Neural Comput & Applic 27, 155–166 (2016). https://doi.org/10.1007/s00521-014-1571-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-014-1571-7