Abstract
Microarray datasets suffers from curse of dimensionality as they are represented by high dimension and only few samples are available. For efficient classification of samples there is a need of selecting a smaller set of relevant and non-redundant genes. In this paper, we propose a two stage algorithm GSUCE for finding a set of discriminatory genes responsible for classification in high dimensional microarray datasets. In the first stage the correlated genes are grouped into clusters and the best gene is selected from each cluster to create a pool of independent genes. This will reduce redundancy. We have used maximal information compression to measure similarity between genes. In second stage a wrapper based forward feature selection method is used to obtain a set of informative genes for a given classifier. The proposed algorithm is tested on five well known publicly available datasets . Comparison with other state of art methods shows that our proposed algorithm is able to achieve better classification accuracy with less number of features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Guyon, I., Elisseeff, A.: An Introduction to Variable and feature Selection. Journal of Machine Learning Research (3), 1157–1182 (2003)
Bellman, R.: Adaptive Control Processes. In: A Guided Tour. Princeton University Press, Princeton (1961)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Dowing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Yang, K., Cai, Z., Li, J., Lin, G.H.: A stable gene selection in microarray data analysis. BMC Bioinformatics 7, 228 (2006)
Cho, J., Lee, D., Park, J.H., Lee, I.B.: New gene selection for classification of cancer subtype considering within-class variation. FEBS Letters 551, 3–7 (2003)
Kohonen, T.: Self-organizing maps. Springer, Berlin (1995)
Eisen, M.B., Spellman, T.P., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(25), 14863–14868 (1998)
Tavazoie, S., Huges, D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematic determination of genetic network architecture. Nature Genet., 281–285 (1999)
Jiang, D., Tang, C., Zhang, A.: Cluster Analysis for gene expression data: A survey. IEEE Trans. Knowledge and Data Eng. 16, 1370–1386 (2004)
Yu, J., Amores, J., Sebe, N., Tian, Q.: Toward Robust Distance Metric analysis for Similarity Estimation. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition (2006)
Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring Expression Data: identification and analysis of coexpressed genes. Genome Research 9, 1106–1115 (1999)
Mitra, P., Murthy, C., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Analysis and Machine Intelligence 24(3), 301–312 (2002)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques (2000)
Kent Ridge Biomedical Data Repository, http://datam.i2r.a-star.edu.sg/datasets/krbd/
Fu, L.M., Liu, C.S.F.: Evaluation of gene importance in microarray data based upon probability of selection. BMC Bioinformatics 6(67) (2005)
Khan, J., Wei, S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F.: Classification and diagnosis prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7, 673–679 (2001)
Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G.: Gene Selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12), 1131–1142 (2001)
Ruiz, R., Riqueline, J.C., Aguilar-Ruiz, J.S.: Incremental wrapper based gene selection from microarray data for cancer classification. Pattern Recognition 39(12), 2383–2392 (2006)
Hong, J.H., Cho, S.B.: The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming. Artif. Intell. Med. 36, 43–58 (2006)
Tibsrani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centriods of gene expression. Proc. Natl. Acad. Sci. USA 99, 6567–6572 (2002)
Yuechui, C., Yaou, Z.: A novel ensemble of classifiers for microarray data classification. Applied Soft Computing (8), 1664–1669 (2008)
Shah, S., Kusiak, A.: Cancer gene search with Data Mining and Genetic Algorithms. Computer in Biology Medicine 37(2), 251–261 (2007)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for cancer classification using support vector machine. Machine Learning (46), 263–268 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bala, R., Agrawal, R.K. (2010). Entropy Based Clustering to Determine Discriminatory Genes for Microarray Dataset. In: Ranka, S., et al. Contemporary Computing. IC3 2010. Communications in Computer and Information Science, vol 94. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14834-7_38
Download citation
DOI: https://doi.org/10.1007/978-3-642-14834-7_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14833-0
Online ISBN: 978-3-642-14834-7
eBook Packages: Computer ScienceComputer Science (R0)