Abstract
Supervised Machine Learning methods have been successfully applied for performing gene expression based cancer diagnosis. Characteristics intrinsic to cancer gene expression data sets, such as high dimensionality, low number of samples and presence of noise makes the classification task very difficult. Furthermore, limitations in the classifier performance may often be attributed to characteristics intrinsic to a particular data set.
This paper presents an analysis of gene expression data sets for cancer diagnosis using classification complexity measures. Such measures consider data geometry, distribution and linear separability as indications of complexity of the classification task. The results obtained indicate that the cancer data sets investigated are formed by mostly linearly separable non-overlapping classes, supporting the good predictive performance of robust linear classifiers, such as SVMs, on the given data sets. Furthermore, we found two complexity indices, which were good indicators for the difficulty of gene expression based cancer diagnosis.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alberts, B., Al, E.: Molecular Biology of the Cell. Garland Science (2002)
Bernadó-Mansilla, E., Maciá-Antonilez, N.: Modeling problem transformation based on data complexity. In: Angulo, C., Godo, L. (eds.) Artificial Intelligence Research and Development, pp. 133–139. IOS Press, Amsterdam (2007)
de Souto, M.C.P., Costa, I.G., de Araujo, D.S.A., Ludermir, T.B., Schliep, A.: Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497+ (2008)
Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97(457), 77–87 (2002)
Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. American Statistical Association 97(457), 77–87 (2002)
Dupuy, A., Simon, R.: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer Institute 99(2), 147–157 (2007)
Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 209–217 (1998)
Friedman, H., Rafsky, L.C.: Multivariate generalization of the wald-wolfowitz and smirnov two-sample tests. Ann. Statist. 7, 697–717 (1979)
Giraud-Carrier, C., Vilalta, R., Brazdil, P.: Introduction to the special issue on meta-learning. Mach. Learn. 54(3), 187–193 (2004)
Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: Data mining, inference and prediction. Springer, New York (2001)
Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 289–300 (2002)
Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.N., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S.C., Hoffman, E., Jedlicka, A.E., Kawasaki, E., Martinez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Ye, S.Q., Yu, W.: Multiple-laboratory comparison of microarray platforms. Nat. Methods 2(5), 345–350 (2005)
Kleinbaum, D.G., Klein, M.: Logistic Regression, 2nd edn. Springer, Heidelberg (2005)
Lorena, A.C., Costa, I.G., de Souto, M.C.P.: On the complexity of gene expression classification data sets. In: Proc. of the 8th International Conference on Hybrid Intelligent Systems, pp. 825–830. IEEE Computer Society Press, Los Alamitos (2008)
Lottaz, C., Kostka, D., Markowetz, F., Spang, R.: Computational diagnostics with gene expression profiles. Methods Mol. Biol. 453, 281–296 (2008)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICMC 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Monti, S., et al.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn 52, 91–118 (2003)
Okun, O., Priisalu, H.: Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artificial Intelligence in Medicine 45(2-3), 151–162 (2009)
Quackenbush, J.: Computational analysis of cDNA microarray data. Nature Reviews 6(2), 418–428 (2001)
Ramaswamy, S., et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA 98, 15149–15154 (2001)
Rosemblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, New York (1962)
Slonim, D.: From patterns to pathways: gene expression data analysis comes of age. Nature Genetics 32, 502–508 (2002)
Smith, F.: Pattern classifier design by linear programming. IEEE Transactions on Computers 17(4), 367–372 (1968)
Sokal, R., Rohlf, F.: Biometry. W. H. Freeman and Company, New York (1995)
Spang, R.: Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. BIOSILICO 1(2), 64–68 (2003)
Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5), 631–643 (2005)
van’t Veer, L.J., Bernards, R.: Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature 452(7187), 564–570 (2008)
Vapnik, V.N.: The nature of Statistical learning theory. Springer, New York (1995)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Yeang, C.H., et al.: Molecular classification of multiple tumor types. In: Proc. 9th Int. Conf. on Intelligent Systems in Molecular Biology, vol. 1, pp. 316–322 (2001)
Zucknick, M., Richardson, S., Stronach, E.: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Statist. Appl. in Genetics and Molec. Biol. 7(1), 1–31 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Costa, I.G., Lorena, A.C., Peres, L.R.M.P.y., de Souto, M.C.P. (2009). Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets. In: Guimarães, K.S., Panchenko, A., Przytycka, T.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2009. Lecture Notes in Computer Science(), vol 5676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03223-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-03223-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03222-6
Online ISBN: 978-3-642-03223-3
eBook Packages: Computer ScienceComputer Science (R0)