Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets

Costa, Ivan G.; Lorena, Ana C.; Peres, Liciana R. M. P. y; de Souto, Marcilio C. P.

doi:10.1007/978-3-642-03223-3_5

Ivan G. Costa²²,
Ana C. Lorena²³,
Liciana R. M. P. y Peres²³ &
…
Marcilio C. P. de Souto²⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5676))

Included in the following conference series:

Brazilian Symposium on Bioinformatics

622 Accesses

Abstract

Supervised Machine Learning methods have been successfully applied for performing gene expression based cancer diagnosis. Characteristics intrinsic to cancer gene expression data sets, such as high dimensionality, low number of samples and presence of noise makes the classification task very difficult. Furthermore, limitations in the classifier performance may often be attributed to characteristics intrinsic to a particular data set.

This paper presents an analysis of gene expression data sets for cancer diagnosis using classification complexity measures. Such measures consider data geometry, distribution and linear separability as indications of complexity of the classification task. The results obtained indicate that the cancer data sets investigated are formed by mostly linearly separable non-overlapping classes, supporting the good predictive performance of robust linear classifiers, such as SVMs, on the given data sets. Furthermore, we found two complexity indices, which were good indicators for the difficulty of gene expression based cancer diagnosis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Can classification performance be predicted by complexity measures? A study using microarray data

Article 14 October 2016

Band-based similarity indices for gene expression classification and clustering

Article Open access 03 November 2021

A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data

References

Alberts, B., Al, E.: Molecular Biology of the Cell. Garland Science (2002)
Google Scholar
Bernadó-Mansilla, E., Maciá-Antonilez, N.: Modeling problem transformation based on data complexity. In: Angulo, C., Godo, L. (eds.) Artificial Intelligence Research and Development, pp. 133–139. IOS Press, Amsterdam (2007)
Google Scholar
de Souto, M.C.P., Costa, I.G., de Araujo, D.S.A., Ludermir, T.B., Schliep, A.: Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497+ (2008)
Article PubMed PubMed Central Google Scholar
Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97(457), 77–87 (2002)
Article CAS Google Scholar
Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. J. American Statistical Association 97(457), 77–87 (2002)
Article CAS Google Scholar
Dupuy, A., Simon, R.: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer Institute 99(2), 147–157 (2007)
Article Google Scholar
Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 209–217 (1998)
Google Scholar
Friedman, H., Rafsky, L.C.: Multivariate generalization of the wald-wolfowitz and smirnov two-sample tests. Ann. Statist. 7, 697–717 (1979)
Article Google Scholar
Giraud-Carrier, C., Vilalta, R., Brazdil, P.: Introduction to the special issue on meta-learning. Mach. Learn. 54(3), 187–193 (2004)
Article Google Scholar
Golub, T.R., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Article CAS PubMed Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: Data mining, inference and prediction. Springer, New York (2001)
Book Google Scholar
Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 289–300 (2002)
Article Google Scholar
Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.N., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S.C., Hoffman, E., Jedlicka, A.E., Kawasaki, E., Martinez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Ye, S.Q., Yu, W.: Multiple-laboratory comparison of microarray platforms. Nat. Methods 2(5), 345–350 (2005)
Article CAS PubMed Google Scholar
Kleinbaum, D.G., Klein, M.: Logistic Regression, 2nd edn. Springer, Heidelberg (2005)
Google Scholar
Lorena, A.C., Costa, I.G., de Souto, M.C.P.: On the complexity of gene expression classification data sets. In: Proc. of the 8th International Conference on Hybrid Intelligent Systems, pp. 825–830. IEEE Computer Society Press, Los Alamitos (2008)
Google Scholar
Lottaz, C., Kostka, D., Markowetz, F., Spang, R.: Computational diagnostics with gene expression profiles. Methods Mol. Biol. 453, 281–296 (2008)
Article CAS PubMed Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICMC 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Google Scholar
Monti, S., et al.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn 52, 91–118 (2003)
Article Google Scholar
Okun, O., Priisalu, H.: Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artificial Intelligence in Medicine 45(2-3), 151–162 (2009)
Article PubMed Google Scholar
Quackenbush, J.: Computational analysis of cDNA microarray data. Nature Reviews 6(2), 418–428 (2001)
Article Google Scholar
Ramaswamy, S., et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA 98, 15149–15154 (2001)
Article CAS PubMed PubMed Central Google Scholar
Rosemblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, New York (1962)
Google Scholar
Slonim, D.: From patterns to pathways: gene expression data analysis comes of age. Nature Genetics 32, 502–508 (2002)
Article CAS PubMed Google Scholar
Smith, F.: Pattern classifier design by linear programming. IEEE Transactions on Computers 17(4), 367–372 (1968)
Article Google Scholar
Sokal, R., Rohlf, F.: Biometry. W. H. Freeman and Company, New York (1995)
Google Scholar
Spang, R.: Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. BIOSILICO 1(2), 64–68 (2003)
Article CAS Google Scholar
Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5), 631–643 (2005)
Article CAS PubMed Google Scholar
van’t Veer, L.J., Bernards, R.: Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature 452(7187), 564–570 (2008)
Article Google Scholar
Vapnik, V.N.: The nature of Statistical learning theory. Springer, New York (1995)
Book Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Yeang, C.H., et al.: Molecular classification of multiple tumor types. In: Proc. 9th Int. Conf. on Intelligent Systems in Molecular Biology, vol. 1, pp. 316–322 (2001)
Google Scholar
Zucknick, M., Richardson, S., Stronach, E.: Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Statist. Appl. in Genetics and Molec. Biol. 7(1), 1–31 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Center of Informatics, Federal University of Pernambuco, Recife, Brazil
Ivan G. Costa
Center of Mathematics, Computation and Cognition, ABC Fed. Univ., SP, Brazil
Ana C. Lorena & Liciana R. M. P. y Peres
Dept. of Informatics and Applied Mathematics, Fed. Univ. of Rio Grande do Norte, Brazil
Marcilio C. P. de Souto

Authors

Ivan G. Costa
View author publications
You can also search for this author in PubMed Google Scholar
Ana C. Lorena
View author publications
You can also search for this author in PubMed Google Scholar
Liciana R. M. P. y Peres
View author publications
You can also search for this author in PubMed Google Scholar
Marcilio C. P. de Souto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center of Informatics, Av. Prof. Luiz Freire, Federal University of Pernambuco, s/n, Cidade Universitária, PE 50740-540, Recife, Brazil
Katia S. Guimarães
National Library of Medicine, National Institutes of Health, National Center for Biotechnology Information, 8600 Rockville Pike, Building 38A 8S814, Bethesda, MD 20894, USA
Anna Panchenko
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Building 38A 8S814, MD 20894,, Bethesda, USA
Teresa M. Przytycka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, I.G., Lorena, A.C., Peres, L.R.M.P.y., de Souto, M.C.P. (2009). Using Supervised Complexity Measures in the Analysis of Cancer Gene Expression Data Sets. In: Guimarães, K.S., Panchenko, A., Przytycka, T.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2009. Lecture Notes in Computer Science(), vol 5676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03223-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-03223-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03222-6
Online ISBN: 978-3-642-03223-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics