Abstract
Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D’haeseleer, P.: How does gene expression clustering work? Nature Biotechnology 23, 1499–1501 (2005)
Kerr, G., Ruskin, H.J., Crane, M., Doolan, P.: Techniques for clustering gene expression data. Computers in Biology and Medicine 38(3), 283–293 (2008)
Golub, T.R., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 96(12), 6745–6750 (1999)
Alizadeh, A.A., et al.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769), 503–511 (2000)
Ramaswamy, S., Ross, K.N., Lander, E.S., Golub, T.R.: A molecular signature of metastasis in primary solid tumors. Nature Genetics 33(1), 49–54 (2003)
Lapointe, J., et al.: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proceedings of the National Academy of Sciences 101(3), 811–816 (2004)
Pirooznia, M., Yang, J., Yang, M.Q., Deng, Y.: A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics 9(suppl. 1), S13 (2008)
Souto, M., Costa, I., de Araujo, D., Ludermir, T., Schliep, A.: Clustering cancer gene expression data: A comparative study. BMC Bioinformatics 9(1), 497 (2008)
Freyhult, E., Landfors, M., Onskog, J., Hvidsten, T., Ryden, P.: Challenges in microarray class discovery: A comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics 11(1), 503 (2010)
Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 16(11), 1370–1386 (2004)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Inc., Upper Saddle River (1988)
Brazma, A., Vilo, J.: Gene expression data analysis. FEBS Letters 480(1), 17–24 (2000)
Steuer, R., Kurths, J., Daub, C.O., Weise, J., Selbig, J.: The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18(suppl. 2), S231–S240 (2002)
Priness, I., Maimon, O., Ben-Gal, I.: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics 8(1), 111 (2007)
Giancarlo, R., Lo Bosco, G., Pinello, L.: Distance Functions, Clustering Algorithms and Microarray Data Analysis. In: Blum, C., Battiti, R. (eds.) LION 4. LNCS, vol. 6073, pp. 125–138. Springer, Heidelberg (2010)
Souto, M.C.P., de Araujo, D.S.A., Costa, I.G., Soares, R.G.F., Ludermir, T.B., Schliep, A.: Comparative study on normalization procedures for cluster analysis of gene expression datasets. In: IJCNN, Hong Kong, China, pp. 2792–2798. IEEE (2008)
Boyack, K.W., et al.: Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoS ONE 6(3), e18029 (2011)
Jaskowiak, P.A., Campello, R.J.G.B., Covões, T.F., Hruschka, E.R.: A comparative study on the use of correlation coefficients for redundant feature elimination. In: 11th Brazilian Symposium on Neural Networks, São Paulo - Brazil, pp. 13–18 (2010)
Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring expression data: Identification and analysis of coexpressed genes. Genome Res. 9(11), 1106–1115 (1999)
Loganantharaj, R., Cheepala, S., Clifford, J.: Metric for measuring the effectiveness of clustering of DNA microarray expression. BMC Bioinformatics 7, S5 (2006)
Gentleman, R., Ding, B., Dudoit, S., Ibrahim, J.: Distance measures in DNA microarray data analysis. In: Bioinformatics and Computational Biology Solutions Using R and Bioconductor, pp. 189–208. Springer, New York (2005)
Giancarlo, R., Lo Bosco, G., Pinello, L., Utro, F.: The Three Steps of Clustering in the Post-Genomic Era: A Synopsis. In: Rizzo, R., Lisboa, P.J.G. (eds.) CIBB 2010. LNCS, vol. 6685, pp. 13–30. Springer, Heidelberg (2011)
Jaskowiak, P.A., Campello, R.J.G.B.: Comparing correlation coefficients as dissimilarity measures for cancer classification in gene expression data. In: 6th Brazilian Symposium on Bioinformatics, Brasília - Brazil, pp. 1–8 (2011)
Pearson, K.: Contributions to the mathematical theory of evolution. iii. Regression, heredity, and panmixia. P. Roy. Soc. Lond. A Mat. 59, 69–71 (1895)
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 100(3/4), 441–471 (1904)
Kendall, M.G.: Rank Correlation Methods, 4th edn. Griffin, London (1970)
Campello, R.J.G.B., Hruschka, E.R.: On comparing two sequences of numbers and its applications to clustering analysis. Inform. Sciences 179(8), 1025–1039 (2009)
Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45, 171–186 (2001)
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Steinley, D.: K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59, 1–34 (2006)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Signal Processing 83(4), 825–833 (2003)
Möller-Levet, C.S., Klawonn, F., Cho, K.H., Yin, H., Wolkenhauer, O.: Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets and Systems 152(1), 49–66 (2005)
Son, Y.S., Baek, J.: A modified correlation coefficient based similarity measure for clustering time-course gene expression data. Pattern Recognition Letters 29(3), 232–242 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jaskowiak, P.A., Campello, R.J.G.B., Costa, I.G. (2012). Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer. In: de Souto, M.C., Kann, M.G. (eds) Advances in Bioinformatics and Computational Biology. BSB 2012. Lecture Notes in Computer Science(), vol 7409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31927-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-31927-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31926-6
Online ISBN: 978-3-642-31927-3
eBook Packages: Computer ScienceComputer Science (R0)