Abstract
Clustering gene expression data are an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Many clustering methods have been used in the field of gene clustering. This paper proposed a new method for gene expression data clustering based on an improved expectation maximization(EM) method of multivariate Gaussian mixture models. To solve the problem of over-reliance on the initialization, we propose a remove and add initialization for the classical EM, and make a random perturbation on the solution before continuing EM iterations. The number of clusters is estimated with the Quasi Akaike’s information criterion in this paper. The improved EM method is tested and compared with some other clustering methods; the performance of our clustering algorithm has been extensively compared over several simulated and real gene expression data sets. Our results indicated that improved EM clustering method is superior than other clustering algorithms and can be widely used for gene clustering.





Similar content being viewed by others
References
Pirim, H., Ekşioğlu, B., Perkins, A.D., Yüceer, Ç.: Clustering of high throughput gene expression data. Comput. Op. Res. 39(12), 3046–3061 (2012)
Sun, J., Chen, W., Fang, W., Wun, X.J., Xu, W.B.: Gene expression data analysis with the clustering method based on an improved quantum-behaved Particle Swarm Optimization. Eng. Appl. Artif. Intell. 25(2), 376–391 (2012)
Mukhopadhyay, A., Maulik, U.: Towards improving fuzzy clustering using support vector machine: application to gene expression data. Pattern Recognit. 42(11), 2744–2763 (2009)
Zhang, W.F., Liu, C.C., Yan, H.: Clustering of temporal gene expression data by regularized spline regression and an energy based similarity measure. Pattern Recognit. 43(12), 3969–3976 (2010)
Kerr, G., Ruskin, H.J., Crane, M., Doolan, P.: Techniques for clustering gene expression data. Comput. Biol. Med. 38(3), 283–293 (2008)
Seal, S., Komarina, S., Aluru, S.: An optimal hierarchical clustering algorithm for gene expression data. Inform. Process Lett. 93(3), 143–147 (2005)
Szeto, L.K., Wee-Chung Liew, A., Yan, Hong, Tang, Sy-sen: Gene expression data clustering and visualization based on a binary hierarchical clustering framework. J. Visual. Lang. Comput. 14(4), 341–362 (2003)
Chan, Zeke S.H., Lesley Collins, Kasabov, N.: An efficient greedy K-means algorithm for global gene trajectory clustering. Expert Syst. Appl. 30(1), 137–141 (2006)
Lam, Yau King, Tsang, Peter W.M.: Exploratory K-Means: a new simple and efficient algorithm for gene clustering. Appl. Soft Comput. 12(3), 1149–1157 (2012)
Ghouila, Amel, Yahia, Sadok Ben, Malouche, Dhafer, et al.: Application of Multi-SOM clustering approach to macrophage gene expression analysis. Infect. Genet. Evol. 9(3), 328–336 (2009)
Niciura, Simone Cristina Méo, Ibelli, Adriana Mércia Guaratini, Gouveia, Gisele Veneroni: Polymorphism and parent-of-origin effects on gene expression of CAST, leptin and DGAT1 in cattle. Meat Sci. 90(2), 507–510 (2012)
Saha, Indrajit, Maulik, Ujjwal, Bandyopadhyay, Sanghamitra, Plewczynski, Dariusz: Improvement of new automatic differential fuzzy clustering using SVM classifier for microarray analysis. Expert Syst. Appl. 38(12), 15122–15133 (2011)
Zeng, Y.J., Javier, G.F.: A novel HMM-based clustering algorithm for the analysis of gene expression time-course data. Comput. Stat. Data Anal. 50(9), 2472–2494 (2006)
McNicholas, Paul D., Subedi, Sanjeena: Clustering gene expression time course data using mixtures of multivariate t-distributions. J. Stat. Plan. Inference 142(5), 1114–1127 (2012)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Yeung, K.Y., Fraley, C., Murua, A., et al.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001)
Qu, Y., Xu, S.Z.: Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 20(12), 1905–1913 (2004)
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52(1), 502–519 (2007)
McNicholas, P.D.: Model-based classification using latent Gaussian mixture models. J. Stat. Plan. Inference 140(5), 1175–1181 (2010)
Yao, W.: A note on EM algorithm for mixture models. Stat. Probabil. Lett. 83(2), 519–526 (2013)
Lee, G., Scott, C.: EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput. Stat. Data Anal. 56(9), 2816–2829 (2012)
Yang, M., Lai, C., Lin, C.: A robust EM clustering algorithm for Gaussian mixture models. Pattern. Recognit. 45(11), 3950–3961 (2012)
Jacques, J., Preda, C.: Model-based clustering for multivariate functional data. Comput. Stat. Data. Anal. 71, 92–106 (2014)
Maraziotis, I.A.: A semi-supervised fuzzy clustering algorithm applied to gene expression data. Pattern Recognit. 45(1), 637–648 (2012)
Akaike, H.: A new look at statistical model identification. IEEE Trans. Autom. Control. 19, 716–723 (1974)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 2907–2912 (1978)
Lebreton, J.D., Burnham, K.P., Clobert, J., Anderson, D.R.: Modelling survival and testing biological hypotheses using marked animals:a unified approach with case studies. Ecol. Monogr. 62, 67–118 (1992)
McNicholas, P.D., Subedi, S.: Clustering gene expression time course data using mixtures of multivariate t-distributions. J. Stat. Plan. Inference 142, 1114–1127 (2012)
Dembele, D., Kastner, P.: Fuzzy C-means method for clustering microarray data. Bioinformatics 19, 973–980 (2003)
Tavazoie, S., Hughes, J.D., Campbell, M.J., et al.: Systematic determination of genetic network architecture. Nat. Genet. 22, 281–285 (1999)
Wen, X.L., Fuhman, S., Michaels, G.S., et al.: Larger-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA 95(1), 334–339 (1998)
Iyer, V.R., et al.: The transcriptional program in the response of the human fibroblasts to serum. Science 283, 83–87 (1999)
Eisen, M.B., Spellman, P.T., Brown, P.O., et al.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(14), 863–14868 (1998)
Tavazoie, S., Hughes, J.D., Campbell, M.J., et al.: Systematic determination of genetic network architecture. Nat. Genet. 22, 218–285 (1999)
Weizmann Institute of Science, GeneCards: The Human Gene Compendium. Accessed February 9, 2011. (1996)
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61402204), the Jiangsu Province Natural Science Foundation (Nos. BK20130529, BK2012209), the Research Fund for the Doctoral Program of Higher Education of China (No. 20113227110010), the research foundation for talented scholars, Jiangsu University (No. 14JDG141), and the science and technology program of Zhenjiang city (SH20140110).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, Z., Song, Yq., Xie, Ch. et al. A new clustering method of gene expression data based on multivariate Gaussian mixture models. SIViP 10, 359–368 (2016). https://doi.org/10.1007/s11760-015-0749-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-015-0749-5