Abstract
In this paper, we have devised a multiobjective optimization solution framework for solving the problem of gene expression data clustering in reduced feature space. Here clustering problem is viewed from two different aspects: clustering of genes in reduced sample space or clustering of samples in reduced gene space. Three objective functions: two internal cluster validity indices and the count on the number of features are optimized simultaneously by a popular multiobjective simulated annealing based approach, namely AMOSA. Here, point symmetry based distance is used for the assignment of gene data points to different clusters. Seven publicly available benchmark gene expression data sets are used for experimental purpose. Both aspects of clustering in reduced feature space is demonstrated. The proposed gene expression clustering technique outperforms the existing nine clustering techniques. Apart from this, also some statistical and biological significant tests have been carried out to show that the proposed FSC-MOO technique is more statistically and biologically enriched
Similar content being viewed by others
Notes
References
Bandyopadhyay S, Saha S (2007) Gaps: A clustering method using a new point symmetry-based distance measure. Pattern Recognit 40(12):3430–3451
Bandyopadhyay S, Saha S (2008) A point symmetry-based clustering technique for automatic evolution of clusters. Knowl Data Eng IEEE Trans 20(11):1441–1457
Bandyopadhyay S, Mukhopadhyay A, Maulik U (2007) An improved algorithm for clustering gene expression data. Bioinformatics 23(21):2859–2865
Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated annealing-based multiobjective optimization algorithm: Amosa. Evolut Comput IEEE Trans 12(3):269–283
Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer, Berlin
Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I (1998) The transcriptional program of sporulation in budding yeast. Science 282(5389):699–705
Davies DL, Bouldin DW (1979) A cluster separation measure. Patt Anal Mach Intell IEEE Trans 2:224–227
Deb K (2001) Multi-objective optimization using evolutionary algorithms, vol 16. Wiley, New York
Handl J, Knowles J (2006) Feature subset selection in unsupervised learning via multiobjective optimization. Int J Comput Intell Res 2(3):217–238
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Huynen M, Snel B, Lathe W, Bork P (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10(8):1204–1210
Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO (2001) Genomic binding sites of the yeast cell-cycle transcription factors sbf and mbf. Nature 409(6819):533–538
Li JJ, Huang H, Bickel PJ, Brenner SE (2014) Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modencode rna-seq data. Genome Res 24(7):1086–1101
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA vol 1, pp 281–297
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. Pattern Anal Mach Intell IEEE Trans 24(12):1650–1654
Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification. Geosci Remote Sens IEEE Trans 41(5):1075–1081
Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Combining pareto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinf 10(1):27
McDowell IC, Manandhar D, Vockley CM, Schmid AK, Reddy TE, Engelhardt BE (2018) Clustering gene expression time series data using an infinite gaussian process mixture model. PLoS Comput Biol 14(1):e1005896
Mitra S, Saha S (2019) A multiobjective multi-view cluster ensemble technique: application in patient subclassification. PLoS ONE 14(5):e0216904
Mukhopadhyay A, Bandyopadhyay S, Maulik U (2010) Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification. PLoS ONE 5(11):e13803
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E (2016) Clustering algorithms: their application to gene expression data. Bioinform Biolo Insights 10:BBI-S38316
Parraga-Alava J, Dorn M, Inostroza-Ponta M (2018) A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Min 11(1):16
Pati SK, Das AK (2012) Optimal samples selection from gene expression microarray data using relational algebra and clustering technique. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam, India, January 2012. Springer, pp 507–514
Paul S, Maji P (2014) City block distance and rough-fuzzy clustering for identification of co-expressed micrornas. Mol BioSyst 10(6):1509–1523
Paul S, Vera J (2015) Rough hypercuboid based supervised clustering of mirnas. Mol BioSyst 11(7):2068–2081
Qin ZS (2006) Clustering microarray gene expression data using weighted chinese restaurant process. Bioinformatics 22(16):1988–1997
Reymond P, Weber H, Damond M, Farmer EE (2000) Differential gene expression in response to mechanical wounding and insect feeding in arabidopsis. Plant Cell 12(5):707–719
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Saha S, Ekbal A, Gupta K, Bandyopadhyay S (2013) Gene expression data clustering using a multiobjective symmetry based clustering technique. Comput Biol Med 43(11):1965–1977
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 96(6):2907–2912
Tang C, Zhang A (2002) An iterative strategy for pattern discovery in high-dimensional data sets. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM, pp 10–17
Tang C, Zhang L, Zhang A, Ramanathan M (2001) Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Proceedings of the IEEE 2nd international symposium on bioinformatics and bioengineering conference, 2001. IEEE, pp 41–48
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999) Systematic determination of genetic network architecture. Nat Genet 22(3):281–285
Tou JT, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley, Reading
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Wilcoxon F, Katti S, Wilcox RA (1963) Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. American Cyanamid, Pearl River, NY
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 8:841–847
Acknowledgements
Dr. Sriparna Saha would like to acknowledge the support of SERB Women in Excellence Award-SB/WEA-08/2017 for conducting this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Alok, A.K., Gupta, P., Saha, S. et al. Simultaneous feature selection and clustering of micro-array and RNA-sequence gene expression data using multiobjective optimization. Int. J. Mach. Learn. & Cyber. 11, 2541–2563 (2020). https://doi.org/10.1007/s13042-020-01139-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01139-x