Skip to main content
Log in

Simultaneous feature selection and clustering of micro-array and RNA-sequence gene expression data using multiobjective optimization

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

In this paper, we have devised a multiobjective optimization solution framework for solving the problem of gene expression data clustering in reduced feature space. Here clustering problem is viewed from two different aspects: clustering of genes in reduced sample space or clustering of samples in reduced gene space. Three objective functions: two internal cluster validity indices and the count on the number of features are optimized simultaneously by a popular multiobjective simulated annealing based approach, namely AMOSA. Here, point symmetry based distance is used for the assignment of gene data points to different clusters. Seven publicly available benchmark gene expression data sets are used for experimental purpose. Both aspects of clustering in reduced feature space is demonstrated. The proposed gene expression clustering technique outperforms the existing nine clustering techniques. Apart from this, also some statistical and biological significant tests have been carried out to show that the proposed FSC-MOO technique is more statistically and biologically enriched

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. http://cmgm.stanford.edu/pbrown/sporulation,sps25

  2. http://faculty.washington.edu/kayee/cluster,sps26

  3. http://homes.esat.kuleuven.be/thijs/Work/Clustering.html.sps27

  4. http://www.stat.ucla.edu/jingyi.li/software-and-data.html,mode

  5. http://www.stat.ucla.edu/jingyi.li/software-and-data.html,mode

  6. http://www.biolab.si/supp/bi-cancer/projections/info/SRBCT.htm

  7. http://www.biolab.si/supp/bi-cancer/projections/info/leukemia.htm

  8. http://db.yeastgenome.org/cgi-bin/GO/goTermFinder

  9. http://geneontology.org/

References

  1. Bandyopadhyay S, Saha S (2007) Gaps: A clustering method using a new point symmetry-based distance measure. Pattern Recognit 40(12):3430–3451

    MATH  Google Scholar 

  2. Bandyopadhyay S, Saha S (2008) A point symmetry-based clustering technique for automatic evolution of clusters. Knowl Data Eng IEEE Trans 20(11):1441–1457

    Google Scholar 

  3. Bandyopadhyay S, Mukhopadhyay A, Maulik U (2007) An improved algorithm for clustering gene expression data. Bioinformatics 23(21):2859–2865

    Google Scholar 

  4. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated annealing-based multiobjective optimization algorithm: Amosa. Evolut Comput IEEE Trans 12(3):269–283

    Google Scholar 

  5. Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer, Berlin

    MATH  Google Scholar 

  6. Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I (1998) The transcriptional program of sporulation in budding yeast. Science 282(5389):699–705

    Google Scholar 

  7. Davies DL, Bouldin DW (1979) A cluster separation measure. Patt Anal Mach Intell IEEE Trans 2:224–227

    Google Scholar 

  8. Deb K (2001) Multi-objective optimization using evolutionary algorithms, vol 16. Wiley, New York

    MATH  Google Scholar 

  9. Handl J, Knowles J (2006) Feature subset selection in unsupervised learning via multiobjective optimization. Int J Comput Intell Res 2(3):217–238

    MathSciNet  Google Scholar 

  10. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    MATH  Google Scholar 

  11. Huynen M, Snel B, Lathe W, Bork P (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10(8):1204–1210

    Google Scholar 

  12. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO (2001) Genomic binding sites of the yeast cell-cycle transcription factors sbf and mbf. Nature 409(6819):533–538

    Google Scholar 

  13. Li JJ, Huang H, Bickel PJ, Brenner SE (2014) Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modencode rna-seq data. Genome Res 24(7):1086–1101

    Google Scholar 

  14. MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA vol 1, pp 281–297

  15. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. Pattern Anal Mach Intell IEEE Trans 24(12):1650–1654

    Google Scholar 

  16. Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification. Geosci Remote Sens IEEE Trans 41(5):1075–1081

    Google Scholar 

  17. Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Combining pareto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinf 10(1):27

    Google Scholar 

  18. McDowell IC, Manandhar D, Vockley CM, Schmid AK, Reddy TE, Engelhardt BE (2018) Clustering gene expression time series data using an infinite gaussian process mixture model. PLoS Comput Biol 14(1):e1005896

    Google Scholar 

  19. Mitra S, Saha S (2019) A multiobjective multi-view cluster ensemble technique: application in patient subclassification. PLoS ONE 14(5):e0216904

    Google Scholar 

  20. Mukhopadhyay A, Bandyopadhyay S, Maulik U (2010) Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification. PLoS ONE 5(11):e13803

    Google Scholar 

  21. Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E (2016) Clustering algorithms: their application to gene expression data. Bioinform Biolo Insights 10:BBI-S38316

  22. Parraga-Alava J, Dorn M, Inostroza-Ponta M (2018) A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Min 11(1):16

    Google Scholar 

  23. Pati SK, Das AK (2012) Optimal samples selection from gene expression microarray data using relational algebra and clustering technique. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam, India, January 2012. Springer, pp 507–514

  24. Paul S, Maji P (2014) City block distance and rough-fuzzy clustering for identification of co-expressed micrornas. Mol BioSyst 10(6):1509–1523

    Google Scholar 

  25. Paul S, Vera J (2015) Rough hypercuboid based supervised clustering of mirnas. Mol BioSyst 11(7):2068–2081

    Google Scholar 

  26. Qin ZS (2006) Clustering microarray gene expression data using weighted chinese restaurant process. Bioinformatics 22(16):1988–1997

    Google Scholar 

  27. Reymond P, Weber H, Damond M, Farmer EE (2000) Differential gene expression in response to mechanical wounding and insect feeding in arabidopsis. Plant Cell 12(5):707–719

    Google Scholar 

  28. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    MATH  Google Scholar 

  29. Saha S, Ekbal A, Gupta K, Bandyopadhyay S (2013) Gene expression data clustering using a multiobjective symmetry based clustering technique. Comput Biol Med 43(11):1965–1977

    Google Scholar 

  30. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 96(6):2907–2912

    Google Scholar 

  31. Tang C, Zhang A (2002) An iterative strategy for pattern discovery in high-dimensional data sets. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM, pp 10–17

  32. Tang C, Zhang L, Zhang A, Ramanathan M (2001) Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Proceedings of the IEEE 2nd international symposium on bioinformatics and bioengineering conference, 2001. IEEE, pp 41–48

  33. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999) Systematic determination of genetic network architecture. Nat Genet 22(3):281–285

    Google Scholar 

  34. Tou JT, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley, Reading

    MATH  Google Scholar 

  35. Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416

    MathSciNet  Google Scholar 

  36. Wilcoxon F, Katti S, Wilcox RA (1963) Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. American Cyanamid, Pearl River, NY

    MATH  Google Scholar 

  37. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 8:841–847

    Google Scholar 

Download references

Acknowledgements

Dr. Sriparna Saha would like to acknowledge the support of SERB Women in Excellence Award-SB/WEA-08/2017 for conducting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhay Kumar Alok.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 126 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alok, A.K., Gupta, P., Saha, S. et al. Simultaneous feature selection and clustering of micro-array and RNA-sequence gene expression data using multiobjective optimization. Int. J. Mach. Learn. & Cyber. 11, 2541–2563 (2020). https://doi.org/10.1007/s13042-020-01139-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01139-x

Keywords

Navigation