Skip to main content
Log in

Semi-supervised clustering for gene-expression data in multiobjective optimization framework

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Studying the patterns hidden in gene expression data helps to understand the functionality of genes. But due to the large volume of genes and the complexity of biological networks it is difficult to study the resulting mass of data which often consists of millions of measurements. In order to reveal natural structures and to identify interesting patterns from the given gene expression data set, clustering techniques are applied. Semi-supervised classification is a new direction of machine learning. It requires huge unlabeled data and a few labeled data. Semi-supervised classification in general performs better than unsupervised classification. But to the best of our knowledge there are no works for solving gene expression data clustering problem using semi-supervised classification techniques. In the current paper we have made an attempt to solve the gene expression data clustering problem using a multiobjective optimization based semi-supervised classification technique with the aim to attain good quality partitions by using few labeled data. In order to generate the labeled data, initially Fuzzy C-means clustering technique is applied. In order to automatically determine the partitioning, multiple cluster centers corresponding to a cluster are encoded in the form of a string. In order to compute the quality of the obtained partitioning, values of five objective functions are computed. The effectiveness of this proposed semi-supervised clustering technique is demonstrated on five publicly available benchmark gene expression data sets. Comparison results with the existing techniques for gene expression data clustering prove that the proposed method is the most effective one. Statistical and biological significance tests have also been carried out.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://anirbanmukhopadhyay.50webs.com/mogasvm.html.

  2. http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.

References

  1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  Google Scholar 

  2. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceed National Acad Sci 96(12):6745–6750

    Article  Google Scholar 

  3. Altun Y, Belkin M, Mcallester DA (2005) Maximum margin semi-supervised learning for structured variables. In: Advances in neural information processing systems, pp 33–40

  4. Bandyopadhyay S (2007) Analysis of biological data: a soft computing approach, World Scientific

  5. Bandyopadhyay S, Saha S (2008) A point symmetry-based clustering technique for automatic evolution of clusters. Knowl Data Eng IEEE Trans 20(11):1441–1457

    Article  Google Scholar 

  6. Bandyopadhyay S, Pal SK, Aruna B (2004) Multiobjective gas, quantitative indices, and pattern classification. Syst Man Cybern Part B Cybern IEEE Trans 34(5):2088–2099

    Article  Google Scholar 

  7. Bandyopadhyay S, Mukhopadhyay A, Maulik U (2007) An improved algorithm for clustering gene expression data. Bioinformatics 23(21):2859–2865

    Article  Google Scholar 

  8. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated annealing-based multiobjective optimization algorithm: AMOSA. Evol Comput IEEE Trans 12(3):269–283

    Article  Google Scholar 

  9. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering

  10. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers

  11. Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the twenty-first international conference on Machine learning, ACM, pp 81–88

  12. Brazma A, Vilo J (2000) Gene expression data analysis. FEBS lett 480(1):17–24

    Article  Google Scholar 

  13. Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proceed Natl Acad Sci 101(12):4164–4169

    Article  Google Scholar 

  14. Chapelle O, Zien A (2004) Semi-supervised classification by low density separation. In AI STATS

  15. Chapelle O, Schölkopf B, Zien A, et al. (2006) Semi-supervised learning, MIT press Cambridge

  16. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ et al (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol cell 2(1):65–73

    Article  Google Scholar 

  17. Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I (1998) The transcriptional program of sporulation in budding yeast. Science 282(5389):699–705

    Article  Google Scholar 

  18. De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y (2002) Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18(5):735–746

    Article  Google Scholar 

  19. Deb K, Pratap A, Agarwal S, Meyarivan T, Fast A (2002) Nsga-ii. IEEE Trans Evol Comput 6(2):182–197

  20. Dembélé D (2008) Multi-objective optimization for clustering 3-way gene expression data. Adv Data Anal Cl 2(3):211–225

    Article  MathSciNet  MATH  Google Scholar 

  21. Dhaeseleer P, Wen X, Fuhrman S, Somogyi R (1998) Mining the gene expression matrix: Inferring gene relationships from large scale gene expression data. In: Information processing in cells and tissues, Springer, pp 203–212

  22. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proceed Natl Acad Sci 95(25):14863–14868

  23. Everitt B (1974/1993) Cluster analysis. Halsted Press

  24. Faceli K, de Souto MC, de Araújo DS, de Carvalho AC (2009) Multi-objective clustering ensemble for gene expression data analysis. Neurocomputing 72(13):2763–2774

    Article  Google Scholar 

  25. Fraley C, Raftery AE (1998) How many clusters? which clustering method? answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  MATH  Google Scholar 

  26. Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Patt Anal Mach Intell IEEE Trans 6:721–741

    Article  MATH  Google Scholar 

  27. Ghosh D, Chinnaiyan AM (2002) Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18(2):275–286

    Article  Google Scholar 

  28. Herwig R, Poustka AJ, Müller C, Bull C, Lehrach H, O’Brien J (1999) Large-scale clustering of cdna-fingerprinting data. Genome Res 9(11):1093–1105

    Article  Google Scholar 

  29. Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9(11):1106–1115

    Article  Google Scholar 

  30. Hu Q, Pan W, An S, Ma P, Wei J (2010) An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int J Mach Learn Cybern 1(1–4):63–74

    Article  Google Scholar 

  31. Ishibuchi H, Murata T (1998) A multi-objective genetic local search algorithm and its application to flowshop scheduling. Syst Man Cybern Part C Appl Rev IEEE Trans 28(3):392–403

    Article  Google Scholar 

  32. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J, Boguski MS et al (1999) The transcriptional program in the response of human fibroblasts to serum. Science 283(5398):83–87

    Article  Google Scholar 

  33. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  34. Jiang D, Pei J, Zhang A (2003) Dhc: a density-based hierarchical clustering method for time series gene expression data. In: Proceedings of Bioinformatics and Bioengineering. Third IEEE Symposium, pp 393–400

  35. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. Knowl Data Eng IEEE Trans 16(11):1370–1386

    Article  Google Scholar 

  36. Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization by simmulated annealing. Science 220(4598):671–680

    Article  MathSciNet  MATH  Google Scholar 

  37. Liu L, Hawkins DM, Ghosh S, Young SS (2003) Robust singular value decomposition analysis of microarray data. Proceed Natl Acad Sci 100(23):13,167–13,172

  38. Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M, Mittmann M, Wang C, Kobayashi M, Horton H et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotech 14(13):1675–1680

    Article  Google Scholar 

  39. Lockhart DJ, Winzeler EA (2000) Genomics, gene expression and dna arrays. Nature 405(6788):827–836

  40. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. Patt Anal Mach Intell IEEE Trans 24(12):1650–1654

    Article  Google Scholar 

  41. Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification. Geosci Remote Sens IEEE Trans 41(5):1075–1081

    Article  Google Scholar 

  42. Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Combining pareto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinform 10(1):27

    Article  Google Scholar 

  43. Mukhopadhyay A, Bandyopadhyay S, Maulik U (2010) Multi-class clustering of cancer subtypes through svm based ensemble of pareto-optimal solutions for gene marker identification. PloS one 5(11):e13803

  44. Mukhopadhyay A, Maulik U, Bandyopadhyay S (2013) An interactive approach to multiobjective clustering of gene expression patterns. Biomed Eng IEEE Trans 60(1):35–41

    Article  Google Scholar 

  45. Qin ZS (2006) Clustering microarray gene expression data using weighted chinese restaurant process. Bioinformatics 22(16):1988–1997

    Article  Google Scholar 

  46. Reymond P, Weber H, Damond M, Farmer EE (2000) Differential gene expression in response to mechanical wounding and insect feeding in arabidopsis. Plant Cell Online 12(5):707–719

    Article  Google Scholar 

  47. Rose K (1998) Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. In: Proceedings of the IEEE 86(11):2210–2239

  48. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  49. Saha S, Bandyopadhyay S (2013) A generalized automatic clustering algorithm in a multiobjective framework. Appl Soft Comput 13(1):89–108

    Article  Google Scholar 

  50. Saha S, Ekbal A, Alok AK (2012) Semi-supervised clustering using multiobjective optimization. In: Hybrid Intelligent Systems (HIS), 12th International Conference, IEEE, pp 360–365

  51. Saha S, Ekbal A, Gupta K, Bandyopadhyay S (2013) Gene expression data clustering using a multiobjective symmetry based clustering technique. Comput Biol Med 43(11):1965–1977

    Article  Google Scholar 

  52. Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: Artificial Neural Networks-ICANN, Springer, pp 175–184

  53. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science 270(5235):467–470

    Article  Google Scholar 

  54. Schott JR (1995) Fault tolerant design using single and multicriteria genetic algorithm optimization. Tech Rep DTIC Doc

  55. Sharan R, Shamir R (2000) Click: a clustering algorithm with applications to gene expression analysis. Proceed Int Conf Intell Syst Mol Biol 8:16

  56. Sharma A, Imoto S, Miyano S, Sharma V (2012) Null space based feature selection method for gene expression data. Int J Mach Learn Cybern 3(4):269–276

    Article  Google Scholar 

  57. Sherlock G (2000) Analysis of large-scale gene expression data. Curr Opin Immunol 12(2):201–205

    Article  Google Scholar 

  58. de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9(1):497

    Article  Google Scholar 

  59. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceed Natl Acad Sci 96(6):2907–2912

    Article  Google Scholar 

  60. Tang VT, Yan H (2012) Noise reduction in microarray gene expression data based on spectral analysis. Int J Mach Learn Cybern 3(1):51–57

    Article  Google Scholar 

  61. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM (1999) Systematic determination of genetic network architecture. Nature Genet 22(3):281–285

    Article  Google Scholar 

  62. Tou JT GR (1974) Pattern recognition principles. Reading: Addison-Wesley

  63. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  64. Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R (1998) Large-scale temporal gene expression mapping of central nervous system development. Proceed Natl Acad Sci 95(1):334–339

    Article  Google Scholar 

  65. Wilcoxon F, Katti S, Wilcox RA (1963) Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. American Cyanamid Comp

  66. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847

    Article  Google Scholar 

  67. Xu X (2013) Enhancing gene expression clustering analysis using tangent transformation. Int J Mach Learn Cybern 4(1):31–40

    Article  Google Scholar 

  68. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987

    Article  Google Scholar 

  69. Zitzler E, Laumanns M, Thiele L (2001) Spea 2: Improving the strength pareto evolutionary algorithm

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhay Kumar Alok.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alok, A.K., Saha, S. & Ekbal, A. Semi-supervised clustering for gene-expression data in multiobjective optimization framework. Int. J. Mach. Learn. & Cyber. 8, 421–439 (2017). https://doi.org/10.1007/s13042-015-0335-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-015-0335-8

Keywords

Navigation