Skip to main content
Log in

Finding best algorithmic components for clustering microarray data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The analysis of microarray data is fundamental to microbiology. Although clustering has long been realized as central to the discovery of gene functions and disease diagnostic, researchers have found the construction of good algorithms a surprisingly difficult task. In this paper, we address this problem by using a component-based approach for clustering algorithm design, for class retrieval from microarray data. The idea is to break up existing algorithms into independent building blocks for typical sub-problems, which are in turn reassembled in new ways to generate yet unexplored methods. As a test, 432 algorithms were generated and evaluated on published microarray data sets. We found their top performers to be better than the original, component-providing ancestors and also competitive with a set of new algorithms recently proposed. Finally, we identified components that showed consistently good performance for clustering microarray data and that should be considered in further development of clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng. doi:10.1016/j.datak.2007.03.016

  2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control. doi:10.1109/TAC.1974.1100705

  3. Andreopoulos B, An A, Wang X et al (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Br Bioinform 10(3):297–314

    Article  Google Scholar 

  4. Ankerst M, Breunig M, Kriegel H, et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD’99 international conference on management of data. Philadelphia, pp 49–60

  5. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SODA ’07), society for industrial and applied mathematics, Philadelphia, pp 1027–1035

  6. Ayadi W, Elloumi M, Hao JK (2012) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358. doi:10.1007/s10115-011-0383-7

    Article  Google Scholar 

  7. Balachandran V, Khemani D (2011) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst. doi:10.1007/s10115-011-0446-9

  8. Baralis E, Bruno G, Flori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst 29:81–101. doi:10.1007/s10115-010-0374-0

    Article  Google Scholar 

  9. Baya AE, Granitto PM (2011) Clustering gene expression data with a penalized graph-based metric. BMC bioinf 12:1–18

    Article  Google Scholar 

  10. Bezdek JC (1981) Pattern recognition With fuzzy objective function algorithms. Plenum Press, New York

    Book  MATH  Google Scholar 

  11. Belacel N, Wang Q, Cuperlovic-Culf M (2006) Clustering methods for microarray gene expression data. OMICS J Integr Biol 10(4):507–531. doi:10.1089/omi.2006.10.507

    Article  Google Scholar 

  12. Bonchi F, Gionis A, Ukkonen, A (2011) Overlapping correlation clustering. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 51–60. doi:10.1109/ICDM.2011.114

  13. Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Tesauro G, Touretzky D (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 585–592

    Google Scholar 

  14. Chen C-L, Tseng FSC (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. doi:j.datak.2010.08.003

    Google Scholar 

  15. Cheung Y (2003) k*-means: a new generalized k-means clustering algorithm. Pattern Recognit Lett 24(15):2883–2893. doi:10.1016/S0167-8655(03)00146-6

    Article  MATH  Google Scholar 

  16. Da Silva A, Chiky R, Hébrail G (2011) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0448-7

  17. Dang H-X, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2010, pp 573–582

  18. De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 564–572

  19. de Souto MCP, Prudencio RBC, Soares RGF et al (2008) Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE international joint conference on neural networks, pp 3729–3735. doi:10.1109/IJCNN.2008.4634333

  20. Delibašić B, Kirchner K, Ruhland J et al (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32:59–75. doi:10.1007/s10462-009-9133-6

    Article  Google Scholar 

  21. Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980

    Article  Google Scholar 

  22. Dhiraj K, Rath SK (2009) Gene expression analysis using clustering. In: Proceedings of 3rd international conference on bioinformatics and, biomedical engineering, pp 154–163

  23. Ding C, He X (2004) Principal component analysis and effective k-means clustering. In: Proceedings of the SIAM international conference on data mining, pp 497–502

  24. Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 681–689

  25. Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231

  26. Forestier G, Gançarski P, Wemmert C (2010) Collaborative clustering with background knowledge. Data Knowl Eng 69(2):211–228. doi:10.1016/j.datak.2009.10.004

    Article  Google Scholar 

  27. Geraci F, Leoncini M, Montangero M et al (2009) K-boost: a scalable algorithm for high-quality clustering of microarray gene expression data. J Comput Biol J Comput Mol Cell Biol 16(6):859–873. doi:10.1089/cmb.2008.0201

    Article  MathSciNet  Google Scholar 

  28. Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol AMB 6(1). doi:10.1186/1748-7188-6-1

  29. Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Blum C, Battiti R (eds) Learning and intelligent, optimization, vol 6073, pp 125–138

  30. Grujic M, Andrejiová M, Marasová D et al (2012) Using principal components analysis and clustering analysis to assess the similarity between conveyor belts. Tech Technol Educ Manag TTEM 7(1):4–10

    Google Scholar 

  31. Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the neural information processing systems, vol 17

  32. Hartigan JA (1975) Clustering algorithms. Probability and mathematical statistics. Wiley, New York

    Google Scholar 

  33. Hartigan JA, Wong MA (1979) A K-means clustering algorithm. Appl Stat 28:100–108

    Article  MATH  Google Scholar 

  34. Iam-on N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26:1513–1519

    Article  Google Scholar 

  35. Jovanović M, Delibašić B, Vukićević M, et al (2011) Optimizing performance of decision tree component-based algorithms using evolutionary algorithms in Rapid Miner. In: proceedings of the 2nd RapidMiner community meeting and conference, Dublin

  36. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

  37. Kumar P, Wasan SK (2010) Comparative analysis of k-mean based algorithms. Intl J Comput Sci Netw Secur 10(4):314–318

    Google Scholar 

  38. Kalogeratos A, Likas A (2011) Document clustering using synthetic cluster prototypes. Data Knowl Eng 70(3):284–306. doi:j.datak.2010.12.002

    Article  Google Scholar 

  39. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. doi:10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  40. Milligan GW, Cooper MC (1987) Methodology review: clustering methods. Appl Psychol Meas 11(4):329–354. doi:10.1177/014662168701100401

    Article  Google Scholar 

  41. Milovanović M, Minović M, Štavljanin V et al (2012) Wiki as a corporate learning tool: case study for software development company. Behav Inf Technol. doi:10.1080/0144929X.2011.642894

  42. Minović M, Milovanović M, Kovačević I, Minović J, Starčević D (2011) Game design as a learning tool for the course of computer Networks. Intern J Eng Educ 27(3):498–508

    Google Scholar 

  43. Moise G, Zimek A, Kröger P et al (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326. doi:10.1007/s10115-009-0226-y

    Article  Google Scholar 

  44. Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118. doi:10.1023/A:1023949509487

    Article  MATH  Google Scholar 

  45. Nascimento A, Prudencio R, de Souto M, et al (2009) Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the 19th international conference on artificial neural networks: Part II, Springer, Berlin

  46. Nascimento MCV, Toledo FMB, Carvalho A (2010) Investigation of a new GRASP-based clustering algorithm applied to biological data. Comput Oper Res 37(8):1381–1388. doi:10.1016/j.cor.2009.02.014

    Article  MATH  Google Scholar 

  47. Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 17, Morgan Kaufmann, Los Altos, pp 727–734

  48. Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5. doi:10.1145/980972.980974

    Article  Google Scholar 

  49. Punera K, Ghosh J (2008) Consensus-based ensembles of soft clusterings. Appl Artif Intell 22:780–810

    Article  Google Scholar 

  50. Pirim H, Gautam D, Bhowmik T (2011) Performance of an ensemble clustering on biological datasets. Math Comput Appl 16(1):87–96

    Google Scholar 

  51. Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427

    Article  Google Scholar 

  52. Raczynski L, Wozniak K, Rubel T, Zaremba K (2010) Application of density based clustering to microarray data analysis. Int J Electron Telecommun 56(3):281–286

    Google Scholar 

  53. Romero C, Ventura S (2011) Educational data mining: a review of the state-of-the-art. IEEE Trans Syst Man Cybern C Appl Rev 40(6):601–618

    Article  Google Scholar 

  54. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. doi:10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  55. Savoiu G, Jaško O, Čudanov M (2010) Diversity of specific quantitative, statistical and social methods, techniques and management models in management system. Management 14(52):5–13

    Google Scholar 

  56. Sander J, Ester M, Kriegel H et al (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194

    Article  Google Scholar 

  57. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  58. Shao J, Plant C, Yang Q, Böhm C (2011) Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 607–616, doi:10.1109/ICDM.2011.50

  59. Shaham E, Sarne D, Ben-Moshe B (2011) Sleeved co-clustering of lagged data. Knowl Inf Syst. doi:10.1007/s10115-011-0420-6

  60. Sedlak O, Kocic-Vugdelija V, Kudumovic M et al (2010) Management of family farms—Implementation of fuzzy method in short-term planning. Tech Technol Educ Manag TTEM 5(4):710–718

    Google Scholar 

  61. Smith-Miles K (2008) Towards insightful algorithm selection for optimization using meta-learning concepts. In: Proceedings of the IEEE international joint conference on neural networks, pp 4118–4124

  62. Sonnenburg S, Braun M, Ong CS et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466

    Google Scholar 

  63. Thalamuthu A, Mukhopadhyay I, Zheng X et al (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22:2405–2412

    Article  Google Scholar 

  64. Vinh NX (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    MathSciNet  MATH  Google Scholar 

  65. Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal evaluation measures as proxies for external indices in clustering gene expression data. In: Proceedings of the 2011 IEEE international conference on bioinformatics and biomedicine (BIBM11). Atlanta, 12–15 Nov

  66. Wan M, Jönsson A, Wang C, Li L, Yang Y (2011) Web user clustering and web prefetching using random indexing with weight functions. Knowl Inf Syst. doi:10.1007/s10115-011-0453-x

  67. Wijaya A, Kalousis M, Hilario M (2010) Predicting classifier performance using data set descriptors and data mining ontology. In: Proceedings of the 3rd planning to learn workshop

  68. Wu LF, Hughes TR, Davierwala AP (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat genet 31:255–265

    Article  Google Scholar 

  69. Wu X, Kumar V, Quinlan JR et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2

    Article  Google Scholar 

  70. Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847

    Article  Google Scholar 

  71. Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. doi:10.1109/RBME.2010.2083647

    Article  Google Scholar 

  72. Yan Y, Chen L, Tjhi W-C (2011) Semi-supervised fuzzy co-clustering algorithm for document classification. Knowl Inf Syst. doi:10.1007/s10115-011-0454-9

  73. Yu Z, Wong H-S, Wang H (2007) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23:2888–2896

    Article  Google Scholar 

Download references

Acknowledgments

This research was partially funded by a grant from German Academic Exchange Office (DAAD) and the Serbian Ministry of Science, Project-ID 50453023.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Milan Vukićević.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vukićević, M., Kirchner, K., Delibašić, B. et al. Finding best algorithmic components for clustering microarray data. Knowl Inf Syst 35, 111–130 (2013). https://doi.org/10.1007/s10115-012-0542-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0542-5

Keywords

Navigation