Finding best algorithmic components for clustering microarray data

Vukićević, Milan; Kirchner, Kathrin; Delibašić, Boris; Jovanović, Miloš; Ruhland, Johannes; Suknović, Milija

doi:10.1007/s10115-012-0542-5

Finding best algorithmic components for clustering microarray data

Regular Paper
Published: 06 September 2012

Volume 35, pages 111–130, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Milan Vukićević¹,
Kathrin Kirchner²,
Boris Delibašić¹,
Miloš Jovanović¹,
Johannes Ruhland² &
…
Milija Suknović¹

496 Accesses
7 Citations
Explore all metrics

Abstract

The analysis of microarray data is fundamental to microbiology. Although clustering has long been realized as central to the discovery of gene functions and disease diagnostic, researchers have found the construction of good algorithms a surprisingly difficult task. In this paper, we address this problem by using a component-based approach for clustering algorithm design, for class retrieval from microarray data. The idea is to break up existing algorithms into independent building blocks for typical sub-problems, which are in turn reassembled in new ways to generate yet unexplored methods. As a test, 432 algorithms were generated and evaluated on published microarray data sets. We found their top performers to be better than the original, component-providing ancestors and also competitive with a set of new algorithms recently proposed. Finally, we identified components that showed consistently good performance for clustering microarray data and that should be considered in further development of clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Methods for Microarray Data Sets

Cluster Analysis of Microarray Data

Bayesian versus data driven model selection for microarray data

Article 16 July 2014

References

Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng. doi:10.1016/j.datak.2007.03.016
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control. doi:10.1109/TAC.1974.1100705
Andreopoulos B, An A, Wang X et al (2009) A roadmap of clustering algorithms: finding a match for a biomedical application. Br Bioinform 10(3):297–314
Article Google Scholar
Ankerst M, Breunig M, Kriegel H, et al (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD’99 international conference on management of data. Philadelphia, pp 49–60
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms (SODA ’07), society for industrial and applied mathematics, Philadelphia, pp 1027–1035
Ayadi W, Elloumi M, Hao JK (2012) BicFinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst 30:341–358. doi:10.1007/s10115-011-0383-7
Article Google Scholar
Balachandran V, Khemani D (2011) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst. doi:10.1007/s10115-011-0446-9
Baralis E, Bruno G, Flori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inf Syst 29:81–101. doi:10.1007/s10115-010-0374-0
Article Google Scholar
Baya AE, Granitto PM (2011) Clustering gene expression data with a penalized graph-based metric. BMC bioinf 12:1–18
Article Google Scholar
Bezdek JC (1981) Pattern recognition With fuzzy objective function algorithms. Plenum Press, New York
Book MATH Google Scholar
Belacel N, Wang Q, Cuperlovic-Culf M (2006) Clustering methods for microarray gene expression data. OMICS J Integr Biol 10(4):507–531. doi:10.1089/omi.2006.10.507
Article Google Scholar
Bonchi F, Gionis A, Ukkonen, A (2011) Overlapping correlation clustering. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 51–60. doi:10.1109/ICDM.2011.114
Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Tesauro G, Touretzky D (eds) Advances in neural information processing systems 7. MIT Press, Cambridge, pp 585–592
Google Scholar
Chen C-L, Tseng FSC (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226. doi:j.datak.2010.08.003
Google Scholar
Cheung Y (2003) k*-means: a new generalized k-means clustering algorithm. Pattern Recognit Lett 24(15):2883–2893. doi:10.1016/S0167-8655(03)00146-6
Article MATH Google Scholar
Da Silva A, Chiky R, Hébrail G (2011) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst. doi:10.1007/s10115-011-0448-7
Dang H-X, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2010, pp 573–582
De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 564–572
de Souto MCP, Prudencio RBC, Soares RGF et al (2008) Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE international joint conference on neural networks, pp 3729–3735. doi:10.1109/IJCNN.2008.4634333
Delibašić B, Kirchner K, Ruhland J et al (2009) Reusable components for partitioning clustering algorithms. Artif Intell Rev 32:59–75. doi:10.1007/s10462-009-9133-6
Article Google Scholar
Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980
Article Google Scholar
Dhiraj K, Rath SK (2009) Gene expression analysis using clustering. In: Proceedings of 3rd international conference on bioinformatics and, biomedical engineering, pp 154–163
Ding C, He X (2004) Principal component analysis and effective k-means clustering. In: Proceedings of the SIAM international conference on data mining, pp 497–502
Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) 2011, pp 681–689
Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231
Forestier G, Gançarski P, Wemmert C (2010) Collaborative clustering with background knowledge. Data Knowl Eng 69(2):211–228. doi:10.1016/j.datak.2009.10.004
Article Google Scholar
Geraci F, Leoncini M, Montangero M et al (2009) K-boost: a scalable algorithm for high-quality clustering of microarray gene expression data. J Comput Biol J Comput Mol Cell Biol 16(6):859–873. doi:10.1089/cmb.2008.0201
Article MathSciNet Google Scholar
Giancarlo R, Utro F (2011) Speeding up the consensus clustering methodology for microarray data analysis. Algorithms Mol Biol AMB 6(1). doi:10.1186/1748-7188-6-1
Giancarlo R, Lo Bosco G, Pinello L (2010) Distance functions, clustering algorithms and microarray data analysis. In: Blum C, Battiti R (eds) Learning and intelligent, optimization, vol 6073, pp 125–138
Grujic M, Andrejiová M, Marasová D et al (2012) Using principal components analysis and clustering analysis to assess the similarity between conveyor belts. Tech Technol Educ Manag TTEM 7(1):4–10
Google Scholar
Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the neural information processing systems, vol 17
Hartigan JA (1975) Clustering algorithms. Probability and mathematical statistics. Wiley, New York
Google Scholar
Hartigan JA, Wong MA (1979) A K-means clustering algorithm. Appl Stat 28:100–108
Article MATH Google Scholar
Iam-on N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26:1513–1519
Article Google Scholar
Jovanović M, Delibašić B, Vukićević M, et al (2011) Optimizing performance of decision tree component-based algorithms using evolutionary algorithms in Rapid Miner. In: proceedings of the 2nd RapidMiner community meeting and conference, Dublin
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Kumar P, Wasan SK (2010) Comparative analysis of k-mean based algorithms. Intl J Comput Sci Netw Secur 10(4):314–318
Google Scholar
Kalogeratos A, Likas A (2011) Document clustering using synthetic cluster prototypes. Data Knowl Eng 70(3):284–306. doi:j.datak.2010.12.002
Article Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. doi:10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Milligan GW, Cooper MC (1987) Methodology review: clustering methods. Appl Psychol Meas 11(4):329–354. doi:10.1177/014662168701100401
Article Google Scholar
Milovanović M, Minović M, Štavljanin V et al (2012) Wiki as a corporate learning tool: case study for software development company. Behav Inf Technol. doi:10.1080/0144929X.2011.642894
Minović M, Milovanović M, Kovačević I, Minović J, Starčević D (2011) Game design as a learning tool for the course of computer Networks. Intern J Eng Educ 27(3):498–508
Google Scholar
Moise G, Zimek A, Kröger P et al (2009) Subspace and projected clustering: experimental evaluation and analysis. Knowl Inf Syst 21(3):299–326. doi:10.1007/s10115-009-0226-y
Article Google Scholar
Monti S, Tamayo P, Mesirov J et al (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118. doi:10.1023/A:1023949509487
Article MATH Google Scholar
Nascimento A, Prudencio R, de Souto M, et al (2009) Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the 19th international conference on artificial neural networks: Part II, Springer, Berlin
Nascimento MCV, Toledo FMB, Carvalho A (2010) Investigation of a new GRASP-based clustering algorithm applied to biological data. Comput Oper Res 37(8):1381–1388. doi:10.1016/j.cor.2009.02.014
Article MATH Google Scholar
Pelleg D, Moore A (2000) X-means: extending K-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 17, Morgan Kaufmann, Los Altos, pp 727–734
Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl 5(2):1–5. doi:10.1145/980972.980974
Article Google Scholar
Punera K, Ghosh J (2008) Consensus-based ensembles of soft clusterings. Appl Artif Intell 22:780–810
Article Google Scholar
Pirim H, Gautam D, Bhowmik T (2011) Performance of an ensemble clustering on biological datasets. Math Comput Appl 16(1):87–96
Google Scholar
Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418–427
Article Google Scholar
Raczynski L, Wozniak K, Rubel T, Zaremba K (2010) Application of density based clustering to microarray data analysis. Int J Electron Telecommun 56(3):281–286
Google Scholar
Romero C, Ventura S (2011) Educational data mining: a review of the state-of-the-art. IEEE Trans Syst Man Cybern C Appl Rev 40(6):601–618
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. doi:10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Savoiu G, Jaško O, Čudanov M (2010) Diversity of specific quantitative, statistical and social methods, techniques and management models in management system. Management 14(52):5–13
Google Scholar
Sander J, Ester M, Kriegel H et al (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc 2(2):169–194
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Shao J, Plant C, Yang Q, Böhm C (2011) Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In: Proceedings of 11th IEEE international conference on data mining (ICDM), pp 607–616, doi:10.1109/ICDM.2011.50
Shaham E, Sarne D, Ben-Moshe B (2011) Sleeved co-clustering of lagged data. Knowl Inf Syst. doi:10.1007/s10115-011-0420-6
Sedlak O, Kocic-Vugdelija V, Kudumovic M et al (2010) Management of family farms—Implementation of fuzzy method in short-term planning. Tech Technol Educ Manag TTEM 5(4):710–718
Google Scholar
Smith-Miles K (2008) Towards insightful algorithm selection for optimization using meta-learning concepts. In: Proceedings of the IEEE international joint conference on neural networks, pp 4118–4124
Sonnenburg S, Braun M, Ong CS et al (2007) The need for open source software in machine learning. J Mach Learn Res 8:2443–2466
Google Scholar
Thalamuthu A, Mukhopadhyay I, Zheng X et al (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22:2405–2412
Article Google Scholar
Vinh NX (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
MathSciNet MATH Google Scholar
Vukicevic M, Delibasic B, Jovanovic M, Suknovic M, Obradovic Z (2011) Internal evaluation measures as proxies for external indices in clustering gene expression data. In: Proceedings of the 2011 IEEE international conference on bioinformatics and biomedicine (BIBM11). Atlanta, 12–15 Nov
Wan M, Jönsson A, Wang C, Li L, Yang Y (2011) Web user clustering and web prefetching using random indexing with weight functions. Knowl Inf Syst. doi:10.1007/s10115-011-0453-x
Wijaya A, Kalousis M, Hilario M (2010) Predicting classifier performance using data set descriptors and data mining ontology. In: Proceedings of the 3rd planning to learn workshop
Wu LF, Hughes TR, Davierwala AP (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat genet 31:255–265
Article Google Scholar
Wu X, Kumar V, Quinlan JR et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2
Article Google Scholar
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Patt Anal Mach Intell 13(8):841–847
Article Google Scholar
Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154. doi:10.1109/RBME.2010.2083647
Article Google Scholar
Yan Y, Chen L, Tjhi W-C (2011) Semi-supervised fuzzy co-clustering algorithm for document classification. Knowl Inf Syst. doi:10.1007/s10115-011-0454-9
Yu Z, Wong H-S, Wang H (2007) Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23:2888–2896
Article Google Scholar

Download references

Acknowledgments

This research was partially funded by a grant from German Academic Exchange Office (DAAD) and the Serbian Ministry of Science, Project-ID 50453023.

Author information

Authors and Affiliations

Faculty of Organizational Sciences, University of Belgrade, Jove Ilića 154, Belgrade, Serbia
Milan Vukićević, Boris Delibašić, Miloš Jovanović & Milija Suknović
Faculty of Economics and Business Administration, Friedrich Schiller University of Jena, Carl-Zeiß Straße 3, Jena, Germany
Kathrin Kirchner & Johannes Ruhland

Authors

Milan Vukićević
View author publications
You can also search for this author in PubMed Google Scholar
Kathrin Kirchner
View author publications
You can also search for this author in PubMed Google Scholar
Boris Delibašić
View author publications
You can also search for this author in PubMed Google Scholar
Miloš Jovanović
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Ruhland
View author publications
You can also search for this author in PubMed Google Scholar
Milija Suknović
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Milan Vukićević.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vukićević, M., Kirchner, K., Delibašić, B. et al. Finding best algorithmic components for clustering microarray data. Knowl Inf Syst 35, 111–130 (2013). https://doi.org/10.1007/s10115-012-0542-5

Download citation

Received: 22 September 2011
Revised: 30 March 2012
Accepted: 14 August 2012
Published: 06 September 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10115-012-0542-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding best algorithmic components for clustering microarray data

Abstract

Access this article

Similar content being viewed by others

Clustering Methods for Microarray Data Sets

Cluster Analysis of Microarray Data

Bayesian versus data driven model selection for microarray data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding best algorithmic components for clustering microarray data

Abstract

Access this article

Similar content being viewed by others

Clustering Methods for Microarray Data Sets

Cluster Analysis of Microarray Data

Bayesian versus data driven model selection for microarray data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation