Abstract
DNA microarray analysis represents a relevant technology in genetic research to explore and recognize possible genomic features of many diseases. Since it is a high-throughput technology, it requires advanced tools for a dimensional reduction in massive data sets. Clustering is among the most appropriate tools for mining these data, although it suffers from the following problems: instability of the results, large number of genes compared with the number of samples, high noise level, complexity of initialization, and grouping genes and samples simultaneously. Almost all these problems can be positively addressed by using novel techniques, such as biclustering. In this paper, a new biclustering algorithm is proposed, hereafter denoted as combinatorial biclustering algorithm (CBA), that addresses the problems listed above. The algorithm analyzes the data finding biclusters of the desired size and allowable error. CBA performances are compared with the ones of other bicluster algorithms by discussing the output of different methods once running them on a synthetic data set. CBA seems to perform better, and for this reason, it has been applied to study a real data set as well. In particular, CBA has analyzed the transcriptional profile of 38 gastric cancer tissues with microsatellite instability (MSI) and without MSS. The results show clearly a much coherent behavior in gene expression of normal tissues versus tumoral ones. The high level of gene misregulation in tumoral tissues affects any further bicluster analysis, and it is only partially smoothed in the MSI/MSS study even admitting much higher level on initial admissible error.
Similar content being viewed by others
References
Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: Proceedings of the sixth international conference on computational biology, Washington, DC, USA, ACM, pp 89–100
Bergmann S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys 67(3 Pt 1):41–48
Bhattacharya A, De RK (2009) Bi-correlation clustering algorithm for determining a set of co-regulated genes. Bioinformatics 25(21):2795–801
Cheng Y, Church G (2000) Biclustering of expression data. In: Press A (ed) Proceeding of the Eighth International Conference Intelligent systems for molecular biology (ISMB 00), pp 93–103
D’Errico M, de Rinaldis E, Blasi M, Viti V, Falchetti M, Calcagnile A, Sera F, Saieva C, Ottini L, Palli D, Palombo F, Giuliani A, Dogliotti E (2009) Genome-wide expression profile of sporadic gastric cancers with microsatellite instability. Eur J Cancer 3(45):461–469
Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. PNAS 97(22):12,079–12,084
Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129
Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34:D354–D357
Kluger Y, Basri R, Chang J, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13:703–716
Lazzeroni L, Owen A (2000) Plaid models for gene expression data. Technical report, Stanford Univ
Milne AN, Carneiro F, O’Morain C, Offerhaus GJ (2009) Nature meets nurture: molecular genetics of gastric cancer. Hum Genet 126:615–628
Mirkin B (1996) Mathematical classification and clustering. Kluwer, Boston
Nosova E, Raiconi G, Tagliaferri R (2011) A multi-biclustering combinatorial based algorithm. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2011), IEEE Catalog Number: CFP11IDM-CDR ISBN: 978-1-4244-9925-0
Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9):1122–1129
Reiss D, Baliga N, Bonneau R (2006) Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinform 2(7):280–302
Tanay A, Sharan R, Kupiec M, Shamir R (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. In: PNAS (ed) Proceedings of the National Academic Science USA, vol 101, pp 2981–2986
Tang C, Zhang L, Ramanathan M, Zhang A (2001) Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: I.C. Society (ed) Proceedings of the IEEE 2nd International Symposium on Bioinformatics and Bioengineering (BIBE’01), pp 41–48
Tchagang AB, Tewfik A (2006) Dna microarray data analysis: a novel biclustering algorithm approach. EURASIP J Appl Signal Process 1:60–60
Wang HX (2002) Clustering by pattern similarity: the pcluster algorithm. http://wis.cs.ucla.edu/hxwang/proj/delta.html
Yang J, Wang W, Wang H, Yu P (2003) Enhanced biclustering on expression data. In: I.C. Society (ed) Proceedings of the Third IEEE Conference Bioinformatics and Bioengineering, pp 321–327
Yang J, Wang W, Wang H, Yu PS (2002) Delta-clusters: capturing subspace correlation in a large data set. In: I.C.S. Press (ed) Proceedings of the IEEE International Conference on Data Engineering (ICDE), Los Alamitos, pp 517–528
Acknowledgments
This work is supported by Istituto Nazionale di Alta Matematica Francesco Severi (INdAM) with the scholarship N U 2007/000458 07/09/2007.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nosova, E., Napolitano, F., Amato, R. et al. An improved combinatorial biclustering algorithm. Neural Comput & Applic 22 (Suppl 1), 293–302 (2013). https://doi.org/10.1007/s00521-012-0902-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-012-0902-9