Abstract
The advent of microarray technology enables us to monitor an entire genome in a single chip using a systematic approach. Clustering, as a widely used data mining approach, has been used to discover phenotypes from the raw expression data. However traditional clustering algorithms have limitations since they can not identify the substructures of samples and features hidden behind the data. Different from clustering, biclustering is a new methodology for discovering genes that are highly related to a subset of samples. Several biclustering models/methods have been presented and used for tumor clinical diagnosis and pathological research. In this paper, we present a new biclustering model using Binary Matrix Factorization (BMF). BMF is a new variant rooted from non-negative matrix factorization (NMF). We begin by proving a new boundedness property of NMF. Two different algorithms to implement the model and their comparison are then presented. We show that the microarray data biclustering problem can be formulated as a BMF problem and can be solved effectively using our proposed algorithms. Unlike the greedy strategy-based algorithms, our proposed algorithms for BMF are more likely to find the global optima. Experimental results on synthetic and real datasets demonstrate the advantages of BMF over existing biclustering methods. Besides the attractive clustering performance, BMF can generate sparse results (i.e., the number of genes/features involved in each biclustering structure is very small related to the total number of genes/features) that are in accordance with the common practice in molecular biology.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB ’02: proceedings of the 6th annual international conference on computational biology. ACM, New York, pp 49–57
Berry M, Browne M, Langville A, Pauca P, Plemmons R (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1): 155–173
Brunet J-P, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12): 4164–4169
Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A (2006) Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7(1): 78
Chee M, Yang R, Hubbell E, Berno A, Huang X, Stern D, Winkler J, Lockhart D, Morris M, Fodor S (1996) Accessing genetic information with high density DNA arrays. Science 274: 610–614
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103
Cooper M, Foote J (2002) Summarizing video using non-negative similarity matrix factorization. In: Proceedings of IEEE workshop on multimedia signal processing, pp 25–28
Dhillon I, Sra S (2005) Generalized nonnegative matrix approximations with Bregman divergences. In: Advances in neural information processing systems, vol 17. MIT Press, Cambridge
Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM data mining conference
Ding C, Li T, Jordan M (2006) Convex and semi-nonnegative matrix factorizations for clustering and low-dimension representation. Technical Report LBNL-60428, Lawrence Berkeley National Laboratory, University of California, Berkeley
Ding C, Li T, Peng W (2006) Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method. In: Proceedings of national conference on artificial intelligence (AAAI-06)
Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA (2003) Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Res 31(13): 3775–3781
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868
Fodor S, Read J, Pirrung M, Stryer L, Lu A, Solas D (1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251: 767–783
Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: SIGIR ’05, pp 601–602
Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62: 4963–4967
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5: 1457–1469
Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl 1): S96–S104
Ideker T et al (2000) Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol 7(6): 805–817
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N (2002) Revealing modular organization in the yeast transcriptional network. Nature Genet 31: 370–377
Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13): 1993–2003
Khatri P, Draghici S, Ostermeier G, Krawetz S (2002) Profiling gene expression using onto-express. Genomics 79(2): 266–270
Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12): 1495–1502
Koyuturk M, Grama A, Ramakrishnan N (2006) Non-orthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans Math Softw 32(1): 33–69
la Torre FD, Kanade T (2006) Discriminative cluster analysis. In: Proceedings of the 23rd international conference on machine learning (ICML 2006)
Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791
Lee D, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, Cambridge
Li T (2005) A general model for clustering binary data. In: Proceedings of the 11th ACM SIGKDD international conference, pp 188–197
Li S, Hou X, Zhang H, Cheng Q (2001) Learning spatially localized, parts-based representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 207–212
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15): 2429–2437
Madeira SC et al (2004) Biclustering algorithms for biological data analysis: a survey. IEEE Trans Comput Biol Bioinformatics 1: 24–45
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5: 111–126
Pauca VP, Shahnaz F, Berry M, Plemmons R (2004) Text mining using non-negative matrix factorization. In: Proceedings of SIAM international conference on data mining, pp 452–456
Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9): 1122–1129
Rocke D, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8(6): 557–569
Sha F, Saul L, Lee D (2003) Multiplicative updates for nonnegative quadratic programming in support vector machines. In: Advances in neural information processing systems, vol 15, pp 1041–1048
Sharan R, Maron-Katz A, Shamir R (2003) Click and expander: a system for clustering and visualizing gene expression data. Bioinformatics 19(14): 1787–1799
Srebro N, Rennie J, Jaakkola T (2005) Maximum margin matrix factorization. In: Advances in neural information processing systems. MIT Press, Cambridge
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organizing maps. In: Proceedings of the national academy of sciences of USA, vol 96
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(90001): S136–S144
Tanay A, Sharan R, Kupiec M, Shamir R, Karp RM (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genome-wide data. Proc Natl Acad Sci USA 101(9): 2981–2986
Vavasis SA (2007) On the complexity of nonnegative matrix factorization. http://arxiv.org/abs/0708.4149
Xie Y-L, Hopke P, Paatero P (1999) Positive matrix factorization applied to a curve resolution problem. J Chemom 12(6): 357–364
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of ACM conference on research and development in IR(SIGIR), Toronto, pp 267–273
Zeimpekis D, Gallopoulos E (2005) Clsi: a flexible approximation scheme from clustered term-document matrices. Proceedings of SIAM data mining conference, pp 631–635
Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization and applications. In: Proceedings of 2007 IEEE international conference on data mining
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Pierre Baldi.
Rights and permissions
About this article
Cite this article
Zhang, ZY., Li, T., Ding, C. et al. Binary matrix factorization for analyzing gene expression data. Data Min Knowl Disc 20, 28–52 (2010). https://doi.org/10.1007/s10618-009-0145-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0145-2