Skip to main content

Advertisement

Log in

Binary matrix factorization for analyzing gene expression data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The advent of microarray technology enables us to monitor an entire genome in a single chip using a systematic approach. Clustering, as a widely used data mining approach, has been used to discover phenotypes from the raw expression data. However traditional clustering algorithms have limitations since they can not identify the substructures of samples and features hidden behind the data. Different from clustering, biclustering is a new methodology for discovering genes that are highly related to a subset of samples. Several biclustering models/methods have been presented and used for tumor clinical diagnosis and pathological research. In this paper, we present a new biclustering model using Binary Matrix Factorization (BMF). BMF is a new variant rooted from non-negative matrix factorization (NMF). We begin by proving a new boundedness property of NMF. Two different algorithms to implement the model and their comparison are then presented. We show that the microarray data biclustering problem can be formulated as a BMF problem and can be solved effectively using our proposed algorithms. Unlike the greedy strategy-based algorithms, our proposed algorithms for BMF are more likely to find the global optima. Experimental results on synthetic and real datasets demonstrate the advantages of BMF over existing biclustering methods. Besides the attractive clustering performance, BMF can generate sparse results (i.e., the number of genes/features involved in each biclustering structure is very small related to the total number of genes/features) that are in accordance with the common practice in molecular biology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB ’02: proceedings of the 6th annual international conference on computational biology. ACM, New York, pp 49–57

  • Berry M, Browne M, Langville A, Pauca P, Plemmons R (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1): 155–173

    Article  MATH  MathSciNet  Google Scholar 

  • Brunet J-P, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12): 4164–4169

    Article  Google Scholar 

  • Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A (2006) Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7(1): 78

    Article  Google Scholar 

  • Chee M, Yang R, Hubbell E, Berno A, Huang X, Stern D, Winkler J, Lockhart D, Morris M, Fodor S (1996) Accessing genetic information with high density DNA arrays. Science 274: 610–614

    Article  Google Scholar 

  • Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103

  • Cooper M, Foote J (2002) Summarizing video using non-negative similarity matrix factorization. In: Proceedings of IEEE workshop on multimedia signal processing, pp 25–28

  • Dhillon I, Sra S (2005) Generalized nonnegative matrix approximations with Bregman divergences. In: Advances in neural information processing systems, vol 17. MIT Press, Cambridge

  • Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM data mining conference

  • Ding C, Li T, Jordan M (2006) Convex and semi-nonnegative matrix factorizations for clustering and low-dimension representation. Technical Report LBNL-60428, Lawrence Berkeley National Laboratory, University of California, Berkeley

  • Ding C, Li T, Peng W (2006) Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method. In: Proceedings of national conference on artificial intelligence (AAAI-06)

  • Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA (2003) Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Res 31(13): 3775–3781

    Article  Google Scholar 

  • Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868

    Article  Google Scholar 

  • Fodor S, Read J, Pirrung M, Stryer L, Lu A, Solas D (1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251: 767–783

    Article  Google Scholar 

  • Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: SIGIR ’05, pp 601–602

  • Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62: 4963–4967

    Google Scholar 

  • Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5: 1457–1469

    MathSciNet  Google Scholar 

  • Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl 1): S96–S104

    Google Scholar 

  • Ideker T et al (2000) Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol 7(6): 805–817

    Article  Google Scholar 

  • Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N (2002) Revealing modular organization in the yeast transcriptional network. Nature Genet 31: 370–377

    Google Scholar 

  • Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13): 1993–2003

    Article  Google Scholar 

  • Khatri P, Draghici S, Ostermeier G, Krawetz S (2002) Profiling gene expression using onto-express. Genomics 79(2): 266–270

    Article  Google Scholar 

  • Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12): 1495–1502

    Article  Google Scholar 

  • Koyuturk M, Grama A, Ramakrishnan N (2006) Non-orthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans Math Softw 32(1): 33–69

    Article  MathSciNet  Google Scholar 

  • la Torre FD, Kanade T (2006) Discriminative cluster analysis. In: Proceedings of the 23rd international conference on machine learning (ICML 2006)

  • Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791

    Article  Google Scholar 

  • Lee D, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, Cambridge

    Google Scholar 

  • Li T (2005) A general model for clustering binary data. In: Proceedings of the 11th ACM SIGKDD international conference, pp 188–197

  • Li S, Hou X, Zhang H, Cheng Q (2001) Learning spatially localized, parts-based representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 207–212

  • Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15): 2429–2437

    Article  Google Scholar 

  • Madeira SC et al (2004) Biclustering algorithms for biological data analysis: a survey. IEEE Trans Comput Biol Bioinformatics 1: 24–45

    Article  Google Scholar 

  • Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5: 111–126

    Article  Google Scholar 

  • Pauca VP, Shahnaz F, Berry M, Plemmons R (2004) Text mining using non-negative matrix factorization. In: Proceedings of SIAM international conference on data mining, pp 452–456

  • Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9): 1122–1129

    Article  Google Scholar 

  • Rocke D, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8(6): 557–569

    Article  Google Scholar 

  • Sha F, Saul L, Lee D (2003) Multiplicative updates for nonnegative quadratic programming in support vector machines. In: Advances in neural information processing systems, vol 15, pp 1041–1048

  • Sharan R, Maron-Katz A, Shamir R (2003) Click and expander: a system for clustering and visualizing gene expression data. Bioinformatics 19(14): 1787–1799

    Article  Google Scholar 

  • Srebro N, Rennie J, Jaakkola T (2005) Maximum margin matrix factorization. In: Advances in neural information processing systems. MIT Press, Cambridge

  • Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617

    Article  MATH  MathSciNet  Google Scholar 

  • Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organizing maps. In: Proceedings of the national academy of sciences of USA, vol 96

  • Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(90001): S136–S144

    Google Scholar 

  • Tanay A, Sharan R, Kupiec M, Shamir R, Karp RM (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genome-wide data. Proc Natl Acad Sci USA 101(9): 2981–2986

    Article  Google Scholar 

  • Vavasis SA (2007) On the complexity of nonnegative matrix factorization. http://arxiv.org/abs/0708.4149

  • Xie Y-L, Hopke P, Paatero P (1999) Positive matrix factorization applied to a curve resolution problem. J Chemom 12(6): 357–364

    Article  Google Scholar 

  • Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of ACM conference on research and development in IR(SIGIR), Toronto, pp 267–273

  • Zeimpekis D, Gallopoulos E (2005) Clsi: a flexible approximation scheme from clustered term-document matrices. Proceedings of SIAM data mining conference, pp 631–635

  • Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization and applications. In: Proceedings of 2007 IEEE international conference on data mining

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Li.

Additional information

Responsible editor: Pierre Baldi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, ZY., Li, T., Ding, C. et al. Binary matrix factorization for analyzing gene expression data. Data Min Knowl Disc 20, 28–52 (2010). https://doi.org/10.1007/s10618-009-0145-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0145-2

Keywords

Navigation