Binary matrix factorization for analyzing gene expression data

Zhang, Zhong-Yuan; Li, Tao; Ding, Chris; Ren, Xian-Wen; Zhang, Xiang-Sun

doi:10.1007/s10618-009-0145-2

Binary matrix factorization for analyzing gene expression data

Published: 02 September 2009

Volume 20, pages 28–52, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Zhong-Yuan Zhang¹,
Tao Li²,
Chris Ding³,
Xian-Wen Ren⁴ &
…
Xiang-Sun Zhang⁴

1085 Accesses
76 Citations
3 Altmetric
Explore all metrics

Abstract

The advent of microarray technology enables us to monitor an entire genome in a single chip using a systematic approach. Clustering, as a widely used data mining approach, has been used to discover phenotypes from the raw expression data. However traditional clustering algorithms have limitations since they can not identify the substructures of samples and features hidden behind the data. Different from clustering, biclustering is a new methodology for discovering genes that are highly related to a subset of samples. Several biclustering models/methods have been presented and used for tumor clinical diagnosis and pathological research. In this paper, we present a new biclustering model using Binary Matrix Factorization (BMF). BMF is a new variant rooted from non-negative matrix factorization (NMF). We begin by proving a new boundedness property of NMF. Two different algorithms to implement the model and their comparison are then presented. We show that the microarray data biclustering problem can be formulated as a BMF problem and can be solved effectively using our proposed algorithms. Unlike the greedy strategy-based algorithms, our proposed algorithms for BMF are more likely to find the global optima. Experimental results on synthetic and real datasets demonstrate the advantages of BMF over existing biclustering methods. Besides the attractive clustering performance, BMF can generate sparse results (i.e., the number of genes/features involved in each biclustering structure is very small related to the total number of genes/features) that are in accordance with the common practice in molecular biology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB ’02: proceedings of the 6th annual international conference on computational biology. ACM, New York, pp 49–57
Berry M, Browne M, Langville A, Pauca P, Plemmons R (2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1): 155–173
Article MATH MathSciNet Google Scholar
Brunet J-P, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12): 4164–4169
Article Google Scholar
Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A (2006) Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7(1): 78
Article Google Scholar
Chee M, Yang R, Hubbell E, Berno A, Huang X, Stern D, Winkler J, Lockhart D, Morris M, Fodor S (1996) Accessing genetic information with high density DNA arrays. Science 274: 610–614
Article Google Scholar
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103
Cooper M, Foote J (2002) Summarizing video using non-negative similarity matrix factorization. In: Proceedings of IEEE workshop on multimedia signal processing, pp 25–28
Dhillon I, Sra S (2005) Generalized nonnegative matrix approximations with Bregman divergences. In: Advances in neural information processing systems, vol 17. MIT Press, Cambridge
Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM data mining conference
Ding C, Li T, Jordan M (2006) Convex and semi-nonnegative matrix factorizations for clustering and low-dimension representation. Technical Report LBNL-60428, Lawrence Berkeley National Laboratory, University of California, Berkeley
Ding C, Li T, Peng W (2006) Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method. In: Proceedings of national conference on artificial intelligence (AAAI-06)
Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA (2003) Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Res 31(13): 3775–3781
Article Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95: 14863–14868
Article Google Scholar
Fodor S, Read J, Pirrung M, Stryer L, Lu A, Solas D (1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251: 767–783
Article Google Scholar
Gaussier E, Goutte C (2005) Relation between plsa and nmf and implications. In: SIGIR ’05, pp 601–602
Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62: 4963–4967
Google Scholar
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5: 1457–1469
MathSciNet Google Scholar
Huber W et al (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl 1): S96–S104
Google Scholar
Ideker T et al (2000) Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol 7(6): 805–817
Article Google Scholar
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N (2002) Revealing modular organization in the yeast transcriptional network. Nature Genet 31: 370–377
Google Scholar
Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13): 1993–2003
Article Google Scholar
Khatri P, Draghici S, Ostermeier G, Krawetz S (2002) Profiling gene expression using onto-express. Genomics 79(2): 266–270
Article Google Scholar
Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12): 1495–1502
Article Google Scholar
Koyuturk M, Grama A, Ramakrishnan N (2006) Non-orthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans Math Softw 32(1): 33–69
Article MathSciNet Google Scholar
la Torre FD, Kanade T (2006) Discriminative cluster analysis. In: Proceedings of the 23rd international conference on machine learning (ICML 2006)
Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791
Article Google Scholar
Lee D, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, Cambridge
Google Scholar
Li T (2005) A general model for clustering binary data. In: Proceedings of the 11th ACM SIGKDD international conference, pp 188–197
Li S, Hou X, Zhang H, Cheng Q (2001) Learning spatially localized, parts-based representation. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 207–212
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15): 2429–2437
Article Google Scholar
Madeira SC et al (2004) Biclustering algorithms for biological data analysis: a survey. IEEE Trans Comput Biol Bioinformatics 1: 24–45
Article Google Scholar
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5: 111–126
Article Google Scholar
Pauca VP, Shahnaz F, Berry M, Plemmons R (2004) Text mining using non-negative matrix factorization. In: Proceedings of SIAM international conference on data mining, pp 452–456
Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9): 1122–1129
Article Google Scholar
Rocke D, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8(6): 557–569
Article Google Scholar
Sha F, Saul L, Lee D (2003) Multiplicative updates for nonnegative quadratic programming in support vector machines. In: Advances in neural information processing systems, vol 15, pp 1041–1048
Sharan R, Maron-Katz A, Shamir R (2003) Click and expander: a system for clustering and visualizing gene expression data. Bioinformatics 19(14): 1787–1799
Article Google Scholar
Srebro N, Rennie J, Jaakkola T (2005) Maximum margin matrix factorization. In: Advances in neural information processing systems. MIT Press, Cambridge
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617
Article MATH MathSciNet Google Scholar
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organizing maps. In: Proceedings of the national academy of sciences of USA, vol 96
Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(90001): S136–S144
Google Scholar
Tanay A, Sharan R, Kupiec M, Shamir R, Karp RM (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genome-wide data. Proc Natl Acad Sci USA 101(9): 2981–2986
Article Google Scholar
Vavasis SA (2007) On the complexity of nonnegative matrix factorization. http://arxiv.org/abs/0708.4149
Xie Y-L, Hopke P, Paatero P (1999) Positive matrix factorization applied to a curve resolution problem. J Chemom 12(6): 357–364
Article Google Scholar
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of ACM conference on research and development in IR(SIGIR), Toronto, pp 267–273
Zeimpekis D, Gallopoulos E (2005) Clsi: a flexible approximation scheme from clustered term-document matrices. Proceedings of SIAM data mining conference, pp 631–635
Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization and applications. In: Proceedings of 2007 IEEE international conference on data mining

Download references

Author information

Authors and Affiliations

School of Statistics, Central University of Finance and Economics, Beijing, People’s Republic of China
Zhong-Yuan Zhang
School of Computing and Information Sciences, Florida International University, Miami, FL, USA
Tao Li
Department of Computer Science and Engineering, University of Texas, Arlington, TX, USA
Chris Ding
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, People’s Republic of China
Xian-Wen Ren & Xiang-Sun Zhang

Authors

Zhong-Yuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Li
View author publications
You can also search for this author in PubMed Google Scholar
Chris Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Wen Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xiang-Sun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Li.

Additional information

Responsible editor: Pierre Baldi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, ZY., Li, T., Ding, C. et al. Binary matrix factorization for analyzing gene expression data. Data Min Knowl Disc 20, 28–52 (2010). https://doi.org/10.1007/s10618-009-0145-2

Download citation

Received: 31 March 2008
Accepted: 03 August 2009
Published: 02 September 2009
Issue Date: January 2010
DOI: https://doi.org/10.1007/s10618-009-0145-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Binary matrix factorization for analyzing gene expression data

Abstract

Access this article

Similar content being viewed by others

Introduction to Bioinformatics

Introduction to the Gene Expression Analysis

SCANPY: large-scale single-cell gene expression data analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Binary matrix factorization for analyzing gene expression data

Abstract

Access this article

Similar content being viewed by others

Introduction to Bioinformatics

Introduction to the Gene Expression Analysis

SCANPY: large-scale single-cell gene expression data analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation