Abstract
Non-negative matrix factorization (NMF) has become a standard tool in data mining, information retrieval, and signal processing. It is used to factorize a non-negative data matrix into two non-negative matrix factors that contain basis elements and linear coefficients, respectively. Often, the columns of the first resulting factor are interpreted as “cluster centroids” of the input data, and the columns of the second factor are understood to contain cluster membership indicators. When analyzing data such as collections of gene expressions, documents, or images, it is often beneficial to ensure that the resulting cluster centroids are meaningful, for instance, by restricting them to be convex combinations of data points. However, known approaches to convex-NMF suffer from high computational costs and therefore hardly apply to large-scale data analysis problems. This paper presents a new framework for convex-NMF that allows for an efficient factorization of data matrices of millions of data points. Triggered by the simple observation that each data point can be expressed as a convex combination of vertices of the data convex hull, we require the basic factors to be vertices of the data convex hull. The benefits of convex-hull NMF are twofold. First, for a growing number of data points the expected size of the convex hull, i.e. the number of its vertices, grows much slower than the dataset. Second, distance preserving low-dimensional embeddings allow us to efficiently sample the convex hull and hence to quickly determine candidate vertices. Our extensive experimental evaluation on large datasets shows that convex-hull NMF compares favorably to convex-NMF in terms of both speed and reconstruction quality. We demonstrate that our method can easily be applied to large-scale, real-world datasets, in our case consisting of 750,000 DBLP entries, 4,000,000 digital images, and 150,000,000 votes on World of Warcraft ®guilds, respectively.
Similar content being viewed by others
References
Aggarwal C (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156
Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc B 44(2): 139–177
Cai D, He X, Wu X, Han J (2008) Non-negative matrix factorization on manifold. In: Proceedings of IEEE international conference on data mining
Chen Y, Rege M, Dong M, Hua J (2008) Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17(3): 355–379
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4): 338–347
de Berg M, van Kreveld M, Overmars M, Schwarzkopf O (2000) Computational geometry. Springer, Heidelberg
Ding C, Li T, Jordan M (2009) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1): 45–55
Donoho D, Stodden V (2004) When does non-negative matrix factorization give a correct decomposition into parts?. In: Advances in neural information processing systems 16. MIT Press
Drineas P, Kannan R, Mahoney M (2006) , Fast Monte Carlo algorithms III: computing a compressed approixmate matrix decomposition. SIAM J Comput 36(1): 184–206
Faloutsos C , Lin K-I (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of ACM SIGMOD conference
Golub G, van Loan J (1996) Matrix computations. 3. Johns Hopkins University Press, Baltimore
Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2): 8–12
Hoyer P (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn 5(Dec): 1457–1469
Hueter I (1999) Limit theorems for the convex hull of random points in higher dimensions. Trans Am Math Soc 351(11): 4337–4363
Jolliffe I (1986) Principal component analysis. Springer, New York
Kim J, Park H (2008) Toward faster nonnegative matrix factorization: a new algorithm and comparisons. In: Proceedings of IEEE internationl conference on data mining
Klingenberg B, Curry J, Dougherty A (2008) Non-negative matrix factorization: ill-posedness and a geometric algorithm. Pattern Recogn 42(5): 918–928
Langville A, Meyer C, Albright R (2006) Initializations for the nonnegative matrix factorization. In: Proceedings of ACM international conference on knowledge discovery and data mining
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755): 788–799
Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inf Syst 17(1): 1–15
Olivia A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3): 145–175
Ostrouchov G, Samatova N (2005) On FastMap and the convex hull of multivariate data: toward fast and robust dimension reduction. IEEE Trans Pattern Anal Mach Intell 27(8): 1340–1434
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2): 111–126
Rennie J, Srebro N (2005) Fast maximum margin matrix factorization for collaborative prediction. In: Proceedings of international conference on machine learning
Srebro N, Rennie JM, Jaakola T (2005) Maximum-margin matrix factorization. In: Advances in neural information processing systems 17. MIT Press
Sun J, Xie Y, Zhang H, Faloutsos C (2007) Less is more: compact matrix decomposition for large sparse graphs. In: Proceedings of SIAM international conference on data mining
Suvrit S (2008) Block-iterative algorithms for non-negative matrix approximation. In: Proceedings of IEEE international conference on data mining
Thurau C, Kersting K, Bauckhage C (2009) Convex non-negative matrix factorization in the Wild. In: Proceedings of IEEE international conference on data mining
Torralba A, Fergus R, Freeman WT (2008) 80 Million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11): 1958–1970
Vasiloglou N, Gray A, Anderson D (2009) Non-negative matrix factorization, convexity and isometry. In: Proceedings of SIAM international conference on data mining
Ziegler G (1995) Lectures on polytopes. Springer, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Thurau, C., Kersting, K., Wahabzada, M. et al. Convex non-negative matrix factorization for massive datasets. Knowl Inf Syst 29, 457–478 (2011). https://doi.org/10.1007/s10115-010-0352-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0352-6