Abstract
Climate change, the global energy footprint, and strategies for sustainable development have become topics of considerable political and public interest. The public debate is informed by an exponentially growing amount of data and there are diverse partisan interest when it comes to interpretation. We therefore believe that data analysis methods are called for that provide results which are intuitively understandable even to non-experts. Moreover, such methods should be efficient so that non-experts users can perform their own analysis at low expense in order to understand the effects of different parameters and influential factors. In this paper, we discuss a new technique for factorizing data matrices that meets both these requirements. The basic idea is to represent a set of data by means of convex combinations of extreme data points. This often accommodates human cognition. In contrast to established factorization methods, the approach presented in this paper can also determine over-complete bases. At the same time, convex combinations allow for highly efficient matrix factorization. Based on techniques adopted from the field of distance geometry, we derive a linear time algorithm to determine suitable basis vectors for factorization. By means of the example of several environmental and developmental data sets we discuss the performance and characteristics of the proposed approach and validate that significant efficiency gains are obtainable without performance decreases compared to existing convexity constrained approaches.
Similar content being viewed by others
References
Achlioptas D, McSherry F (2007) Fast computation of low-rank matrix approximations. J ACM 54(9): 1–19
Aguilar O, Huerta G, Prado R, West M (1998) Bayesian inference on latent structure in time series. In: Bernardo J, Bergen J, Dawid A, Smith A (eds) Bayesian statistics. Oxford University Press, Oxford
Blumenthal LM (1953) Theory and applications of distance geometry. Oxford University Press, Oxford
Chan B, Mitchell D, Cram L (2003) Archetypal analysis of galaxy spectra. Mon Not R Astron Soc 338(3): 790–795
Chang CI, Wu CC, Liu WM, Ouyang YC (2006) A new growing method for simplex-based endmember extraction algorithm. IEEE T Geosci Remote 44(10): 2804–2819
Crippen G (1988) Distance geometry and molecular conformation. Wiley, New York
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4): 338–347
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1): 107–113
Ding C, Li T, Jordan M (2010) Convex and semi-nonnegative matrix factorizations. IEEE T Pattern Anal 32(1): 45–55
Drineas P, Kannan R, Mahoney M (2006) Fast Monte Carlo algorithms III: computing a compressed approixmate matrix decomposition. SIAM J Comput 36(1): 184–206
Faloutsos C, Lin KI (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD international conference on management of data, San Diego
Foster D, Nascimento S, Amano K (2004) Information limits on neural identification of coloured surfaces in natural scenes. Visual Neurosci 21: 331–336
Gomes C (2009) Computational sustainability. The Bridge, National Academy of Engineering 39(4): 6–11
Goreinov SA, Tyrtyshnikov EE (2001) The maximum-volume concept in approximation by low-rank matrices. Contemp Math 280: 47–51
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(7): 498–520
Kersting K, Wahabzada M, Thurau C, Bauckhage C (2010) Hierarchical convex NMF for clustering massive data. In: Proceedings of the 2nd Asian Conference on Machine Learning (ACML-10)
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755): 788–799
Lucas A, Klaassen P, Spreij P, Straetmans S (2003) Tail behaviour of credit loss distributions for general latent factor models. Appl Math Finance 10(4): 337–357
MacKay D (2009) Sustainable energy—without the hot air. UIT Cambridge Ltd, Cambridge
Miao L, Qi H (2007) Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization. IEEE T Geosci Remote 45(3): 765–777
Nascimento JMP, Dias JMB (2005) Vertex component analysis: a fast algorithm to unmix hyperspectral data. IEEE T Geosci Remote 43(4): 898–910
Ostrouchov G, Samatova N (2005) On fastmap and the convex hull of multivariate data: toward fast and robust dimension reduction. IEEE T Pattern Anal 27(8): 1340–1434
Sippl M, Sheraga H (1986) Cayley-Menger coordinates. Proc Natl Acad Sci 83(8): 2283–2287
Spearman C (1904) General intelligence objectively determined and measured. Am J Psychol 15: 201–293
Thurau C, Kersting K, Bauckhage C (2009) Convex non-negative matrix factorization in the wild. In: Proceedings of the IEEE International Conference on Data Mining, Miami
Thurau C, Kersting K, Wahabzada M, Bauckhage C (2010) Convex non-negative matrix factorization for massive datasets. Knowl Inf Syst (KAIS). doi:10.1007/s10115-010-0352-6
Winter ME (1999) N-FINDR: an algorithm for fast and autonomous spectral endmember determination in hyperspectral data. In: Proceedings of the International Conference on Applied Geologic Remote Sensing, Vancouver
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Katharina Morik, Kanishka Bhaduri and Hillol Kargupta.
Rights and permissions
About this article
Cite this article
Thurau, C., Kersting, K., Wahabzada, M. et al. Descriptive matrix factorization for sustainability Adopting the principle of opposites. Data Min Knowl Disc 24, 325–354 (2012). https://doi.org/10.1007/s10618-011-0216-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-011-0216-z