Skip to main content
Log in

Exemplar-based low-rank matrix decomposition for data clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Today, digital data is accumulated at a faster than ever speed in science, engineering, biomedicine, and real-world sensing. The ubiquitous phenomenon of massive data and sparse information imposes considerable challenges in data mining research. In this paper, we propose a theoretical framework, Exemplar-based low-rank sparse matrix decomposition (EMD), to cluster large-scale datasets. Capitalizing on recent advances in matrix approximation and decomposition, EMD can partition datasets with large dimensions and scalable sizes efficiently. Specifically, given a data matrix, EMD first computes a representative data subspace and a near-optimal low-rank approximation. Then, the cluster centroids and indicators are obtained through matrix decomposition, in which we require that the cluster centroids lie within the representative data subspace. By selecting the representative exemplars, we obtain a compact “sketch”of the data. This makes the clustering highly efficient and robust to noise. In addition, the clustering results are sparse and easy for interpretation. From a theoretical perspective, we prove the correctness and convergence of the EMD algorithm, and provide detailed analysis on its efficiency, including running time and spatial requirements. Through extensive experiments performed on both synthetic and real datasets, we demonstrate the performance of EMD for clustering large-scale data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://people.csail.mit.edu/jrennie/20Newsgroups/.

  2. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

  3. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

  4. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

  • Achlioptas D, Mcsherry F (2007) Fast computation of low-rank matrix approximations. J ACM 54(2):9

    Article  MathSciNet  Google Scholar 

  • Baker CTH (1997) The numerical treatment of integral equations. Clarendon Press, Oxford

    Google Scholar 

  • Barron AR, Rissanen J, Yu B (1998) The minimum description length principle in coding and modeling. IEEE Trans Inf Theory 44(6):2743–2760

    Article  MATH  MathSciNet  Google Scholar 

  • Berry MW, Browne M, Langville AN, Pauca PV, Plemmons RJ (September 2007) Algorithms and applications for approximate nonnegative matrix factorization. Comput Stat Data Anal 52(1):155–173

    Google Scholar 

  • Berry MW, Pulatova SA, Stewart GW (2005) Algorithm 844: computing sparse reduced-rank approximations to sparse matrices. ACM Trans Math Softw 31(2):252–269

    Article  MATH  MathSciNet  Google Scholar 

  • Chen Y, Wang L, Dong M, Hua J (2009) Exemplar-based visualization of large document corpus. IEEE Trans Visual Comput Graphics 15(6):1169–1176

    Article  Google Scholar 

  • Chung FRK (1997) Spectral graph theory. American Mathematical Society

  • Delves LM, Mohamed JL (1985) Computational methods for integral equations. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39:1–38

    MATH  MathSciNet  Google Scholar 

  • Dhillon I, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 551–556

  • Dhillon IS, Guan Y, Kulis B (2005) A unified view of kernel k-means, spectral clustering and graph cuts. Technical Report TR-04-25, University of Texas Dept. of Computer Science

  • Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of SIAM International Conference of Data Mining, pp 606–610

  • Ding C, He X, Zha H, Simon HD (2001) A min-max cut algorithm for graph partitioning and data clustering. In: IEEE International Conference on Data Mining, pp 107–114

  • Ding C, Li T, Jordan MI (2008) Convex and semi-nonnegative matrix factorizations. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 99. IEEE Computer Society, Los Alamitos

  • Ding C, Li T, Peng W (2006) Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence chi-square statistic, and a hybrid method. Proc Natl Conf Artif Intell 21(1):342

    Google Scholar 

  • Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 126–135

  • Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. IEEE J Mach Learn 56(1–3):9–33

    Article  MATH  Google Scholar 

  • Drineas P, Kannan R, Mahoney M (2006) Fast monte-carlo algorithms for matrices ii: computing low-rank approximations to a matrix. SIAM J Comput 36:158–183

    Article  MATH  MathSciNet  Google Scholar 

  • Drineas P, Kannan R, Mahoney MW (2006) Fast monte carlo algorithms for matrices iii: computing a compressed approximate matrix decomposition. SIAM J Comput 36:184–206

    Article  MATH  MathSciNet  Google Scholar 

  • Drineas P, Mahoney MW (2005) On the nyström method for approximating a gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175

    MATH  MathSciNet  Google Scholar 

  • Duda HO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Fiedler M (1973) Algebraic connectivity of graphs. Czechoslov Math J 23(98):298–305

    MathSciNet  Google Scholar 

  • Fowlkes C, Belongie S, Chung F, Malik J (2004) Spectral grouping using the nyström method. IEEE Trans Pattern Anal Mach Intell 26(2):214–225

    Article  Google Scholar 

  • Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, New York

    MATH  Google Scholar 

  • Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  • Hagen L, Kahng AB (1992) New spectral methods for ratio cut partitioning and clustering. IEEE Trans Comput Aided Des Integr Circuits Syst 11(9):1074–1085

    Article  Google Scholar 

  • Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5:1457–1469

    MATH  MathSciNet  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323

    Article  Google Scholar 

  • Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12):1495–1502

    Article  Google Scholar 

  • Lang K (1995) News weeder: learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, pp 331–339

  • Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

    Article  Google Scholar 

  • Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. Neural Inf Proc Syst 13:556–562

    Google Scholar 

  • Li T, Ding C (2006) The relationships among various nonnegative matrix factorization methods for clustering. In: Proceedings of the IEEE International Conference on Data Mining, pp 362–371

  • MacQueen JB (1967) Some methods for classsification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297

  • Mahdavi M, Abolhassani H (2009) Harmony k-means algorithm for document clustering. Data Min Knowl Disc 18:370–391. doi:10.1007/s10618-008-0123-0

    Article  MathSciNet  Google Scholar 

  • Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  • Sha F, Lin Y, Saul LK, Lee DD (2007) Multiplicative updates for nonnegative quadratic programming. Neural Comput 19(8):2004–2031

    Article  MATH  MathSciNet  Google Scholar 

  • Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Article  Google Scholar 

  • Shyamalkumar ND, Varadarajan K (2007) Efficient subspace approximation algorithms. In: SODA ’07 Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, pp 532–540

  • Stewart GW (1999) Four algorithms for the efficient computation of truncated qr approximations to a sparse matrix. Numer Math 83:313–323

    Article  MATH  MathSciNet  Google Scholar 

  • Strehl A, Ghosh J, Cardie C (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    Google Scholar 

  • Sun J, Xie Y, Zhang H, Faloutsos C (2008) Less is more: sparse graph mining with compact matrix decomposition. Stat Anal Data Min 1(1):6–22

    Article  MathSciNet  Google Scholar 

  • Tong H, Papadimitriou S, Sun J, Yu PS, Faloutsos C (2008) Colibri: fast mining of large static and dynamic graphs. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 686–694

  • Wang F, Li T, Wang X, Zhu S, Ding C (2011) Community discovery using nonnegative matrix factorization. Data Min Knowl Disc 22:493–521. doi:10.1007/s10618-010-0181-y

    Article  MATH  MathSciNet  Google Scholar 

  • Wang L, Dong M (2011) On the clustering of large-scale data: a matrix-based approach. In: To appear Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2011), p 10

  • Williams CK, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, MIT Press, pp 682–688

  • Xu W, Gong Y (2004) Document clustering by concept factorization. In: SIGIR ’04: proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, New York, pp 202–209

  • Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp 267–273

  • Yan D, Huang L, Jordan M (2009) Fast approximate spectral clustering. Technical Report UCB/EECS-2009-45, EECS Department, University of California, Berkeley

  • Yen GG, Wu Z (2008) Ranked centroid projection: a data visualization approach with self-organizing maps. IEEE Trans Neural Netw 19(2):245–259

    Article  Google Scholar 

  • Zhang K, Kwok JT (2006) Block-quantized kernel matrix for fast spectral embedding. In: ICML ’06: proceedings of the 23rd international conference on Machine learning, ACM, New York, pp 1097–1104

  • Zhang K, Tsang IW, Kwok JT (2008) Improved nyström low-rank approximation and error analysis. In ICML ’08: proceedings of the 25th international conference on Machine learning, ACM, New York, pp 1232–1239

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lijun Wang.

Additional information

Responsible editor: Sugato Basu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Dong, M. Exemplar-based low-rank matrix decomposition for data clustering. Data Min Knowl Disc 29, 324–357 (2015). https://doi.org/10.1007/s10618-014-0347-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0347-0

Keywords

Navigation