Abstract
With advances in data collection technologies, multiple data sources are assuming increasing prominence in many applications. Clustering from multiple data sources has emerged as a topic of critical significance in the data mining and machine learning community. Different data sources provide different levels of necessarily detailed knowledge. Thus, combining multiple data sources is pivotal to facilitate the clustering process. However, in reality, the data usually exhibits heterogeneity and incompleteness. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of missing data. Conventional methods mainly focus on clustering heterogeneous data with full information in all sources or at least one source without missing values. In this paper, we propose a more general framework T-MIC (Tensor based Multi-source Incomplete data Clustering) to integrate multiple incomplete data sources. Specifically, we first use the kernel matrices to form an initial tensor across all the multiple sources. Then we formulate a joint tensor factorization process with the sparsity constraint and use it to iteratively push the initial tensor towards a quality-driven exploration of the latent factors by taking into account missing data uncertainty. Finally, these factors serve as features to clustering. Extensive experiments on both synthetic and real datasets demonstrate that our proposed approach can effectively boost clustering performance, even with large amounts of missing data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bickel, S., Scheffer, T.: Multi-view clustering. In: ICDM, pp. 19–26 (2004)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2009)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, New York, NY, USA, pp. 92–100 (1998)
Cattell, R.B.: Parallel proportional profiles and other principles for determining the choice of factors by rotation. Psychometrika 9(4), 267–283 (1944)
Duin, R.P.: Handwritten-Numerals-Dataset
Kettenring, J.R.: Canonical Analysis of Several Sets of Variables. Biometrika 58(3), 433–451 (1971)
Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM REVIEW 51, 455–500 (2009)
Kriegel, H.P., Kunath, P.,Pryakhin, A., Schubert, M.: MUSE: multi-represented similarity estimation. In: ICDE, pp. 1340–1342. IEEE Computer Society, Washington (2008)
Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications 18(2), 95–138 (1977)
Kumar, A., Daume III, H.: A co-training approach for multi-view spectral clustering. In: ICML, New York, NY, USA, pp. 393–400, June 2011
Kumar, A., Rai, P., Daume III, H.: Co-regularized multi-view spectral clustering. In: NIPS, pp. 1413–1421 (2011)
Kushmerick, N.: Learning to Remove Internet Advertisements, pp. 175–181. ACM Press (1999)
De Lathauwer, L., De Moor, B., Vandewalle, J.: On the Best Rank-1 and Rank-(R1, R2, RN) Approximation of Higher-Order Tensors. SIAM J. Matrix Anal. Appl. 21(4), 1324–1342 (2000)
Li, S., Jiang, Y., Zhou, Z.: Partial Multi-View Clustering (2014)
Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix factorization. In: SDM (2013)
Liu, X., Ji, S., Glanzel, W., De Moor, B.: Multiview Partitioning via Tensor Methods. IEEE Trans. Knowl. Data Eng. 25(5), 1056–1069 (2013)
Long, B., Yu, P.S., Zhang, Z.M.: A general model for multiple view unsupervised learning. In: SDM, pp. 822–833 (2008)
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: CIKM, pp. 86–93. ACM, New York (2000)
Papalexakis, E.E., Akoglu, L., Ience, D.: Do more views of a graph help? community detection and clustering in multi-graphs. In: FUSION, pp. 899–905. IEEE (2013)
Papalexakis, E.E., Sidiropoulos, N.D.: Co-clustering as multilinear decomposition with sparse latent factors. In: ICASSP, pp. 2064–2067. IEEE (2011)
Shao, W., Shi, X., Yu, P.S.: Clustering on multiple incomplete datasets via collective kernel learning. In: ICDM, pp. 1181–1186 (2013)
Shi, X., Paiement, J., Grangier, D., Yu, P.S.: Learning from heterogeneous sources via gradient boosting consensus. In: SDM (2012)
Silva, V., Lim, L.-H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 30(3), 1084–1127 (2008)
Tang, W., Lu, Z., Dhillon, I.S.: Clustering with multiple graphs. In: ICDM, Miami, Florida, USA, pp. 1016–1021, December 2009
Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Multiview clustering with incomplete views. In: NIPS Workshop, Whistler, Canada (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Shao, W., He, L., Yu, P.S. (2015). Clustering on Multi-source Incomplete Data via Tensor Modeling and Factorization. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-18032-8_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)