Clustering on Multi-source Incomplete Data via Tensor Modeling and Factorization

Shao, Weixiang; He, Lifang; Yu, Philip S.

doi:10.1007/978-3-319-18032-8_38

Weixiang Shao¹⁰,
Lifang He¹¹ &
Philip S. Yu^10,12

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9078))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4339 Accesses
11 Citations

Abstract

With advances in data collection technologies, multiple data sources are assuming increasing prominence in many applications. Clustering from multiple data sources has emerged as a topic of critical significance in the data mining and machine learning community. Different data sources provide different levels of necessarily detailed knowledge. Thus, combining multiple data sources is pivotal to facilitate the clustering process. However, in reality, the data usually exhibits heterogeneity and incompleteness. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of missing data. Conventional methods mainly focus on clustering heterogeneous data with full information in all sources or at least one source without missing values. In this paper, we propose a more general framework T-MIC (Tensor based Multi-source Incomplete data Clustering) to integrate multiple incomplete data sources. Specifically, we first use the kernel matrices to form an initial tensor across all the multiple sources. Then we formulate a joint tensor factorization process with the sparsity constraint and use it to iteratively push the initial tensor towards a quality-driven exploration of the latent factors by taking into account missing data uncertainty. Finally, these factors serve as features to clustering. Extensive experiments on both synthetic and real datasets demonstrate that our proposed approach can effectively boost clustering performance, even with large amounts of missing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bickel, S., Scheffer, T.: Multi-view clustering. In: ICDM, pp. 19–26 (2004)
Google Scholar
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2009)
Article Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, New York, NY, USA, pp. 92–100 (1998)
Google Scholar
Cattell, R.B.: Parallel proportional profiles and other principles for determining the choice of factors by rotation. Psychometrika 9(4), 267–283 (1944)
Article Google Scholar
Duin, R.P.: Handwritten-Numerals-Dataset
Google Scholar
Kettenring, J.R.: Canonical Analysis of Several Sets of Variables. Biometrika 58(3), 433–451 (1971)
Article MATH MathSciNet Google Scholar
Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM REVIEW 51, 455–500 (2009)
Article MATH MathSciNet Google Scholar
Kriegel, H.P., Kunath, P.,Pryakhin, A., Schubert, M.: MUSE: multi-represented similarity estimation. In: ICDE, pp. 1340–1342. IEEE Computer Society, Washington (2008)
Google Scholar
Kruskal, J.B.: Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications 18(2), 95–138 (1977)
Article MATH MathSciNet Google Scholar
Kumar, A., Daume III, H.: A co-training approach for multi-view spectral clustering. In: ICML, New York, NY, USA, pp. 393–400, June 2011
Google Scholar
Kumar, A., Rai, P., Daume III, H.: Co-regularized multi-view spectral clustering. In: NIPS, pp. 1413–1421 (2011)
Google Scholar
Kushmerick, N.: Learning to Remove Internet Advertisements, pp. 175–181. ACM Press (1999)
Google Scholar
De Lathauwer, L., De Moor, B., Vandewalle, J.: On the Best Rank-1 and Rank-(R1, R2, RN) Approximation of Higher-Order Tensors. SIAM J. Matrix Anal. Appl. 21(4), 1324–1342 (2000)
Article MATH MathSciNet Google Scholar
Li, S., Jiang, Y., Zhou, Z.: Partial Multi-View Clustering (2014)
Google Scholar
Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix factorization. In: SDM (2013)
Google Scholar
Liu, X., Ji, S., Glanzel, W., De Moor, B.: Multiview Partitioning via Tensor Methods. IEEE Trans. Knowl. Data Eng. 25(5), 1056–1069 (2013)
Article Google Scholar
Long, B., Yu, P.S., Zhang, Z.M.: A general model for multiple view unsupervised learning. In: SDM, pp. 822–833 (2008)
Google Scholar
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: CIKM, pp. 86–93. ACM, New York (2000)
Google Scholar
Papalexakis, E.E., Akoglu, L., Ience, D.: Do more views of a graph help? community detection and clustering in multi-graphs. In: FUSION, pp. 899–905. IEEE (2013)
Google Scholar
Papalexakis, E.E., Sidiropoulos, N.D.: Co-clustering as multilinear decomposition with sparse latent factors. In: ICASSP, pp. 2064–2067. IEEE (2011)
Google Scholar
Shao, W., Shi, X., Yu, P.S.: Clustering on multiple incomplete datasets via collective kernel learning. In: ICDM, pp. 1181–1186 (2013)
Google Scholar
Shi, X., Paiement, J., Grangier, D., Yu, P.S.: Learning from heterogeneous sources via gradient boosting consensus. In: SDM (2012)
Google Scholar
Silva, V., Lim, L.-H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 30(3), 1084–1127 (2008)
Article MathSciNet Google Scholar
Tang, W., Lu, Z., Dhillon, I.S.: Clustering with multiple graphs. In: ICDM, Miami, Florida, USA, pp. 1016–1021, December 2009
Google Scholar
Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Multiview clustering with incomplete views. In: NIPS Workshop, Whistler, Canada (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA
Weixiang Shao & Philip S. Yu
Institute for Computer Vision, Shenzhen University, Shenzhen, China
Lifang He
Institute for Data Science, Tsinghua University, Beijing, China
Philip S. Yu

Authors

Weixiang Shao
View author publications
You can also search for this author in PubMed Google Scholar
Lifang He
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lifang He .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tru Cao
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, Hong Kong SAR
David Cheung
Osaka University, Osaka, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shao, W., He, L., Yu, P.S. (2015). Clustering on Multi-source Incomplete Data via Tensor Modeling and Factorization. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9078. Springer, Cham. https://doi.org/10.1007/978-3-319-18032-8_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-18032-8_38
Published: 09 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18031-1
Online ISBN: 978-3-319-18032-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics