Clustering Pairwise Distances with Missing Data: Maximum Cuts Versus Normalized Cuts

Poland, Jan; Zeugmann, Thomas

doi:10.1007/11893318_21

Jan Poland²¹ &
Thomas Zeugmann²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4265))

Included in the following conference series:

International Conference on Discovery Science

1365 Accesses

Abstract

Clustering algorithms based on a matrix of pairwise similarities (kernel matrix) for the data are widely known and used, a particularly popular class being spectral clustering algorithms. In contrast, algorithms working with the pairwise distance matrix have been studied rarely for clustering. This is surprising, as in many applications, distances are directly given, and computing similarities involves another step that is error-prone, since the kernel has to be chosen appropriately, albeit computationally cheap. This paper proposes a clustering algorithm based on the SDP relaxation of the max-k-cut of the graph of pairwise distances, based on the work of Frieze and Jerrum. We compare the algorithm with Yu and Shi’s algorithm based on spectral relaxation of a norm-k-cut. Moreover, we propose a simple heuristic for dealing with missing data, i.e., the case where some of the pairwise distances or similarities are not known. We evaluate the algorithms on the task of clustering natural language terms with the Google distance, a semantic distance recently introduced by Cilibrasi and Vitányi, using relative frequency counts from WWW queries and based on the theory of Kolmogorov complexity.

This work was supported by JSPS 21st century COE program C01. Additional support has been provided by the MEXT Grand-in-Aid for Scientific Research on Priority Areas under Grant No. 18049001.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

Clustering Analysis of a Dissimilarity: a Review of Algebraic and Geometric Representation

Article 30 March 2019

A Kernel-Learning Approach to Semi-supervised Clustering with Relative Distance Comparisons

References

Borchers, B., Young, J.G.: Implementation of a primal-dual method for sdp on a shared memory parallel architecture (March 27, 2006)
Google Scholar
Cilibrasi, R., Vitányi, P.M.B.: Automatic meaning discovery using Google. CWI, Amsterdam (Manuscript, 2006)
Google Scholar
Ding, C.H.Q., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: ICDM 2001: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 107–114. IEEE Computer Society, Los Alamitos (2001)
Chapter Google Scholar
Frieze, A., Jerrum, M.: Improved algorithms for max k-cut and max bisection. Algorithmica 18, 67–81 (1997)
Article MATH MathSciNet Google Scholar
Goemans, M.X., Williamson, D.P.: 879-approximation algorithms for MAX CUT and MAX 2SAT. In: STOC 1994: Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pp. 422–431. ACM Press, New York (1994)
Chapter Google Scholar
Graepel, T.: Kernel matrix completion by semidefinite programming. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 694–699. Springer, Heidelberg (2002)
Chapter Google Scholar
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. JMLR 5, 27–72 (2004)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Article Google Scholar
Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5), 1299–1319 (1998)
Article Google Scholar
Sturm, J.: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software 11(12), 625–653 (1999)
Article MathSciNet Google Scholar
Xing, E.P., Jordan, M.I.: On semidefinite relaxation for normalized k-cut and connections to spectral clustering. Technical Report UCB/CSD-03-1265, EECS Department, University of California, Berkeley (2003)
Google Scholar
Yu, S.X., Shi, J.: Multiclass spectral clustering. In: ICCV 2003: Proceedings of the Ninth IEEE International Conference on Computer Vision, pp. 313–319. IEEE Computer Society, Los Alamitos (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Division of Computer Science, Hokkaido University, Sapporo, 060-0814, Japan
Jan Poland & Thomas Zeugmann

Authors

Jan Poland
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Zeugmann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Jozef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Ljupčo Todorovski
University of Nova Gorica, Nova Gorica, Slovenia
Nada Lavrač
Meme Media Laboratory, Hokkaido University Sapporo, Kita 13, Nishi 8, Kita-ku, P.O. Box, 060-8628, Sapporo, Japan
Klaus P. Jantke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Poland, J., Zeugmann, T. (2006). Clustering Pairwise Distances with Missing Data: Maximum Cuts Versus Normalized Cuts. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds) Discovery Science. DS 2006. Lecture Notes in Computer Science(), vol 4265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893318_21

Download citation

DOI: https://doi.org/10.1007/11893318_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46491-4
Online ISBN: 978-3-540-46493-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics