A Set Correlation Model for Partitional Clustering

Vinh, Nguyen Xuan; Houle, Michael E.

doi:10.1007/978-3-642-13657-3_4

Nguyen Xuan Vinh^23,24,25 &
Michael E. Houle²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6118))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4132 Accesses
3 Citations

Abstract

This paper introduces GlobalRSC, a novel formulation for partitional data clustering based on the Relevant Set Correlation (RSC) clustering model. Our formulation resembles that of the K-means clustering model, but with a shared-neighbor similarity measure instead of the Euclidean distance. Unlike K-means and most other clustering heuristics that can only work with real-valued data and distance measures taken from specific families, GlobalRSC has the advantage that it can work with any distance measure, and any data representation. We also discuss various techniques for boosting the scalability of GlobalRSC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22(11), 1025–1034 (1973)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems, 512–521 (1999)
Google Scholar
Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proc. 3rd SIAM Intern. Conf. on Data Mining, SDM (2003)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowl. Discovery and Data Mining (KDD), pp. 226–231. AAAI Press, Menlo Park (1996)
Google Scholar
Houle, M.E.: The relevant-set correlation model for data clustering. Stat. Anal. Data Min. 1(3), 157–176 (2008)
Article Google Scholar
Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2007-0890, University of California, San Diego (2008)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, McGraw-Hill Book Company (2000)
Google Scholar
Houle, M.E., Sakuma, J.: Fast approximate similarity search in extremely high-dimensional data sets. In: ICDE 2005: Proceedings of the 21st International Conference on Data Engineering, Washington, DC, USA, pp. 619–630. IEEE Computer Society, Los Alamitos (2005)
Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML 2003 (2003)
Google Scholar
Karypis, G.: CLUTO – a clustering toolkit (2002)
Google Scholar
Lawrence, H., Phipps, A.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985)
Article Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In: ICML 2009: Proceedings of the 26th international conference on Machine learning (2009)
Google Scholar
Yeung, K.Y.: Cluster analysis of gene expression data. PhD thesis, University of Washington, Seattle, WA (2001)
Google Scholar
Yeung, K., Medvedovic, M., Bumgarner, R.: Clustering gene-expression data with repeated measurements. Genome Biology 4(5), R34 (2003)
Article Google Scholar
Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The amsterdam library of object images. Int. J. Comput. Vision 61(1), 103–112 (2005)
Article Google Scholar
Boujemaa, N., Fauqueur, J., Ferecatu, M., Fleuret, F., Gouet, V., LeSaux, B., Sahbi, H.: Ikona: Interactive specific and generic image retrieval. In: International workshop on Multimedia ContentBased Indexing and Retrieval, MMCBIR 2001 (2001)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Google Scholar
Dasgupta, S.: Experiments with random projection. In: UAI 2000: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 143–151. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW, 2052, Australia
Nguyen Xuan Vinh
National ICT Australia (NICTA),
Nguyen Xuan Vinh
National Institute of Informatics, Tokyo, Japan
Nguyen Xuan Vinh & Michael E. Houle

Authors

Nguyen Xuan Vinh
View author publications
You can also search for this author in PubMed Google Scholar
Michael E. Houle
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki
The Chinese University of Hong Kong, China
Jeffrey Xu Yu
IIT Madras, Chennai, India
B. Ravindran
IIIT, Hyderabad, India
Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vinh, N.X., Houle, M.E. (2010). A Set Correlation Model for Partitional Clustering. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-13657-3_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics