Efficient Hierarchical Clustering Algorithms Using Partially Overlapping Partitions

Dash, Manoranjan; Liu, Huan

doi:10.1007/3-540-45357-1_52

Manoranjan Dash⁴ &
Huan Liu⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2035))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1352 Accesses
2 Citations

Abstract

Clustering is an important data exploration task. A prominent clustering algorithm is agglomerative hierarchical clustering. Roughly, in each iteration, it merges the closest pair of clusters. It was first proposed way back in 1951, and since then there have been numerous modifications. Some of its good features are: a natural, simple, and non-parametric grouping of similar objects which is capable of finding clusters of different shape such as spherical and arbitrary. But large CPU time and high memory requirement limit its use for large data. In this paper we show that geometric metric (centroid, median, and minimum variance) algorithms obey a 90-10 relationship where roughly the first 90iterations are spent on merging clusters with distance less than 10the maximum merging distance. This characteristic is exploited by partially overlapping partitioning. It is shown with experiments and analyses that different types of existing algorithms benefit excellently by drastically reducing CPU time and memory. Other contributions of this paper include comparison study of multi-dimensional vis-a-vis single-dimensional partitioning, and analytical and experimental discussions on setting of parameters such as number of partitions and dimensions for partitioning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M.R. Anderberg. Cluster Analysis for Applications. Academic Press, NY, 1973.
MATH Google Scholar
W.H.E. Day and H. Edlesbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1(1):7–24, 1984.
Article MATH Google Scholar
W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing at files atter. In Proceedings of KDD’99, pages 6–15, 1999.
Google Scholar
M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of KDD’96, pages 226–231, 1996.
Google Scholar
U. Fayyad, C. Reina, and P.S. Bradley. Initialization of iterative refinement clustering algorithms. In Proceedings of KDD’98, pages 194–198, 1998.
Google Scholar
K. Florek, J. Lukaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la liason et la division des points d’un ensemble_ni. Colloq. Math., 2:282–285, 1951.
Google Scholar
S. Guha, R. Rastogi, and S. Kyuseok. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE’99, pages 512–521, 1999.
Google Scholar
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD’98, pages 73–84, 1998.
Google Scholar
A.K. Jain and R.C. Dubes. Algorithm for Clustering Data, chapter Clustering Methods and Algorithms. Prentice-Hall Advanced Reference Series, 1988.
Google Scholar
G. Karypis, E-H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32:68–75, 1999.
Google Scholar
F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26:354–359, 1983.
MATH Google Scholar
C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313–1325, 1995.
Article MATH MathSciNet Google Scholar
M.O. Rabin. Probabilistic algorithms. In J.F. Traub, editor, Algorithms and Complexity, pages 21–39. Academic Press, New York, 1976.
Google Scholar
F.J. Rohlf. Computation efficiency of agglomerative clustering algorithms. Technical Report Report RC 6831, IBM T.J. Watson Research Center, NY, 1977.
Google Scholar
M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The SEQUOIA 2000 storage benchmark. In Proceedings of ACM SIGMOD, pages 2–11, 1993.
Google Scholar
Jr. J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963.
Article MathSciNet Google Scholar
G. Yuval. Finding nearest neighbors. Information Processing Letters, 5:63–65, 1976.
Article MATH MathSciNet Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of ACM SIGMOD pages 103–114, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore
Manoranjan Dash
Dept. of Computer Sci. and Engg., Arizona State University, Arizona
Huan Liu

Authors

Manoranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Information Systems, The University of Hong Kong, Pokfulam, Hong Kong China
David Cheung
CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia
Graham J. Williams
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dash, M., Liu, H. (2001). Efficient Hierarchical Clustering Algorithms Using Partially Overlapping Partitions. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_52

Download citation

DOI: https://doi.org/10.1007/3-540-45357-1_52
Published: 11 April 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics