Abstract
Clustering is an important data exploration task. A prominent clustering algorithm is agglomerative hierarchical clustering. Roughly, in each iteration, it merges the closest pair of clusters. It was first proposed way back in 1951, and since then there have been numerous modifications. Some of its good features are: a natural, simple, and non-parametric grouping of similar objects which is capable of finding clusters of different shape such as spherical and arbitrary. But large CPU time and high memory requirement limit its use for large data. In this paper we show that geometric metric (centroid, median, and minimum variance) algorithms obey a 90-10 relationship where roughly the first 90iterations are spent on merging clusters with distance less than 10the maximum merging distance. This characteristic is exploited by partially overlapping partitioning. It is shown with experiments and analyses that different types of existing algorithms benefit excellently by drastically reducing CPU time and memory. Other contributions of this paper include comparison study of multi-dimensional vis-a-vis single-dimensional partitioning, and analytical and experimental discussions on setting of parameters such as number of partitions and dimensions for partitioning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M.R. Anderberg. Cluster Analysis for Applications. Academic Press, NY, 1973.
W.H.E. Day and H. Edlesbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1(1):7–24, 1984.
W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing at files atter. In Proceedings of KDD’99, pages 6–15, 1999.
M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of KDD’96, pages 226–231, 1996.
U. Fayyad, C. Reina, and P.S. Bradley. Initialization of iterative refinement clustering algorithms. In Proceedings of KDD’98, pages 194–198, 1998.
K. Florek, J. Lukaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la liason et la division des points d’un ensemble_ni. Colloq. Math., 2:282–285, 1951.
S. Guha, R. Rastogi, and S. Kyuseok. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE’99, pages 512–521, 1999.
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD’98, pages 73–84, 1998.
A.K. Jain and R.C. Dubes. Algorithm for Clustering Data, chapter Clustering Methods and Algorithms. Prentice-Hall Advanced Reference Series, 1988.
G. Karypis, E-H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32:68–75, 1999.
F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26:354–359, 1983.
C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313–1325, 1995.
M.O. Rabin. Probabilistic algorithms. In J.F. Traub, editor, Algorithms and Complexity, pages 21–39. Academic Press, New York, 1976.
F.J. Rohlf. Computation efficiency of agglomerative clustering algorithms. Technical Report Report RC 6831, IBM T.J. Watson Research Center, NY, 1977.
M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The SEQUOIA 2000 storage benchmark. In Proceedings of ACM SIGMOD, pages 2–11, 1993.
Jr. J.H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963.
G. Yuval. Finding nearest neighbors. Information Processing Letters, 5:63–65, 1976.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of ACM SIGMOD pages 103–114, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dash, M., Liu, H. (2001). Efficient Hierarchical Clustering Algorithms Using Partially Overlapping Partitions. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_52
Download citation
DOI: https://doi.org/10.1007/3-540-45357-1_52
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive