Abstract
Sequential agglomerative hierarchical non-overlapping (SAHN) clustering techniques belong to the classical clustering methods that are applied heavily in many application domains, e.g., in cheminformatics. Asymptotically optimal SAHN clustering algorithms are known for arbitrary dissimilarity measures, but their quadratic time and space complexity even in the best case still limits the applicability to small data sets. We present a new pivot based heuristic SAHN clustering algorithm exploiting the properties of metric distance measures in order to obtain a best case running time of \(\mathcal{O}(n\log n)\) for the input size n. Our approach requires only linear space and supports median and centroid linkage. It is especially suitable for expensive distance measures, as it needs only a linear number of exact distance computations. In extensive experimental evaluations on real-world and synthetic data sets, we compare our approach to exact state-of-the-art SAHN algorithms in terms of quality and running time. The evaluations show a subquadratic running time in practice and a very low memory footprint.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: Ordering Points To Identify the Clustering Structure. SIGMOD Rec. 28(2), 49–60 (1999)
Breunig, M.M., Kriegel, H.P., Kröger, P., Sander, J.: Data bubbles: quality preserving performance boosting for hierarchical clustering. SIGMOD Rec. 30(2), 79–90 (2001)
Chen, J., MacEachren, A.M., Peuquet, D.J.: Constructing overview + detail dendrogram-matrix views. TVCG 15(6), 889–896 (2009)
Downs, G.M., Barnard, J.M.: Clustering Methods and Their Uses in Computational Chemistry, pp. 1–40. John Wiley & Sons, Inc., New Jersey (2003)
Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML 2003, pp. 147–153. AAAI Press, Menlo Park (2003)
Eppstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. Exp. Algorithmics 5(1) (2000)
Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78(383), 553–569 (1983)
Koga, H., Ishibashi, T., Watanabe, T.: Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems 12(1), 25–53 (2007)
Lance, G.N., Williams, W.T.: A general theory of classificatory sorting strategies 1. hierarchical systems. The Computer Journal 9(4), 373–380 (1967)
Meilă, M.: Comparing clusterings—an information based distance. JMVA 98(5), 873–895 (2007)
Murtagh, F.: Multidimensional clustering algorithms. In: COMPSTAT Lectures 4. Physica-Verlag, Wuerzburg (1985)
Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. WIREs Data Mining Knowl. Discov. 2(1), 86–97 (2012)
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms, arXiv:1109.2378v1 (2011)
Nanni, M.: Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 378–387. Springer, Heidelberg (2005)
Patra, B.K., Hubballi, N., Biswas, S., Nandi, S.: Distance based fast hierarchical clustering method for large datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 50–59. Springer, Heidelberg (2010)
Rohlf, F.J.: Hierarchical clustering using the minimum spanning tree. Computer Journal 16, 93–95 (1973)
Wetzel, S., Klein, K., Renner, S., Rauh, D., Oprea, T.I., Mutzel, P., Waldmann, H.: Interactive exploration of chemical space with Scaffold Hunter. Nature Chemical Biology 5(8), 581–583 (2009)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. In: Advances in Database Systems, vol. 32. Springer (2006)
Zhou, J.: Efficiently Searching and Mining Biological Sequence and Structure Data. Ph.D. thesis, University of Alberta (2009)
Zhou, J., Sander, J.: Speedup clustering with hierarchical ranking. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 1205–1210 (2006)
Zhou, J., Sander, J.: Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, vol. 29, pp. 452–463, VLDB Endowment (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kriege, N., Mutzel, P., Schäfer, T. (2014). SAHN Clustering in Arbitrary Metric Spaces Using Heuristic Nearest Neighbor Search. In: Pal, S.P., Sadakane, K. (eds) Algorithms and Computation. WALCOM 2014. Lecture Notes in Computer Science, vol 8344. Springer, Cham. https://doi.org/10.1007/978-3-319-04657-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-04657-0_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04656-3
Online ISBN: 978-3-319-04657-0
eBook Packages: Computer ScienceComputer Science (R0)