Abstract
Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.
Similar content being viewed by others
Notes
Let \(C_1\) and \(C_2\) be two groups (clusters) of patterns, respectively. Then, distance between them: \(\hbox {Distance}(C_1,C_2) =\min \{||x_i-x_j||\mid x_i\in C_1, x_j\in C_2\}\).
Two approaches can be found in [16].
References
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60
Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recognit 42(9):1719–1731
Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). New York City, New York, USA, August, pp 9–15
Breunig MM, Kriegel H-P, Krger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Mehrotra S, Sellis T (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, CA, USA, May, pp 79–90
Breunig MM, Kriegel HP, Sander J (2000) Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed D, Komorowski H, Zytkow J (eds) Proceedings of principles of data mining and knowledge discovery, 4th European conference, PKDD 2000. Lyon, France, September, pp 232–242
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Singapore
Ester M, Kriegel H-P, Sander J (1996) Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, Oregon, USA, pp 226–231
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101
Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Proceedings of rough sets and current trends in computing (RSCTC 2010), 7th international conference, RSCTC 2010, Warsaw, Poland, June, 2010. Lecture Notes in Computer Science 6086, Springer, pp 60–69
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Murtagh F (1984) Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q 1:101–113
Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44(12):2862–2870
Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Ross S (2002) A first course in probability. Pearson Education, New Delhi
Sarma T, Viswanath P, Reddy B (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4(2):107–117
Sneath A, Sokal PH (1973) Numerical taxonomy. Freeman, London
Steinhaus H (1956) Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III 4:801–804
Viswanath P, Babu V (2009) Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recognit Lett 30(16):1477–1488
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. Montreal, Quebec, Canada, June, pp 103–114
Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Technical report, University of Minnesota
Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag J, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds) Proceedings of 29th international conference on very large data bases (VLDB 2003), September, 2003. Germany, Berlin, pp 452–463
Acknowledgments
We thank anonymous reviewers for their very useful comments and suggestions. This work was partially carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under Grant Agreement 246016.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Patra, B.K., Nandi, S. Effective data summarization for hierarchical clustering in large datasets. Knowl Inf Syst 42, 1–20 (2015). https://doi.org/10.1007/s10115-013-0709-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0709-8