Effective data summarization for hierarchical clustering in large datasets

Patra, Bidyut Kr.; Nandi, Sukumar

doi:10.1007/s10115-013-0709-8

Effective data summarization for hierarchical clustering in large datasets

Regular Paper
Published: 30 November 2013

Volume 42, pages 1–20, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Bidyut Kr. Patra^1,2 &
Sukumar Nandi³

664 Accesses
8 Citations
Explore all metrics

Abstract

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-Effective Clustering by Aggregating Local Density Peaks

K-Means Algorithm to Form Dynamic Cluster Formation to Counter the Static Property of K-Means

Decomposition/Aggregation K-means for Big Data

Notes

Let \(C_1\) and \(C_2\) be two groups (clusters) of patterns, respectively. Then, distance between them: \(\hbox {Distance}(C_1,C_2) =\min \{||x_i-x_j||\mid x_i\in C_1, x_j\in C_2\}\).
Two approaches can be found in [16].

References

Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60
Article Google Scholar
Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recognit 42(9):1719–1731
Article MATH Google Scholar
Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). New York City, New York, USA, August, pp 9–15
Breunig MM, Kriegel H-P, Krger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Mehrotra S, Sellis T (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, CA, USA, May, pp 79–90
Breunig MM, Kriegel HP, Sander J (2000) Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed D, Komorowski H, Zytkow J (eds) Proceedings of principles of data mining and knowledge discovery, 4th European conference, PKDD 2000. Lyon, France, September, pp 232–242
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Singapore
Google Scholar
Ester M, Kriegel H-P, Sander J (1996) Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, Oregon, USA, pp 226–231
Hartigan JA (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101
Article Google Scholar
Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Proceedings of rough sets and current trends in computing (RSCTC 2010), 7th international conference, RSCTC 2010, Warsaw, Poland, June, 2010. Lecture Notes in Computer Science 6086, Springer, pp 60–69
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet MATH Google Scholar
Murtagh F (1984) Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q 1:101–113
MATH Google Scholar
Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Article Google Scholar
Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44(12):2862–2870
Article MATH Google Scholar
Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Article Google Scholar
Ross S (2002) A first course in probability. Pearson Education, New Delhi
Google Scholar
Sarma T, Viswanath P, Reddy B (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4(2):107–117
Google Scholar
Sneath A, Sokal PH (1973) Numerical taxonomy. Freeman, London
MATH Google Scholar
Steinhaus H (1956) Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III 4:801–804
MathSciNet Google Scholar
Viswanath P, Babu V (2009) Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recognit Lett 30(16):1477–1488
Article Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. Montreal, Quebec, Canada, June, pp 103–114
Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Technical report, University of Minnesota
Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag J, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds) Proceedings of 29th international conference on very large data bases (VLDB 2003), September, 2003. Germany, Berlin, pp 452–463

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions. This work was partially carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under Grant Agreement 246016.

Author information

Authors and Affiliations

National Institute of Technology Rourkela, Rourkela, 769 008, Orissa, India
Bidyut Kr. Patra
VTT Technical Research Centre of Finland, PO Box 1000, 02044 , Espoo, Finland
Bidyut Kr. Patra
Indian Institute of Technology Guwahati, Guwahati , 789 039, Assam, India
Sukumar Nandi

Authors

Bidyut Kr. Patra
View author publications
You can also search for this author in PubMed Google Scholar
Sukumar Nandi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bidyut Kr. Patra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patra, B.K., Nandi, S. Effective data summarization for hierarchical clustering in large datasets. Knowl Inf Syst 42, 1–20 (2015). https://doi.org/10.1007/s10115-013-0709-8

Download citation

Received: 30 December 2012
Revised: 25 July 2013
Accepted: 15 November 2013
Published: 30 November 2013
Issue Date: January 2015
DOI: https://doi.org/10.1007/s10115-013-0709-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective data summarization for hierarchical clustering in large datasets

Abstract

Access this article

Similar content being viewed by others

Cost-Effective Clustering by Aggregating Local Density Peaks

K-Means Algorithm to Form Dynamic Cluster Formation to Counter the Static Property of K-Means

Decomposition/Aggregation K-means for Big Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effective data summarization for hierarchical clustering in large datasets

Abstract

Access this article

Similar content being viewed by others

Cost-Effective Clustering by Aggregating Local Density Peaks

K-Means Algorithm to Form Dynamic Cluster Formation to Counter the Static Property of K-Means

Decomposition/Aggregation K-means for Big Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation