Skip to main content
Log in

Effective data summarization for hierarchical clustering in large datasets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Let \(C_1\) and \(C_2\) be two groups (clusters) of patterns, respectively. Then, distance between them: \(\hbox {Distance}(C_1,C_2) =\min \{||x_i-x_j||\mid x_i\in C_1, x_j\in C_2\}\).

  2. Two approaches can be found in [16].

References

  1. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60

    Article  Google Scholar 

  2. Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recognit 42(9):1719–1731

    Article  MATH  Google Scholar 

  3. Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). New York City, New York, USA, August, pp 9–15

  4. Breunig MM, Kriegel H-P, Krger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Mehrotra S, Sellis T (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, CA, USA, May, pp 79–90

  5. Breunig MM, Kriegel HP, Sander J (2000) Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed D, Komorowski H, Zytkow J (eds) Proceedings of principles of data mining and knowledge discovery, 4th European conference, PKDD 2000. Lyon, France, September, pp 232–242

  6. Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Singapore

    Google Scholar 

  7. Ester M, Kriegel H-P, Sander J (1996) Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, Oregon, USA, pp 226–231

  8. Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  9. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  10. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  11. King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101

    Article  Google Scholar 

  12. Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Proceedings of rough sets and current trends in computing (RSCTC 2010), 7th international conference, RSCTC 2010, Warsaw, Poland, June, 2010. Lecture Notes in Computer Science 6086, Springer, pp 60–69

  13. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  MATH  Google Scholar 

  14. Murtagh F (1984) Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q 1:101–113

    MATH  Google Scholar 

  15. Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016

    Article  Google Scholar 

  16. Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44(12):2862–2870

    Article  MATH  Google Scholar 

  17. Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  18. Ross S (2002) A first course in probability. Pearson Education, New Delhi

    Google Scholar 

  19. Sarma T, Viswanath P, Reddy B (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4(2):107–117

    Google Scholar 

  20. Sneath A, Sokal PH (1973) Numerical taxonomy. Freeman, London

    MATH  Google Scholar 

  21. Steinhaus H (1956) Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III 4:801–804

    MathSciNet  Google Scholar 

  22. Viswanath P, Babu V (2009) Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recognit Lett 30(16):1477–1488

    Article  Google Scholar 

  23. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. Montreal, Quebec, Canada, June, pp 103–114

  24. Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Technical report, University of Minnesota

  25. Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag J, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds) Proceedings of 29th international conference on very large data bases (VLDB 2003), September, 2003. Germany, Berlin, pp 452–463

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions. This work was partially carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under Grant Agreement 246016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bidyut Kr. Patra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patra, B.K., Nandi, S. Effective data summarization for hierarchical clustering in large datasets. Knowl Inf Syst 42, 1–20 (2015). https://doi.org/10.1007/s10115-013-0709-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0709-8

Keywords

Navigation