Skip to main content
Log in

BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN

  • Industrial and commercial application
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Many important applications such as recommender systems, e-commerce sites, web crawlers involve dynamic datasets. Dynamic datasets undergo frequent changes in the form of insertion or deletion of data that affects its size. A naive algorithm may not process these frequent changes efficiently as it involves the entire set of data points each time a change is inflicted. Fast incremental algorithms process these updates to datasets efficiently to avoid redundant computation. In this article, we propose incremental extensions to shared nearest neighbor density-based clustering (SNNDB) algorithm for both addition and deletion of data points. Existing incremental extension to SNNDB viz. InSDB cannot handle deletion and handles insertions one point at a time. Our method overcomes both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We propose three incremental variants of SNNDB in batch mode for both addition and deletion with the third variant being the most effective. Experimental observations on real world and synthetic datasets showed that our algorithms are up to 4 orders of magnitude faster than the naive SNNDB algorithm and about 2 orders of magnitude faster than the pointwise incremental method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

Notes

  1. Having a point in the KNN list does not guarantee the formation of a shared strong link between the concerned point and its neighbor. For a shared strong link to exist, each of the two conditions for strong link formation must be satisfied.

  2. The link gets broken as the points are no longer in each others’ KNN list (a necessary condition to construct a shared strong link).

  3. The core point with which the shared link strength is highest becomes the “nearest” core point.

  4. When no more points remain to prevent the shrinkage of top-K window due to further deletion, Batch-Dec1 involves entire \(D^ \prime\) to rebuild e-\(\hbox {KNN}_{updated}\)(p).

  5. We assume that P does not share a strong link with P10, P9 and P7 yet they can be present in \(\hbox {KNN}_{updated}\)(P).

  6. The KNN list of any data point is sorted in increasing order of distances with its top-K neighboring points.

  7. Point-based deletion signifies the execution of \(BISDB_{del}\) (most effective batch-incremental deletion algorithm) with batches of size 1.

  8. We provide the detailed results pertaining to constant updates for \(BISDB_{add}\) vs InSDB and \(BISDB_{del}\) vs point-based deletion in the supplementary material “Online Resource.pdf” beyond this article.

References

  1. Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications. CRC Press, Boca Raton

    Book  Google Scholar 

  2. Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, Heidelberg, pp 25–71

    Chapter  Google Scholar 

  3. Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Syst (TOIS) 11(2):143–164

    Article  Google Scholar 

  4. Charikar M, Chekuri C, Feder T, Motwani R (2004) Incremental clustering and dynamic information retrieval. SIAM J Comput 33(6):1417–1440

    Article  MathSciNet  Google Scholar 

  5. Crespo F, Weber R (2005) A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets Syst 150(2):267–284

    Article  MathSciNet  Google Scholar 

  6. Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM international conference on data mining. SIAM, pp 47–58

  7. Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231

    Google Scholar 

  8. Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. VLDB, Citeseer 98:323–333

    Google Scholar 

  9. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Article  Google Scholar 

  10. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  11. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  12. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs, NJ

    MATH  Google Scholar 

  13. Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025–1034

    Article  Google Scholar 

  14. Kong D, Ding C, Huang H (2011) Robust nonnegative matrix factorization using l21-norm. In: Proceedings of the 20th ACM international conference on information and knowledge management. ACM, New York, NY, USA, CIKM ’11, pp 673–682. https://doi.org/10.1145/2063576.2063676

  15. Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355

    Article  MathSciNet  Google Scholar 

  16. Li Z, Tang J, He X (2018) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst 29(5):1947–1960. https://doi.org/10.1109/TNNLS.2017.2691725

    Article  MathSciNet  Google Scholar 

  17. Liao TW (2005) Clustering of time series dataa survey. Pattern Recognit 38(11):1857–1874

    Article  Google Scholar 

  18. Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27

    Article  Google Scholar 

  19. Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883

    Article  Google Scholar 

  20. Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2006) TF-ICF: a new term weighting scheme for clustering dynamic data streams. In: 5th international conference on machine learning and applications, 2006. ICMLA’06. IEEE, pp 258–263

  21. Singh S, Awekar A (2013) Incremental shared nearest neighbor density-based clustering. In: Proceedings of the 22nd ACM international conference on information & knowledge management. ACM, pp 1533–1536

  22. Ting KM, Zhu Y, Carman M, Zhu Y, Zhou ZH (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1205–1214

  23. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panthadeep Bhattacharjee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 116 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhattacharjee, P., Mitra, P. BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN. Pattern Anal Applic 23, 975–1009 (2020). https://doi.org/10.1007/s10044-019-00831-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-019-00831-1

Keywords

Navigation