BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN

Bhattacharjee, Panthadeep; Mitra, Pinaki

doi:10.1007/s10044-019-00831-1

BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN

Industrial and commercial application
Published: 01 July 2019

Volume 23, pages 975–1009, (2020)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

391 Accesses
7 Citations
Explore all metrics

Abstract

Many important applications such as recommender systems, e-commerce sites, web crawlers involve dynamic datasets. Dynamic datasets undergo frequent changes in the form of insertion or deletion of data that affects its size. A naive algorithm may not process these frequent changes efficiently as it involves the entire set of data points each time a change is inflicted. Fast incremental algorithms process these updates to datasets efficiently to avoid redundant computation. In this article, we propose incremental extensions to shared nearest neighbor density-based clustering (SNNDB) algorithm for both addition and deletion of data points. Existing incremental extension to SNNDB viz. InSDB cannot handle deletion and handles insertions one point at a time. Our method overcomes both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We propose three incremental variants of SNNDB in batch mode for both addition and deletion with the third variant being the most effective. Experimental observations on real world and synthetic datasets showed that our algorithms are up to 4 orders of magnitude faster than the naive SNNDB algorithm and about 2 orders of magnitude faster than the pointwise incremental method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

Article 13 October 2016

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

Notes

Having a point in the KNN list does not guarantee the formation of a shared strong link between the concerned point and its neighbor. For a shared strong link to exist, each of the two conditions for strong link formation must be satisfied.
The link gets broken as the points are no longer in each others’ KNN list (a necessary condition to construct a shared strong link).
The core point with which the shared link strength is highest becomes the “nearest” core point.
When no more points remain to prevent the shrinkage of top-K window due to further deletion, Batch-Dec1 involves entire \(D^ \prime\) to rebuild e-\(\hbox {KNN}_{updated}\)(p).
We assume that P does not share a strong link with P10, P9 and P7 yet they can be present in \(\hbox {KNN}_{updated}\)(P).
The KNN list of any data point is sorted in increasing order of distances with its top-K neighboring points.
Point-based deletion signifies the execution of \(BISDB_{del}\) (most effective batch-incremental deletion algorithm) with batches of size 1.
We provide the detailed results pertaining to constant updates for \(BISDB_{add}\) vs InSDB and \(BISDB_{del}\) vs point-based deletion in the supplementary material “Online Resource.pdf” beyond this article.

References

Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications. CRC Press, Boca Raton
Book Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, Heidelberg, pp 25–71
Chapter Google Scholar
Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Syst (TOIS) 11(2):143–164
Article Google Scholar
Charikar M, Chekuri C, Feder T, Motwani R (2004) Incremental clustering and dynamic information retrieval. SIAM J Comput 33(6):1417–1440
Article MathSciNet Google Scholar
Crespo F, Weber R (2005) A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets Syst 150(2):267–284
Article MathSciNet Google Scholar
Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM international conference on data mining. SIAM, pp 47–58
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231
Google Scholar
Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. VLDB, Citeseer 98:323–333
Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Article Google Scholar
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs, NJ
MATH Google Scholar
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025–1034
Article Google Scholar
Kong D, Ding C, Huang H (2011) Robust nonnegative matrix factorization using l21-norm. In: Proceedings of the 20th ACM international conference on information and knowledge management. ACM, New York, NY, USA, CIKM ’11, pp 673–682. https://doi.org/10.1145/2063576.2063676
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Article MathSciNet Google Scholar
Li Z, Tang J, He X (2018) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst 29(5):1947–1960. https://doi.org/10.1109/TNNLS.2017.2691725
Article MathSciNet Google Scholar
Liao TW (2005) Clustering of time series dataa survey. Pattern Recognit 38(11):1857–1874
Article Google Scholar
Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27
Article Google Scholar
Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883
Article Google Scholar
Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2006) TF-ICF: a new term weighting scheme for clustering dynamic data streams. In: 5th international conference on machine learning and applications, 2006. ICMLA’06. IEEE, pp 258–263
Singh S, Awekar A (2013) Incremental shared nearest neighbor density-based clustering. In: Proceedings of the 22nd ACM international conference on information & knowledge management. ACM, pp 1533–1536
Ting KM, Zhu Y, Carman M, Zhu Y, Zhou ZH (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1205–1214
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, North Guwahati, Amingaon, Assam, 781039, India
Panthadeep Bhattacharjee & Pinaki Mitra

Authors

Panthadeep Bhattacharjee
View author publications
You can also search for this author in PubMed Google Scholar
Pinaki Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panthadeep Bhattacharjee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 116 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhattacharjee, P., Mitra, P. BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN. Pattern Anal Applic 23, 975–1009 (2020). https://doi.org/10.1007/s10044-019-00831-1

Download citation

Received: 25 October 2018
Accepted: 17 June 2019
Published: 01 July 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10044-019-00831-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

Clustering graph data: the roadmap to spectral techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 116 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Big data analytics on Apache Spark

Clustering graph data: the roadmap to spectral techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 116 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation