Abstract
Many important applications such as recommender systems, e-commerce sites, web crawlers involve dynamic datasets. Dynamic datasets undergo frequent changes in the form of insertion or deletion of data that affects its size. A naive algorithm may not process these frequent changes efficiently as it involves the entire set of data points each time a change is inflicted. Fast incremental algorithms process these updates to datasets efficiently to avoid redundant computation. In this article, we propose incremental extensions to shared nearest neighbor density-based clustering (SNNDB) algorithm for both addition and deletion of data points. Existing incremental extension to SNNDB viz. InSDB cannot handle deletion and handles insertions one point at a time. Our method overcomes both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We propose three incremental variants of SNNDB in batch mode for both addition and deletion with the third variant being the most effective. Experimental observations on real world and synthetic datasets showed that our algorithms are up to 4 orders of magnitude faster than the naive SNNDB algorithm and about 2 orders of magnitude faster than the pointwise incremental method.
Similar content being viewed by others
Notes
Having a point in the KNN list does not guarantee the formation of a shared strong link between the concerned point and its neighbor. For a shared strong link to exist, each of the two conditions for strong link formation must be satisfied.
The link gets broken as the points are no longer in each others’ KNN list (a necessary condition to construct a shared strong link).
The core point with which the shared link strength is highest becomes the “nearest” core point.
When no more points remain to prevent the shrinkage of top-K window due to further deletion, Batch-Dec1 involves entire \(D^ \prime\) to rebuild e-\(\hbox {KNN}_{updated}\)(p).
We assume that P does not share a strong link with P10, P9 and P7 yet they can be present in \(\hbox {KNN}_{updated}\)(P).
The KNN list of any data point is sorted in increasing order of distances with its top-K neighboring points.
Point-based deletion signifies the execution of \(BISDB_{del}\) (most effective batch-incremental deletion algorithm) with batches of size 1.
We provide the detailed results pertaining to constant updates for \(BISDB_{add}\) vs InSDB and \(BISDB_{del}\) vs point-based deletion in the supplementary material “Online Resource.pdf” beyond this article.
References
Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications. CRC Press, Boca Raton
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, Heidelberg, pp 25–71
Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Syst (TOIS) 11(2):143–164
Charikar M, Chekuri C, Feder T, Motwani R (2004) Incremental clustering and dynamic information retrieval. SIAM J Comput 33(6):1417–1440
Crespo F, Weber R (2005) A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets Syst 150(2):267–284
Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM international conference on data mining. SIAM, pp 47–58
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231
Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. VLDB, Citeseer 98:323–333
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, Englewood Cliffs, NJ
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025–1034
Kong D, Ding C, Huang H (2011) Robust nonnegative matrix factorization using l21-norm. In: Proceedings of the 20th ACM international conference on information and knowledge management. ACM, New York, NY, USA, CIKM ’11, pp 673–682. https://doi.org/10.1145/2063576.2063676
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Li Z, Tang J, He X (2018) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst 29(5):1947–1960. https://doi.org/10.1109/TNNLS.2017.2691725
Liao TW (2005) Clustering of time series dataa survey. Pattern Recognit 38(11):1857–1874
Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27
Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883
Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2006) TF-ICF: a new term weighting scheme for clustering dynamic data streams. In: 5th international conference on machine learning and applications, 2006. ICMLA’06. IEEE, pp 258–263
Singh S, Awekar A (2013) Incremental shared nearest neighbor density-based clustering. In: Proceedings of the 22nd ACM international conference on information & knowledge management. ACM, pp 1533–1536
Ting KM, Zhu Y, Carman M, Zhu Y, Zhou ZH (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1205–1214
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Bhattacharjee, P., Mitra, P. BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN. Pattern Anal Applic 23, 975–1009 (2020). https://doi.org/10.1007/s10044-019-00831-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-019-00831-1