Abstract
Sparse Data Observers (SDO) is an unsupervised learning approach developed to cover the need for fast, highly interpretable and intuitively parameterizable anomaly detection. We present SDOclust, an extension that performs clustering while preserving the simplicity and applicability of the original approach. In a nutshell, SDOclust considers observers as graph nodes and applies local thresholding to divide the obtained graph into clusters; later on, observers’ labels are propagated to data points following the observation principle. We tested SDOclust with multiple datasets for clustering evaluation by using no input parameters (default or self-tuned) and nevertheless obtaining outstanding performances. SDOclust is a powerful option when statistical estimates are representative and feature spaces conform distance-based analysis. Its main characteristics are: lightweight, intuitive, self-adjusted, noise-resistant, able to extract non-convex clusters, and built on robust parameters and interpretable models. Feasibility and rapid integration into real-world applications are the core goals behind the design of SDOclust.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We term a data point as impure when it lies in unclear zones between clusters.
- 2.
With distance—or the d(.) function—we refer to Euclidean distance, but the method is not restricted to it.
- 3.
i.e., the coordinate with the highest value. In the absence of a dominant coordinate, the algorithm forces it randomly among the highest candidates.
- 4.
\(\boldsymbol{q}\) is commonly obtained as \(\boldsymbol{q}=Q(\boldsymbol{\rho },P)\), where Q(.) is the quantile function.
- 5.
- 6.
HDBSCAN parameters: min_cluster_size = 5, cluster_selection_epsilon = 0.0, approx_min_span_tree = True, allow_single_cluster = False, min_samples = None, algorithm = ‘best’, p = None, alpha = 1.0, metric = ‘euclidean’, leaf_size = 40, memory = Memory(location = None), cluster_selection_method = ‘eom’, gen_ min_span_tree = False, core_dist_n_jobs = 4, prediction_data = False, match_ reference_implementation = False;
SDOclust parameters: x = 5, qv = 0.3, zeta = 0.6, chi_min = 8, chi_prop = 0.05, e = 3, chi = None, xc = None, k = None, q = None.
- 7.
k-means-- is tuned with maximum_iterations = 1000 and tol = 0.0001, where tol is the convergence criterion for centroid displacement.
- 8.
- 9.
- 10.
- 11.
Extracted with Go-flows [28].
- 12.
References
Archana, N., Pawar, S.: Periodicity detection of outlier sequences using constraint based pattern tree with mad. Int. J. Adv. Stud. Comput. Sci. Eng. 4(6), 34 (2015)
Böhm, C., Faloutsos, C., Pan, J.Y., Plant, C.: Robust information-theoretic clustering. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 65–75. Association or Computer Machine, New York (2006)
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 5:1–5:51 (2015)
Chawla, S., Gionis, A.: k-means-: a unified approach to clustering and outlier detection. In: 2013 SIAM International Conference on Data Mining, pp. 189–197. SIAM (2013)
Chen, L., Xu, L., Li, G.: Anomaly detection using spatio-temporal correlation and information entropy in wireless sensor networks. In: IEEE Congress on Cybermatics: iThings, GreenCom, CPSCom, SmartData, pp. 121–128 (2020)
Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the wide project. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2000, p. 51. USENIX Association, USA (2000)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–765 (2006)
Fränti, P., Virmajoki, O., Hautamäki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006)
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018)
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Disc. Data (TKDD) 1(1), 4-es (2007)
Hartl, A., Iglesias, F., Zseby, T.: Sdostream: low-density models for streaming outlier detection. In: 28th ESANN, Bruges, Belgium, 2–4 October 2020, pp. 661–666 (2020)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Iglesias, F., Zseby, T., Ferreira, D., Zimek, A.: Mdcgen: multidimensional dataset generator for clustering. J. Classif. 36(3), 599–618 (2019)
Iglesias, F., Zseby, T., Zimek, A.: Outlier detection based on low density models. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 970–979 (2018)
Iglesias Vázquez, F.: SDOclust Evaluation Tests (2023). https://doi.org/10.48436/3q7jp-mg161
Kärkkäinen, I., Fränti, P.: Gradual model generator for single-pass clustering. Pattern Recogn. 40(3), 784–795 (2007)
von Luxburg, U., Williamson, R.C., Guyon, I.: Clustering: science or art? In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, vol. 27, pp. 65–79. PMLR (2012)
Van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 9 (2008)
McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2 (2017). https://hdbscan.readthedocs.io/en/latest/
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rezaei, M., Fränti, P.: Can the number of clusters be determined by external indices? IEEE Access 8, 89239–89257 (2020)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Ruiz Martorell, G., López Plazas, F., Cuchí Burgos, A.: Sistema d’informació del consum d’energia i d’aigua de la UPC (Sirena). 1r Congrés UPC Sostenible (2007)
Sun, J., Qiu, Y., Shang, Y., Lu, G.: A multi-fault advanced diagnosis method based on sparse data observers for lithium-ion batteries. J. Energy Storage 50, 104694 (2022)
Tsai, C.-Y., Chiu, C.-C.: A clustering-oriented star coordinate translation method for reliable clustering parameterization. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 749–758. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_72
Veenman, C.J., Reinders, M.J.T., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Pattern Analy. Mach. Intell. 24(9), 1273–1280 (2002)
Vormayr, G., Fabini, J., Zseby, T.: Why are my flows different? a tutorial on flow exporters. IEEE Commun. Surv. Tutor. 22(3), 2064–2103 (2020)
Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2 (2015)
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Tran. Neural Netw. 16(3), 645–678 (2005)
Zimek, A., Gaudet, M., Campello, R.J., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 428–436 (2013)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Iglesias, F., Zseby, T., Hartl, A., Zimek, A. (2023). SDOclust: Clustering with Sparse Data Observers. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-46994-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46993-0
Online ISBN: 978-3-031-46994-7
eBook Packages: Computer ScienceComputer Science (R0)