SDOclust: Clustering with Sparse Data Observers

Iglesias, Félix; Zseby, Tanja; Hartl, Alexander; Zimek, Arthur

doi:10.1007/978-3-031-46994-7_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14289))

Included in the following conference series:

International Conference on Similarity Search and Applications

591 Accesses

Abstract

Sparse Data Observers (SDO) is an unsupervised learning approach developed to cover the need for fast, highly interpretable and intuitively parameterizable anomaly detection. We present SDOclust, an extension that performs clustering while preserving the simplicity and applicability of the original approach. In a nutshell, SDOclust considers observers as graph nodes and applies local thresholding to divide the obtained graph into clusters; later on, observers’ labels are propagated to data points following the observation principle. We tested SDOclust with multiple datasets for clustering evaluation by using no input parameters (default or self-tuned) and nevertheless obtaining outstanding performances. SDOclust is a powerful option when statistical estimates are representative and feature spaces conform distance-based analysis. Its main characteristics are: lightweight, intuitive, self-adjusted, noise-resistant, able to extract non-convex clusters, and built on robust parameters and interpretable models. Feasibility and rapid integration into real-world applications are the core goals behind the design of SDOclust.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Half-space mass: a maximally robust and efficient data depth method

Article 05 August 2015

SPINEX-anomaly: similarity-based predictions with explainable neighbors exploration for anomaly and outlier detection

Article Open access 01 April 2025

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

Notes

1.
We term a data point as impure when it lies in unclear zones between clusters.
2.
With distance—or the d(.) function—we refer to Euclidean distance, but the method is not restricted to it.
3.
i.e., the coordinate with the highest value. In the absence of a dominant coordinate, the algorithm forces it randomly among the highest candidates.
4.
$\boldsymbol{q}$ is commonly obtained as $\boldsymbol{q}=Q(\boldsymbol{\rho },P)$, where Q(.) is the quantile function.
5.
https://github.com/CN-TU/pysdoclust.
6.
HDBSCAN parameters: min_cluster_size = 5, cluster_selection_epsilon = 0.0, approx_min_span_tree = True, allow_single_cluster = False, min_samples = None, algorithm = ‘best’, p = None, alpha = 1.0, metric = ‘euclidean’, leaf_size = 40, memory = Memory(location = None), cluster_selection_method = ‘eom’, gen_ min_span_tree = False, core_dist_n_jobs = 4, prediction_data = False, match_ reference_implementation = False;
SDOclust parameters: x = 5, qv = 0.3, zeta = 0.6, chi_min = 8, chi_prop = 0.05, e = 3, chi = None, xc = None, k = None, q = None.
7.
k-means-- is tuned with maximum_iterations = 1000 and tol = 0.0001, where tol is the convergence criterion for centroid displacement.
8.
https://cs.joensuu.fi/sipu/datasets/.
9.
https://scikit-learn.org/stable/datasets.html.
10.
https://mawi.wide.ad.jp/mawi/samplepoint-F/2022/202207311400.html.
11.
Extracted with Go-flows [28].
12.
https://upcsirena.app.dexma.com/.

References

Archana, N., Pawar, S.: Periodicity detection of outlier sequences using constraint based pattern tree with mad. Int. J. Adv. Stud. Comput. Sci. Eng. 4(6), 34 (2015)
Google Scholar
Böhm, C., Faloutsos, C., Pan, J.Y., Plant, C.: Robust information-theoretic clustering. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 65–75. Association or Computer Machine, New York (2006)
Google Scholar
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 5:1–5:51 (2015)
Google Scholar
Chawla, S., Gionis, A.: k-means-: a unified approach to clustering and outlier detection. In: 2013 SIAM International Conference on Data Mining, pp. 189–197. SIAM (2013)
Google Scholar
Chen, L., Xu, L., Li, G.: Anomaly detection using spatio-temporal correlation and information entropy in wireless sensor networks. In: IEEE Congress on Cybermatics: iThings, GreenCom, CPSCom, SmartData, pp. 121–128 (2020)
Google Scholar
Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the wide project. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2000, p. 51. USENIX Association, USA (2000)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–765 (2006)
Article MATH Google Scholar
Fränti, P., Virmajoki, O., Hautamäki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006)
Article Google Scholar
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018)
Article MATH Google Scholar
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Disc. Data (TKDD) 1(1), 4-es (2007)
Google Scholar
Hartl, A., Iglesias, F., Zseby, T.: Sdostream: low-density models for streaming outlier detection. In: 28th ESANN, Bruges, Belgium, 2–4 October 2020, pp. 661–666 (2020)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar
Iglesias, F., Zseby, T., Ferreira, D., Zimek, A.: Mdcgen: multidimensional dataset generator for clustering. J. Classif. 36(3), 599–618 (2019)
Article MathSciNet MATH Google Scholar
Iglesias, F., Zseby, T., Zimek, A.: Outlier detection based on low density models. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 970–979 (2018)
Google Scholar
Iglesias Vázquez, F.: SDOclust Evaluation Tests (2023). https://doi.org/10.48436/3q7jp-mg161
Kärkkäinen, I., Fränti, P.: Gradual model generator for single-pass clustering. Pattern Recogn. 40(3), 784–795 (2007)
Article MATH Google Scholar
von Luxburg, U., Williamson, R.C., Guyon, I.: Clustering: science or art? In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, vol. 27, pp. 65–79. PMLR (2012)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 9 (2008)
MATH Google Scholar
McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2 (2017). https://hdbscan.readthedocs.io/en/latest/
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rezaei, M., Fränti, P.: Can the number of clusters be determined by external indices? IEEE Access 8, 89239–89257 (2020)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Ruiz Martorell, G., López Plazas, F., Cuchí Burgos, A.: Sistema d’informació del consum d’energia i d’aigua de la UPC (Sirena). 1r Congrés UPC Sostenible (2007)
Google Scholar
Sun, J., Qiu, Y., Shang, Y., Lu, G.: A multi-fault advanced diagnosis method based on sparse data observers for lithium-ion batteries. J. Energy Storage 50, 104694 (2022)
Article Google Scholar
Tsai, C.-Y., Chiu, C.-C.: A clustering-oriented star coordinate translation method for reliable clustering parameterization. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 749–758. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_72
Chapter Google Scholar
Veenman, C.J., Reinders, M.J.T., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Pattern Analy. Mach. Intell. 24(9), 1273–1280 (2002)
Article Google Scholar
Vormayr, G., Fabini, J., Zseby, T.: Why are my flows different? a tutorial on flow exporters. IEEE Commun. Surv. Tutor. 22(3), 2064–2103 (2020)
Article Google Scholar
Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2 (2015)
Google Scholar
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Tran. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar
Zimek, A., Gaudet, M., Campello, R.J., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 428–436 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

TU Wien, Gusshausstraße 25/E389, 1040, Vienna, Austria
Félix Iglesias, Tanja Zseby & Alexander Hartl
SDU, Campusvej 55, 5230, Odense, DK, Denmark
Arthur Zimek

Authors

Félix Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
Tanja Zseby
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hartl
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Félix Iglesias or Arthur Zimek .

Editor information

Editors and Affiliations

University of A Coruña, Coruña, Spain
Oscar Pedreira
Pompeu Fabra University, Barcelona, Spain
Vladimir Estivill-Castro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Iglesias, F., Zseby, T., Hartl, A., Zimek, A. (2023). SDOclust: Clustering with Sparse Data Observers. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-46994-7_16
Published: 27 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46993-0
Online ISBN: 978-3-031-46994-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SDOclust: Clustering with Sparse Data Observers