Skip to main content

SDOclust: Clustering with Sparse Data Observers

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2023)

Abstract

Sparse Data Observers (SDO) is an unsupervised learning approach developed to cover the need for fast, highly interpretable and intuitively parameterizable anomaly detection. We present SDOclust, an extension that performs clustering while preserving the simplicity and applicability of the original approach. In a nutshell, SDOclust considers observers as graph nodes and applies local thresholding to divide the obtained graph into clusters; later on, observers’ labels are propagated to data points following the observation principle. We tested SDOclust with multiple datasets for clustering evaluation by using no input parameters (default or self-tuned) and nevertheless obtaining outstanding performances. SDOclust is a powerful option when statistical estimates are representative and feature spaces conform distance-based analysis. Its main characteristics are: lightweight, intuitive, self-adjusted, noise-resistant, able to extract non-convex clusters, and built on robust parameters and interpretable models. Feasibility and rapid integration into real-world applications are the core goals behind the design of SDOclust.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We term a data point as impure when it lies in unclear zones between clusters.

  2. 2.

    With distance—or the d(.) function—we refer to Euclidean distance, but the method is not restricted to it.

  3. 3.

    i.e., the coordinate with the highest value. In the absence of a dominant coordinate, the algorithm forces it randomly among the highest candidates.

  4. 4.

    \(\boldsymbol{q}\) is commonly obtained as \(\boldsymbol{q}=Q(\boldsymbol{\rho },P)\), where Q(.) is the quantile function.

  5. 5.

    https://github.com/CN-TU/pysdoclust.

  6. 6.

    HDBSCAN parameters: min_cluster_size = 5, cluster_selection_epsilon = 0.0, approx_min_span_tree = True, allow_single_cluster = False, min_samples = None, algorithm = ‘best’, p = None, alpha = 1.0, metric = ‘euclidean’, leaf_size = 40, memory = Memory(location = None), cluster_selection_method = ‘eom’, gen_ min_span_tree = False, core_dist_n_jobs = 4, prediction_data = False, match_ reference_implementation = False;

    SDOclust parameters: x = 5, qv = 0.3, zeta = 0.6, chi_min = 8, chi_prop = 0.05, e = 3, chi = None, xc = None, k = None, q = None.

  7. 7.

    k-means-- is tuned with maximum_iterations = 1000 and tol = 0.0001, where tol is the convergence criterion for centroid displacement.

  8. 8.

    https://cs.joensuu.fi/sipu/datasets/.

  9. 9.

    https://scikit-learn.org/stable/datasets.html.

  10. 10.

    https://mawi.wide.ad.jp/mawi/samplepoint-F/2022/202207311400.html.

  11. 11.

    Extracted with Go-flows [28].

  12. 12.

    https://upcsirena.app.dexma.com/.

References

  1. Archana, N., Pawar, S.: Periodicity detection of outlier sequences using constraint based pattern tree with mad. Int. J. Adv. Stud. Comput. Sci. Eng. 4(6), 34 (2015)

    Google Scholar 

  2. Böhm, C., Faloutsos, C., Pan, J.Y., Plant, C.: Robust information-theoretic clustering. In: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 65–75. Association or Computer Machine, New York (2006)

    Google Scholar 

  3. Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10(1), 5:1–5:51 (2015)

    Google Scholar 

  4. Chawla, S., Gionis, A.: k-means-: a unified approach to clustering and outlier detection. In: 2013 SIAM International Conference on Data Mining, pp. 189–197. SIAM (2013)

    Google Scholar 

  5. Chen, L., Xu, L., Li, G.: Anomaly detection using spatio-temporal correlation and information entropy in wireless sensor networks. In: IEEE Congress on Cybermatics: iThings, GreenCom, CPSCom, SmartData, pp. 121–128 (2020)

    Google Scholar 

  6. Cho, K., Mitsuya, K., Kato, A.: Traffic data repository at the wide project. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2000, p. 51. USENIX Association, USA (2000)

    Google Scholar 

  7. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  8. Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–765 (2006)

    Article  MATH  Google Scholar 

  9. Fränti, P., Virmajoki, O., Hautamäki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006)

    Article  Google Scholar 

  10. Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018)

    Article  MATH  Google Scholar 

  11. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Disc. Data (TKDD) 1(1), 4-es (2007)

    Google Scholar 

  12. Hartl, A., Iglesias, F., Zseby, T.: Sdostream: low-density models for streaming outlier detection. In: 28th ESANN, Bruges, Belgium, 2–4 October 2020, pp. 661–666 (2020)

    Google Scholar 

  13. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  MATH  Google Scholar 

  14. Iglesias, F., Zseby, T., Ferreira, D., Zimek, A.: Mdcgen: multidimensional dataset generator for clustering. J. Classif. 36(3), 599–618 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  15. Iglesias, F., Zseby, T., Zimek, A.: Outlier detection based on low density models. In: IEEE International Conference on Data Mining Workshops (ICDMW), pp. 970–979 (2018)

    Google Scholar 

  16. Iglesias Vázquez, F.: SDOclust Evaluation Tests (2023). https://doi.org/10.48436/3q7jp-mg161

  17. Kärkkäinen, I., Fränti, P.: Gradual model generator for single-pass clustering. Pattern Recogn. 40(3), 784–795 (2007)

    Article  MATH  Google Scholar 

  18. von Luxburg, U., Williamson, R.C., Guyon, I.: Clustering: science or art? In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, vol. 27, pp. 65–79. PMLR (2012)

    Google Scholar 

  19. Van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9(2579–2605), 9 (2008)

    MATH  Google Scholar 

  20. McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2 (2017). https://hdbscan.readthedocs.io/en/latest/

  21. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  22. Rezaei, M., Fränti, P.: Can the number of clusters be determined by external indices? IEEE Access 8, 89239–89257 (2020)

    Article  Google Scholar 

  23. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  24. Ruiz Martorell, G., López Plazas, F., Cuchí Burgos, A.: Sistema d’informació del consum d’energia i d’aigua de la UPC (Sirena). 1r Congrés UPC Sostenible (2007)

    Google Scholar 

  25. Sun, J., Qiu, Y., Shang, Y., Lu, G.: A multi-fault advanced diagnosis method based on sparse data observers for lithium-ion batteries. J. Energy Storage 50, 104694 (2022)

    Article  Google Scholar 

  26. Tsai, C.-Y., Chiu, C.-C.: A clustering-oriented star coordinate translation method for reliable clustering parameterization. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 749–758. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_72

    Chapter  Google Scholar 

  27. Veenman, C.J., Reinders, M.J.T., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Pattern Analy. Mach. Intell. 24(9), 1273–1280 (2002)

    Article  Google Scholar 

  28. Vormayr, G., Fabini, J., Zseby, T.: Why are my flows different? a tutorial on flow exporters. IEEE Commun. Surv. Tutor. 22(3), 2064–2103 (2020)

    Article  Google Scholar 

  29. Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2 (2015)

    Google Scholar 

  30. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Tran. Neural Netw. 16(3), 645–678 (2005)

    Article  Google Scholar 

  31. Zimek, A., Gaudet, M., Campello, R.J., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 428–436 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Félix Iglesias or Arthur Zimek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Iglesias, F., Zseby, T., Hartl, A., Zimek, A. (2023). SDOclust: Clustering with Sparse Data Observers. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46994-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46993-0

  • Online ISBN: 978-3-031-46994-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics