Skip to main content

Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14947))

  • 801 Accesses

Abstract

Mining data containing density-based clusters is well-established and widespread but faces problems when it comes to systematic and reproducible comparison and evaluation. Although the success of clustering methods hinges on data quality and availability, reproducibly generating suitable data for this setting is not easy, leading to mostly low-dimensional toy datasets being used. To resolve this issue, we propose DENSIRED (DENSIty-based Reproducible Experimental Data), a novel data generator for data containing density-based clusters. It is highly flexible w.r.t. a large variety of properties of the data and produces reproducible datasets in a two-step approach. First, skeletons of the clusters are constructed following a random walk. In the second step, these skeletons are enriched with data samples. DENSIRED enables the systematic generation of data for a robust and reliable analysis of methods aimed toward examining data containing density-connected clusters. In extensive experiments, we analyze the impact of user-defined properties on the generated datasets and the intrinsic dimensionalities of synthesized clusters. Our code and novel benchmark datasets are publicly available at: https://github.com/PhilJahn/DENSIRED.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://scikit-learn.org, last accessed: Oct 5th, 2023.

  2. 2.

    https://github.com/collinleiber/ClustPy, last accessed: Nov 15th, 2023.

  3. 3.

    https://github.com/scikit-learn-contrib/hdbscan, last accessed: Oct 5th, 2023.

  4. 4.

    https://github.com/tobinjo96/DCFcluster, last accessed: Oct 5th, 2023.

  5. 5.

    https://github.com/SpectralClusteringAcceleratedRobust/SCAR, last accessed: Nov 4th, 2023.

  6. 6.

    https://bitbucket.org/Sibylse/spectacl, last accessed: Oct 11th, 2023.

  7. 7.

    https://github.com/colinwke/dpca, last accessed: Oct 31st, 2023.

  8. 8.

    Exact values: \(\omega =0.8\), \(\delta =1.5\), \(\beta =0.1\), \(\varkappa =1\), 200 cores.

References

  1. Tommasi, T., Patricia, N., Caputo, B., Tuytelaars, T.: A deeper look at dataset bias. In: Csurka, G. (ed.) Domain Adaptation in Computer Vision Applications. ACVPR, pp. 37–55. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58347-1_2

    Chapter  Google Scholar 

  2. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, AAAI Press, pp. 226–231 (1996)

    Google Scholar 

  3. Tobin, J., Zhang, M.: DCF: an efficient and robust density-based clustering method. In: ICDM, pp. 629–638. IEEE (2021)

    Google Scholar 

  4. Hess, S., Duivesteijn, W., Honysz, P., Morik, K.: The SpectACl of nonconvex clustering: a spectral approach to density-based clustering. In: AAAI, AAAI Press, pp. 3788–3795 (2019)

    Google Scholar 

  5. Hohma, E., Frey, C.M.M., Beer, A., Seidl, T.: SCAR - spectral clustering accelerated and robustified. Proc. VLDB Endow. 15(11), 3031–3044 (2022)

    Article  Google Scholar 

  6. Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min. Knowl. Discov. 2(2), 169–194 (1998)

    Article  Google Scholar 

  7. Ankerst, M., Breunig, M.M., Kriegel, H., Sander, J.: OPTICS: ordering points to identify the clustering structure, pp. 49–60 (1999)

    Google Scholar 

  8. Frey, C., Züfle, A., Emrich, T., Renz, M.: Efficient information flow maximization in probabilistic graphs. IEEE Trans. Knowl. Data Eng. 30(5), 880–894 (2018)

    Article  Google Scholar 

  9. Ashour, W., Sunoallah, S.: Multi density DBSCAN. In: Yin, H., Wang, W., Rayward-Smith, V. (eds.) IDEAL 2011. LNCS, vol. 6936, pp. 446–453. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23878-9_53

    Chapter  Google Scholar 

  10. Frey, C.M., Jungwirth, A., Frey, M., Kolisch, R.: The vehicle routing problem with time windows and flexible delivery locations. Eur. J. Oper. Res. 308(3), 1142–1159 (2023). ISSN 0377-2217

    Google Scholar 

  11. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  12. Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., Bennett, K.P.: Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416, 244–255 (2020). ISSN 0925-2312

    Google Scholar 

  13. Libes, D., Lechevalier, D., Jain, S.: Issues in synthetic data generation for advanced manufacturing. In: 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, pp. 1746-1754 (2017)

    Google Scholar 

  14. Gan, J., Tao, Y.: DBSCAN revisited: Mis-claim, un-fixability, and approximation. In: SIGMOD Conference, pp. 519–530. ACM (2015)

    Google Scholar 

  15. Mai, S.T., Assent, I., Storgaard, M.: AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: KDD, pp. 1025–1034. ACM (2016)

    Google Scholar 

  16. Hou, J., Gao, H., Li, X.: DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Trans. Image Process. 25(7), 3182–3193 (2016)

    Article  MathSciNet  Google Scholar 

  17. Bryant, A., Cios, K.J.: RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Trans. Knowl. Data Eng. 30(6), 1109–1121 (2018)

    Article  Google Scholar 

  18. Kim, J., Choi, J., Yoo, K., Nasridinov, A.: AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities. J. Supercomput. 75(1), 142–169 (2019)

    Article  Google Scholar 

  19. Ren, Y., Wang, N., Li, M., Xu, Z.: Deep density-based image clustering. Knowl. Based Syst. 197, 105841 (2020)

    Article  Google Scholar 

  20. Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., Du, J.: BLOCK-DBSCAN: fast clustering for large scale data. Pattern Recognit. 109, 107624 (2021)

    Article  Google Scholar 

  21. dos Santos, J.A., Iqbal, S.T., Naldi, M.C., Campello, R.J.G.B., Sander, J.: Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data 7(1), 102–114 (2021)

    Article  Google Scholar 

  22. Wang, Z., et al.: AMD-DBSCAN: an adaptive multi-density DBSCAN for datasets of extremely variable density. In: DSAA, pp. 1–10. IEEE (2022)

    Google Scholar 

  23. Huang, X., Ma, T., Liu, C., Liu, S.: GriT-DBSCAN: a spatial clustering algorithm for very large databases. Pattern Recognit. 142, 109658 (2023)

    Article  Google Scholar 

  24. Ma, B., Yang, C., Li, A., Chi, Y., Chen, L.: A faster dbscan algorithm based on self-adaptive determination of parameters. Procedia Comput. Sci. 221, 113–120 (2023). (ITQM 2023)

    Google Scholar 

  25. Qian, J., Zhou, Y., Han, X., Wang, Y.: MDBSCAN: a multi-density dbscan based on relative density. Neurocomputing 576, 127329 (2024)

    Article  Google Scholar 

  26. Milligan, G.W.: An algorithm for generating artificial test clusters. Psychometrika 50, 123–127 (1985)

    Article  Google Scholar 

  27. Qiu, W., Joe, H.: Generation of random clusters with specified degree of separation. J. Classif. 23(2), 315–334 (2006)

    Article  MathSciNet  Google Scholar 

  28. Melnykov, V., Chen, W.-C., Maitra, R.: MixSim: an r package for simulating data to study performance of clustering algorithms. J. Stat. Softw. 51, 1–25 (2012)

    Article  Google Scholar 

  29. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  30. Fachada, N., de Andrade, D.: Generating multidimensional clusters with support lines. Knowl. Based Syst. 277, 110836 (2023)

    Article  Google Scholar 

  31. Steinley, D.L., Henson, R.: OCLUS: an analytic method for generating clusters with known overlap. J. Classif. 22(2), 221–250 (2005)

    Article  MathSciNet  Google Scholar 

  32. Shand, C., Allmendinger, R., Handl, J., Webb, A.M., Keane, J.: HAWKS: evolving challenging benchmark sets for cluster analysis. IEEE Trans. Evol. Comput. 26(6), 1206–1220 (2022)

    Article  Google Scholar 

  33. Iglesias, F., Zseby, T., Ferreira, D.C., Zimek, A.: MDCGen: multidimensional dataset generator for clustering. J. Classif. 36(3), 599–618 (2019)

    Article  MathSciNet  Google Scholar 

  34. Vennam, J.R., Vadapalli, S.: SynDECA: a tool to generate synthetic datasets for evaluation of clustering algorithms. In: COMAD, Computer Society of India, pp. 27–36 (2005)

    Google Scholar 

  35. Gan, J., Tao, Y.: On the hardness and approximation of euclidean DBSCAN. ACM Trans. Database Syst. 42(3), 14:1–14:45 (2017)

    Google Scholar 

  36. Rachkovskij, D.A., Kussul, E.M.: DataGen: a generator of datasets for evaluation of classification algorithms. Pattern Recognit. Lett. 19(7), 537–544 (1998)

    Article  Google Scholar 

  37. Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets, pp. 4743–4759 (2018). http://cs.uef.fi/sipu/datasets/

  38. Beer, A., Schüler, N.S., Seidl, T.: A generator for subspace clusters. In: LWDA, ser. CEUR Workshop Proceedings, vol. 2454, pp. 69–73 (2019). CEUR-WS.org

    Google Scholar 

  39. Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42(3), 19:1–19:21 (2017)

    Google Scholar 

  40. Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., Sales, A.P.: Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20(1), 1–40 (2020)

    Article  Google Scholar 

  41. Pei, Y., Zaiane, O.R.: A synthetic data generator for clustering and outlier analysis (2006)

    Google Scholar 

  42. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–136 (1982)

    Article  MathSciNet  Google Scholar 

  43. Georgoulas, G.K., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E., Vachtsevanos, G.J.: “Seismic-mass” density-based algorithm for spatio-temporal clustering. Expert Syst. Appl. 40(10), 4183–4189 (2013)

    Google Scholar 

  44. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  45. Comaniciu, D., Meer, P.: Mean Shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)

    Article  Google Scholar 

  46. Ullmann, T., Beer, A., Hünemörder, M., Seidl, T., Boulesteix, A.: Over-optimistic evaluation and reporting of novel cluster algorithms: an illustrative study. Adv. Data Anal. Classif. 17(1), 211–238 (2023)

    Article  MathSciNet  Google Scholar 

  47. Levina, E., Bickel, P.: Maximum likelihood estimation of intrinsic dimension. In: Saul, L., Weiss, Y., Bottou, L. (eds.) NIPS, vol. 17. MIT Press (2004)

    Google Scholar 

  48. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is “Nearest Neighbor’’ meaningful? In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_15

    Chapter  Google Scholar 

  49. Beer, A., Draganov, A., Hohma, E., Jahn, P., Frey, C.M., Assent, I.: Connecting the dots - density-connectivity distance unifies dbscan, k-center and spectral clustering. In: KDD, pp. 80–92. ACM (2023)

    Google Scholar 

  50. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  51. Ward, J.H., Jr.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)

    Article  MathSciNet  Google Scholar 

  52. Xie, J., Girshick, R.B., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML, ser. JMLR Workshop and Conference Proceedings, vol. 48, pp. 478–487 (2016). JMLR.org

    Google Scholar 

  53. Leiber, C., Bauer, L.G.M., Schelling, B., Böhm, C., Plant, C.: Dip-based deep embedded clustering with k-estimation. In: KDD, pp. 903–913. ACM (2021)

    Google Scholar 

  54. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)

    Article  Google Scholar 

  55. Leiber, C., Miklautz, L., Plant, C., Böhm, C.: Benchmarking deep clustering algorithms with clustpy. In: ICDM (Workshops), pp. 625–632. IEEE (2023)

    Google Scholar 

  56. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    Article  Google Scholar 

  57. Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philipp Jahn .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2492 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jahn, P., Frey, C.M.M., Beer, A., Leiber, C., Seidl, T. (2024). Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14947. Springer, Cham. https://doi.org/10.1007/978-3-031-70368-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70368-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70367-6

  • Online ISBN: 978-3-031-70368-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics