Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms

Jahn, Philipp; Frey, Christian M. M.; Beer, Anna; Leiber, Collin; Seidl, Thomas

doi:10.1007/978-3-031-70368-3_1

Philipp Jahn^13,14,
Christian M. M. Frey¹⁵,
Anna Beer¹⁶,
Collin Leiber^13,14 &
…
Thomas Seidl^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14947))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

801 Accesses

Abstract

Mining data containing density-based clusters is well-established and widespread but faces problems when it comes to systematic and reproducible comparison and evaluation. Although the success of clustering methods hinges on data quality and availability, reproducibly generating suitable data for this setting is not easy, leading to mostly low-dimensional toy datasets being used. To resolve this issue, we propose DENSIRED (DENSIty-based Reproducible Experimental Data), a novel data generator for data containing density-based clusters. It is highly flexible w.r.t. a large variety of properties of the data and produces reproducible datasets in a two-step approach. First, skeletons of the clusters are constructed following a random walk. In the second step, these skeletons are enriched with data samples. DENSIRED enables the systematic generation of data for a robust and reliable analysis of methods aimed toward examining data containing density-connected clusters. In extensive experiments, we analyze the impact of user-defined properties on the generated datasets and the intrinsic dimensionalities of synthesized clusters. Our code and novel benchmark datasets are publicly available at: https://github.com/PhilJahn/DENSIRED.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Scalable density-based clustering with quality guarantees using random projections

Article 02 March 2017

MDCGen: Multidimensional Dataset Generator for Clustering

Article Open access 23 April 2019

Neighborhood density information in clustering

Article 27 May 2021

Notes

1.
https://scikit-learn.org, last accessed: Oct 5th, 2023.
2.
https://github.com/collinleiber/ClustPy, last accessed: Nov 15th, 2023.
3.
https://github.com/scikit-learn-contrib/hdbscan, last accessed: Oct 5th, 2023.
4.
https://github.com/tobinjo96/DCFcluster, last accessed: Oct 5th, 2023.
5.
https://github.com/SpectralClusteringAcceleratedRobust/SCAR, last accessed: Nov 4th, 2023.
6.
https://bitbucket.org/Sibylse/spectacl, last accessed: Oct 11th, 2023.
7.
https://github.com/colinwke/dpca, last accessed: Oct 31st, 2023.
8.
Exact values: $\omega =0.8$, $\delta =1.5$, $\beta =0.1$, $\varkappa =1$, 200 cores.

References

Tommasi, T., Patricia, N., Caputo, B., Tuytelaars, T.: A deeper look at dataset bias. In: Csurka, G. (ed.) Domain Adaptation in Computer Vision Applications. ACVPR, pp. 37–55. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58347-1_2
Chapter Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, AAAI Press, pp. 226–231 (1996)
Google Scholar
Tobin, J., Zhang, M.: DCF: an efficient and robust density-based clustering method. In: ICDM, pp. 629–638. IEEE (2021)
Google Scholar
Hess, S., Duivesteijn, W., Honysz, P., Morik, K.: The SpectACl of nonconvex clustering: a spectral approach to density-based clustering. In: AAAI, AAAI Press, pp. 3788–3795 (2019)
Google Scholar
Hohma, E., Frey, C.M.M., Beer, A., Seidl, T.: SCAR - spectral clustering accelerated and robustified. Proc. VLDB Endow. 15(11), 3031–3044 (2022)
Article Google Scholar
Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min. Knowl. Discov. 2(2), 169–194 (1998)
Article Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H., Sander, J.: OPTICS: ordering points to identify the clustering structure, pp. 49–60 (1999)
Google Scholar
Frey, C., Züfle, A., Emrich, T., Renz, M.: Efficient information flow maximization in probabilistic graphs. IEEE Trans. Knowl. Data Eng. 30(5), 880–894 (2018)
Article Google Scholar
Ashour, W., Sunoallah, S.: Multi density DBSCAN. In: Yin, H., Wang, W., Rayward-Smith, V. (eds.) IDEAL 2011. LNCS, vol. 6936, pp. 446–453. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23878-9_53
Chapter Google Scholar
Frey, C.M., Jungwirth, A., Frey, M., Kolisch, R.: The vehicle routing problem with time windows and flexible delivery locations. Eur. J. Oper. Res. 308(3), 1142–1159 (2023). ISSN 0377-2217
Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., Bennett, K.P.: Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416, 244–255 (2020). ISSN 0925-2312
Google Scholar
Libes, D., Lechevalier, D., Jain, S.: Issues in synthetic data generation for advanced manufacturing. In: 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, pp. 1746-1754 (2017)
Google Scholar
Gan, J., Tao, Y.: DBSCAN revisited: Mis-claim, un-fixability, and approximation. In: SIGMOD Conference, pp. 519–530. ACM (2015)
Google Scholar
Mai, S.T., Assent, I., Storgaard, M.: AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: KDD, pp. 1025–1034. ACM (2016)
Google Scholar
Hou, J., Gao, H., Li, X.: DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Trans. Image Process. 25(7), 3182–3193 (2016)
Article MathSciNet Google Scholar
Bryant, A., Cios, K.J.: RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Trans. Knowl. Data Eng. 30(6), 1109–1121 (2018)
Article Google Scholar
Kim, J., Choi, J., Yoo, K., Nasridinov, A.: AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities. J. Supercomput. 75(1), 142–169 (2019)
Article Google Scholar
Ren, Y., Wang, N., Li, M., Xu, Z.: Deep density-based image clustering. Knowl. Based Syst. 197, 105841 (2020)
Article Google Scholar
Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., Du, J.: BLOCK-DBSCAN: fast clustering for large scale data. Pattern Recognit. 109, 107624 (2021)
Article Google Scholar
dos Santos, J.A., Iqbal, S.T., Naldi, M.C., Campello, R.J.G.B., Sander, J.: Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data 7(1), 102–114 (2021)
Article Google Scholar
Wang, Z., et al.: AMD-DBSCAN: an adaptive multi-density DBSCAN for datasets of extremely variable density. In: DSAA, pp. 1–10. IEEE (2022)
Google Scholar
Huang, X., Ma, T., Liu, C., Liu, S.: GriT-DBSCAN: a spatial clustering algorithm for very large databases. Pattern Recognit. 142, 109658 (2023)
Article Google Scholar
Ma, B., Yang, C., Li, A., Chi, Y., Chen, L.: A faster dbscan algorithm based on self-adaptive determination of parameters. Procedia Comput. Sci. 221, 113–120 (2023). (ITQM 2023)
Google Scholar
Qian, J., Zhou, Y., Han, X., Wang, Y.: MDBSCAN: a multi-density dbscan based on relative density. Neurocomputing 576, 127329 (2024)
Article Google Scholar
Milligan, G.W.: An algorithm for generating artificial test clusters. Psychometrika 50, 123–127 (1985)
Article Google Scholar
Qiu, W., Joe, H.: Generation of random clusters with specified degree of separation. J. Classif. 23(2), 315–334 (2006)
Article MathSciNet Google Scholar
Melnykov, V., Chen, W.-C., Maitra, R.: MixSim: an r package for simulating data to study performance of clustering algorithms. J. Stat. Softw. 51, 1–25 (2012)
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Fachada, N., de Andrade, D.: Generating multidimensional clusters with support lines. Knowl. Based Syst. 277, 110836 (2023)
Article Google Scholar
Steinley, D.L., Henson, R.: OCLUS: an analytic method for generating clusters with known overlap. J. Classif. 22(2), 221–250 (2005)
Article MathSciNet Google Scholar
Shand, C., Allmendinger, R., Handl, J., Webb, A.M., Keane, J.: HAWKS: evolving challenging benchmark sets for cluster analysis. IEEE Trans. Evol. Comput. 26(6), 1206–1220 (2022)
Article Google Scholar
Iglesias, F., Zseby, T., Ferreira, D.C., Zimek, A.: MDCGen: multidimensional dataset generator for clustering. J. Classif. 36(3), 599–618 (2019)
Article MathSciNet Google Scholar
Vennam, J.R., Vadapalli, S.: SynDECA: a tool to generate synthetic datasets for evaluation of clustering algorithms. In: COMAD, Computer Society of India, pp. 27–36 (2005)
Google Scholar
Gan, J., Tao, Y.: On the hardness and approximation of euclidean DBSCAN. ACM Trans. Database Syst. 42(3), 14:1–14:45 (2017)
Google Scholar
Rachkovskij, D.A., Kussul, E.M.: DataGen: a generator of datasets for evaluation of classification algorithms. Pattern Recognit. Lett. 19(7), 537–544 (1998)
Article Google Scholar
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets, pp. 4743–4759 (2018). http://cs.uef.fi/sipu/datasets/
Beer, A., Schüler, N.S., Seidl, T.: A generator for subspace clusters. In: LWDA, ser. CEUR Workshop Proceedings, vol. 2454, pp. 69–73 (2019). CEUR-WS.org
Google Scholar
Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42(3), 19:1–19:21 (2017)
Google Scholar
Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., Sales, A.P.: Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20(1), 1–40 (2020)
Article Google Scholar
Pei, Y., Zaiane, O.R.: A synthetic data generator for clustering and outlier analysis (2006)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–136 (1982)
Article MathSciNet Google Scholar
Georgoulas, G.K., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E., Vachtsevanos, G.J.: “Seismic-mass” density-based algorithm for spatio-temporal clustering. Expert Syst. Appl. 40(10), 4183–4189 (2013)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Comaniciu, D., Meer, P.: Mean Shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
Article Google Scholar
Ullmann, T., Beer, A., Hünemörder, M., Seidl, T., Boulesteix, A.: Over-optimistic evaluation and reporting of novel cluster algorithms: an illustrative study. Adv. Data Anal. Classif. 17(1), 211–238 (2023)
Article MathSciNet Google Scholar
Levina, E., Bickel, P.: Maximum likelihood estimation of intrinsic dimension. In: Saul, L., Weiss, Y., Bottou, L. (eds.) NIPS, vol. 17. MIT Press (2004)
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is “Nearest Neighbor’’ meaningful? In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_15
Chapter Google Scholar
Beer, A., Draganov, A., Hohma, E., Jahn, P., Frey, C.M., Assent, I.: Connecting the dots - density-connectivity distance unifies dbscan, k-center and spectral clustering. In: KDD, pp. 80–92. ACM (2023)
Google Scholar
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Article MathSciNet Google Scholar
Ward, J.H., Jr.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Article MathSciNet Google Scholar
Xie, J., Girshick, R.B., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML, ser. JMLR Workshop and Conference Proceedings, vol. 48, pp. 478–487 (2016). JMLR.org
Google Scholar
Leiber, C., Bauer, L.G.M., Schelling, B., Böhm, C., Plant, C.: Dip-based deep embedded clustering with k-estimation. In: KDD, pp. 903–913. ACM (2021)
Google Scholar
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Article Google Scholar
Leiber, C., Miklautz, L., Plant, C., Böhm, C.: Benchmarking deep clustering algorithms with clustpy. In: ICDM (Workshops), pp. 625–632. IEEE (2023)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Article Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

LMU Munich, Munich, Germany
Philipp Jahn, Collin Leiber & Thomas Seidl
Munich Center for Machine Learning (MCML), Munich, Germany
Philipp Jahn, Collin Leiber & Thomas Seidl
Fraunhofer IIS, Erlangen, Germany
Christian M. M. Frey & Thomas Seidl
University of Vienna, Vienna, Austria
Anna Beer

Authors

Philipp Jahn
View author publications
You can also search for this author in PubMed Google Scholar
Christian M. M. Frey
View author publications
You can also search for this author in PubMed Google Scholar
Anna Beer
View author publications
You can also search for this author in PubMed Google Scholar
Collin Leiber
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Seidl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philipp Jahn .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
KU Leuven, Leuven, Belgium
Jesse Davis
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Institute of Computer Science, University of Tartu, Tartu, Estonia
Meelis Kull
Department of Computer Science, Bundeswehr University Munich, Munich, Germany
Eirini Ntoutsi
Dept. of Computer Science, University of Helsinki, Helsinki, Finland
Indrė Žliobaitė

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2492 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jahn, P., Frey, C.M.M., Beer, A., Leiber, C., Seidl, T. (2024). Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14947. Springer, Cham. https://doi.org/10.1007/978-3-031-70368-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-70368-3_1
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70367-6
Online ISBN: 978-3-031-70368-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms