Abstract
Mining data containing density-based clusters is well-established and widespread but faces problems when it comes to systematic and reproducible comparison and evaluation. Although the success of clustering methods hinges on data quality and availability, reproducibly generating suitable data for this setting is not easy, leading to mostly low-dimensional toy datasets being used. To resolve this issue, we propose DENSIRED (DENSIty-based Reproducible Experimental Data), a novel data generator for data containing density-based clusters. It is highly flexible w.r.t. a large variety of properties of the data and produces reproducible datasets in a two-step approach. First, skeletons of the clusters are constructed following a random walk. In the second step, these skeletons are enriched with data samples. DENSIRED enables the systematic generation of data for a robust and reliable analysis of methods aimed toward examining data containing density-connected clusters. In extensive experiments, we analyze the impact of user-defined properties on the generated datasets and the intrinsic dimensionalities of synthesized clusters. Our code and novel benchmark datasets are publicly available at: https://github.com/PhilJahn/DENSIRED.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
https://scikit-learn.org, last accessed: Oct 5th, 2023.
- 2.
https://github.com/collinleiber/ClustPy, last accessed: Nov 15th, 2023.
- 3.
https://github.com/scikit-learn-contrib/hdbscan, last accessed: Oct 5th, 2023.
- 4.
https://github.com/tobinjo96/DCFcluster, last accessed: Oct 5th, 2023.
- 5.
https://github.com/SpectralClusteringAcceleratedRobust/SCAR, last accessed: Nov 4th, 2023.
- 6.
https://bitbucket.org/Sibylse/spectacl, last accessed: Oct 11th, 2023.
- 7.
https://github.com/colinwke/dpca, last accessed: Oct 31st, 2023.
- 8.
Exact values: \(\omega =0.8\), \(\delta =1.5\), \(\beta =0.1\), \(\varkappa =1\), 200 cores.
References
Tommasi, T., Patricia, N., Caputo, B., Tuytelaars, T.: A deeper look at dataset bias. In: Csurka, G. (ed.) Domain Adaptation in Computer Vision Applications. ACVPR, pp. 37–55. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58347-1_2
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, AAAI Press, pp. 226–231 (1996)
Tobin, J., Zhang, M.: DCF: an efficient and robust density-based clustering method. In: ICDM, pp. 629–638. IEEE (2021)
Hess, S., Duivesteijn, W., Honysz, P., Morik, K.: The SpectACl of nonconvex clustering: a spectral approach to density-based clustering. In: AAAI, AAAI Press, pp. 3788–3795 (2019)
Hohma, E., Frey, C.M.M., Beer, A., Seidl, T.: SCAR - spectral clustering accelerated and robustified. Proc. VLDB Endow. 15(11), 3031–3044 (2022)
Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min. Knowl. Discov. 2(2), 169–194 (1998)
Ankerst, M., Breunig, M.M., Kriegel, H., Sander, J.: OPTICS: ordering points to identify the clustering structure, pp. 49–60 (1999)
Frey, C., Züfle, A., Emrich, T., Renz, M.: Efficient information flow maximization in probabilistic graphs. IEEE Trans. Knowl. Data Eng. 30(5), 880–894 (2018)
Ashour, W., Sunoallah, S.: Multi density DBSCAN. In: Yin, H., Wang, W., Rayward-Smith, V. (eds.) IDEAL 2011. LNCS, vol. 6936, pp. 446–453. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23878-9_53
Frey, C.M., Jungwirth, A., Frey, M., Kolisch, R.: The vehicle routing problem with time windows and flexible delivery locations. Eur. J. Oper. Res. 308(3), 1142–1159 (2023). ISSN 0377-2217
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., Bennett, K.P.: Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416, 244–255 (2020). ISSN 0925-2312
Libes, D., Lechevalier, D., Jain, S.: Issues in synthetic data generation for advanced manufacturing. In: 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, pp. 1746-1754 (2017)
Gan, J., Tao, Y.: DBSCAN revisited: Mis-claim, un-fixability, and approximation. In: SIGMOD Conference, pp. 519–530. ACM (2015)
Mai, S.T., Assent, I., Storgaard, M.: AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: KDD, pp. 1025–1034. ACM (2016)
Hou, J., Gao, H., Li, X.: DSets-DBSCAN: a parameter-free clustering algorithm. IEEE Trans. Image Process. 25(7), 3182–3193 (2016)
Bryant, A., Cios, K.J.: RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Trans. Knowl. Data Eng. 30(6), 1109–1121 (2018)
Kim, J., Choi, J., Yoo, K., Nasridinov, A.: AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities. J. Supercomput. 75(1), 142–169 (2019)
Ren, Y., Wang, N., Li, M., Xu, Z.: Deep density-based image clustering. Knowl. Based Syst. 197, 105841 (2020)
Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., Du, J.: BLOCK-DBSCAN: fast clustering for large scale data. Pattern Recognit. 109, 107624 (2021)
dos Santos, J.A., Iqbal, S.T., Naldi, M.C., Campello, R.J.G.B., Sander, J.: Hierarchical density-based clustering using MapReduce. IEEE Trans. Big Data 7(1), 102–114 (2021)
Wang, Z., et al.: AMD-DBSCAN: an adaptive multi-density DBSCAN for datasets of extremely variable density. In: DSAA, pp. 1–10. IEEE (2022)
Huang, X., Ma, T., Liu, C., Liu, S.: GriT-DBSCAN: a spatial clustering algorithm for very large databases. Pattern Recognit. 142, 109658 (2023)
Ma, B., Yang, C., Li, A., Chi, Y., Chen, L.: A faster dbscan algorithm based on self-adaptive determination of parameters. Procedia Comput. Sci. 221, 113–120 (2023). (ITQM 2023)
Qian, J., Zhou, Y., Han, X., Wang, Y.: MDBSCAN: a multi-density dbscan based on relative density. Neurocomputing 576, 127329 (2024)
Milligan, G.W.: An algorithm for generating artificial test clusters. Psychometrika 50, 123–127 (1985)
Qiu, W., Joe, H.: Generation of random clusters with specified degree of separation. J. Classif. 23(2), 315–334 (2006)
Melnykov, V., Chen, W.-C., Maitra, R.: MixSim: an r package for simulating data to study performance of clustering algorithms. J. Stat. Softw. 51, 1–25 (2012)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Fachada, N., de Andrade, D.: Generating multidimensional clusters with support lines. Knowl. Based Syst. 277, 110836 (2023)
Steinley, D.L., Henson, R.: OCLUS: an analytic method for generating clusters with known overlap. J. Classif. 22(2), 221–250 (2005)
Shand, C., Allmendinger, R., Handl, J., Webb, A.M., Keane, J.: HAWKS: evolving challenging benchmark sets for cluster analysis. IEEE Trans. Evol. Comput. 26(6), 1206–1220 (2022)
Iglesias, F., Zseby, T., Ferreira, D.C., Zimek, A.: MDCGen: multidimensional dataset generator for clustering. J. Classif. 36(3), 599–618 (2019)
Vennam, J.R., Vadapalli, S.: SynDECA: a tool to generate synthetic datasets for evaluation of clustering algorithms. In: COMAD, Computer Society of India, pp. 27–36 (2005)
Gan, J., Tao, Y.: On the hardness and approximation of euclidean DBSCAN. ACM Trans. Database Syst. 42(3), 14:1–14:45 (2017)
Rachkovskij, D.A., Kussul, E.M.: DataGen: a generator of datasets for evaluation of classification algorithms. Pattern Recognit. Lett. 19(7), 537–544 (1998)
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets, pp. 4743–4759 (2018). http://cs.uef.fi/sipu/datasets/
Beer, A., Schüler, N.S., Seidl, T.: A generator for subspace clusters. In: LWDA, ser. CEUR Workshop Proceedings, vol. 2454, pp. 69–73 (2019). CEUR-WS.org
Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42(3), 19:1–19:21 (2017)
Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., Sales, A.P.: Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 20(1), 1–40 (2020)
Pei, Y., Zaiane, O.R.: A synthetic data generator for clustering and outlier analysis (2006)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–136 (1982)
Georgoulas, G.K., Konstantaras, A., Katsifarakis, E., Stylios, C.D., Maravelakis, E., Vachtsevanos, G.J.: “Seismic-mass” density-based algorithm for spatio-temporal clustering. Expert Syst. Appl. 40(10), 4183–4189 (2013)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Comaniciu, D., Meer, P.: Mean Shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
Ullmann, T., Beer, A., Hünemörder, M., Seidl, T., Boulesteix, A.: Over-optimistic evaluation and reporting of novel cluster algorithms: an illustrative study. Adv. Data Anal. Classif. 17(1), 211–238 (2023)
Levina, E., Bickel, P.: Maximum likelihood estimation of intrinsic dimension. In: Saul, L., Weiss, Y., Bottou, L. (eds.) NIPS, vol. 17. MIT Press (2004)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is “Nearest Neighbor’’ meaningful? In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_15
Beer, A., Draganov, A., Hohma, E., Jahn, P., Frey, C.M., Assent, I.: Connecting the dots - density-connectivity distance unifies dbscan, k-center and spectral clustering. In: KDD, pp. 80–92. ACM (2023)
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Ward, J.H., Jr.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Xie, J., Girshick, R.B., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML, ser. JMLR Workshop and Conference Proceedings, vol. 48, pp. 478–487 (2016). JMLR.org
Leiber, C., Bauer, L.G.M., Schelling, B., Böhm, C., Plant, C.: Dip-based deep embedded clustering with k-estimation. In: KDD, pp. 903–913. ACM (2021)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Leiber, C., Miklautz, L., Plant, C., Böhm, C.: Benchmarking deep clustering algorithms with clustpy. In: ICDM (Workshops), pp. 625–632. IEEE (2023)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jahn, P., Frey, C.M.M., Beer, A., Leiber, C., Seidl, T. (2024). Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14947. Springer, Cham. https://doi.org/10.1007/978-3-031-70368-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-70368-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70367-6
Online ISBN: 978-3-031-70368-3
eBook Packages: Computer ScienceComputer Science (R0)