Generating Diverse Clustering Datasets with Targeted Characteristics

dos Santos Fernandes, Luiz Henrique; Smith-Miles, Kate; Lorena, Ana Carolina

doi:10.1007/978-3-031-21686-2_28

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13653))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

689 Accesses

Abstract

When evaluating clustering algorithms, it is important to assess their performance in retrieving clusters of datasets with known structures. Nonetheless, generating and choosing diverse datasets to compose such test benchmarks is non-trivial. The datasets must present a large variety of structures and characteristics so that the algorithms can be challenged and their strengths and weaknesses can be revealed. The use of generators currently available in the literature relies on trial and error procedures that can be quite costly and inaccurate. Taking advantage of an Instance Space Analysis of popular clustering benchmarks, where datasets are projected into a 2-D embedding with linear trends according to different characteristics, we use a genetic algorithm to produce new datasets at targeted locations in the instance space. This is a natural extension of the Instance Space Analysis framework, and as a result, we are able to produce diverse datasets for composing test benchmarks for clustering.

Supported by FAPESP (grant 2021/06870–3), CNPq and the Australian Research Council (Laureate Fellowship scheme FL140100012).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fernandes, L.H.S., de Souto, M.C.P., Lorena, A.C.: Evaluating data characterization measures for clustering problems in meta-learning. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. LNCS, vol. 13108, pp. 621–632. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92185-9_51
Chapter Google Scholar
Fernandes, L.H.d.S., Lorena, A.C., Smith-Miles, K.: Towards understanding clustering problems and algorithms: An instance space analysis. Algorithms 14(3), 95 (2021)
Google Scholar
Handl, J., Knowles, J.: Cluster generators for large high-dimensional data sets with large numbers of clusters. Dimension 2, 20 (2005)
Google Scholar
Iglesias, F., Zseby, T., Ferreira, D., Zimek, A.: Mdcgen: Multidimensional dataset generator for clustering. J. Classification 36(3), 599–618 (2019)
Article MathSciNet MATH Google Scholar
Kandanaarachchi, S., Muñoz, M.A., Hyndman, R.J., Smith-Miles, K.: On normalization and algorithm selection for unsupervised outlier detection. Data Mining Knowl. Disc. 34(2), 309–354 (2020)
Article MathSciNet MATH Google Scholar
Kang, Y., Hyndman, R.J., Smith-Miles, K.: Visualising forecasting algorithm performance using time series instance spaces. Int. J. Forecast. 33(2), 345–358 (2017)
Article Google Scholar
Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behav. Res. 21(4), 441–458 (1986)
Article Google Scholar
Muñoz, M.A., Smith-Miles, K.A.: Performance analysis of continuous black-box optimization algorithms via footprints in instance space. Evol. Comput. 25(4), 529–554 (2017)
Article Google Scholar
Munoz, M.A., Villanova, L., Baatar, D., Smith-Miles, K.: Instance spaces for machine learning classification. Mach. Learn. 107(1), 109–147 (2018)
Article MathSciNet MATH Google Scholar
Muñoz, M.A., et al.: An instance space analysis of regression problems. ACM Trans. Knowl. Discovery Data (TKDD) 15(2), 1–25 (2021)
Article Google Scholar
Pei, Y., Zaïane, O.: A synthetic data generator for clustering and outlier analysis. Tech. rep., Department of Computing Science, University of Alberta Edmonton, AB, Canada (2006). https://era.library.ualberta.ca/items/63beb6a7-cc50-4ffd-990b-64723b1e4bf9
Pimentel, B.A., de Carvalho, A.C.: A new data characterization for selecting clustering algorithms using meta-learning. Inform. Sci. 477, 203–219 (2019)
Article Google Scholar
Qiu, W., Joe, H.: Generation of random clusters with specified degree of separation. J. Classification 23(2), 315–334 (2006)
Article MathSciNet MATH Google Scholar
Rice, J.R.: The algorithm selection problem. In: Advances in Computers, vol. 15, pp. 65–118. Elsevier (1976)
Google Scholar
Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. Proc. VLDB Endowment 8(12), 1976–1979 (2015)
Article Google Scholar
Smith-Miles, K., Baatar, D., Wreford, B., Lewis, R.: Towards objective measures of algorithm performance across instance space. Comput. Op. Res. 45, 12–24 (2014)
Article MathSciNet MATH Google Scholar
Smith-Miles, K., Bowly, S.: Generating new test instances by evolving in instance space. Comput. Oper. Res. 63, 102–113 (2015)
Article MathSciNet MATH Google Scholar
Steinley, D., Henson, R.: Oclus: an analytic method for generating clusters with known overlap. J. Classification 22(2), 221–250 (2005)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico de Aeronáutica, São José Dos Campos/SP, Brazil
Luiz Henrique dos Santos Fernandes & Ana Carolina Lorena
The University of Melbourne, Melbourne, Australia
Kate Smith-Miles

Authors

Luiz Henrique dos Santos Fernandes
View author publications
You can also search for this author in PubMed Google Scholar
Kate Smith-Miles
View author publications
You can also search for this author in PubMed Google Scholar
Ana Carolina Lorena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luiz Henrique dos Santos Fernandes .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Norte, Natal, Brazil
João Carlos Xavier-Junior
Federal University of Bahia, Salvador, Brazil
Ricardo Araújo Rios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

dos Santos Fernandes, L.H., Smith-Miles, K., Lorena, A.C. (2022). Generating Diverse Clustering Datasets with Targeted Characteristics. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13653. Springer, Cham. https://doi.org/10.1007/978-3-031-21686-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-21686-2_28
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21685-5
Online ISBN: 978-3-031-21686-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Generating Diverse Clustering Datasets with Targeted Characteristics