ABSTRACT
In this paper, we propose a method for synthesis of datasets with specific characteristics for the clustering task. Namely, we propose an algorithm, which can generate a clustering dataset given its meta-feature description. The method we propose is based on an evolutionary algorithm with crossover and mutation operators that are capable to improve candidate datasets in a natural way. We experimentally compared this method with two other approaches for dataset synthesis. We used meta-feature vectors of 247 real-world datasets as inputs. The proposed method outperformed existing ones with respect to Mahalanobis distance between target meta-feature vectors and characteristics of generated datasets.
- Kaufman, L., and Rousseeuw, P. J. 2009. Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.Google Scholar
- Bora, D. J., Gupta, D., and Kumar, A. 2014. A comparative study between fuzzy clustering algorithm and hard clustering algorithm. International Journal of Computer Trends and Technology. 10, 2, 108--113.Google ScholarCross Ref
- Bonner, R. E. 1964. On some clustering techniques. IBM journal of research and development. 8,1, 22--32.Google Scholar
- Kleinberg, J. M. 2003. An impossibility theorem for clustering. In Advances in neural information processing systems. 463--470.Google Scholar
- Rendón, E., Abundez, I., Arizmendi, A., and Quiroz, E. M. 2011. Internal versus external cluster validation indexes. International Journal of computers and communications, 5,1, 27--34.Google Scholar
- Färber, I., Günnemann, S., Kriegel, H. P., Kröger, P., Müller, E., Schubert, E., and Zimek, A. 2010. On using class-labels in evaluation of clusterings. In MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings held in conjunction with KDDGoogle Scholar
- Dom, B. E. 2002. An information-theoretic external cluster-validity measure. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence. 137--145. Morgan Kaufmann Publishers Inc.Google ScholarDigital Library
- Kovács, F., Legány, C., and Babos, A. 2005. Cluster validity measurement techniques. In 6th International symposium of hungarian researchers on computational intelligence.Google Scholar
- Giraud-Carrier, C. 2008. Metalearning-a tutorial. In Tutorial at the 2008 International Conference on Machine Learning and Applications, ICMLA. 11--13.Google Scholar
- Brazdil, P., Carrier, C. G., Soares, C., and Vilalta, R. 2008. Metalearning: Applications to data mining. Springer Science & Business Media.Google Scholar
- Reif, M., Shafait, F., and Dengel, A. 2012. Dataset generation for meta-learning. KI-2012: Poster and Demo Track, 69--73.Google Scholar
- Muñoz, M. A., and Smith-Miles, K. 2017. Generating custom classification datasets by targeting the instance space. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. 1582--1588. ACM.Google Scholar
- Anderson, T. W. 1958. An introduction to multivariate statistical analysis. 2, 5--3. New York: Wiley.Google Scholar
- Vanschoren, J., Van Rijn, J. N., Bischl, B., and Torgo, L. 2014. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter. 15,2, 49--60.Google ScholarDigital Library
- Durillo, J. J., Nebro, A. J., and Alba, E. 2010. The jMetal framework for multi-objective optimization: Design and architecture. In Evolutionary Computation (CEC), 2010 IEEE Congress on. 1--8. IEEE.Google Scholar
- Ferrari, D. G., and De Castro, L. N. 2015. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences. 301, 181--194.Google ScholarDigital Library
- Arbelaitz, O., Gurrutxaga, I., Muguerza, J., PéRez, J. M., Perona, I. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243--256.Google ScholarDigital Library
- Caliński, T., & Harabasz, J. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1--27.Google ScholarCross Ref
- Gurrutxaga, I., Albisua, I., Arbelaitz, O., Martín, J. I., Muguerza, J., Pérez, J. M., & Perona, I. 2010. SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index. Pattern Recognition, 43(10), 3364--3373.Google ScholarDigital Library
- Hubert, L. J., & Levin, J. R. 1976. A general statistical framework for assessing categorical clustering in free recall. Psychological bulletin, 83(6), 1072.Google Scholar
- Dunn, J. C. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters.Google Scholar
- Žalik, K. R., & Žalik, B. 2011. Validity index for clusters of different sizes and densities. Pattern Recognition Letters, 32(2), 221--234.Google ScholarDigital Library
- Saitta, S., Raphael, B., & Smith, I. F. 2007. A bounded index for cluster validity. In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 174--187). Springer, Berlin, Heidelberg.Google ScholarDigital Library
- Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53--65.Google ScholarDigital Library
- Mahalanobis, P. C. 1936. On the generalized distance in statistics. National Institute of Science of India.Google Scholar
- Filchenkov A., Muravyov S., Parfenov V. 2016. Towards cluster validity index evaluation and selection. In 2016 IEEE Artificial Intelligence and Natural Language Conference (AINL). 2016.1--8.Google Scholar
Index Terms
- Synthesis of Datasets with Specific Characteristics for the Clustering Problem
Recommendations
Probabilistic model-building genetic algorithms
GECCO '08: Proceedings of the 10th annual conference companion on Genetic and evolutionary computationProbabilistic model-building algorithms (PMBGAs) replace traditional variation of genetic and evolutionary algorithms by (1) building a probabilistic model of promising solutions and (2) sampling the built model to generate new candidate solutions. ...
Centroid Opposition-Based Differential Evolution
The capabilities of evolutionary algorithms (EAs) in solving nonlinear and non-convex optimization problems are significant. Differential evolution (DE) is an effective population-based EA, which has emerged as very competitive. Since its inception in ...
Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer
AbstractClustering based on nature-inspired algorithms is considered as one of the fast growing areas that aims to benefit from such algorithms to formulate a clustering problem as an optimization problem. In this work, the search capabilities ...
Comments