Abstract
Clustering is one of the most prominent data analysis techniques to structure large datasets and produce a human-understandable overview. In this paper, we focus on the case when the data has many categorical attributes, and thus can not be represented in a faithful way in the Euclidean space. We follow the graph-based paradigm and propose a graph-based genetic algorithm for clustering, the flexibility of which can mainly be attributed to the possibility of using various kernels. As our approach can naturally be parallelized, while implementing and testing it, we distribute the computations over several CPUs. In contrast to the complexity of the problem, that is NP-hard, our experiments show that in case of well clusterable data, our algorithm scales well. We also perform experiments on real medical data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ackerman, M., Ben-David, S.: Which data sets are clusterable?—A theoretical study of clusterability (2008), http://www.cs.uwaterloo.ca/~shai/publications/ability_submit.pdf
Ben-David, S., Ackerman, M.: Measures of clustering quality: A working set of axioms for clustering. In: Advances in Neural Information Processing Systems, vol. 21, pp. 121–128 (2009)
Ben-David, S., Pál, D., Simon, H.: Stability of k-means clustering. In: Bshouty, N.H., Gentile, C. (eds.) COLT. LNCS (LNAI), vol. 4539, pp. 20–34. Springer, Heidelberg (2007)
Ben-David, S., Von Luxburg, U.: Relating clustering stability to properties of cluster boundaries. In: Proceedings of the International Conference on Computational Learning Theory, COLT (2008)
Beyer, H.: The theory of evolution strategies. Springer, Heidelberg (2001)
Brown, N., McKay, B., Gilardoni, F., Gasteiger, J.: A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. Journal of Chemical Information and Computer Sciences 44(3), 1079–1087 (2004)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to algorithms. The MIT Press, Cambridge (2003)
Czumaj, A., Sohler, C.: Sublinear-time approximation algorithms for clustering via random sampling. Random Structures & Algorithms 30(1-2), 226–256 (2007)
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Kleinberg, J.: An impossibility theorem for clustering. In: Advances in Neural Information Processing Systems, vol. 15, p. 463 (2003)
Meyerson, A., O’Callaghan, L., Plotkin, S.: A k-median algorithm with running time independent of data size. Machine Learning 56(1), 61–87 (2004)
Mishra, N., Oblinger, D., Pitt, L.: Sublinear time approximate clustering. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 439–447. Society for Industrial and Applied Mathematics, Philadelphia (2001)
Shamir, O., Tishby, N.: On the reliability of clustering stability in the large sample regime. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1465–1472 (2009)
de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 50–58. ACM, New York (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Buza, K., Buza, A., Kis, P.B. (2011). A Distributed Genetic Algorithm for Graph-Based Clustering. In: Czachórski, T., Kozielski, S., Stańczyk, U. (eds) Man-Machine Interactions 2. Advances in Intelligent and Soft Computing, vol 103. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23169-8_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-23169-8_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23168-1
Online ISBN: 978-3-642-23169-8
eBook Packages: EngineeringEngineering (R0)