Abstract
Clustering algorithms are well-established and widely used for solving data-mining tasks. Every clustering algorithm is composed of several solutions for specific sub-problems in the clustering process. These solutions are linked together in a clustering algorithm, and they define the process and the structure of the algorithm. Frequently, many of these solutions occur in more than one clustering algorithm. Mostly, new clustering algorithms include frequently occurring solutions to typical sub-problems from clustering, as well as from other machine-learning algorithms. The problem is that these solutions are usually integrated in their algorithms, and that original algorithms are not designed to share solutions to sub-problems outside the original algorithm easily. We propose a way of designing cluster algorithms and to improve existing ones, based on reusable components. Reusable components are well-documented, frequently occurring solutions to specific sub-problems in a specific area. Thus we identify reusable components, first, as solutions to characteristic sub-problems in partitioning cluster algorithms, and, further, identify a generic structure for the design of partitioning cluster algorithms. We analyze some partitioning algorithms (K-means, X-means, MPCK-means, and Kohonen SOM), and identify reusable components in them. We give examples of how new cluster algorithms can be designed based on them.
Similar content being viewed by others
References
Adams M, Coplien J, Gamoke R, Hammer R, Keeve F, Nicodemus K (1998) Fault-tolerant telecommunication system patterns. In: Rising L (eds) The pattern handbook: techniques, strategies, and applications. Cambridge University Press, New York, pp 189–202
Alexander C (1979) The timeless way of building. Oxford University Press, New York
Alexander C (2005a) The nature of order book 1: the phenomenon of life. The Center for Environmental Structure, Berkeley, CA
Alexander C. (2005b) The nature of order book 2: the process of creating life. The Center for Environmental Structure, Berkeley, CA
Alexander C. (2005c) The nature of order book 3: a vision of a living world. The Center for Environmental Structure, Berkeley, CA
Alexander C. (2005d) The nature of order book 4: the luminous ground. The Center for Environmental Structure, Berkeley, CA
Arthur D, Vassilvitskii S (2007) K-Means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics, New Orleans, Louisiana, pp 1027–1035
Barbara D, Couto J, Li Y (2001) COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management, pp 582–589
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of ACM SIGKDD, Seattle, WA, pp 59–68
Bennett KP, Bradley PS, Demiriz A (2000) Constrained k-means clustering. Microsoft Research. Available via DIALOG. ftp://ftp.research.microsoft.com/pub/tr/tr-2000-65.ps Accessed 9 Apr 2009
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin-Heidelberg, pp 25–71
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the twenty-first international conference on machine learning, Banff, Canada, pp 81–88
Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 91–99
Cheung YM (2003) k*-Means: a new generalized k-means clustering algorithm. Pattern Recog Lett 24: 2883–2893
Coplien JO, Harrison NB (2005) Organizational patterns of agile software development. Prentice-Hall PTR, Upper Saddle River, NJ
Coplien JO, Schmidt DC (1995) Pattern languages of program design. Addison-Wesley Professional, Reading, MA
Delibasic B, Kirchner K, Ruhland J et al (2008) A pattern-based data mining approach. In: Preisach C, Burckhardt H, Schmidt-Thieme L (eds) Data analysis, machine learning and applications. Springer, Berlin/Heidelberg, pp 327–334
Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, p 29
Drossos N, Papagelis A, Kalles D (2000) Decision tree toolkit: a component-based library of decision tree algorithms. In: Zighed DZ, Komorowski J, Zytkow J (eds) Principles of data mining and knowledge discovery. Springer, Berlin/Heidelberg, pp 121–150
Freeman P (1983) Reusable software engineering: concepts and research directions. In: Workshop on reusability in programming, ITT Programming, Stratford, Connecticut, pp 2–16
Gamma E, Helm R, Johnson R, Vlissides JM (1995) Design patterns: elements of reusable object-oriented software. Addison-Wesley, Reading, MA
Hammerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the seventeenth annual conference on neural information processing systems, pp 281–288
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28: 100–108
Kohonen T (2001) Self-organizing maps. Springer, Berlin
Lea D (1994) Design patterns for avionics control systems. Available via DIALOG. http://gee.cs.oswego.edu/dl/acs/acs.pdf. Accessed 9 Apr 2009
Likas A, Vlassis N, Verbeek JJ (2002) The global k-means clustering algorithm. Pattern Recog 36: 451–461
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 935–940
Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 727–734
Siddique NH., Amavasai BP., Ikuta A.: (2007) Special issue on hybrid techniques in AI. Artif Intell Rev 27: 71–
Sommerville I (2004) Software engineering. Pearson, Boston
Sonnenburg S, Braun ML, Ong CS, Bengio S, Bottou L, Holmes G, LeCun Y, Müller KR, Pereira F, Rasmussen CE, Rätsch G, Schölkopf B, Smola A, Vincent P, Weston J, Williamson RC (2007) The need for open source software in machine learning. J Mach Learn Resour 8: 2443–2466
Steinley D (2006) K-means clustering: a half-century synthesis. British J Math Stat Psychol 59: 1–34
Su MC, Liu TK, Chang HT (2002) Improving the self-organizing feature map algorithm using an efficient initialization scheme. Tamkang J Sci Eng 5: 35–48
Tracz W (1990) Where does reuse start. ACM SIGSOFT Softw Eng Notes 15: 42–46
Winn T, Calder P (2002) Is this a pattern?. IEEE Softw 19((1): 59–66
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning with application to clustering with side-information. Adv Neural Inf Syst 15: 521–528
Zaki M, De N, Gao F, Palmerini P, Parimi N, Pathuri J, Phoophakdee B, Urban J (2005) Generic pattern mining via data mining template library. In: Boulicaut JF, De Raedt L, Mannila H (eds) Constraint-based mining and inductive databases. European workshop on inductive databases and constraint based mining. Springer, Berlin/Heidelberg, pp 362–379
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach (Advances in database systems). Springer, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Delibašić, B., Kirchner, K., Ruhland, J. et al. Reusable components for partitioning clustering algorithms. Artif Intell Rev 32, 59–75 (2009). https://doi.org/10.1007/s10462-009-9133-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-009-9133-6