Abstract
k-means is traditionally viewed as an algorithm for the unsupervised clustering of a heterogeneous population into a number of more homogeneous groups of objects. However, it is not necessarily guaranteed to group the same types (classes) of objects together. In such cases, some supervision is needed to partition objects which have the same label into one cluster. This paper demonstrates how the popular k-means clustering algorithm can be profitably modified to be used as a classifier algorithm. The output field itself cannot be used in the clustering but it is used in developing a suitable metric defined on other fields. The proposed algorithm combines Simulated Annealing with the modified k-means algorithm. We apply the proposed algorithm to real data sets, and compare the output of the resultant classifier to that of C4.5.
Similar content being viewed by others
References
Al-Harbi S, Rayward-Smith VJ (2003) The Use of a Supervised k-means Algorithm on Real-Valued Data with Applications in Health. In: Chung PWH, Chris H, Ali Moois (ed) Developments in Applied Artificial Intelligence LNAI 2718. Springer-Verlag, UK, Loughborough, pp 575–581
Ayan NF (1999) Using Information Gain as Feature Weight. In: Proceedings of the 8th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN/99), Turkey
Basu S, Banerjee A, Mooney R (2002) Semi-supervised Clustering by Seeding.In: Proceedings of the 19th International Conference on Machine Learning (ICML-2002), Sydney, Australia
Berry M, Linoff G (1997) Data Mining Techniques for Marketing, Sales, and Customer Support. John Wiley and Sons, New York
Berson A, Thearling K, Smith S (1999) Building Data Mining Applications for CRM. McGraw-Hill ProfessionalPublishing
Breiman L (2000) Randomizing Outputs To Increase Prediction Accuracy.J Mach Learn 40(3):229–242
Brittain D (1999) Optimisation of the Telecommunication Access Network.University of Bristol, UK, PhD thesis
Burgess M, Janacek G, Rayward-Smith VJ (2003) Handling Categorical Data in Rule Induction. In: Proceedings ICANNGA Conference 2003, D.W. Pearson et al. (eds) Springer Computer Science, Wien and New York, pp 249–255
Cohn D, Caruana R, McCallum A (2003) Semi-supervised Clustering with User Feedback. In http://cs.citeseer.nj.nec.com/387862.html
Copson ET (1968) Metric spaces. Cambridge University Press
Demiriz A, Bennett KP (2002) A genetic Approach for Semi-Supervised Clustering. Rensselaer Polytechnic Institute, R.P.I. Math Report No. 9901: Troy, New York
Dietterich TG (2000) An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. J Mach Learn 40(2):139–157
Everitt B (1974) Cluster analysis. Social Science Research Council
Hartigan J (1975) Clustering algorithms. John Wiley and Sons Inc
Huang Z (1997) Clustering Large Data Sets with Mixed Numeric and Categorical Values. In: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining
Lanner Group Inc. (2003) Data Lamp Version 2.02: Technology for knowing. In http://www.lanner.com
Jourdan L, Dhaenens C, Talbi E-G (2003) CHyGA: A New Distance Based Hybrid Genetic Algorithm. The Journal of Mathematical Modelling and Algorithms (JMMA), Rayward-Smith VJ (ed) (Submitted)
Kaufman L, Rousseeuw P (1990) Finding groups IN DATA: An Introduction to Cluster Analysis. John Wiley and Sons Inc
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkeley symposium 1:281–297
Nukoolkit C, Chen H (2001) A Data Transformation Technique for Car Injury Prediction. Technical report, University of Alabama, USA
Quinlan JR (1996) Improved use of Continuous Attributes in C4.5. J Arti Intell Res 4:77–90
Rayward-Smith VJ, Osman IH, Reeves CR, Smith GD (1996) Modern Heuristic Search Methods. John Wiley and Sons Ltd
Schlimmer JC (2003) Auto imports Database. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases
Sigillito V (2003) National Institute of Diabetes and Digestive and Kidney Diseases. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases
National Indonesia Contraceptive Prevalence Survey (2003) Contraceptive Method Choice Data Set. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases
Wagstaff K, Rogers S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International conference on Machine Learning (ICML-2001), pp 577–584
Wolberg WH, Mangasarian OL (2003) Pattern Separation for Medical Diagnosis Applied to Breast Cytology. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Al-Harbi, S.H., Rayward-Smith, V.J. Adapting k-means for supervised clustering. Appl Intell 24, 219–226 (2006). https://doi.org/10.1007/s10489-006-8513-8
Issue Date:
DOI: https://doi.org/10.1007/s10489-006-8513-8