Skip to main content
Log in

Adapting k-means for supervised clustering

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

k-means is traditionally viewed as an algorithm for the unsupervised clustering of a heterogeneous population into a number of more homogeneous groups of objects. However, it is not necessarily guaranteed to group the same types (classes) of objects together. In such cases, some supervision is needed to partition objects which have the same label into one cluster. This paper demonstrates how the popular k-means clustering algorithm can be profitably modified to be used as a classifier algorithm. The output field itself cannot be used in the clustering but it is used in developing a suitable metric defined on other fields. The proposed algorithm combines Simulated Annealing with the modified k-means algorithm. We apply the proposed algorithm to real data sets, and compare the output of the resultant classifier to that of C4.5.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Al-Harbi S, Rayward-Smith VJ (2003) The Use of a Supervised k-means Algorithm on Real-Valued Data with Applications in Health. In: Chung PWH, Chris H, Ali Moois (ed) Developments in Applied Artificial Intelligence LNAI 2718. Springer-Verlag, UK, Loughborough, pp 575–581

    Google Scholar 

  2. Ayan NF (1999) Using Information Gain as Feature Weight. In: Proceedings of the 8th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN/99), Turkey

  3. Basu S, Banerjee A, Mooney R (2002) Semi-supervised Clustering by Seeding.In: Proceedings of the 19th International Conference on Machine Learning (ICML-2002), Sydney, Australia

  4. Berry M, Linoff G (1997) Data Mining Techniques for Marketing, Sales, and Customer Support. John Wiley and Sons, New York

    Google Scholar 

  5. Berson A, Thearling K, Smith S (1999) Building Data Mining Applications for CRM. McGraw-Hill ProfessionalPublishing

  6. Breiman L (2000) Randomizing Outputs To Increase Prediction Accuracy.J Mach Learn 40(3):229–242

    MATH  Google Scholar 

  7. Brittain D (1999) Optimisation of the Telecommunication Access Network.University of Bristol, UK, PhD thesis

  8. Burgess M, Janacek G, Rayward-Smith VJ (2003) Handling Categorical Data in Rule Induction. In: Proceedings ICANNGA Conference 2003, D.W. Pearson et al. (eds) Springer Computer Science, Wien and New York, pp 249–255

    Google Scholar 

  9. Cohn D, Caruana R, McCallum A (2003) Semi-supervised Clustering with User Feedback. In http://cs.citeseer.nj.nec.com/387862.html

  10. Copson ET (1968) Metric spaces. Cambridge University Press

  11. Demiriz A, Bennett KP (2002) A genetic Approach for Semi-Supervised Clustering. Rensselaer Polytechnic Institute, R.P.I. Math Report No. 9901: Troy, New York

  12. Dietterich TG (2000) An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. J Mach Learn 40(2):139–157

    Google Scholar 

  13. Everitt B (1974) Cluster analysis. Social Science Research Council

  14. Hartigan J (1975) Clustering algorithms. John Wiley and Sons Inc

  15. Huang Z (1997) Clustering Large Data Sets with Mixed Numeric and Categorical Values. In: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining

  16. Lanner Group Inc. (2003) Data Lamp Version 2.02: Technology for knowing. In http://www.lanner.com

  17. Jourdan L, Dhaenens C, Talbi E-G (2003) CHyGA: A New Distance Based Hybrid Genetic Algorithm. The Journal of Mathematical Modelling and Algorithms (JMMA), Rayward-Smith VJ (ed) (Submitted)

  18. Kaufman L, Rousseeuw P (1990) Finding groups IN DATA: An Introduction to Cluster Analysis. John Wiley and Sons Inc

  19. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkeley symposium 1:281–297

    MATH  MathSciNet  Google Scholar 

  20. Nukoolkit C, Chen H (2001) A Data Transformation Technique for Car Injury Prediction. Technical report, University of Alabama, USA

    Google Scholar 

  21. Quinlan JR (1996) Improved use of Continuous Attributes in C4.5. J Arti Intell Res 4:77–90

    MATH  Google Scholar 

  22. Rayward-Smith VJ, Osman IH, Reeves CR, Smith GD (1996) Modern Heuristic Search Methods. John Wiley and Sons Ltd

  23. Schlimmer JC (2003) Auto imports Database. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases

  24. Sigillito V (2003) National Institute of Diabetes and Digestive and Kidney Diseases. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases

  25. National Indonesia Contraceptive Prevalence Survey (2003) Contraceptive Method Choice Data Set. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases

  26. Wagstaff K, Rogers S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International conference on Machine Learning (ICML-2001), pp 577–584

  27. Wolberg WH, Mangasarian OL (2003) Pattern Separation for Medical Diagnosis Applied to Breast Cytology. In http://www.icu.uci.edu/pub/machine-learning-data-bases. UCI repository of machine learning databases

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. H. Al-Harbi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al-Harbi, S.H., Rayward-Smith, V.J. Adapting k-means for supervised clustering. Appl Intell 24, 219–226 (2006). https://doi.org/10.1007/s10489-006-8513-8

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-006-8513-8

Keywords

Navigation