Abstract
The recurrent use of databases with variables of the categorical type in different fields of science. Demands new approaches when using cluster analysis techniques on this type of database. For this reason, in this article we compare the function kmeans() of Matlab with a function K-Means implemented by us, with the addition that it has integrated a measure of similarity that the function of Matlab does not have, the distance chi-square, both algorithms were tested in databases with quantitative and categorical variables. The experimental results showed a higher level of classification success in favor of the function implemented by us, explaining the correct functioning of the implemented algorithm and demonstrating that the chi-square distance is the measure of appropriate similarity for categorical type databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hand, D.J.: Principles of data mining. Drug Saf. 30(7), 621–622 (2007)
Anderberg, M.R.: Cluster Analysis for Applications: Probability and Mathematical Statistics: A Series of Monographs and Textbooks, vol. 19. Academic Press, Cambridge (2014)
Ball, G.: A clustering technique for summarizing multivariate data. Behav. Sci. 12(2), 153–155 (1967)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 14, no. 1, pp. 281–297 (1967)
Ralambondrainy, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16(11), 1147–1157 (1995)
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)
Gowda, K.: Symbolic clustering using a new dissimilarity measure. Pattern Recogn. 24(6), 567–578 (1991)
Kaufman, L.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, New York (2009)
Woodbury, M.A.: Clinical pure types as a fuzzy partition. J. Cybern. 4(3), 111–121 (1974)
Michalski, R.S.: Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 4, 396–410 (1983)
Ghosh, S., Dubey, S.K.: Comparative analysis of K-means and fuzzy C-means algorithms. Int. J. Adv. Comput. Sci. Appl. 4(4), 35–38 (2013)
Mohanavalli, S.: Precise distance metric for mixed data clustering using chi-square statistics. Res. J. Appl. Sci. Eng. Technol. 10(12), 1441–1444 (2015)
Mathworks.com: K-means clustering - MATLAB kmeans (2018). https://www.mathworks.com/help/stats/kmeans.html. Accessed 26 June 2018
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository Irvine. University of California, School of Information and Computer Science (2013)
Martinez, T.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
Acknowledgments
We would like to thank the Corporacion Instituto de Administracion y Finanzas (CIAF) and the research group of organizations and innovation belonging to the same institution. Who supported us in the development and financing of the article.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ariosto Serna, L., Alejandro Hernández, K., Navarro González, P. (2019). A K-Means Clustering Algorithm: Using the Chi-Square as a Distance. In: Tang, Y., Zu, Q., Rodríguez García, J. (eds) Human Centered Computing. HCC 2018. Lecture Notes in Computer Science(), vol 11354. Springer, Cham. https://doi.org/10.1007/978-3-030-15127-0_46
Download citation
DOI: https://doi.org/10.1007/978-3-030-15127-0_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15126-3
Online ISBN: 978-3-030-15127-0
eBook Packages: Computer ScienceComputer Science (R0)