Abstract
Measuring the similarity for categorical data is a challenging task in data mining due to the poor structure of categorical data. This paper presents a dissimilarity measure for categorical data based on the relations among attributes. This measure not only has the advantage of value variance but also overcomes the limitations of condition the probability-based measure when applied to databases whose attributes are independent. Experiments with 30 databases also showed that the proposed measure boosted the accuracy of Nearest Neighbor classification in comparison with other tested measures.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. Journal of classification 3, 5–48 (1986)
Le, S.Q., Ho, T.B.: A Conditional probability distribution-based dissimilarity measure for categorical data. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 580–589. Springer, Heidelberg (2004)
Aono, M., Kobayashi, M.: Vector space models for search and cluster mining. In: Survey of Text Mining: clustering, classification and retrieval, pp. 103–122. Springer, New York (2004)
Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)
Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recognition Letters 26(16), 2549–2557 (2005)
Blake, C.L., Merz, C.J.: (uci) repository of machine learning databases (1998)
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Knowledge Discovery and Data Mining, pp. 80–86 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Le, S.Q., Ho, T.B., Vinh, L.S. (2006). Association-Based Dissimilarity Measures for Categorical Data: Limitation and Improvement. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_57
Download citation
DOI: https://doi.org/10.1007/11731139_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)