Association-Based Dissimilarity Measures for Categorical Data: Limitation and Improvement

Le, Si Quang; Ho, Tu Bao; Vinh, Le Sy

doi:10.1007/11731139_57

Si Quang Le^22,23,
Tu Bao Ho²² &
Le Sy Vinh^24,25

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3169 Accesses

Abstract

Measuring the similarity for categorical data is a challenging task in data mining due to the poor structure of categorical data. This paper presents a dissimilarity measure for categorical data based on the relations among attributes. This measure not only has the advantage of value variance but also overcomes the limitations of condition the probability-based measure when applied to databases whose attributes are independent. Experiments with 30 databases also showed that the proposed measure boosted the accuracy of Nearest Neighbor classification in comparison with other tested measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

ConDist: A Context-Driven Categorical Distance Measure

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

Ultrametricity of Dissimilarity Spaces and Its Significance for Data Mining

References

Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. Journal of classification 3, 5–48 (1986)
Article MathSciNet MATH Google Scholar
Le, S.Q., Ho, T.B.: A Conditional probability distribution-based dissimilarity measure for categorical data. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 580–589. Springer, Heidelberg (2004)
Chapter Google Scholar
Aono, M., Kobayashi, M.: Vector space models for search and cluster mining. In: Survey of Text Mining: clustering, classification and retrieval, pp. 103–122. Springer, New York (2004)
Google Scholar
Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)
Article Google Scholar
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)
Article MATH Google Scholar
Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recognition Letters 26(16), 2549–2557 (2005)
Article Google Scholar
Blake, C.L., Merz, C.J.: (uci) repository of machine learning databases (1998)
Google Scholar
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Knowledge Discovery and Data Mining, pp. 80–86 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa, 923-1292, Japan
Si Quang Le & Tu Bao Ho
LIRMM, Montpellier Cedex 5, France
Si Quang Le
John von Neumann Institute for Computing, Juelich, Germany
Le Sy Vinh
American Museum of Natural History, New York, USA
Le Sy Vinh

Authors

Si Quang Le
View author publications
You can also search for this author in PubMed Google Scholar
Tu Bao Ho
View author publications
You can also search for this author in PubMed Google Scholar
Le Sy Vinh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le, S.Q., Ho, T.B., Vinh, L.S. (2006). Association-Based Dissimilarity Measures for Categorical Data: Limitation and Improvement. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_57

Download citation

DOI: https://doi.org/10.1007/11731139_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics