Abstract
Measuring the similarity between objects described by categorical attributes is a difficult task because no relations between categorical values can be mathematically specified or easily established. In the literature, most similarity (dissimilarity) measures for categorical data consider the similarity of value pairs by considering whether or not these two values are identical. In these methods, the similarity (dissimilarity) of a non-identical value pair is simply considered 0 (1). In this paper, we introduce a dissimilarity measure for categorical data by imposing association relations between non-identical value pairs of an attribute based on their relations with other attributes. The key idea is to measure the similarity between two values of a categorical attribute by the similarities of the conditional probability distributions of other attributes conditioned on these two values. Experiments with a nearest neighbor algorithm demonstrate the merits of our proposal in real-life data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
MacQueen, J.: Some methods for classification and analysis of multivariate observation. In: Proceedings 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery II, 283–304 (1988)
Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm, 405–416 (1987)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)
Nene, S., Nayar, S.: A simple algorithm for nearest neighbor search in high dimensions. IEEETPAMI: IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997)
Aha, D.W., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Farag’o, A., Linder, T., Lugosi, G.: Fast nearest-neighbor search in dissimilarity spaces. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(9), 957–962 (1993)
Hubálek, Z.: Coefficients of association and similarity, based on binary (presentabsence) data: an evaluation. Biological review (57), 669–689 (1982)
Baulieu, F.B.: Classification of presence/absence based dissimilarity coefficients. Journal of Classification (6), 233–246 (1989)
Batagelj, V., Bren, M.: Comparing resemblance measures. Journal of Classification 12(1) (1995)
Albert, M.: Measures of Association. Quantitative Applications in the Social Sciences, vol. 32. Sage publications, Thousand Oaks (1983)
Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. Journal of Classification (3), 5–48 (1986)
Kullback, S.: Information theory and statistics. John Wiley and Sons, New York (1959)
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
Blake, C.L., Merz, C.J.: (uci) repository of machine learning databases (1998)
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. Knowledge Discovery and Data Mining, 80–86 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Quang, L.S., Bao, H.T. (2004). A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_69
Download citation
DOI: https://doi.org/10.1007/978-3-540-24775-3_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive