A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data

Quang, Le Si; Bao, Ho Tu

doi:10.1007/978-3-540-24775-3_69

Le Si Quang¹⁹ &
Ho Tu Bao¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2937 Accesses
2 Citations

Abstract

Measuring the similarity between objects described by categorical attributes is a difficult task because no relations between categorical values can be mathematically specified or easily established. In the literature, most similarity (dissimilarity) measures for categorical data consider the similarity of value pairs by considering whether or not these two values are identical. In these methods, the similarity (dissimilarity) of a non-identical value pair is simply considered 0 (1). In this paper, we introduce a dissimilarity measure for categorical data by imposing association relations between non-identical value pairs of an attribute based on their relations with other attributes. The key idea is to measure the similarity between two values of a categorical attribute by the similarities of the conditional probability distributions of other attributes conditioned on these two values. Experiments with a nearest neighbor algorithm demonstrate the merits of our proposal in real-life data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

MacQueen, J.: Some methods for classification and analysis of multivariate observation. In: Proceedings 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery II, 283–304 (1988)
Google Scholar
Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm, 405–416 (1987)
Google Scholar
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)
Article MATH Google Scholar
Nene, S., Nayar, S.: A simple algorithm for nearest neighbor search in high dimensions. IEEETPAMI: IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997)
Google Scholar
Aha, D.W., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Google Scholar
Farag’o, A., Linder, T., Lugosi, G.: Fast nearest-neighbor search in dissimilarity spaces. IEEE Trans. on Pattern Analysis and Machine Intelligence 15(9), 957–962 (1993)
Article Google Scholar
Hubálek, Z.: Coefficients of association and similarity, based on binary (presentabsence) data: an evaluation. Biological review (57), 669–689 (1982)
Article Google Scholar
Baulieu, F.B.: Classification of presence/absence based dissimilarity coefficients. Journal of Classification (6), 233–246 (1989)
Article MATH MathSciNet Google Scholar
Batagelj, V., Bren, M.: Comparing resemblance measures. Journal of Classification 12(1) (1995)
Google Scholar
Albert, M.: Measures of Association. Quantitative Applications in the Social Sciences, vol. 32. Sage publications, Thousand Oaks (1983)
Google Scholar
Gower, J.C., Legendre, P.: Metric and euclidean properties of dissimilarity coefficients. Journal of Classification (3), 5–48 (1986)
Article MATH MathSciNet Google Scholar
Kullback, S.: Information theory and statistics. John Wiley and Sons, New York (1959)
MATH Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)
Article MATH MathSciNet Google Scholar
Blake, C.L., Merz, C.J.: (uci) repository of machine learning databases (1998)
Google Scholar
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. Knowledge Discovery and Data Mining, 80–86 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, School of Knowledge science, Tatsunokuchi, Ishikawa, 923-1292, Japan
Le Si Quang & Ho Tu Bao

Authors

Le Si Quang
View author publications
You can also search for this author in PubMed Google Scholar
Ho Tu Bao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering and Information Technology, Deakin University, VIC 3125, Australia
Honghua Dai
University of Illinois at Urbana-Champaign, 61801, Urbana, IL, USA
Ramakrishnan Srikant
Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Quang, L.S., Bao, H.T. (2004). A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_69

Download citation

DOI: https://doi.org/10.1007/978-3-540-24775-3_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics