A New Context-Based Similarity Measure for Categorical Data Using Information Theory

Nguyen, Thanh-Phu; Ryoke, Mina; Huynh, Van-Nam

doi:10.1007/978-3-319-75429-1_10

Thanh-Phu Nguyen¹⁷,
Mina Ryoke¹⁸ &
Van-Nam Huynh¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10758))

Included in the following conference series:

International Symposium on Integrated Uncertainty in Knowledge Modelling and Decision Making

1507 Accesses

Abstract

Similarity is a common notion in many fields including machine learning and data mining. For numerical data, similarity measures are relatively straightforward due to their designed metrics in numerical space. However, with categorical data, measures to quantify their resemblance are still not well understood. In this research, we propose a new similarity measure based on information theoretic approach that could be able to integrate context information into the quantification of similarity between categorical data. The evaluation experiment conducted on classification task shows that the effectiveness of our proposed measure is competitive with other current state-of-the-art similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A New Context-Based Clustering Framework for Categorical Data

A Hybrid Approach to Classification of Categorical Data Based on Information-Theoretic Context Selection

ConDist: A Context-Driven Categorical Distance Measure

References

Cohen, J., Cohen, P.: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. L. Erlbaum Associates, Hillsdale (1983)
Google Scholar
Goodall, D.W.: A new similarity index based on probability. Biometrics 22(4), 882–907 (1966)
Article Google Scholar
Ring, M., Otto, F., Becker, M., Niebler, T., Landes, D., Hotho, A.: ConDist: a context-driven categorical distance measure. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 251–266. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_16
Chapter Google Scholar
Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30024-5_7
Chapter Google Scholar
Sokal, R.R., Sneath, P.H.A.: Principles of Numerical Taxonomy. W. H. Freeman, San Francisco (1961)
MATH Google Scholar
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)
Article Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 243–254 (2008)
Google Scholar
Gambaryan, P.: A mathematical model of taxonomy. SSR 17(12), 47–53 (1964)
Google Scholar
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Applications of Data Mining in Computer Security. Kluwer (2002)
Google Scholar
Burnaby, T.P.: On a method for character weighting a similarity coefficient, employing the concept of information. J. Int. Assoc. Math. Geol. 2(1), 25–38 (1970)
Article Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)
Google Scholar
Smirnov, E.S.: On exact methods in systematics. Syst. Zool. 17(1), 1–13 (1968)
Article Google Scholar
Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)
MATH Google Scholar
Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914, July 2014
Google Scholar
Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recogn. Lett. 26(16), 2549–2557 (2005)
Article Google Scholar
Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data 6(1), 1:1–1:25 (2012)
Article Google Scholar
Khorshidpour, Z., Hashemi, S., Hamzeh, A.: Distance learning for categorical attribute based on context information. In: 2010 2nd International Conference on Software Technology and Engineering, vol. 2, pp. V2-296–V2-300, October 2010
Google Scholar
Morlini, I., Zani, S.: A new class of weighted similarity indices using polytomous variables. J. Classif. 29(2), 199–226 (2012)
Article MathSciNet MATH Google Scholar
Jia, H., Cheung, Y., Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)
Article MathSciNet Google Scholar
MacKay, D.J.C.: Information Theory, Inference & Learning Algorithms. Cambridge University Press, New York (2002)
Google Scholar
Au, W.H., Chan, K.C.C., Wong, A.K.C., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2(2), 83–101 (2005)
Article Google Scholar
Machine Learning with Python: k-Nearest Neighbor Classifier. http://www.python-course.eu/k_nearest_neighbor_classifier.php
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)
Article Google Scholar

Download references

Acknowledgment

This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.

Author information

Authors and Affiliations

School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Thanh-Phu Nguyen & Van-Nam Huynh
Graduate School of Business Sciences, University of Tsukuba, Tokyo, Japan
Mina Ryoke

Authors

Thanh-Phu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Mina Ryoke
View author publications
You can also search for this author in PubMed Google Scholar
Van-Nam Huynh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanh-Phu Nguyen .

Editor information

Editors and Affiliations

Japan Advanced Institute of Science and Technology, Nomi, Japan
Van-Nam Huynh
Osaka University, Osaka, Japan
Masahiro Inuiguchi
Hanoi National University of Education, Hanoi, Vietnam
Dang Hung Tran
Université de Technologie de Compiègne, Compiègne, France
Thierry Denoeux

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, TP., Ryoke, M., Huynh, VN. (2018). A New Context-Based Similarity Measure for Categorical Data Using Information Theory. In: Huynh, VN., Inuiguchi, M., Tran, D., Denoeux, T. (eds) Integrated Uncertainty in Knowledge Modelling and Decision Making. IUKM 2018. Lecture Notes in Computer Science(), vol 10758. Springer, Cham. https://doi.org/10.1007/978-3-319-75429-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-75429-1_10
Published: 04 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75428-4
Online ISBN: 978-3-319-75429-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics