Abstract
Similarity is a common notion in many fields including machine learning and data mining. For numerical data, similarity measures are relatively straightforward due to their designed metrics in numerical space. However, with categorical data, measures to quantify their resemblance are still not well understood. In this research, we propose a new similarity measure based on information theoretic approach that could be able to integrate context information into the quantification of similarity between categorical data. The evaluation experiment conducted on classification task shows that the effectiveness of our proposed measure is competitive with other current state-of-the-art similarity measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cohen, J., Cohen, P.: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. L. Erlbaum Associates, Hillsdale (1983)
Goodall, D.W.: A new similarity index based on probability. Biometrics 22(4), 882–907 (1966)
Ring, M., Otto, F., Becker, M., Niebler, T., Landes, D., Hotho, A.: ConDist: a context-driven categorical distance measure. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 251–266. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_16
Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30024-5_7
Sokal, R.R., Sneath, P.H.A.: Principles of Numerical Taxonomy. W. H. Freeman, San Francisco (1961)
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 243–254 (2008)
Gambaryan, P.: A mathematical model of taxonomy. SSR 17(12), 47–53 (1964)
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Applications of Data Mining in Computer Security. Kluwer (2002)
Burnaby, T.P.: On a method for character weighting a similarity coefficient, employing the concept of information. J. Int. Assoc. Math. Geol. 2(1), 25–38 (1970)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)
Smirnov, E.S.: On exact methods in systematics. Syst. Zool. 17(1), 1–13 (1968)
Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)
Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914, July 2014
Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recogn. Lett. 26(16), 2549–2557 (2005)
Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data 6(1), 1:1–1:25 (2012)
Khorshidpour, Z., Hashemi, S., Hamzeh, A.: Distance learning for categorical attribute based on context information. In: 2010 2nd International Conference on Software Technology and Engineering, vol. 2, pp. V2-296–V2-300, October 2010
Morlini, I., Zani, S.: A new class of weighted similarity indices using polytomous variables. J. Classif. 29(2), 199–226 (2012)
Jia, H., Cheung, Y., Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)
MacKay, D.J.C.: Information Theory, Inference & Learning Algorithms. Cambridge University Press, New York (2002)
Au, W.H., Chan, K.C.C., Wong, A.K.C., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2(2), 83–101 (2005)
Machine Learning with Python: k-Nearest Neighbor Classifier. http://www.python-course.eu/k_nearest_neighbor_classifier.php
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)
Acknowledgment
This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Nguyen, TP., Ryoke, M., Huynh, VN. (2018). A New Context-Based Similarity Measure for Categorical Data Using Information Theory. In: Huynh, VN., Inuiguchi, M., Tran, D., Denoeux, T. (eds) Integrated Uncertainty in Knowledge Modelling and Decision Making. IUKM 2018. Lecture Notes in Computer Science(), vol 10758. Springer, Cham. https://doi.org/10.1007/978-3-319-75429-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-75429-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75428-4
Online ISBN: 978-3-319-75429-1
eBook Packages: Computer ScienceComputer Science (R0)