Skip to main content

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

  • Conference paper
  • First Online:
Integrated Uncertainty in Knowledge Modelling and Decision Making (IUKM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10758))

  • 1507 Accesses

Abstract

Similarity is a common notion in many fields including machine learning and data mining. For numerical data, similarity measures are relatively straightforward due to their designed metrics in numerical space. However, with categorical data, measures to quantify their resemblance are still not well understood. In this research, we propose a new similarity measure based on information theoretic approach that could be able to integrate context information into the quantification of similarity between categorical data. The evaluation experiment conducted on classification task shows that the effectiveness of our proposed measure is competitive with other current state-of-the-art similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Cohen, J., Cohen, P.: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. L. Erlbaum Associates, Hillsdale (1983)

    Google Scholar 

  2. Goodall, D.W.: A new similarity index based on probability. Biometrics 22(4), 882–907 (1966)

    Article  Google Scholar 

  3. Ring, M., Otto, F., Becker, M., Niebler, T., Landes, D., Hotho, A.: ConDist: a context-driven categorical distance measure. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 251–266. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_16

    Chapter  Google Scholar 

  4. Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30024-5_7

    Chapter  Google Scholar 

  5. Sokal, R.R., Sneath, P.H.A.: Principles of Numerical Taxonomy. W. H. Freeman, San Francisco (1961)

    MATH  Google Scholar 

  6. Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)

    Article  Google Scholar 

  7. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 243–254 (2008)

    Google Scholar 

  8. Gambaryan, P.: A mathematical model of taxonomy. SSR 17(12), 47–53 (1964)

    Google Scholar 

  9. Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Applications of Data Mining in Computer Security. Kluwer (2002)

    Google Scholar 

  10. Burnaby, T.P.: On a method for character weighting a similarity coefficient, employing the concept of information. J. Int. Assoc. Math. Geol. 2(1), 25–38 (1970)

    Article  Google Scholar 

  11. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)

    Google Scholar 

  12. Smirnov, E.S.: On exact methods in systematics. Syst. Zool. 17(1), 1–13 (1968)

    Article  Google Scholar 

  13. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, New York (1973)

    MATH  Google Scholar 

  14. Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914, July 2014

    Google Scholar 

  15. Le, S.Q., Ho, T.B.: An association-based dissimilarity measure for categorical data. Pattern Recogn. Lett. 26(16), 2549–2557 (2005)

    Article  Google Scholar 

  16. Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data 6(1), 1:1–1:25 (2012)

    Article  Google Scholar 

  17. Khorshidpour, Z., Hashemi, S., Hamzeh, A.: Distance learning for categorical attribute based on context information. In: 2010 2nd International Conference on Software Technology and Engineering, vol. 2, pp. V2-296–V2-300, October 2010

    Google Scholar 

  18. Morlini, I., Zani, S.: A new class of weighted similarity indices using polytomous variables. J. Classif. 29(2), 199–226 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  19. Jia, H., Cheung, Y., Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)

    Article  MathSciNet  Google Scholar 

  20. MacKay, D.J.C.: Information Theory, Inference & Learning Algorithms. Cambridge University Press, New York (2002)

    Google Scholar 

  21. Au, W.H., Chan, K.C.C., Wong, A.K.C., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2(2), 83–101 (2005)

    Article  Google Scholar 

  22. Machine Learning with Python: k-Nearest Neighbor Classifier. http://www.python-course.eu/k_nearest_neighbor_classifier.php

  23. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  24. Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437–1447 (2003)

    Article  Google Scholar 

Download references

Acknowledgment

This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thanh-Phu Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, TP., Ryoke, M., Huynh, VN. (2018). A New Context-Based Similarity Measure for Categorical Data Using Information Theory. In: Huynh, VN., Inuiguchi, M., Tran, D., Denoeux, T. (eds) Integrated Uncertainty in Knowledge Modelling and Decision Making. IUKM 2018. Lecture Notes in Computer Science(), vol 10758. Springer, Cham. https://doi.org/10.1007/978-3-319-75429-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75429-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75428-4

  • Online ISBN: 978-3-319-75429-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics