DISC: Data-Intensive Similarity Measure for Categorical Data

Desai, Aditya; Singh, Himanshu; Pudi, Vikram

doi:10.1007/978-3-642-20847-8_39

Aditya Desai²²,
Himanshu Singh²² &
Vikram Pudi²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6635))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2635 Accesses

Abstract

The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

ConDist: A Context-Driven Categorical Distance Measure

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

Article 02 April 2019

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

References

Boriah, S., Chandola, V., Kumar, V.: Similarity Measures for Categorical Data: A Comparative Evaluation. In: Proceedings of SDM 2008. SIAM, Atlanta (2008)
Google Scholar
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of Numerical Classification. W. H. Freeman and Company, San Francisco (1973)
MATH Google Scholar
Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, London (1973)
MATH Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)
MATH Google Scholar
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. (JAIR) 6, 1–34 (1997)
MATH Google Scholar
Biberman, Y.: A context similarity measure. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 49–63. Springer, Heidelberg (1994)
Chapter Google Scholar
Das, G., Mannila, H.: Context-based similarity measures for categorical databases. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 201–210. Springer, Heidelberg (2000)
Chapter Google Scholar
Palmer, C.R., Faloutsos, C.: Electricity based external similarity of categorical attributes. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 486–500. Springer, Heidelberg (2003)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)
Article Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS–clustering categorical data using summaries. In: KDD 1999. ACM Press, New York (1999)
Google Scholar
Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci. 38(6), 420–442 (1987)
Article Google Scholar
Noreault, T., McGill, M., Koll, M.B.: A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. In: SIGIR 1980: Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, Kent, UK, pp. 57–76. Butterworth & Co. (1981)
Google Scholar
Zwick, R., Carlstein, E., Budescu, D.V.: Measures of similarity among fuzzy concepts: A comparative analysis. International Journal of Approximate Reasoning 1(2), 221–242 (1987)
Article Google Scholar
Pappis, C.P., Karacapilidis, N.I.: A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets and Systems 56(2), 171–174 (1993)
Article MATH Google Scholar
Wang, X., De Baets, B., Kerre, E.: A comparative study of similarity measures. Fuzzy Sets and Systems 73(2), 259–268 (1995)
Article MATH Google Scholar
Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal 8(34), 222–236 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK–a robust clusering algorith for categorical attributes. In: Proceedings of IEEE International Conference on Data Engineering (1999)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8, 87–102 (1992)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

International Institute of Information Technology-Hyderabad, Hyderabad, India
Aditya Desai, Himanshu Singh & Vikram Pudi

Authors

Aditya Desai
View author publications
You can also search for this author in PubMed Google Scholar
Himanshu Singh
View author publications
You can also search for this author in PubMed Google Scholar
Vikram Pudi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, 518055, Shenzhen, China
Joshua Zhexue Huang
Faculty of Engineering and Information Technology, Center for Quantum Computation and Intelligent Systems, Data Sciences and Knowledge Discovery Lab, University of Technology Sydney, 2007, Sydney, NSW, Australia
Longbing Cao
Department of Computer Science and Engineering, University of Minnesota, 55455, Minneapolis, MN, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Desai, A., Singh, H., Pudi, V. (2011). DISC: Data-Intensive Similarity Measure for Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-20847-8_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20846-1
Online ISBN: 978-3-642-20847-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DISC: Data-Intensive Similarity Measure for Categorical Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

ConDist: A Context-Driven Categorical Distance Measure

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DISC: Data-Intensive Similarity Measure for Categorical Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

ConDist: A Context-Driven Categorical Distance Measure

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

A New Context-Based Similarity Measure for Categorical Data Using Information Theory

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation