Skip to main content

DISC: Data-Intensive Similarity Measure for Categorical Data

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6635))

Included in the following conference series:

Abstract

The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boriah, S., Chandola, V., Kumar, V.: Similarity Measures for Categorical Data: A Comparative Evaluation. In: Proceedings of SDM 2008. SIAM, Atlanta (2008)

    Google Scholar 

  2. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of Numerical Classification. W. H. Freeman and Company, San Francisco (1973)

    MATH  Google Scholar 

  3. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, London (1973)

    MATH  Google Scholar 

  4. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  5. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)

    MATH  Google Scholar 

  6. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. (JAIR) 6, 1–34 (1997)

    MATH  Google Scholar 

  7. Biberman, Y.: A context similarity measure. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 49–63. Springer, Heidelberg (1994)

    Chapter  Google Scholar 

  8. Das, G., Mannila, H.: Context-based similarity measures for categorical databases. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 201–210. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  9. Palmer, C.R., Faloutsos, C.: Electricity based external similarity of categorical attributes. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 486–500. Springer, Heidelberg (2003)

    Google Scholar 

  10. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)

    Article  Google Scholar 

  11. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS–clustering categorical data using summaries. In: KDD 1999. ACM Press, New York (1999)

    Google Scholar 

  12. Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci. 38(6), 420–442 (1987)

    Article  Google Scholar 

  13. Noreault, T., McGill, M., Koll, M.B.: A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. In: SIGIR 1980: Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, Kent, UK, pp. 57–76. Butterworth & Co. (1981)

    Google Scholar 

  14. Zwick, R., Carlstein, E., Budescu, D.V.: Measures of similarity among fuzzy concepts: A comparative analysis. International Journal of Approximate Reasoning 1(2), 221–242 (1987)

    Article  Google Scholar 

  15. Pappis, C.P., Karacapilidis, N.I.: A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets and Systems 56(2), 171–174 (1993)

    Article  MATH  Google Scholar 

  16. Wang, X., De Baets, B., Kerre, E.: A comparative study of similarity measures. Fuzzy Sets and Systems 73(2), 259–268 (1995)

    Article  MATH  Google Scholar 

  17. Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal 8(34), 222–236 (2000)

    Article  Google Scholar 

  18. Guha, S., Rastogi, R., Shim, K.: ROCK–a robust clusering algorith for categorical attributes. In: Proceedings of IEEE International Conference on Data Engineering (1999)

    Google Scholar 

  19. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  20. Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8, 87–102 (1992)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Desai, A., Singh, H., Pudi, V. (2011). DISC: Data-Intensive Similarity Measure for Categorical Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20847-8_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20847-8_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20846-1

  • Online ISBN: 978-3-642-20847-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics