Skip to main content

Categorical Data Clustering

  • Reference work entry
  • 1485 Accesses

Synonyms

Clustering of nonnumerical data; Grouping

Definition

Data clustering is informally defined as the problem of partitioning a set of objects into groups, such that the objects in the same group are similar, while the objects in different groups are dissimilar. Categorical data clustering refers to the case where the data objects are defined over categorical attributes. A categorical attribute is an attribute whose domain is a set of discrete values that are not inherently comparable. That is, there is no single ordering or inherent distance function for the categorical values, and there is no mapping from categorical to numerical values that is semantically meaningful.

Motivation and Background

Clustering is a problem of great practical importance that has been the focus of substantial research in several domains for decades. As storage capacities grow, we have at hand larger amounts of data available for analysis and mining. Clustering plays an instrumental role in this...

This is a preview of subscription content, log in via an institution.

Recommended Reading

  • Andritsos, P., Tsaparas, P., Miller, R. J., Kenneth, C., & Sevcik, K. C. (2004). LIMBO: Scalable clustering of categorical data. In Proceedings of the 9th international conference on extending database technology (EDBT) (pp. 123–146). Heraklion, Greece.

    Google Scholar 

  • Barbarà, D., Couto, J., & Li, Y. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on information and knowledge management (CIKM) (pp. 582–589). McLean, VA.

    Google Scholar 

  • Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.

    MATH  Google Scholar 

  • Das, G., & Mannila, H. (2000). Context-based similarity measures for categorical databases. In Proceedings of the 4th European conference on principles of data mining and knowledge discovery (PKDD) (pp. 201–210). Lyon, France.

    Google Scholar 

  • Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139–172.

    Google Scholar 

  • Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). CACTUS: Clustering categorical data using summaries. In Proceedings of the 5th international conference on knowledge discovery and data mining (KDD) (pp. 73–83). San Diego, CA.

    Google Scholar 

  • Gionis, A., Mannila, H., & Tsaparas, P. (2007). Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1), Article No 4.

    Google Scholar 

  • Gluck, M., & Corter, J. (1985). Information, uncertainty, and the utility of categories. In Proceedings of the 7th annual conference of the cognitive science society (COGSCI) (pp. 283–287). Irvine, CA.

    Google Scholar 

  • Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th international conference on data engineering (pp. 512–521). Sydney, Australia.

    Google Scholar 

  • Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.

    Google Scholar 

  • Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall.

    MATH  Google Scholar 

  • Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (1999). Fundamentals of data warehouses. Berlin: Springer.

    Google Scholar 

  • Kleinberg, Jon (1999). Authoritative sources in a hyperlinked environment”. Journal of the ACM 46(5): 604632.

    MathSciNet  Google Scholar 

  • Zaki, M. J., Peters, M., Assent, I., & Seidl, T. (2005). CLICKS: An effective algorithm for mining subspace clusters in categorical datasets. In Proceeding of the 11th international conference on knowledge discovery and data mining (KDD) (pp. 736–742). Chicago, IL.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Andritsos, P., Tsaparas, P. (2011). Categorical Data Clustering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_99

Download citation

Publish with us

Policies and ethics