Integrative Parameter-Free Clustering of Data with Mixed Type Attributes

Böhm, Christian; Goebl, Sebastian; Oswald, Annahita; Plant, Claudia; Plavinski, Michael; Wackersreuther, Bianca

doi:10.1007/978-3-642-13657-3_7

Christian Böhm²³,
Sebastian Goebl²³,
Annahita Oswald²³,
Claudia Plant²⁴,
Michael Plavinski²³ &
…
Bianca Wackersreuther²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6118))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4162 Accesses
11 Citations

Abstract

Integrative mining of heterogeneous data is one of the major challenges for data mining in the next decade. We address the problem of integrative clustering of data with mixed type attributes. Most existing solutions suffer from one or both of the following drawbacks: Either they require input parameters which are difficult to estimate, or/and they do not adequately support mixed type attributes. Our technique INTEGRATE is a novel clustering approach that truly integrates the information provided by heterogeneous numerical and categorical attributes. Originating from information theory, the Minimum Description Length (MDL) principle allows a unified view on numerical and categorical information and thus naturally balances the influence of both sources of information in clustering. Moreover, supported by the MDL principle, parameter-free clustering can be performed which enhances the usability of INTEGRATE on real world data. Extensive experiments demonstrate the effectiveness of INTEGRATE in exploiting numerical and categorical information for clustering. As an efficient iterative algorithm INTEGRATE is scalable to large data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, Q., Wu, X.: 10 challenging problems in data mining research. IJITDM 5(4), 597–604 (2006)
Google Scholar
Macqueen, J.B.: Some methods of classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference, pp. 103–114 (1996)
Google Scholar
Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: ICML, pp. 727–734 (2000)
Google Scholar
Böhm, C., Faloutsos, C., Pan, J.Y., Plant, C.: Robust information-theoretic clustering. In: KDD, pp. 65–75 (2006)
Google Scholar
Böhm, C., Faloutsos, C., Plant, C.: Outlier-robust clustering using independent components. In: SIGMOD Conference, pp. 185–198 (2008)
Google Scholar
Yin, J., Tan, Z.: Clustering mixed type attributes in large dataset. In: ISPA, pp. 655–661 (2005)
Google Scholar
Hsu, C.C., Chen, Y.C.: Mining of mixed data with application to catalog marketing. Expert Syst. Appl. 32(1), 12–23 (2007)
Article Google Scholar
He, Z., Xu, X., Deng, S.: Clustering mixed numeric and categorical data: A cluster ensemble approach. CoRR abs/cs/0509011 (2005)
Google Scholar
Rendon, E., Sánchez, J.S.: Clustering based on compressed data for categorical and mixed attributes. In: SSPR/SPR, pp. 817–825 (2006)
Google Scholar
Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Article Google Scholar
Brouwer, R.K.: Clustering feature vectors with mixed numerical and categorical attributes. IJCIS 1-4, 285–298 (2008)
Google Scholar
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)
Article Google Scholar
Li, T., Chen, Y.: A weight entropy k-means algorithm for clustering dataset with mixed numeric and categorical data. In: FSKD 2008, vol. (1), pp. 36–41 (2008)
Google Scholar
Rissanen, J.: An introduction to the mdl principle. Technical report, Helsinkin Institute for Information Technology (2005)
Google Scholar
Dom, B.: An information-theoretic external cluster-validity measure. In: UAI, pp. 137–145 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Munich,
Christian Böhm, Sebastian Goebl, Annahita Oswald, Michael Plavinski & Bianca Wackersreuther
Technische Universität München,
Claudia Plant

Authors

Christian Böhm
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Goebl
View author publications
You can also search for this author in PubMed Google Scholar
Annahita Oswald
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Plant
View author publications
You can also search for this author in PubMed Google Scholar
Michael Plavinski
View author publications
You can also search for this author in PubMed Google Scholar
Bianca Wackersreuther
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki
The Chinese University of Hong Kong, China
Jeffrey Xu Yu
IIT Madras, Chennai, India
B. Ravindran
IIIT, Hyderabad, India
Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Böhm, C., Goebl, S., Oswald, A., Plant, C., Plavinski, M., Wackersreuther, B. (2010). Integrative Parameter-Free Clustering of Data with Mixed Type Attributes. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-13657-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics