Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study

Kuri-Morales, Angel; Trejo-Baños, Daniel; Cortes-Berrueco, Luis Enrique

doi:10.1007/978-3-642-25330-0_21

Angel Kuri-Morales²¹,
Daniel Trejo-Baños²² &
Luis Enrique Cortes-Berrueco²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7095))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

935 Accesses

Abstract

The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables (if not most of them) correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behavior of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters. We contrast the characteristics of the clusters gotten from the automated method with those of the experts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Interpretable Clustering via Soft Clustering Trees

A Distributional Approach for Soft Clustering Comparison and Evaluation

PSS: New Parametric Based Clustering for Data Category

References

Agresti, A.: Categorical Data Analysis, 2nd edn. Wiley Series in Probability and Statistics. Wiley- Interscience (2002)
Google Scholar
Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM, New York (2002)
Chapter Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. In: SDM, pp. 243–254 (2008)
Google Scholar
Cesario, E., Manco, G., Ortale, R.: Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans. on Knowl. and Data Eng. 19(12), 1607–1624 (2007)
Article Google Scholar
Chandola, V., Boriah, S., Kumar, V.: A framework for exploring categorical data. In: SDM, pp. 185–196 (2009)
Google Scholar
Chang, C.-H., Ding, Z.-K.: Categorical data visualization and clustering using subjective factors. Data Knowl. Eng. 53(3), 243–262 (2005)
Article Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus—clustering categorical data using summaries. In: KDD 1999: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83. ACM, New York (1999)
Chapter Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. The VLDB Journal 8(3-4), 222–236 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. In: ICDE Conference, pp. 512–521 (1999)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 1st edn. Morgan Kaufmann, San Francisco (2001)
MATH Google Scholar
Hsu, C.-C., Wang, S.-H.: An integrated framework for visualized and exploratory pattern discovery in mixed data. IEEE Trans. on Knowl. and Data Eng. 18(2), 161–173 (2006)
Article Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)
Article Google Scholar
Lee, J., Lee, Y.-J., Park, M.: Clustering with Domain Value Dissimilarity for Categorical Data. In: Perner, P. (ed.) ICDM 2009. LNCS, vol. 5633, pp. 310–324. Springer, Heidelberg (2009)
Chapter Google Scholar
Johansson, S., Jern, M., Johansson, J.: Interactive quantification of categorical variables in mixed data sets. In: IV 2008: Proceedings of the 2008 12th International Conference Information Visualisation, pp. 3–10. IEEE Computer Society, Washington, DC, USA (2008)
Chapter Google Scholar
Koyuturk, M., Grama, A., Ramakrishnan, N.: Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. on Knowl. and Data Eng. 17(4), 447–461 (2005)
Article Google Scholar
Wang, K., Xu, C., Liu, B.: Clustering transactions using large items. In: ACM CIKM Conference, pp. 483–490 (1999)
Google Scholar
Yan, H., Chen, K., Liu, L.: Efficiently clustering transactional data with weighted coverage density. In: CIKM 2006: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 367–376. ACM, New York (2006)
Google Scholar
Yang, Y., Guan, X., You, J.: Clope: a fast and effective clustering algorithm for transactional data. In: KDD 2002: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 682–687. ACM, New York (2002)
Google Scholar
Haykin, S.: Neural networks: A comprehensive foundation. MacMillan (1994)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. J. Intell. Inf. Syst. 17(2-3), 107–145 (2001)
Article MATH Google Scholar
Jenssen, R., Hild, K.E., Erdogmus, D., Principe, J.C., Eltoft, T.: Clustering using Renyi’s entropy. In: Proceedings of the International Joint Conference on Neural Networks 2003, vol. 1, pp. 523–528 (2003)
Google Scholar
Lee, Y., Choi, S.: Minimum entropy, k-means, spectral clustering. In: Proceedings IEEE International Joint Conference on Neural Networks, 2004, vol. 1 (2005)
Google Scholar
Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. Scientific American (July 1949)
Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clustering’s comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1073–1080 (2009)
Google Scholar
Teuvo, K.: Self-organizing maps. Springer-Verlag, New York, Inc., Secaucus (1999)
MATH Google Scholar
http://udel.edu/~mcdonald/statspearman.html (August 26, 2011)
http://www.mei.org.uk/files/pdf/Spearmanrcc.pdf (September 9, 2011)

Download references

Author information

Authors and Affiliations

Instituto Tecnológico Autónomo de México, Río Hondo No. 1, México, D.F., México
Angel Kuri-Morales
Universidad Nacional Autónoma de México, Apartado Postal 70-600, Ciudad Universitaria, México, D.F., México
Daniel Trejo-Baños & Luis Enrique Cortes-Berrueco

Authors

Angel Kuri-Morales
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Trejo-Baños
View author publications
You can also search for this author in PubMed Google Scholar
Luis Enrique Cortes-Berrueco
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Mexican Petroleum Institute (IMP), Eje Central Lazaro Cardenas Norte, 152, Col. San Bartolo Atepehuacan, CP 07730, Mexico DF, Mexico
Ildar Batyrshin
National Polytechnic Institute (IPN), Center for Computing Research (CIC), Av. Juan Dios Bátiz, s/n, Col. Nueva Industrial Vallejo, CP 07738, Mexico D.F., Mexico
Grigori Sidorov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuri-Morales, A., Trejo-Baños, D., Cortes-Berrueco, L.E. (2011). Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study. In: Batyrshin, I., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2011. Lecture Notes in Computer Science(), vol 7095. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25330-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-25330-0_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25329-4
Online ISBN: 978-3-642-25330-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Interpretable Clustering via Soft Clustering Trees

A Distributional Approach for Soft Clustering Comparison and Evaluation

PSS: New Parametric Based Clustering for Data Category

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Interpretable Clustering via Soft Clustering Trees

A Distributional Approach for Soft Clustering Comparison and Evaluation

PSS: New Parametric Based Clustering for Data Category

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation