Clustering mixed type data: a space structure-based approach

Li, Feijiang; Qian, Yuhua; Wang, Jieting; Peng, Furong; Liang, Jiye

doi:10.1007/s13042-022-01602-x

Clustering mixed type data: a space structure-based approach

Original Article
Published: 05 July 2022

Volume 13, pages 2799–2812, (2022)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Feijiang Li¹^na1,
Yuhua Qian ORCID: orcid.org/0000-0001-6772-4247^1,2^na1,
Jieting Wang¹,
Furong Peng¹ &
…
Jiye Liang²

379 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Clustering mixed type data is important for the areas such as knowledge discovery and machine learning. Although many clustering algorithms have been developed for mixed type data, clustering mixed type data is still a challenging task. The challenges mainly come from the fact that the numerical attributes and categorical attributes of mixed type data are not in the same space. Most of the mixed data clustering methods handle the two types of attributes separately. The gap between the numerical attributes and categorical attributes is not handled very well. To handle the above issues, we expand the space structure representation scheme for categorical data to mixed type data. In the new scheme, all the attributes of the mixed type data are expressed as the numerical type, which is in a Euclidean space. In addition, we propose an accelerated approximate space structure based on the Nyström method, which reduces the time cost and memory cost of constructing a space structure. We then propose general frameworks based on the space structure data (SBM) and accelerated approximate space structure (Ap-SBM) for mixed type data clustering. Experimental analyses reflect the ability of the space structure to express the original mixed type data and the ability of the accelerated approximate space structure to express the space structure. The experimental results on thirteen mixed type data sets from UCI show superiority of the proposed frameworks compared with the other six representative mixed type data clustering algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid data labeling algorithm for clustering large mixed type data

Article 14 December 2014

Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm

Article Open access 25 April 2020

A Seed-Based Inter-Domain Supervised Framework to Cluster Mixed Data Types

References

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Vegapons S, Ruizshulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(03):337–372
Article MathSciNet Google Scholar
Li F, Qian Y, Wang J, Dang C, Liu B (2018) Cluster’s quality evaluation and selective clustering ensemble. ACM Trans Knowl Discov Data 12(5):60
Article Google Scholar
Li F, Qian Y, Wang J, Liang J (2017) Multigranulation information fusion: a dempster-shafer evidence theory-based clustering ensemble method. Inf Sci 378:389–409
Article MATH Google Scholar
Macqueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976
Article MathSciNet MATH Google Scholar
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Article Google Scholar
Aggarwal CC, Procopiuc CM, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62
Article Google Scholar
Chen H, Chuang K, Chen M (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472
Article Google Scholar
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. Research Issues on Data Mining and Knowledge Discovery 1–8
Qian Y, Liang J, Pedrycz W, Dang C (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9):597–618
Article MathSciNet MATH Google Scholar
Bai L, Liang J, Dang C, Cao F (2013) The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522
Article Google Scholar
Hunt L, Jorgensen M (2011) Clustering mixed data. Wiley Interdisciplinary Rev Data Mining Knowledge Discovery 1(4):352–361
Article Google Scholar
Blomstedt P, Tang J, Xiong J, Granlund C, Corander J (2015) A bayesian predictive model for clustering data of mixed discrete and continuous type. IEEE Trans Pattern Anal Mach Intell 37(3):489–498
Article Google Scholar
Lam D, Wei M, Wunsch D (2017) Clustering data of mixed categorical and numerical type with unsupervised feature learning. IEEE Access 3(2):1605–1613
Google Scholar
Ni X, Quadrianto N, Wang Y, Chen C (2017) Composing tree graphical models with persistent homology features for clustering mixed-type data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2622–2631
Jeris C, Jeris C, Jeris C, Jeris C, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–268
Hsu C, Chen C, Su Y (2007) Hierarchical clustering of mixed data based on distance hierarchy. Inf Sci 177(20):4474–4492
Article Google Scholar
Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 4:673–690
Article Google Scholar
Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209
Article Google Scholar
Manuela H, Dominic E, Annette K-S (2017) Clustering of samples and variables with mixed-type data. PLoS ONE 12(11):0188274
Google Scholar
Chen J, He H (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf Sci 345(C):271–293
Article Google Scholar
Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863
Article Google Scholar
Mao J, Jain AK (1996) A self-organizing network for hyperellipsoidal clustering. IEEE Trans Neural Networks 7(1):16
Article Google Scholar
Jarvis RA, Patrick EA (2006) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput C–22(11):1025–1034
Article Google Scholar
Michalski RS, Stepp RE (1983) Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans Pattern Anal Mach Intell 5(4):396–410
Article Google Scholar
Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl-Based Syst 155:54–65
Article Google Scholar
Wang P, Shi H, Yang X, Mi J (2019) Three-way k-means: integrating k-means and three-way decision. Int J Mach Learn Cybern 10:2767–2777
Article Google Scholar
Qian Y, Li F, Liang J, Liu B, Dang C (2016) Space structure and clustering of categorical data. IEEE Transact Neural Networks Learn Syst 27(10):2047–2059
Article MathSciNet Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304
Article Google Scholar
Ji J, Pang W, Zhou C, Han X, Wang Z (2012) A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowl-Based Syst 30:129–135
Article Google Scholar
Zhao W, Dai W, Tang C (2007) K-centers algorithm for clustering mixed type data. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 1140–1147. Springer
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-asia Conference on Knowledge Discovery and Data Mining, pp. 21–34. Singapore
Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882
Article Google Scholar
Hsu C, Wang S (2006) An integrated framework for visualized and exploratory pattern discovery in mixed data. IEEE Trans Knowl Data Eng 18(2):161–173
Article Google Scholar
Hsu C, Chen Y (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32(1):12–23
Article Google Scholar
Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45(6):2251–2265
Article MATH Google Scholar
Rnyi A (1961) On measures of entropy and information. Proc.fourth Berkeley Symp.on Math.statist. & Prob.univ.of Calif 1(5073):547–561
Liang J, Chin K, Dang C, Yam RC (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342
Article MathSciNet MATH Google Scholar
Cheung Y, Jia H (2013) Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recogn 46(8):2228–2238
Article MATH Google Scholar
Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871
Article Google Scholar
Wangchamhan T, Chiewchanwattana S, Sunat K (2017) Efficient algorithms based on the k-means and chaotic league championship algorithm for numeric, categorical, and mixed-type data clustering. Expert Syst Appl 90:146–167
Article Google Scholar
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 11(1):1–34
Article MathSciNet MATH Google Scholar
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Comm Acm 29(12):1213–1228
Article Google Scholar
Yuan K, Xu W, Li W, Weiping D (2022) An incremental learning mechanism for object classification based on progressive fuzzy three-way concept. Inf Sci 584:127–147
Article Google Scholar
Li M, Chen M, Xu W (2019) Double-quantitative multigranulation decision-theoretic rough fuzzy set model. Int J Mach Learn Cybern 10:3225–3244
Article Google Scholar
Xu W, Yu J (2017) A novel approach to information fusion in multi-source datasets: A granular computing viewpoint. Inf Sci 378:410–423
Article MATH Google Scholar
Chatzis SP (2011) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38(7):8684–8689
Article Google Scholar
Zheng Z, Gong M, Ma J, Jiao L, Wu Q (2010) Unsupervised evolutionary clustering algorithm for mixed type data. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1–8
Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596
Article Google Scholar
Foss A, Markatou M, Ray BK, Heching AR (2016) A semiparametric method for clustering mixed data. Mach Learn 105(3):419–458
Article MathSciNet MATH Google Scholar
Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 682–688
Charless F, Serge B, Fan C, Jitendra M (2019) Spectral grouping using the nystrm method. IEEE Trans Pattern Anal Mach Intell 26(2):214–25
Google Scholar
Chen W-Y, Song Y, Bai H, Lin C-J, Chang EY (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586
Article Google Scholar
Cai D, Chen X (2015) Large scale spectral clustering via landmark-based sparse representation. IEEE Transact Cybern 45(8):1669–1680
Article Google Scholar
Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45:2251–2265
Article MATH Google Scholar
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Reshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Article MATH Google Scholar
Yang Y (1999) An evaluation of statistical approaches to yext categorization. Inf Retrieval 1:69–90
Article Google Scholar
Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article MATH Google Scholar

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (No. 2021ZD0112400), National Natural Science Foundation of China (Nos. 62136005, 62106132), the Shanxi Province Science Foundation for Youths (No. 201901D211168, 20210302124271, 202103021223026).

Author information

Feijiang Li and Yuhua Qian contributed equally to this work.

Authors and Affiliations

Institute of Big Data Science and Industry, Shanxi University, Taiyuan, 030006, Shanxi, China
Feijiang Li, Yuhua Qian, Jieting Wang & Furong Peng
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Taiyuan, 030006, Shanxi, China
Yuhua Qian & Jiye Liang

Authors

Feijiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuhua Qian
View author publications
You can also search for this author in PubMed Google Scholar
Jieting Wang
View author publications
You can also search for this author in PubMed Google Scholar
Furong Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jiye Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhua Qian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, F., Qian, Y., Wang, J. et al. Clustering mixed type data: a space structure-based approach. Int. J. Mach. Learn. & Cyber. 13, 2799–2812 (2022). https://doi.org/10.1007/s13042-022-01602-x

Download citation

Received: 28 December 2021
Accepted: 10 June 2022
Published: 05 July 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s13042-022-01602-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering mixed type data: a space structure-based approach

Abstract

Access this article

Similar content being viewed by others

Hybrid data labeling algorithm for clustering large mixed type data

Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm

A Seed-Based Inter-Domain Supervised Framework to Cluster Mixed Data Types

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering mixed type data: a space structure-based approach

Abstract

Access this article

Similar content being viewed by others

Hybrid data labeling algorithm for clustering large mixed type data

Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm

A Seed-Based Inter-Domain Supervised Framework to Cluster Mixed Data Types

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation