Abstract
Different clustering strategies to partition heterogeneous data set with numeric, binary, categorical and ordinal attributes are explored by the researchers. All the real-life applications data set is often heterogeneous in nature; if it is converted to homogeneous, then it leads to information loss. In this paper, we propose an interblend fusing of genetic algorithm-based attribute selection and increase the clustering accuracy in credit risk assessment. The proposed technique classifies the similar objects together without changing the characteristics of heterogeneous data sets. This algorithm also identifies the importance of attributes in clustering large number of objects with good many attributes. The fusing technique yields contextual distance measure for clustering the objects. The result presented in this paper provides clear interpretation of applying our methodology to the data sets. The performance of this algorithm is of the higher standard when compared to the related literature.



Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Change history
17 October 2024
This article has been retracted. Please see the Retraction Notice for more detail: https://doi.org/10.1007/s00500-024-10194-3
References
Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527
Akeem OA, Ogunyinka TK, Abimbola BL (2012) A framework for multi media data mining in information technology environment. Int J Comput Sci Inf Secur 10(5):69–77
Andritsos P et al. (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the 9th international conference on extending database technology, Springer. pp 123–146
Bache K, Lichman M (2013) UCI machine learning repository. http://archieve.ics.uci.edu/ml
Bashon Y, Neagu D, Ridley M (2013) A framework for comparing heterogeneous objects: on the similarity measurements for fuzzy, numerical and categorical attributes. Soft Comput A Fusion Found Methodol Appl 17(9):1595–1615
Bie T et al. (2007) Kernel-based data fusion for gene prioritization. In: ISMB/ECCB (supplement of bioinformatics). Oxford University Press, vol 23, issuse no 13, pp 125–132
Chaturvedi A, Green PE, Caroll JD (2003) k-modes clustering. J Classif 18(1):35–55
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection In: Icml, vol 1, pp 74–81
Dash M et al (2005) Feature selection for clustering. Springer, Chicago
Dos Santos TRL et al (2015) Categorical data clustering: What similarity measure to recommend? Expert Syst Appl 42(3):1247–1260
Dy J, Brodley C (2000) Feature subset selection and order identification for unsupervised learning. In: ICML, pp 247–254
Frank A, Asuncion A (2010) UCI machine learning repository. University of California, School of Information and Computer science. http://archieve.ics.uci.edu/ml
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS- clustering categorical data using summaries. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 73–83
Gao B et al (2005) Consistent bipartite graph co-partitioning for star structured high-order heterogeneous data co-clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 1–31
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366
Hall MA (2000) Correlation-based feature selection of discrete and numeric class machine learning. In: Proceedings of the seventeenth international. Morgan Kaufmann Publishers Inc, pp 359–366
Harikumar S, Surya PV (2015) K-medoid clustering for heterogeneous datasets. Procedia Comput Sci 70:226–237
He Z, Xu X, Deng S (2002) An efficient algorithm for clustering categorical data. J Comput Sci Technol 17(5):611–624
Huang Z (1997) A fastclustering algorithm to cluster very large categorical data sets in datamining. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, vol 3, issuse no 8, pp 34–39
Huang Z (1998) Extension to the K-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Huang CL, Wang CJ, Chen MC (2007) Credit scoring with a data mining approach based on support vector machines. Expert Syst Appl 33(4):847–856
Karegowda AG et al (2010) Feature subset selection problem using wrapper approach in supervised learning. Int J Comput Appl 1(7):13–17
Khashman A (2010) Neural networks for credit risk evaluation: investigation of different neural models and learning schemes. Expert Syst Appl 37(9):6233–6239
Kim Y, Street WN, Menczer F (2000) Feature selection in unsupervised learning via evolutionary search. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 365–369
Kohavi R, Sommerfield D (1995) Feature subset selection using the wrapper method: overfitting and dynamic search space topology. In: Proceedings of the first international conference on knowledge discovery and data mining. KDD, pp 192–197
Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 4:673–690
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 14(4):491–502
Liu H et al (1998) Feature extraction, construction and selection: a data mining perspective, vol 453. Springer, Berlin, pp 50–62
Manjunath TN, Hegadi RS, Ravikumar GK (2010) A survey on multimedia data mining and its relevance today. Int J Comput Sci Inf Secur 10:165–170
Mojahed A et al (2015) Applying clustering analysis to heterogeneous data using similarity matrix fusion (smf). In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 251–265
Naija Y et al (2008) Extension of partitional clustering methods for handling mixed data . In: IEEE international conference on data mining workshops. IEEE, pp 257–266
Oreski S, Oreski G (2013) Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst Appl 41(4):2052–2064
Pyle D (1999) Data preparation for data mining (The Morgan Kaufmann Series in data management systems), vol 3. Morgan Kaufmann Publishers, San Francisco
Rastogi R, Mondal P et al (2015) GA based clustering of mixed data type of attributes—numeric, categorical, ordinal, binary and ratio scaled. Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM’s) Int J Inf Technol 7(2):861–865
Refaeilzadeh P, Tang L, Liu H (2007) On comparison of feature selection algorithms. In: Proceedings of AAAI workshop on evaluation methods for machine learning II, vol 3, p 5
Shi et al (2007) L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform 11(1):309–332
Smys S, Bala GJ (2012) Performance analysis of virtual clusters in personal communication networks. Soft Comput 15(3):211–222
Tan F et al (2008) A genetic algorithm-based method for feature subset selection. Soft Comput 12(2):111–120
Wang S et al (2009) Empirical analysis of support vector machine ensemble classifiers. Expert Syst Appl 36(3):6466–6476
Wilson DR, Martinez TR (1997) Improved heterogeneous distance function. J Artif Intell Res 6:1–34
Xing EP, Jordan MI, Karp RM (2001) Feature selection for high-dimensional genomic microarray data. In: ICML, vol 1, pp 601-608
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863
Zaki MJ, Peters M (2005) CLICK:mining subspace clusters in categorical data via k partite maximal cliques. In: 21st international conference on data engineering. IEEE, pp 355-356
Zhang T, Ramakishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Rec 25:103–114
Zhuo L et al (2008) A genetic algorithm based wrapper feature selection method for classification of hyperspectral images using support vector machine. In: Geoinformatics 2008 and joint conference on gis and built environment: classification of remote sensing images. International Society for Optics and Photonics, vol 7147, p 71471
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Communicated by P. Pandian.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1007/s00500-024-10194-3
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dhayanithi, J., Akilandeswari, J. RETRACTED ARTICLE: Interblend fusing of genetic algorithm-based attribute selection for clustering heterogeneous data set. Soft Comput 23, 2747–2759 (2019). https://doi.org/10.1007/s00500-018-3669-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3669-9