Abstract
Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel \(k\)-means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.


Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Chen S et al (2012) Capacity of data collection in arbitrary wireless sensor networks. IEEE Trans Parallel Distrib Syst 23(1):52–60
Zhang Q, Chen Z (2014) A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data. Int J Distrib Sens Networks 2014:161–169
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871
Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50
Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533
Richard J, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B: Cybern 31(5):735–744
Zhu C et al (2013) A review of key issues that concern the feasibility of mobile cloud computing. In: Proceedings of IEEE international conference on cyber, physical, and social computing, pp 769–776
Zhu C et al (2015) An authenticated trust and peputation calculation and management system for cloud and sensor networks integration. IEEE Trans Inf Forensics Secur 10(1):118–131
GeaurRahman M, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl Based Syst 56:311–327
Liu C, Dai D, Yan H (2010) The theoretic framework of local weighted approximation for microarray missing value estimation. Pattern Recognit 43(8):2993–3002
Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: Proceedings of international conference on computer science and information technology, pp 82–88
Wang X et al (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32
Silva EL, Rafael PL (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1):121–129
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: Proceedings of IEEE 3rd international conference on computational cybernetics, pp 207–212
Li D et al (2004) Towards missing data imputation: A study of fuzzy k-means clustering method. In: Proceedings of rough sets and current trends in computing, pp 573–579
Liao Z et al (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Proceedings of IEEE international conference on fuzzy systems and knowledge discovery, pp 133–137
Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35
Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393
Alessandro G, Nuovo D (2011) Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797
Bu F, Chen Z, Zhang Q (2014) Incomplete big data clustering algorithm using feature selection and partial distance. In: Proceedings of 2014 5th IEEE conference on digital home, pp 263–266
Javed K, Babri HA, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477
Guyon I et al (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Proceedings of advances in neural information processing systems, pp 545–552
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by Project U1301253 of NSFC and Project 201202032 of Liaoning Provincial Natural Science Foundation of China.
Rights and permissions
About this article
Cite this article
Bu, F., Chen, Z., Zhang, Q. et al. Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud. J Supercomput 72, 2977–2990 (2016). https://doi.org/10.1007/s11227-015-1433-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1433-9