Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

Bu, Fanyu; Chen, Zhikui; Zhang, Qingchen; Yang, Laurence T.

doi:10.1007/s11227-015-1433-9

Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

Published: 06 May 2015

Volume 72, pages 2977–2990, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Fanyu Bu^1,2,
Zhikui Chen¹,
Qingchen Zhang¹ &
…
Laurence T. Yang³

559 Accesses
Explore all metrics

Abstract

Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel $k$-means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Robust Fuzzy c-Means Clustering Algorithm for Incomplete Data

Missing information in imbalanced data stream: fuzzy adaptive imputation approach

Article 16 August 2021

Correlated Cluster-Based Imputation for Treatment of Missing Values

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Chen S et al (2012) Capacity of data collection in arbitrary wireless sensor networks. IEEE Trans Parallel Distrib Syst 23(1):52–60
Article Google Scholar
Zhang Q, Chen Z (2014) A distributed weighted possibilistic c-means algorithm for clustering incomplete big sensor data. Int J Distrib Sens Networks 2014:161–169
Google Scholar
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14(5):853–871
Article Google Scholar
Rahman MG, Islam MZ (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50
Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533
Article Google Scholar
Richard J, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B: Cybern 31(5):735–744
Article Google Scholar
Zhu C et al (2013) A review of key issues that concern the feasibility of mobile cloud computing. In: Proceedings of IEEE international conference on cyber, physical, and social computing, pp 769–776
Zhu C et al (2015) An authenticated trust and peputation calculation and management system for cloud and sensor networks integration. IEEE Trans Inf Forensics Secur 10(1):118–131
Article Google Scholar
GeaurRahman M, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl Based Syst 56:311–327
Article Google Scholar
Liu C, Dai D, Yan H (2010) The theoretic framework of local weighted approximation for microarray missing value estimation. Pattern Recognit 43(8):2993–3002
Article MATH Google Scholar
Rahman MG, Islam MZ (2013) Data quality improvement by imputation of missing values. In: Proceedings of international conference on computer science and information technology, pp 82–88
Wang X et al (2006) Missing value estimation for dna microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32
Article Google Scholar
Silva EL, Rafael PL (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1):121–129
Article Google Scholar
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: Proceedings of IEEE 3rd international conference on computational cybernetics, pp 207–212
Li D et al (2004) Towards missing data imputation: A study of fuzzy k-means clustering method. In: Proceedings of rough sets and current trends in computing, pp 573–579
Liao Z et al (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Proceedings of IEEE international conference on fuzzy systems and knowledge discovery, pp 133–137
Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35
Article Google Scholar
Krishnapuram R, Keller JM (1996) The possibilistic c-means algorithm: insights and recommendations. IEEE Trans Fuzzy Syst 4(3):385–393
Article Google Scholar
Alessandro G, Nuovo D (2011) Missing data analysis with fuzzy c-means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797
Article Google Scholar
Bu F, Chen Z, Zhang Q (2014) Incomplete big data clustering algorithm using feature selection and partial distance. In: Proceedings of 2014 5th IEEE conference on digital home, pp 263–266
Javed K, Babri HA, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24(3):465–477
Article Google Scholar
Guyon I et al (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Proceedings of advances in neural information processing systems, pp 545–552

Download references

Author information

Authors and Affiliations

School of Software Technology, Dalian University of Technology, Dalian, 116620, China
Fanyu Bu, Zhikui Chen & Qingchen Zhang
College of Vocation, Inner Mongolia University of Finance and Economics, Hohhot, 010010, China
Fanyu Bu
Department of Computer Science, St. Francis Xavier University, Antigonish, B2G 2W5, Canada
Laurence T. Yang

Authors

Fanyu Bu
View author publications
You can also search for this author inPubMed Google Scholar
Zhikui Chen
View author publications
You can also search for this author inPubMed Google Scholar
Qingchen Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Laurence T. Yang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Qingchen Zhang.

Additional information

This work was supported by Project U1301253 of NSFC and Project 201202032 of Liaoning Provincial Natural Science Foundation of China.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bu, F., Chen, Z., Zhang, Q. et al. Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud. J Supercomput 72, 2977–2990 (2016). https://doi.org/10.1007/s11227-015-1433-9

Download citation

Published: 06 May 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s11227-015-1433-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Robust Fuzzy c-Means Clustering Algorithm for Incomplete Data

Missing information in imbalanced data stream: fuzzy adaptive imputation approach

Correlated Cluster-Based Imputation for Treatment of Missing Values

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now