Missing Value Imputation Based on K-Mean Clustering with Weighted Distance

Patil, Bankat M.; Joshi, Ramesh C.; Toshniwal, Durga

doi:10.1007/978-3-642-14834-7_56

Bankat M. Patil⁹,
Ramesh C. Joshi⁹ &
Durga Toshniwal⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 94))

Included in the following conference series:

International Conference on Contemporary Computing

1507 Accesses
17 Citations

Abstract

It is common to encounter databases that have up to a half of the entries missing, which is specifically true with medical databases. Most of the statistical and data mining techniques require complete datasets and obviously these techniques do not provide accurate results with missing values. Several methods have been proposed to deal with the missing data. Commonly used method is to delete instances with missing value attribute. These approaches are suitable when there are few missing values. In case of large number of missing values, deleting these instances results in loss of bulk of information. Other method to cope-up with this problem is to complete their imputation (filling in missing attribute). We propose an efficient missing value imputation method based on clustering with weighted distance. We divide the data set into clusters based on user specified value K. Then find a complete valued neighbor which is nearest to the missing valued instance. Then we compute the missing value by taking the average of the centroid value and the centroidal distance of the neighbor. This value is used as impute value. In our proposed approach we use K-means technique with weighted distance and show that our approach results in better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
Google Scholar
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
SAS Institute, Inc.: SAS Procedure Guide. SAS Institute Inc. Cary NC (1990)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Statistical Society 82, 528–550 (1978)
Google Scholar
Myrtveit, I., Stensrud, E., Olsson, U.H.: Analyzing Datasets with Missing Data: an Empirical Evaluation of Imputation Methods and Likelihood-Based Methods. IEEE Trans. on Software Engineering 27, 999–1013 (2001)
Article Google Scholar
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Mateo (1999)
Google Scholar
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural, and Statistical Classification. Ellis Horwood, New York (1994)
MATH Google Scholar
Chan, S.L., Dunn, O.J.: The Treatment of Missing Values in Discriminant Analysis. J. American Statistical Association 67, 473–477 (1972)
Article MATH Google Scholar
Mundfrom, D.J., Whitcomb, A.: Imputing Missing Values: The effect on the Accuracy of Classification. Multiple Linear Regression Viewpoints 25(1), 13–19 (1998)
Google Scholar
Beaumont, J.F.: On Regression Imputation in the Presence of Nonignorable Nonresponse. In: Proceedings of the Survey Research 570 Methods Section, ASA, pp. 580–585 (2000)
Google Scholar
Lall, U., Sharma, A.: A Nearest-Neighbor Bootstrap for Resampling Hydrologic Time Series. Water Resource. Res. 32, 679–693 (1996)
Article Google Scholar
Chen, S.M., Huang, C.M.: Generating Weighted Fuzzy Rules from Relational Database Systems for Estimating Null Values using Genetic Algorithms. IEEE Trans. Fuzzy Systems 11, 495–506 (2003)
Article Google Scholar
Congdon, P.: Bayesian Models for Categorical Data. John Wiley & Sons, New York (2005)
Book MATH Google Scholar
Chiu, H.Y., Sedransk, J.: A Bayesian Procedure for Imputing Missing Values in Sample Surveys. J. Amer. Statist. Assoc., 5667–5676 (1996)
Google Scholar
Batista, G.E.A.P.A., Monard, M.C.: An analysis of Four Missing Data Treatment Methods for Supervised Learning. J. Applied Artificial Intelligence 17, 519–533 (2003)
Article Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17, 520–525 (2001)
Article Google Scholar
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004)
Google Scholar
Jain, A.K.: Data Clustering: 50 Years Beyond K-Means. J. Pattern Recognition Letters (2009)
Google Scholar
Newman, D.J., Hettich, S., Blake, C.L.S., Merz, C.J.: UCI Repository of Machine Learning databases. University of California, Department of Information and Computer Science, Irvine (1998) (last assessed: 15/01/2010)
Google Scholar
Chen, G., Astebro, T.: How to Deal with Missing Categorical data: Test of a Simple Bayesian Method. Organ. Res. Methods, 309–327 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, Uttarakhand, India, 247667
Bankat M. Patil, Ramesh C. Joshi & Durga Toshniwal

Authors

Bankat M. Patil
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh C. Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Durga Toshniwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Sciences, University of Florida, 32611, Gainesville, FL, USA
Sanjay Ranka
University of Florida, Gainesville, Fl, USA
Arunava Banerjee
Department of Computer Science and Engineering, Indian Institute of Technology, 110016, New Delhi, INDIA
Kanad Kishore Biswas
Computer Science, College of Engineering and Science, Louisiana Tech University, LA 71272, Ruston, USA
Sumeet Dua
University of Florida, Gainesville, FL, USA
Prabhat Mishra
Department of Computer Science & Engineering, Indian Institute of Technology, 208016, Kanpur, India
Rajat Moona
National Tsing Hua University, Hsin-Chu, Taiwan, R.O.C.
Sheung-Hung Poon
Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
Cho-Li Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patil, B.M., Joshi, R.C., Toshniwal, D. (2010). Missing Value Imputation Based on K-Mean Clustering with Weighted Distance. In: Ranka, S., et al. Contemporary Computing. IC3 2010. Communications in Computer and Information Science, vol 94. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14834-7_56

Download citation

DOI: https://doi.org/10.1007/978-3-642-14834-7_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14833-0
Online ISBN: 978-3-642-14834-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics