Skip to main content

Missing Value Imputation Based on K-Mean Clustering with Weighted Distance

  • Conference paper
Contemporary Computing (IC3 2010)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 94))

Included in the following conference series:

Abstract

It is common to encounter databases that have up to a half of the entries missing, which is specifically true with medical databases. Most of the statistical and data mining techniques require complete datasets and obviously these techniques do not provide accurate results with missing values. Several methods have been proposed to deal with the missing data. Commonly used method is to delete instances with missing value attribute. These approaches are suitable when there are few missing values. In case of large number of missing values, deleting these instances results in loss of bulk of information. Other method to cope-up with this problem is to complete their imputation (filling in missing attribute). We propose an efficient missing value imputation method based on clustering with weighted distance. We divide the data set into clusters based on user specified value K. Then find a complete valued neighbor which is nearest to the missing valued instance. Then we compute the missing value by taking the average of the centroid value and the centroidal distance of the neighbor. This value is used as impute value. In our proposed approach we use K-means technique with weighted distance and show that our approach results in better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)

    Google Scholar 

  2. Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  3. SAS Institute, Inc.: SAS Procedure Guide. SAS Institute Inc. Cary NC (1990)

    Google Scholar 

  4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Statistical Society 82, 528–550 (1978)

    Google Scholar 

  5. Myrtveit, I., Stensrud, E., Olsson, U.H.: Analyzing Datasets with Missing Data: an Empirical Evaluation of Imputation Methods and Likelihood-Based Methods. IEEE Trans. on Software Engineering 27, 999–1013 (2001)

    Article  Google Scholar 

  6. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Mateo (1999)

    Google Scholar 

  7. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural, and Statistical Classification. Ellis Horwood, New York (1994)

    MATH  Google Scholar 

  8. Chan, S.L., Dunn, O.J.: The Treatment of Missing Values in Discriminant Analysis. J. American Statistical Association 67, 473–477 (1972)

    Article  MATH  Google Scholar 

  9. Mundfrom, D.J., Whitcomb, A.: Imputing Missing Values: The effect on the Accuracy of Classification. Multiple Linear Regression Viewpoints 25(1), 13–19 (1998)

    Google Scholar 

  10. Beaumont, J.F.: On Regression Imputation in the Presence of Nonignorable Nonresponse. In: Proceedings of the Survey Research 570 Methods Section, ASA, pp. 580–585 (2000)

    Google Scholar 

  11. Lall, U., Sharma, A.: A Nearest-Neighbor Bootstrap for Resampling Hydrologic Time Series. Water Resource. Res. 32, 679–693 (1996)

    Article  Google Scholar 

  12. Chen, S.M., Huang, C.M.: Generating Weighted Fuzzy Rules from Relational Database Systems for Estimating Null Values using Genetic Algorithms. IEEE Trans. Fuzzy Systems 11, 495–506 (2003)

    Article  Google Scholar 

  13. Congdon, P.: Bayesian Models for Categorical Data. John Wiley & Sons, New York (2005)

    Book  MATH  Google Scholar 

  14. Chiu, H.Y., Sedransk, J.: A Bayesian Procedure for Imputing Missing Values in Sample Surveys. J. Amer. Statist. Assoc., 5667–5676 (1996)

    Google Scholar 

  15. Batista, G.E.A.P.A., Monard, M.C.: An analysis of Four Missing Data Treatment Methods for Supervised Learning. J. Applied Artificial Intelligence 17, 519–533 (2003)

    Article  Google Scholar 

  16. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 17, 520–525 (2001)

    Article  Google Scholar 

  17. Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004)

    Google Scholar 

  18. Jain, A.K.: Data Clustering: 50 Years Beyond K-Means. J. Pattern Recognition Letters (2009)

    Google Scholar 

  19. Newman, D.J., Hettich, S., Blake, C.L.S., Merz, C.J.: UCI Repository of Machine Learning databases. University of California, Department of Information and Computer Science, Irvine (1998) (last assessed: 15/01/2010)

    Google Scholar 

  20. Chen, G., Astebro, T.: How to Deal with Missing Categorical data: Test of a Simple Bayesian Method. Organ. Res. Methods, 309–327 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Patil, B.M., Joshi, R.C., Toshniwal, D. (2010). Missing Value Imputation Based on K-Mean Clustering with Weighted Distance. In: Ranka, S., et al. Contemporary Computing. IC3 2010. Communications in Computer and Information Science, vol 94. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14834-7_56

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14834-7_56

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14833-0

  • Online ISBN: 978-3-642-14834-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics