Skip to main content

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5691))

Abstract

The appropriate choice of a method for imputation of missing data becomes especially important when the fraction of missing values is large and the data are of mixed type. The proposed dynamic clustering imputation (DCI) algorithm relies on similarity information from shared neighbors, where mixed type variables are considered together. When evaluated on a public social science dataset of 46,043 mixed type instances with up to 33% missing values, DCI resulted in more than 20% improved imputation accuracy over Multiple Imputation, Predictive Mean Matching, Linear and Multilevel Regression, and Mean Mode Replacement methods. Data imputed by 6 methods were used for prediction tests by NB-Tree, Random Subset Selection and Neural Network-based classification models. In our experiments classification accuracy obtained using DCI-preprocessed data was much better than when relying on alternative imputation methods for data preprocessing.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. John Wiley & Sons, New York (1987)

    MATH  Google Scholar 

  2. Schafer, J.L.: Multiple imputation: a primer. Statistical Methods in Medical Research 8(1), 3–15 (1999)

    Article  Google Scholar 

  3. Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  4. Mantaras, R.L.: A distance-based attribute selection measure for decision tree induction. Machine Learning 6, 81–92 (1991)

    Article  Google Scholar 

  5. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM Press, Philadelphia (2007)

    Book  MATH  Google Scholar 

  6. Wishart, D.: K-means clustering with outlier detection, mixed variables and missing values. In: Schwaiger, M., Opitz, O. (eds.) Exploratory Data Analysis in Empirical Research, pp. 216–226. Springer, New York (2003)

    Chapter  Google Scholar 

  7. Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing Data: A comparison of neural network and expectation maximization techniques. Current Science 3(11), 1514–1521 (2007)

    Google Scholar 

  8. Bermejo, S., Cabestany, J.: Oriented principal component analysis for large margin classifiers. Neural Networks 14(10), 1447–1461 (2001)

    Article  MATH  Google Scholar 

  9. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  10. Landerman, L.R., Land, K.C., Pieper, C.F.: An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values. Sociological Methods & Research 26(1), 3–33 (1997)

    Article  Google Scholar 

  11. Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2006)

    Book  Google Scholar 

  12. King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review 95(1), 49–69 (2001)

    Google Scholar 

  13. Oudshoorn, C.G.M., Buuren, V.S., Rijckevorsel, V.: Flexible Multiple Imputation by Chained Equations of the AVO-95 Survey. In: TNO Prevention and Health (1999)

    Google Scholar 

  14. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, Irvine, http://archive.ics.uci.edu/ml/datasets/Adult

  15. Kohavi, R.: Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In: Proc. in 2-nd Int. KDDM Conf., pp. 202–207. AAAI Press, Portland (1996)

    Google Scholar 

  16. Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)

    Article  Google Scholar 

  17. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1998)

    MATH  Google Scholar 

  18. Gwet, K.: Statistical Tables for Inter-Rater Agreement. StatAxis, Gaithersburg (2001)

    Google Scholar 

  19. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth, London (1979)

    MATH  Google Scholar 

  20. Crime and Justice Research Center, Temple University, http://www.temple.edu/prodes/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z. (2009). Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2009. Lecture Notes in Computer Science, vol 5691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03730-6_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03730-6_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03729-0

  • Online ISBN: 978-3-642-03730-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics