Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

Ayuyev, Vadim V.; Jupin, Joseph; Harris, Philip W.; Obradovic, Zoran

doi:10.1007/978-3-642-03730-6_29

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

Vadim V. Ayuyev¹⁹,
Joseph Jupin²⁰,
Philip W. Harris²¹ &
…
Zoran Obradovic²⁰

Conference paper

1086 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5691))

Abstract

The appropriate choice of a method for imputation of missing data becomes especially important when the fraction of missing values is large and the data are of mixed type. The proposed dynamic clustering imputation (DCI) algorithm relies on similarity information from shared neighbors, where mixed type variables are considered together. When evaluated on a public social science dataset of 46,043 mixed type instances with up to 33% missing values, DCI resulted in more than 20% improved imputation accuracy over Multiple Imputation, Predictive Mean Matching, Linear and Multilevel Regression, and Mean Mode Replacement methods. Data imputed by 6 methods were used for prediction tests by NB-Tree, Random Subset Selection and Neural Network-based classification models. In our experiments classification accuracy obtained using DCI-preprocessed data was much better than when relying on alternative imputation methods for data preprocessing.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. John Wiley & Sons, New York (1987)
MATH Google Scholar
Schafer, J.L.: Multiple imputation: a primer. Statistical Methods in Medical Research 8(1), 3–15 (1999)
Article Google Scholar
Fujikawa, Y., Ho, T.-B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002)
Chapter Google Scholar
Mantaras, R.L.: A distance-based attribute selection measure for decision tree induction. Machine Learning 6, 81–92 (1991)
Article Google Scholar
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM Press, Philadelphia (2007)
Book MATH Google Scholar
Wishart, D.: K-means clustering with outlier detection, mixed variables and missing values. In: Schwaiger, M., Opitz, O. (eds.) Exploratory Data Analysis in Empirical Research, pp. 216–226. Springer, New York (2003)
Chapter Google Scholar
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing Data: A comparison of neural network and expectation maximization techniques. Current Science 3(11), 1514–1521 (2007)
Google Scholar
Bermejo, S., Cabestany, J.: Oriented principal component analysis for large margin classifiers. Neural Networks 14(10), 1447–1461 (2001)
Article MATH Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Landerman, L.R., Land, K.C., Pieper, C.F.: An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values. Sociological Methods & Research 26(1), 3–33 (1997)
Article Google Scholar
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge (2006)
Book Google Scholar
King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review 95(1), 49–69 (2001)
Google Scholar
Oudshoorn, C.G.M., Buuren, V.S., Rijckevorsel, V.: Flexible Multiple Imputation by Chained Equations of the AVO-95 Survey. In: TNO Prevention and Health (1999)
Google Scholar
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, Irvine, http://archive.ics.uci.edu/ml/datasets/Adult
Kohavi, R.: Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In: Proc. in 2-nd Int. KDDM Conf., pp. 202–207. AAAI Press, Portland (1996)
Google Scholar
Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
Article Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall, Englewood Cliffs (1998)
MATH Google Scholar
Gwet, K.: Statistical Tables for Inter-Rater Agreement. StatAxis, Gaithersburg (2001)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth, London (1979)
MATH Google Scholar
Crime and Justice Research Center, Temple University, http://www.temple.edu/prodes/

Download references

Author information

Authors and Affiliations

FN1-KF Department, Bauman Moscow State Technical University (Kaluga Branch), Bazgenova Str. 2, Kaluga, 248600, Russian Federation
Vadim V. Ayuyev
Center for Information Science and Technology, Temple University, 303 Wachman Hall, 1805 N. Broad St., Philadelphia, PA, 19122, USA
Joseph Jupin & Zoran Obradovic
Department of Criminal Justice, Temple University, 512 Glatfelter Hall, 1115 W Berks Str., Philadelphia, PA, 19122, USA
Philip W. Harris

Authors

Vadim V. Ayuyev
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Jupin
View author publications
You can also search for this author in PubMed Google Scholar
Philip W. Harris
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Obradovic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Aalborg University, Selma Lagerlöfsvej 300, 9220, Aalborg Ø, Denmark
Torben Bach Pedersen
IBM India Research Lab, Plot No. 4, Block C, Institutional Area, Vasant Kunj, 110 070, New Delhi, India
Mukesh K. Mohania
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, 1040, Wien, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z. (2009). Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2009. Lecture Notes in Computer Science, vol 5691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03730-6_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-03730-6_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03729-0
Online ISBN: 978-3-642-03730-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics