Abstract
To complete missing values, a solution is to use attribute correlations within data. However, it is difficult to identify such relations within data containing missing values. Accordingly, we develop a kernel-based missing data imputation method in this paper. This approach aims at making optimal statistical parameters: mean, distribution function after missing-data are imputed. We refer this approach to p arameter op timization method (POP algorithm, a random regression imputation). We experimentally evaluate our approach, and demonstrate that our POP algorithm is much better than deterministic regression imputation in efficiency of generating an inference on the above two parameters. The results also show our algorithm is computationally efficient, robust and stable for the missing data imputation.
This work is partially supported by Australian large ARC grants (DP0449535 DP0559536 and DP0667060), a China NSFC major research Program (60496327), a China NSFC grant (60463003) and a grant from Overseas Outstanding Talent Research Program of Chinese Academy of Sciences (06S3011S01).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Batista, G.A., et al.: An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17(5-6), 519–533 (2003)
Blake, C.L., Merz, C.J.: UCI Repository of machine learning database. Irvine, CA: university of California, Department of Information and Computer Science (1998), http://www.ics.uci.edu/~mlearn/MLResoesitory.html
Chen, S.M., Chen, H.H.: Estimating null values in the distributed relational databases environments. Cybernetics and Systems 31, 851–871 (2000)
Chen, S.M., Huang, C.M.: Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms. IEEE Transactions on Fuzzy Systems 11, 495–506 (2003)
Gessert, G.: Handling Missing Data by Using Stored Truth Values. SIGMOD Record 20(3), 30–42 (1991)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
Kahl, F., Heyden, A., Quan, L.: Minimal Projective Reconstruction Including Missing Data. IEEE Trans. Pattern Anal. Mach. Intell. 23(4), 418–424 (2001)
Lakshminarayan, K., Harp, S.A., Goldman, R.P., Samad, T.: Imputation of Missing Data Using Machine Learning Techniques. In: KDD-1996, pp. 140–145 (1996)
Little, R.J.A., Rubin, D.A.: Statistical analysis with missing data. John Wiley and Sons, New York (1987)
Magnani, M.: Techniques for Dealing with Missing Data in Knowledge Discovery Tasks (2004), http://magnanim.web.cs.unibo.it/index.html
Qin, Y.S., Rao, J.N.K.: Confidence intervals for parameters of the response variable in a linear model with missing data. Technical Report, School of Math and Statistics, Carleton University (2004)
Pawlak, W.: Kernel classification rules from missing data. IEEE Transactions on Information Theory 39(3), 979–988 (1993)
Pesonen, E., Eskelinen, M., Juhola, M.: Treatment of missing data values in a neural network based decision support system for acute abdominal pain. Artificial Intelligence in Medicine 13(3), 139–146 (1998)
Ramoni, M., et al.: Robust Learning with Missing Data. Machine Learning 45(2), 147–170 (2001)
Rao, J.N.K.: On variance estimation with imputed survey data. J. Amer. Statist. Assoc. 91, 499–520 (1996)
Schafer, J.L., Graham, J.W.: Missing Data: Our View of the State of the Art. Psychological Methods 7(2), 147–177 (2002)
Zhang, S., Zhang, C., Yang, Q.: Information Enhancement for Data Mining. IEEE Intelligent Systems 19(2), 12–13 (2004)
Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: Missing is useful: missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17(12), 1689–1693 (2005)
Wang, Q., Rao, J.N.K.: Empirical likelihood-based inference in linear models with missing data. Scand. J. Statist. 29, 563–576 (2002a)
Wang, Q., Rao, J.N.K.: Empirical likelihood-based inference under imputation for missing response data. Ann. Statist. 30, 896–924 (2002b)
Lall, U., Sharma, A.: A nearest-neighbor bootstrap for resampling hydrologic time series. Water Resource. Res. 32, 679–693 (1996)
John, S.-T., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge Press, Cambridge (2004)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, S., Qin, Y., Zhu, X., Zhang, J., Zhang, C. (2006). Optimized Parameters for Missing Data Imputation. In: Yang, Q., Webb, G. (eds) PRICAI 2006: Trends in Artificial Intelligence. PRICAI 2006. Lecture Notes in Computer Science(), vol 4099. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-36668-3_124
Download citation
DOI: https://doi.org/10.1007/978-3-540-36668-3_124
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36667-6
Online ISBN: 978-3-540-36668-3
eBook Packages: Computer ScienceComputer Science (R0)