Abstract
A nearly universal problem with real data is that they are incomplete, with some values missing. Furthermore, the ways in which values can go missing are quite varied, with arbitrary interdependencies between variables and their values leading to missing values. In order to test and compare data mining algorithms it is necessary to generate artificial data which have the same characteristics. We introduce DataZapper, a tool for uncreating data. Given a dataset containing joint samples over variables, DataZapper will make a specified percentage of observed values disappear, replaced by an indication that the measurement failed. DataZapper also supports any kind of dependence, and any degree of dependence, in its generation of missing values. We illustrate its use in a machine learning experiment and offer it to the data mining and machine learning communities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Onisko, A., Druzdzel, M.J., Wasyluk, H.: An experimental comparison of methods for handling incomplete data in learning parameters of bayesian networks. In: Proceedings of the IIS 2002 Symposium on Intelligent Information Systems, pp. 351–360. Physica-Verlag (2002)
Twala, B., Cartwright, M., Shepperd, M.J.: Comparison of various methods for handling incomplete data in software engineering databases. In: 2005 International Symposium on Empirical Software Engineering, Noosa Heads, Australia, pp. 105–114 (2005)
Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29, 950–956 (2008)
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical Report AIM-1509, Artificial Intelligence laboraory and Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology (1994)
Gill, M.K., Asefa, T., Kaheil, Y., McKee, M.: Effect of missing data on performance of learning algorithms for hydrologic predictions: Implications to an imputation technique. Water Resources Research 43 (2007)
Richman, M.B., Trafalis, T.B., Adrianto, I.: Multiple imputation through machine learning algorithms. In: Artificial Intelligence and Climate Applications (Joint between 5th Conference on Applications of Artificial Intelligence in the Environmental Sciences and 19th Conference on Climate Variability and Change) (2007)
Francois, O., Leray, P.: Generation of incomplete test-data using bayesian networks. In: Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, pp. 12–17 (2007)
Backus, J., Naur, P.: Revised report on the algorithmic language algol 60. Communications of the ACM 3, 299–314 (1960)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Wallace, C., Korb, K.B., Dai, H.: Causal discovery via MML. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 516–524. Morgan Kaufmann, San Francisco (1996)
Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000)
Leray, P., Francois, O.: BNT structure learning package: documentation and experiment s. Technical Report Laboratoire PSI - INSA Rouen-FRE CNRS 2645, Universitet INSA de Rouen (2004)
Cooper, G.F., Herskovits, E.: A Bayesian method for constructing Bayesian belief networks from databases. In: Proceedings of the Conference on Uncertainty in AI, pp. 86–94. Morgan Kaufmann, San Mateo (1991)
Meek, C.: Graphical Models: Selecting Causal and Statistical Models. PhD thesis, Carnegie Mellon University (1997)
Chickering, D.M.: A tranformational characterization of equivalent Bayesian network structures. In: Besnard, P., Hanks, S. (eds.) UAI 1995, San Francisco, pp. 87–98 (1995)
Wen, Y., Korb, K.B.: A heuristic algorithm for pattern-to-dag conversion. In: Proceedings of IASTED International Conference on Artificial Intelligence and Applications, pp. 428–433 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wen, Y., Korb, K.B., Nicholson, A.E. (2010). Generating Incomplete Data with DataZapper. In: Filipe, J., Fred, A., Sharp, B. (eds) Agents and Artificial Intelligence. ICAART 2009. Communications in Computer and Information Science, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11819-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-11819-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11818-0
Online ISBN: 978-3-642-11819-7
eBook Packages: Computer ScienceComputer Science (R0)