Generating Incomplete Data with DataZapper

Wen, Yingying; Korb, Kevin B.; Nicholson, Ann E.

doi:10.1007/978-3-642-11819-7_9

Yingying Wen⁴,
Kevin B. Korb⁴ &
Ann E. Nicholson⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 67))

Included in the following conference series:

International Conference on Agents and Artificial Intelligence

591 Accesses

Abstract

A nearly universal problem with real data is that they are incomplete, with some values missing. Furthermore, the ways in which values can go missing are quite varied, with arbitrary interdependencies between variables and their values leading to missing values. In order to test and compare data mining algorithms it is necessary to generate artificial data which have the same characteristics. We introduce DataZapper, a tool for uncreating data. Given a dataset containing joint samples over variables, DataZapper will make a specified percentage of observed values disappear, replaced by an indication that the measurement failed. DataZapper also supports any kind of dependence, and any degree of dependence, in its generation of missing values. We illustrate its use in a machine learning experiment and offer it to the data mining and machine learning communities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Onisko, A., Druzdzel, M.J., Wasyluk, H.: An experimental comparison of methods for handling incomplete data in learning parameters of bayesian networks. In: Proceedings of the IIS 2002 Symposium on Intelligent Information Systems, pp. 351–360. Physica-Verlag (2002)
Google Scholar
Twala, B., Cartwright, M., Shepperd, M.J.: Comparison of various methods for handling incomplete data in software engineering databases. In: 2005 International Symposium on Empirical Software Engineering, Noosa Heads, Australia, pp. 105–114 (2005)
Google Scholar
Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29, 950–956 (2008)
Article Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)
Article MATH MathSciNet Google Scholar
Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical Report AIM-1509, Artificial Intelligence laboraory and Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology (1994)
Google Scholar
Gill, M.K., Asefa, T., Kaheil, Y., McKee, M.: Effect of missing data on performance of learning algorithms for hydrologic predictions: Implications to an imputation technique. Water Resources Research 43 (2007)
Google Scholar
Richman, M.B., Trafalis, T.B., Adrianto, I.: Multiple imputation through machine learning algorithms. In: Artificial Intelligence and Climate Applications (Joint between 5th Conference on Applications of Artificial Intelligence in the Environmental Sciences and 19th Conference on Climate Variability and Change) (2007)
Google Scholar
Francois, O., Leray, P.: Generation of incomplete test-data using bayesian networks. In: Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, pp. 12–17 (2007)
Google Scholar
Backus, J., Naur, P.: Revised report on the algorithmic language algol 60. Communications of the ACM 3, 299–314 (1960)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Wallace, C., Korb, K.B., Dai, H.: Causal discovery via MML. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 516–524. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000)
Google Scholar
Leray, P., Francois, O.: BNT structure learning package: documentation and experiment s. Technical Report Laboratoire PSI - INSA Rouen-FRE CNRS 2645, Universitet INSA de Rouen (2004)
Google Scholar
Cooper, G.F., Herskovits, E.: A Bayesian method for constructing Bayesian belief networks from databases. In: Proceedings of the Conference on Uncertainty in AI, pp. 86–94. Morgan Kaufmann, San Mateo (1991)
Google Scholar
Meek, C.: Graphical Models: Selecting Causal and Statistical Models. PhD thesis, Carnegie Mellon University (1997)
Google Scholar
Chickering, D.M.: A tranformational characterization of equivalent Bayesian network structures. In: Besnard, P., Hanks, S. (eds.) UAI 1995, San Francisco, pp. 87–98 (1995)
Google Scholar
Wen, Y., Korb, K.B.: A heuristic algorithm for pattern-to-dag conversion. In: Proceedings of IASTED International Conference on Artificial Intelligence and Applications, pp. 428–433 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Clayton School of Information Technology, Monash University, 3800, VIC, Australia
Yingying Wen, Kevin B. Korb & Ann E. Nicholson

Authors

Yingying Wen
View author publications
You can also search for this author in PubMed Google Scholar
Kevin B. Korb
View author publications
You can also search for this author in PubMed Google Scholar
Ann E. Nicholson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departament of Systems and Informatics, Polytechnic Institute of Setúbal – INSTICC, Rua do Vale de Chaves - Estefanilha, 2910-761, Setúbal, Portugal
Joaquim Filipe
Instituto de Telecomunicaçöes, IST - Instituto Superior Técnico, Av. Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred
School of Computing, Staffordshire University, Baconside,, Stafford, UK
Bernadette Sharp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, Y., Korb, K.B., Nicholson, A.E. (2010). Generating Incomplete Data with DataZapper. In: Filipe, J., Fred, A., Sharp, B. (eds) Agents and Artificial Intelligence. ICAART 2009. Communications in Computer and Information Science, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11819-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-11819-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11818-0
Online ISBN: 978-3-642-11819-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics