Skip to main content
Log in

A web-based approach to data imputation

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: ACM DL, pp. 85–94 (2000)

  2. Barnard, J., Rubin, D.: Small-sample degrees of freedom with multiple imputation. Biometrika 86(4), 948–955 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  3. Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17(5–6), 519–533 (2003)

    Article  Google Scholar 

  4. Brin, S.: Extracting patterns and relations from the world wide web. In: The World Wide Web and Databases, pp. 172–183. Springer (1999)

  5. Cormode, G., Golab, L., Flip, K., McGregor, A., Srivastava, D., Zhang, X.: Estimating the confidence of conditional functional dependencies. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 469–482. ACM (2009)

  6. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B (Methodological) 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  7. Grzymala-Busse, J.W.: Three approaches to missing attribute values: a rough set perspective. Data Mining: Foundations and Practice 118, 139 (2008)

    Google Scholar 

  8. Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Rough sets and current trends in computing, vol. 2005, p. 378. Springer (2001)

  9. Grzymala-Busse, J., Grzymala-Busse, W., Goodwin, L.: Coping with missing attribute values based on closest fit in preterm birth data: a rough set approach. Comput. Intell. 17(3), 425–434 (2001)

    Article  Google Scholar 

  10. Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. Proceedings of the VLDB Endowment (PVLDB) 2(1), 289–300 (2009)

    Google Scholar 

  11. Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 539–545 (1992)

  12. Li, J., Cercone, N.: Assigning missing attribute values based on rough sets theory. In: Granular Computing, 2006 IEEE International Conference on, pp. 607–610. IEEE (2006)

  13. Li, Z., Sitbon, L., Zhou, X.: Learning-based relevance feedback for web-based relation completion. In: CIKM, pp. 1535–1540 (2011)

  14. Li, Z., Sharaf, M.A., Sitbon, L., Sadiq, S., Indulska, M., Zhou, X.: Webput: efficient web-based data imputation. In: WISE, pp. 243–256. Springer (2012)

  15. Loshin, D.: The data quality business case: projecting return on investment. Volume White Paper, Informatica (2008)

  16. Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: EACL, pp. 1–8 (1999)

  17. Quinlan, J.: C4. 5: Programs for Machine Learning. Morgan Kaufmann (1993)

  18. Ramoni, M., Sebastiani, P.: Robust learning with missing data. Mach. Learn. 45(2), 147–170 (2001)

    Article  MATH  Google Scholar 

  19. Shi, S., Zhang, H., Yuan, X., Wen, J.-R.: Corpus-based semantic class mining: distributional vs. pattern-based approaches. In: COLING, pp. 993–1001 (2010)

  20. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  21. Wang, R., Cohen, W.: Iterative set expansion of named entities using the web. In: ICDM, pp. 1091–1096. IEEE (2008)

  22. Wang, R., Cohen, W.: Automatic set instance extraction using the web. In: ACL/AFNLP, pp. 441–449. Association for Computational Linguistics (2009)

  23. Wang, Q., Rao, J.: Empirical likelihood-based inference under imputation for missing response data. Ann. Stat. 30(3), 896–924 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  24. Wu, C., Wun, C., Chou, H.: Using association rules for completing missing data. In: HIS, pp. 236–241. IEEE (2004)

  25. Zhang, S.: Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell. Inform. Bull. 9(1), 32–38 (2008)

    Google Scholar 

  26. Zhang, S.: Shell-neighbor method and its application in missing data imputation. Appl. Intell. 35(1), 123–133 (2011)

    Article  MATH  Google Scholar 

  27. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23(1), 110–121 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhixu Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Sharaf, M.A., Sitbon, L. et al. A web-based approach to data imputation. World Wide Web 17, 873–897 (2014). https://doi.org/10.1007/s11280-013-0263-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-013-0263-z

Keywords

Navigation