Skip to main content

A Data Cleaning Framework Based on User Feedback

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7923))

Abstract

In this paper, we present our design of a data cleaning framework that combines interaction of data quality rules (CFDS, CINDS and MDs) with user feedback through an interactive process. First, to generate candidate repairs for each potentially dirty attribute, we propose an optimization model based on genetic algorithm. We then create a Bayesian machine learning model with several committees to predict the correctness of the repair and rank these repairs by uncertainly score to improve the learned model. User feedback is used to decide whether the model is accurate while inspecting the suggestions. Finally, our experiments on real-world datasets show significant improvement in data quality.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  2. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modication. In: ACM SIGMOD, pp. 143–154 (2005)

    Google Scholar 

  3. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality:consistency and accuracy. In: VLDB, pp. 315–326 (2007)

    Google Scholar 

  4. Lopatenko, A., Bravo, L.: Efficient approximation algorithms for repairing inconsistent databases. In: ICDE, pp. 216–225 (2007)

    Google Scholar 

  5. Fan, W., Geerts, F.: Foundations of Data Quality Management. In: Synthesis Lectures on Data Management (2012)

    Google Scholar 

  6. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer (2006)

    Google Scholar 

  7. Greco, G., Greco, S., Zumpano, E.: A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng. 15(6), 1389–1408 (2003)

    Article  Google Scholar 

  8. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model and Algorithms. In: VLDB (2001)

    Google Scholar 

  9. Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB (2001)

    Google Scholar 

  10. Jeery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: ACM SIGMOD, pp. 847–860 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xie, H., Wang, H., Li, J., Gao, H. (2013). A Data Cleaning Framework Based on User Feedback. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38562-9_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38561-2

  • Online ISBN: 978-3-642-38562-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics