Skip to main content

Advertisement

Log in

AutoRepair: an automatic repairing approach over multi-source data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Truth discovery methods and rule-based data repairing methods are two classic lines of approaches to improve data quality in the field of database. Truth discovery methods resolve the multi-source conflicts for the same entity by estimating the reliabilities of different source, while rule-based data repairing methods resolve the inconsistencies among different entities using integrity constraints. However, both lines of methods suffer unsatisfactory performances due to the lacking of enough evidence. In this paper, we propose AutoRepair, a novel automatic multi-source data repairing approach to enrich the evidence by taking the advantages of truth discovery and data repairing. We use functional dependency, one of the most common types of constraints, to detect the violations, and use the source reliability as evidence to discover and repair the errors among these violations. At the same time, the repaired results are used to estimate the source reliability. As the source reliability is unknown in advance, we model the process as an iterative framework to ensure better performance. Extensive experiments are conducted on both simulated and real-world datasets. The results clearly demonstrate the advantages of our approach, which outperform both recent truth discovery and rule-based data repairing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. We consider the type of constraints as functional dependency due to its importance in improving data quality. However, other types of constraints can also be adopted.

  2. To reduce the influence of the existence of incomplete tuples to the result, the attributes with a large number of missing values are removed.

  3. http://www.hospitalcompare.hhs.gov/.

  4. http://data.gov.uk/data.

  5. We set the number of the simulated sources as 5, as it is relatively easy to acquire information from 5 sources in real-world areas, and the information is enough for our multi-source data repairing.

  6. https://opendata.cityofnewyork.us/.

  7. https://www.yelp.com/nyc.

  8. https://www.yellowpages.com/.

  9. http://www.nychealthratings.com/.

  10. http://test.supertour.com/newyork-ny.us.aspx.

  11. https://tools.usps.com/go/ZipLookupAction.

References

  1. Bertossi L, Kolahi S, Lakshmanan LV (2013) Data cleaning and query answering with matching dependencies and matching functions. Theory Comput Syst 52(3):441–482

    Article  MathSciNet  MATH  Google Scholar 

  2. Beskales G, Ilyas IF, Golab L (2010) Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2):197–207

    Google Scholar 

  3. Bohannon P, Fan W, Flaster M et al (2005) A cost-based model and effective heuristic for repairing constraints by value modification. In: Özcan F (ed) Proceedings of SIGMOD, ACM, Baltimore, MD, pp 143–154

  4. Chiang F, Miller RJ (2011) A unified model for data and constraint repair. In: Abiteboul S, Böhm K, Koch C, Tan K (eds) Proceedings of ICDE, IEEE. Hannover, Germany, pp 446–457

  5. Cong G, Fan W, Geerts F et al (2007) Improving data quality: consistency and accuracy. In Koch C, Johannes G, Garofalakis M et al (eds) Proceedings of VLDB, ACM, University of Vienna, Vienna, pp 315–326

  6. Dallachiesa M, Ebaid A, Eldawy A et al (2013) NADEEF: a commodity data cleaning system. In: Ross K, Srivastava D, Papadias D (eds) Proceedigns of SIGMOD, ACM, New York, NY, pp 541–552

  7. Dong X, Berti-Équille L, Srivastava D (2009) Truth discovery and copying detection in a dynamic world. PVLDB 2(1):562–573

    Google Scholar 

  8. Fan W (2008) Dependencies revisited for improving data quality. In: Lenzerini M, Lembo D (eds) Proceedings of PODS, ACM, Vancouver, BC, pp 159–170

  9. Fan W (2015) Data quality: from theory to practice. SIGMOD Record 44(3):7–18

    Article  Google Scholar 

  10. Fan W, Geerts F, Jia X, Kementsietsidis A (2008) Conditional functional dependencies for capturing data inconsistencies. TODS 33(2):6:1–6:48

    Article  Google Scholar 

  11. Fan W, Jia X, Li J et al (2009) Reasoning about record matching rules. PVLDB 2(1):407–418

    Google Scholar 

  12. Fan W, Li J, Ma S et al (2010) Towards certain fixes with editing rules and master data. PVLDB 3(1–2):173–184

    Google Scholar 

  13. Geerts F, Mecca G, Papotti P et al (2013) The LLUNATIC data-cleaning framework. PVLDB 6(9):625–636

    Google Scholar 

  14. Klein BD, Goodhue DL, Davis GB (1997) Can humans detect errors in data? Impact of base rates, incentives, and goals. MIS Quarterly 21(2):169–194

    Article  Google Scholar 

  15. Kolahi S, Lakshmanan LV (2009) On approximating optimum repairs for functional dependency violations. In: Fagin R (ed) Proceedings of ICDT, ACM, St. Petersburg, pp 53–62

  16. Li Q, Li Y, Gao J et al (2014) A confidence-aware approach for truth discovery on long-tail data. PVLDB 8(4):425–436

    Google Scholar 

  17. Li Q, Li Y, Gao J et al (2014) Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Dyreson, C, Li F, Tamer Özsu M (eds) Proceedings of SIGMOD, ACM, Snowbird, UT, pp 1187–1198

  18. Li X, Dong X, Lyons KB et al (2012) Truth finding on the deep web: is the problem solved? PVLDB 6(2):97–108

    Google Scholar 

  19. Li Y, Gao J, Meng C et al (2015) A survey on truth discovery. SIGKDD Explor 17(12):1–16

    Google Scholar 

  20. Li Y, Li Q, Gao J et al (2015) On the discovery of evolving truth. In: Cao L, Zhang C, Joachims T et al (eds) Proceedings of SIGKDD, ACM, Sydney, NSW, pp 675–684

  21. Ma S, Fan W, Bravo L (2014) Extending inclusion dependencies with conditions. Theor Comput Sci 515:64–95

    Article  MathSciNet  MATH  Google Scholar 

  22. Mayfield C, Neville J, Prabhakar S (2010) ERACER: a database approach for statistical inference and data cleaning. In: Elmagarmid A, Agrawal D (eds) Proceedings of SIGMOD, ACM, Indianapolis, IN, pp 75–86

  23. Meng C, Jiang W, Li Y (2015) Truth discovery on crowd sensing of correlated entities. In Song J, Abdelzaher T, Mascolo C (eds) Proceedings of SenSys, ACM, Seoul, pp 169–182

  24. Pasternack J, Roth D (2010) Knowing What to Believe (when you already know something). In: Huang C, Jurafsky D (eds) Proceedings of COLING, Tsinghua University Press, Beijing, pp 877–885

  25. Pochampally R, Das Sarma A, Dong, X et al (2014) Fusing data with correlations. In Dyreson CE, Li F, Tamer Özsu M (eds) Proceedings of SIGMOD, ACM, Snowbird, UT, pp 433–444

  26. Qi G.-J, Aggarwal CC, Han J et al (2013) Mining collective intelligence in diverse groups. In Schwabe D, Almeida V, Glaser H (eds) Proceedings of WWW, ACM, Rio de Janeiro, pp 1041–1052

  27. Rekatsinas T, Xu C, Ilyas IF (2017) HoloClean: holistic data repairs with probabilistic inference. PVLDB 10(11):1190–1201

    Google Scholar 

  28. Wang J, Tang N (2014) Towards dependable data repairing with fixing rules. In Dyreson CE, Li F, Tamer Özsu M (eds) Proceedings of SIGMOD, ACM, Snowbird, UT, pp 457–468

  29. Wang X, Sheng QZ, Yao L et al (2016) Empowering truth discovery with multi-truth prediction. In: Mukhopadhyay S, Zhai C, Bertino E et al (eds) Proceedings of CIKM, ACM, Indianapolis, IN, pp 881–890

  30. Wang X, Sheng QZ, Yao L et al (2016) Truth discovery via exploiting implications from multi-source data. In: Mukhopadhyay S, Zhai C, Bertino E et al (eds) Proceedings of CIKM, ACM, Indianapolis, IN, pp 861–870

  31. Xiao H, Gao J, Li Q et al (2016) Towards confidence in the truth: a bootstrapping based truth discovery approach. In: Krishnapuram B, Shah M, Smola AJ et al (eds) Proceedings of SIGKDD, ACM, San Francisco, CA, pp 1935–1944

  32. Yakout M, Elmagarmid AK, Neville J et al (2011) Guided data repair. PVLDB 4(5):279–289

    Google Scholar 

  33. Ye C, Wang H, Li J et al (2016) Crowdsourcing-enhanced missing values imputation based on bayesian network. In: Navathe SB, Wu W, Shekhar S et al (eds) Proceedings of DASFAA, Springer, Dallas, TX, pp 67–81

  34. Yin X, Han J, Philip SY (2008) Truth discovery with multiple conflicting information providers on the web. TKDE 20(6):796–808

    Google Scholar 

  35. Yu D, Huang H, Cassidy T (2014) The wisdom of minority: unsupervised slot filling validation based on multi-dimensional truth-finding. In: Hajic J and Tsujii J (eds) Proceedings of COLING, ACL, Dublin, pp 1567–1578

  36. Zhang H, Li Q, Ma F et al (2016) Influence-aware truth discovery. In: Mukhopadhyay S, Zhai C, Bertino E et al (eds) Proceedings of CIKM, ACM, Indianapolis, IN, pp 851–860

  37. Zhang H, Li Y, Ma F et al (2018) TextTruth: an unsupervised approach to discover trustworthy information from multi-sourced text data. In Guo Y, Farooq F (eds) Proceedings of SIGKDD, ACM, London, pp 2729–2737

  38. Zhao B, Han J (2012) A probabilistic model for estimating real-valued truth from conflicting sources. In: Proceedings of QDB

  39. Zhao B, Rubinstein BI, Gemmell J (2012) A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6):550–561

    Google Scholar 

Download references

Acknowledgements

This paper was partially supported by NSFC Grants U1509216, U1866602, the National Key Research and Development Program of China 2016YFB1000703, NSFC Grants 61472099, 61602129, NSF IIS 1553411, and the Chinese Scholarship Council Funding (No. 201606120227).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, C., Li, Q., Zhang, H. et al. AutoRepair: an automatic repairing approach over multi-source data. Knowl Inf Syst 61, 227–257 (2019). https://doi.org/10.1007/s10115-018-1284-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1284-9

Keywords

Navigation