Skip to main content
Log in

Distilling relations using knowledge bases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Given a relational table, we study the problem of detecting and repairing erroneous data, as well as marking correct data, using well curated knowledge bases (KBs). We propose detective rules (DRs), a new type of data cleaning rules that can make actionable decisions on relational data, by building connections between a relation and a KB. The main invention is that a DR simultaneously models two opposite semantics of an attribute belonging to a relation using types and relationships in a KB: The positive semantics explains how its value should be linked to other attribute values in a correct tuple, and the negative semantics indicate how a wrong attribute value is connected to other correct attribute values within the same tuple. Naturally, a DR can mark correct values in a tuple if it matches the positive semantics. Meanwhile, a DR can detect/repair an error if it matches the negative semantics. We study fundamental problems associated with DRs, e.g., rule consistency and rule implication. We present efficient algorithms to apply DRs to clean a relation, based on rule order selection and inverted indexes. Moreover, we discuss approaches on how to generate DRs from examples. Extensive experiments, using both real-world and synthetic datasets, verify the effectiveness and efficiency of applying DRs in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Edit distance of two instances is the minimum number of edit transformations from one to the other, where the edit operations include insertion, deletion and substitution. For example \(\mathsf {ED} (\mathsf {Chemistry}, \mathsf {Chamstry})=2\).

  2. https://www.cse.iitb.ac.in/~sunita/wwt/

  3. http://wiki.freebase.com/wiki/WEX

  4. https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country

  5. https://en.wikipedia.org/wiki/List_of_countries_by_Nobel_laureates_per_capita

  6. http://sherlock.ics.uci.edu/data.html

References

  1. Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: where are we and what needs to be done? PVLDB 9(12), 993–1004 (2016)

    Google Scholar 

  2. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  3. Anchuri, P., Zaki, M.J., Barkol, O., Golan, S., Shamy, M.: Approximate graph mining with label costs. In: KDD, pp. 518–526 (2013)

  4. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases. In: SIGMOD, pp. 68–79. ACM (1999)

  5. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss markov random fields and probabilistic soft logic. CoRR, arXiv:1505.04406 (2015)

  6. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  7. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD (2005)

  8. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)

  9. Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE (2011)

  10. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE (2013)

  11. Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: SIGMOD (2015)

  12. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB (2007)

  13. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD (2013)

  14. Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. PVLDB 6(13), 1606–1617 (2013)

    Google Scholar 

  15. Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)

    Google Scholar 

  16. Deshpande, O., Lamba, D.S., Tourn, M., Das, S., Subramaniam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Building, maintaining, and using knowledge bases: a report from the trenches. In: SIGMOD Conference (2013)

  17. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: SIGKDD (2014)

  18. Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W.: From data fusion to knowledge fusion. PVLDB 7(10), 881–892 (2014)

    Google Scholar 

  19. Fan, W.: Dependencies revisited for improving data quality. In: PODS (2008)

  20. Fan, W., Fan, Z., Tian, C., Dong, X.L.: Keys for graphs. PVLDB 8(12), 1590–1601 (2015)

    Google Scholar 

  21. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2), 6 (2008)

    Article  Google Scholar 

  22. Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. PVLDB 2(1), 407–418 (2009)

    Google Scholar 

  23. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)

    Article  Google Scholar 

  24. Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)

    Article  Google Scholar 

  25. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The LLUNATIC data-cleaning framework. PVLDB 6(9), 625–636 (2013)

    Google Scholar 

  26. Hao, S., Tang, N., Li, G., Li, J.: Cleaning relations using knowledge bases. In: ICDE (2017)

  27. He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: SIGMOD (2016)

  28. Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)

  29. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2009)

    MATH  Google Scholar 

  30. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  31. Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: ICDE (2015)

  32. Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)

    Google Scholar 

  33. Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: Bigdansing: a system for big data cleansing. In: SIGMOD (2015)

  34. Li, G.: A human-machine method for web table understanding. In: WAIM, pp. 179–189 (2013)

  35. Li, G.: Human-in-the-loop data integration. PVLDB 10(12), 2006–2017 (2017)

    Google Scholar 

  36. Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd-based selections and joins. In: SIGMOD, pp. 1463–1478 (2017)

  37. Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  38. Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: SIGMOD, pp. 903–914 (2008)

  39. Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)

    Article  Google Scholar 

  40. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(12), 1338–1347 (2010)

    Google Scholar 

  41. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.N.: Dbpedia SPARQL benchmark—performance assessment with real queries on real data. In: ISWC (2011)

  42. Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: Scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)

    Google Scholar 

  43. Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB (2001)

  44. Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean Holistic data repairs with probabilistic inference. PVLDB 10(11), 1190–1201 (2017)

    Google Scholar 

  45. Shang, Z., Liu, Y., Li, G., Feng, J.: K-join: knowledge-aware similarity join. IEEE Trans. Knowl. Data Eng. 28(12), 3293–3308 (2016)

    Article  Google Scholar 

  46. Shin, J., Wu, S., Wang, F., Sa, C.D., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. PVLDB 8(11), 1310–1321 (2015)

    Google Scholar 

  47. Singh, R., Meduri, V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: PVLDB (2017)

  48. Singh, R., Meduri, V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. In: SIGMOD demo (2017)

  49. Song, S., Cheng, H., Yu, J.X., Chen, L.: Repairing vertex labels under neighborhood constraints. PVLDB 7(11), 987–998 (2014)

    Google Scholar 

  50. Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)

    Google Scholar 

  51. Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE (2014)

  52. Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)

    Google Scholar 

  53. Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11–16, 2011, Hannover, Germany, pp. 458–469 (2011)

  54. Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240 (2013)

  55. Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD (2014)

  56. Yakout, M., Berti-Equille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: SIGMOD (2013)

  57. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)

    Google Scholar 

  58. Yu, M., Wang, J., Li, G., Zhang, Y., Deng, D., Feng, J.: A unified framework for string similarity search with edit-distance constraint. VLDB J. 26(2), 249–274 (2017)

    Article  Google Scholar 

  59. Zhuang, Y., Li, G., Feng, Z.Z.J.: Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: CIKM (2017)

  60. Zhuang, Y., Li, G., Zhong, Z., Feng, J.: PBA: partition and blocking based alignment for large knowledge bases. In: DASFAA, pp. 415–431 (2016)

Download references

Acknowledgements

This work was supported by the 973 Program of China (2015CB358700), NSF of China (61632016, 61472198, 61521002, 61661166012), and TAL education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoliang Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, S., Tang, N., Li, G. et al. Distilling relations using knowledge bases. The VLDB Journal 27, 497–519 (2018). https://doi.org/10.1007/s00778-018-0506-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0506-9

Keywords

Navigation