Skip to main content

Advertisement

Log in

ScaLeKB: scalable learning and inference over large knowledge bases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Recent years have seen a drastic rise in the construction of web knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge, web corpora, and information extraction algorithms, the knowledge bases are still far from complete. To infer the missing knowledge, we propose the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from these web knowledge bases. The OP algorithm scales up via a series of optimization techniques, including a new parallel-rule-mining algorithm, a pruning strategy to eliminate unsound and inefficient rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 h; no existing system achieves this scale.

Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 h. We use cross validation to evaluate the inferred facts and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approach outperforms state-of-the-art mining algorithms and inference engines in terms of both performance and quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://dsr.cise.ufl.edu/projects/probkb-web-scale-probabilistic-knowledge-base.

  2. In Freebase, domains are used to conceptually organize the types. We do not use this terminology elsewhere in the paper.

References

  1. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology. ACM (2010)

  2. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Record (1993)

  3. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: VLDB (1994)

  4. Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: SIGMOD. ACM (2010)

  5. Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 739–748. IEEE (2008)

  6. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. Springer (2007)

  7. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. In: IJCAI (2007)

  8. Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the 32nd Symposium on Principles of Database Systems. ACM (2013)

  9. Beame, P., Koutris, P., Suciu, D.: Skew in parallel query processing. In: Proceedings of the 33rd Symposium on Principles of Database Systems. ACM (2014)

  10. Biega, J., Kuzey, E., Suchanek, F.M.: Inside yago2s: a transparent information extraction architecture. In: WWW. International World Wide Web Conferences Steering Committee (2013)

  11. Blog, G.O.: Introducing the knowledge graph: thing, not strings. http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

  12. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD. ACM (2008)

  13. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, volume 5, page 3 (2010)

  14. Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr, E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of WSCM (2010)

  15. Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: ACM Sigplan Notices, volume 45, pages 363–375. ACM (2010)

  16. Chen, Y., Goldberg, S., Wang, D.Z., Johri, S.S.: Ontological pathfinding: Mining first-order knowledge from large knowledge bases. In: SIGMOD. ACM (2016)

  17. Chen, Y., Petrovic, M., Clark, M.: Semmemdb: In-database knowledge activation. In: FLAIRS Conference (2014)

  18. Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD Conference, pages 649–660 (2014)

  19. Cheng, Y., Qin, C., Rusu, F.: Glade: big data analytics made easy. In: SIGMOD (2012)

  20. Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM (2015)

  21. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  22. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: SIGKDD (2014)

  23. Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S. Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment (2014)

  24. Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014)

  25. Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: The second generation. In: IJCAI (2011)

  26. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: EMNLP (2011)

  27. Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal (2015)

  28. Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)

  29. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: OSDI (2012)

  30. Gottlob, G., Lee, S.T., Valiant, G., Valiant, P.: Size and treewidth bounds for conjunctive queries. Journal of the ACM (JACM) (2012)

  31. Han, J., Pei, J.: Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD explorations newsletter (2000)

  32. Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., et al.: The madlib analytics library: or mad skills, the sql. VLDB (2012)

  33. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  34. Horn, A.: On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic (1951)

  35. Huynh, T.N.: Discriminative learning with markov logic networks. Technical report, DTIC Document (2009)

    Google Scholar 

  36. Joglekar, M., Re, C.: It’s all a matter of degree: Using degree information to optimize multiway joins. Proceedings of the International Conference on Database Theory (ICDT) (2016)

  37. Kersting, K., De Raedt, L.: 1 bayesian logic programming: Theory and tool. Statistical Relational Learning, page 291, (2007)

  38. Khamis, M.A., Ngo, H.Q., Suciu, D.: Computing join queries with functional dependencies. Proceedings of the 32nd Symposium on Principles of Database Systems (2016)

  39. Kok, S.: Structure Learning in Markov Logic Networks. PhD thesis, University of Washington (2010)

  40. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: ICDM (2001)

  41. Kuramochi, M., Karypis, G.: Finding frequent patterns in a large sparse graph*. Data mining and knowledge discovery (2005)

  42. Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of EMNLP (2011)

  43. Lao, N., Subramanya, A., Pereira, F., Cohen, W.W.: Reading the web with learned syntactic-semantic inference rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics (2012)

  44. Li, K., Wang, D.Z., Dobra, A., Dudley, C.: Uda-gist: An in-database framework to unify data-parallel and state-parallel analytics. Proceedings of the VLDB Endowment (2015)

  45. Lin, T., Etzioni, O., et al.: Identifying functional relations in web text. In: EMNLP (2010)

  46. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB (2012)

  47. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new parallel framework for machine learning. In: UAI (July 2010)

  48. Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: A knowledge base from multilingual wikipedias. In: CIDR (2015)

  49. Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B.D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., Welling, J.: Never-ending learning (2015)

  50. Muggleton, S.: Inductive logic programming: derivations, successes and shortcomings. ACM SIGART Bulletin (1994)

  51. Muggleton, S.: Inverse entailment and progol. New generation computing (1995)

  52. Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms:[extended abstract]. In: Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012)

  53. Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. VLDB (2011)

  54. Niu, F., Zhang, C., Ré, C., Shavlik, J.: Scaling inference for markov logic with a task-decomposition approach. arXiv preprint arXiv:1108.0294 (2011)

  55. Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pages 25–28 (2012)

  56. Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. SIGMOD Record (1995)

  57. Quinlan, J.R.: Learning logical definitions from relations. Machine learning 5(3), 239–266 (1990)

    Google Scholar 

  58. Raghavan, S., Mooney, R.J.: Online inference-rule learning from natural-language extractions. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2013)

  59. Richards, B.L.: Learning relations by bathfinding (1992)

  60. Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1–2), 107–136 (2006)

    Article  Google Scholar 

  61. Ritter, A., Downey, D., Soderland, S., Etzioni, O.: It’s a contradiction—no, it’s not: a case study using functional relations. In: EMNLP (2008)

  62. Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: VLDB (1995)

  63. Schoenmackers, S., Etzioni, O., Weld, D.S.: Scaling textual inference to the web. In: EMNLP (2008)

  64. Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: EMNLP (2010)

  65. Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment (2015)

  66. Suchanek, F.M., Abiteboul, S., Senellart, P.: Paris: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment (2011)

  67. Tausend, B.: Representing biases for inductive logic programming. In: Machine Learning: ECML-94. Springer (1994)

  68. Veldhuizen, T.L.: Leapfrog triejoin: A simple, worst-case optimal join algorithm. Proceedings of the International Conference on Database Theory (ICDT) (2014)

  69. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Communications of the ACM (2014)

  70. Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin (2014)

  71. Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD (2011)

  72. West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., Lin, D.: Knowledge base completion via search-based question answering. In: Proceedings of the 23rd international conference on World wide web. ACM (2014)

  73. Wijaya, D., Talukdar, P.P., Mitchell, T.: Pidgin: ontology alignment using web text as interlingua. In: CIKM (2013)

  74. Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A probabilistic taxonomy for text understanding. In: SIGMOD. ACM (2012)

  75. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: NSDI. USENIX Association (2012)

  76. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10 (2010)

  77. Zeng, Q., Patel, J.M., Page, D.: Quickfoil: scalable inductive logic programming. Proceedings of the VLDB Endowment (2014)

  78. Zhang, C.: DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, UW-Madison (2015)

  79. Zou, L., Chen, L., Özsu, M.T.: Distance-join: Pattern match query in a large graph database. Proceedings of VLDB (2009)

Download references

Acknowledgments

This work was partially supported by NSF IIS Award # 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google. We also thank Dr. Milenko Petrovic and Dr. Alin Dobra for the helpful discussions on query optimization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Wang, D.Z. & Goldberg, S. ScaLeKB: scalable learning and inference over large knowledge bases. The VLDB Journal 25, 893–918 (2016). https://doi.org/10.1007/s00778-016-0444-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0444-3

Keywords

Navigation