ScaLeKB: scalable learning and inference over large knowledge bases

Chen, Yang; Wang, Daisy Zhe; Goldberg, Sean

doi:10.1007/s00778-016-0444-3

ScaLeKB: scalable learning and inference over large knowledge bases

Regular Paper
Published: 31 October 2016

Volume 25, pages 893–918, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

2583 Accesses
23 Citations
Explore all metrics

Abstract

Recent years have seen a drastic rise in the construction of web knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge, web corpora, and information extraction algorithms, the knowledge bases are still far from complete. To infer the missing knowledge, we propose the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from these web knowledge bases. The OP algorithm scales up via a series of optimization techniques, including a new parallel-rule-mining algorithm, a pruning strategy to eliminate unsound and inefficient rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 h; no existing system achieves this scale.

Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 h. We use cross validation to evaluate the inferred facts and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approach outperforms state-of-the-art mining algorithms and inference engines in terms of both performance and quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

Automating data extraction in systematic reviews: a systematic review

Article Open access 15 June 2015

Notes

http://dsr.cise.ufl.edu/projects/probkb-web-scale-probabilistic-knowledge-base.
In Freebase, domains are used to conceptually organize the types. We do not use this terminology elsewhere in the paper.

References

Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology. ACM (2010)
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Record (1993)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: VLDB (1994)
Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: SIGMOD. ACM (2010)
Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 739–748. IEEE (2008)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. Springer (2007)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. In: IJCAI (2007)
Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the 32nd Symposium on Principles of Database Systems. ACM (2013)
Beame, P., Koutris, P., Suciu, D.: Skew in parallel query processing. In: Proceedings of the 33rd Symposium on Principles of Database Systems. ACM (2014)
Biega, J., Kuzey, E., Suchanek, F.M.: Inside yago2s: a transparent information extraction architecture. In: WWW. International World Wide Web Conferences Steering Committee (2013)
Blog, G.O.: Introducing the knowledge graph: thing, not strings. http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD. ACM (2008)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, volume 5, page 3 (2010)
Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr, E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of WSCM (2010)
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: ACM Sigplan Notices, volume 45, pages 363–375. ACM (2010)
Chen, Y., Goldberg, S., Wang, D.Z., Johri, S.S.: Ontological pathfinding: Mining first-order knowledge from large knowledge bases. In: SIGMOD. ACM (2016)
Chen, Y., Petrovic, M., Clark, M.: Semmemdb: In-database knowledge activation. In: FLAIRS Conference (2014)
Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD Conference, pages 649–660 (2014)
Cheng, Y., Qin, C., Rusu, F.: Glade: big data analytics made easy. In: SIGMOD (2012)
Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM (2015)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: SIGKDD (2014)
Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S. Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment (2014)
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014)
Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: The second generation. In: IJCAI (2011)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: EMNLP (2011)
Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal (2015)
Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: OSDI (2012)
Gottlob, G., Lee, S.T., Valiant, G., Valiant, P.: Size and treewidth bounds for conjunctive queries. Journal of the ACM (JACM) (2012)
Han, J., Pei, J.: Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD explorations newsletter (2000)
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., et al.: The madlib analytics library: or mad skills, the sql. VLDB (2012)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013)
Article MathSciNet MATH Google Scholar
Horn, A.: On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic (1951)
Huynh, T.N.: Discriminative learning with markov logic networks. Technical report, DTIC Document (2009)
Google Scholar
Joglekar, M., Re, C.: It’s all a matter of degree: Using degree information to optimize multiway joins. Proceedings of the International Conference on Database Theory (ICDT) (2016)
Kersting, K., De Raedt, L.: 1 bayesian logic programming: Theory and tool. Statistical Relational Learning, page 291, (2007)
Khamis, M.A., Ngo, H.Q., Suciu, D.: Computing join queries with functional dependencies. Proceedings of the 32nd Symposium on Principles of Database Systems (2016)
Kok, S.: Structure Learning in Markov Logic Networks. PhD thesis, University of Washington (2010)
Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: ICDM (2001)
Kuramochi, M., Karypis, G.: Finding frequent patterns in a large sparse graph*. Data mining and knowledge discovery (2005)
Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of EMNLP (2011)
Lao, N., Subramanya, A., Pereira, F., Cohen, W.W.: Reading the web with learned syntactic-semantic inference rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics (2012)
Li, K., Wang, D.Z., Dobra, A., Dudley, C.: Uda-gist: An in-database framework to unify data-parallel and state-parallel analytics. Proceedings of the VLDB Endowment (2015)
Lin, T., Etzioni, O., et al.: Identifying functional relations in web text. In: EMNLP (2010)
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB (2012)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new parallel framework for machine learning. In: UAI (July 2010)
Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: A knowledge base from multilingual wikipedias. In: CIDR (2015)
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B.D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., Welling, J.: Never-ending learning (2015)
Muggleton, S.: Inductive logic programming: derivations, successes and shortcomings. ACM SIGART Bulletin (1994)
Muggleton, S.: Inverse entailment and progol. New generation computing (1995)
Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms:[extended abstract]. In: Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012)
Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. VLDB (2011)
Niu, F., Zhang, C., Ré, C., Shavlik, J.: Scaling inference for markov logic with a task-decomposition approach. arXiv preprint arXiv:1108.0294 (2011)
Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pages 25–28 (2012)
Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. SIGMOD Record (1995)
Quinlan, J.R.: Learning logical definitions from relations. Machine learning 5(3), 239–266 (1990)
Google Scholar
Raghavan, S., Mooney, R.J.: Online inference-rule learning from natural-language extractions. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2013)
Richards, B.L.: Learning relations by bathfinding (1992)
Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1–2), 107–136 (2006)
Article Google Scholar
Ritter, A., Downey, D., Soderland, S., Etzioni, O.: It’s a contradiction—no, it’s not: a case study using functional relations. In: EMNLP (2008)
Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: VLDB (1995)
Schoenmackers, S., Etzioni, O., Weld, D.S.: Scaling textual inference to the web. In: EMNLP (2008)
Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: EMNLP (2010)
Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment (2015)
Suchanek, F.M., Abiteboul, S., Senellart, P.: Paris: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment (2011)
Tausend, B.: Representing biases for inductive logic programming. In: Machine Learning: ECML-94. Springer (1994)
Veldhuizen, T.L.: Leapfrog triejoin: A simple, worst-case optimal join algorithm. Proceedings of the International Conference on Database Theory (ICDT) (2014)
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Communications of the ACM (2014)
Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin (2014)
Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD (2011)
West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., Lin, D.: Knowledge base completion via search-based question answering. In: Proceedings of the 23rd international conference on World wide web. ACM (2014)
Wijaya, D., Talukdar, P.P., Mitchell, T.: Pidgin: ontology alignment using web text as interlingua. In: CIKM (2013)
Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A probabilistic taxonomy for text understanding. In: SIGMOD. ACM (2012)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: NSDI. USENIX Association (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10 (2010)
Zeng, Q., Patel, J.M., Page, D.: Quickfoil: scalable inductive logic programming. Proceedings of the VLDB Endowment (2014)
Zhang, C.: DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, UW-Madison (2015)
Zou, L., Chen, L., Özsu, M.T.: Distance-join: Pattern match query in a large graph database. Proceedings of VLDB (2009)

Download references

Acknowledgments

This work was partially supported by NSF IIS Award # 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google. We also thank Dr. Milenko Petrovic and Dr. Alin Dobra for the helpful discussions on query optimization.

Author information

Authors and Affiliations

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32608, USA
Yang Chen, Daisy Zhe Wang & Sean Goldberg

Authors

Yang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Daisy Zhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sean Goldberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Wang, D.Z. & Goldberg, S. ScaLeKB: scalable learning and inference over large knowledge bases. The VLDB Journal 25, 893–918 (2016). https://doi.org/10.1007/s00778-016-0444-3

Download citation

Received: 03 December 2015
Revised: 29 September 2016
Accepted: 11 October 2016
Published: 31 October 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s00778-016-0444-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ScaLeKB: scalable learning and inference over large knowledge bases

Abstract

Access this article

Similar content being viewed by others

A survey on large language model based autonomous agents

Clustering graph data: the roadmap to spectral techniques

Automating data extraction in systematic reviews: a systematic review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ScaLeKB: scalable learning and inference over large knowledge bases

Abstract

Access this article

Similar content being viewed by others

A survey on large language model based autonomous agents

Clustering graph data: the roadmap to spectral techniques

Automating data extraction in systematic reviews: a systematic review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation