ABSTRACT
Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.
- S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In SIGMOD, pages 519--530. ACM, 2010. Google ScholarDigital Library
- S. Arumugam, F. Xu, R. Jampani, C. Jermaine, L. L. Perez, and P. J. Haas. Mcdb-r: Risk analysis in the database. VLDB, 3(1--2):782--793, 2010. Google ScholarDigital Library
- S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722--735. Springer, 2007. Google ScholarDigital Library
- G. O. Blog. Introducing the knowledge graph: thing, not strings. http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-thin%gs-not.html, 2012.Google Scholar
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250. ACM, 2008. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. VLDB, 3(1--2):285--296, 2010. Google ScholarDigital Library
- A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 2, 2010.Google Scholar
- J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. VLDB, 2(2):1481--1492, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, pages 10--10, 2004. Google ScholarDigital Library
- O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. Open information extraction: The second generation. In IJCAI. AAAI Press, 2011. Google ScholarDigital Library
- A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. Google ScholarDigital Library
- X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In SIGMOD, pages 325--336. ACM, 2012. Google ScholarDigital Library
- V. Gogate and P. Domingos. Probabilistic theorem proving. In UAI, pages 256--265, Corvallis, Oregon, 2011. AUAI Press.Google Scholar
- J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, pages 324--332, 2011.Google Scholar
- J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, 2009.Google Scholar
- C. E. Grant, J.-d. Gumbs, K. Li, D. Z. Wang, and G. Chitouras. Madden: query-driven statistical text analytics. In CIKM, pages 2740--2742. ACM, 2012. Google ScholarDigital Library
- J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. VLDB, 5(12):1700--1711, 2012. Google ScholarDigital Library
- A. Horn. On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic, 16(1):14--21, 1951.Google ScholarCross Ref
- T. N. Huynh and R. J. Mooney. Discriminative structure and parameter learning for markov logic networks. In ICML, 2008. Google ScholarDigital Library
- S. Kok. Structure Learning in Markov Logic Networks. PhD thesis, University of Washington, 2010. Google ScholarDigital Library
- S. Kok and P. Domingos. Learning markov logic network structure via hypergraph lifting. In ICML. ACM, 2009. Google ScholarDigital Library
- S. Kok and P. Domingos. Learning markov logic networks using structural motifs. In ICML, pages 551--558, 2010.Google Scholar
- S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, and P. Domingos. The alchemy system for statistical relational ai (technical report). department of computer science and engineering, university of washington, seattle, wa, 2006.Google Scholar
- D. Kollar and N. Friedman. Probabilistic graphical models: principles and techniques. The MIT Press, 2009. Google ScholarDigital Library
- F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 2001. Google ScholarDigital Library
- S. Lightstone, T. Teorey, and T. Nadeau. Physical database design. Morgan Kaufman, pages 318--334, 2007.Google Scholar
- T. Lin, O. Etzioni, et al. Identifying functional relations in web text. In EMNLP, 2010. Google ScholarDigital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. VLDB, 2012. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, July 2010.Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146. ACM, 2010. Google ScholarDigital Library
- S. Muggleton. Inverse entailment and progol. New generation computing, 13(3--4):245--286, 1995.Google Scholar
- F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: scaling up statistical inference in markov logic networks using an rdbms. VLDB, pages 373--384, 2011. Google ScholarDigital Library
- F. Niu, C. Zhang, C. Ré, and J. Shavlik. Scaling inference for markov logic via dual decomposition. In ICDM, pages 1032--1037. IEEE, 2012. Google ScholarDigital Library
- H. Poon and P. Domingos. Sound and efficient inference with probabilistic and deterministic dependencies. In AAAI, 2006. Google ScholarDigital Library
- H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, volume 7, pages 913--918, 2007. Google ScholarDigital Library
- J. R. Quinlan. Learning logical definitions from relations. Machine learning, 5(3):239--266, 1990. Google ScholarDigital Library
- S. Raghavan and R. J. Mooney. Online inference-rule learning from natural-language extractions. In Proceedings of the AAAI Workshop on Statistical Relational AI (StaRAI-13), 2013.Google Scholar
- M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1--2):107--136, 2006. Google ScholarDigital Library
- A. Ritter, D. Downey, S. Soderland, and O. Etzioni. It's a contradiction--no, it's not: a case study using functional relations. In EMNLP, pages 11--20, 2008. Google ScholarDigital Library
- M. Schmitz, R. Bart, S. Soderland, O. Etzioni, et al. Open language learning for information extraction. In EMNLP, 2012.Google ScholarDigital Library
- S. Schoenmackers, O. Etzioni, and D. S. Weld. Scaling textual inference to the web. In EMNLP, 2008. Google ScholarDigital Library
- S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis. Learning first-order horn clauses from web text. In EMNLP, 2010. Google ScholarDigital Library
- P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, pages 572--582. IEEE, 2006. Google ScholarDigital Library
- P. Singla and P. Domingos. Memory-efficient inference in relational domains. In AAAI, volume 21, page 488, 2006. Google ScholarDigital Library
- P. Singla and P. Domingos. Lifted first-order belief propagation. In AAAI, volume 2, pages 1094--1099, 2008. Google ScholarDigital Library
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706. ACM, 2007. Google ScholarDigital Library
- J. D. Ullman, H. Garcia-Molina, and J. Widom. Database systems: the complete book. Prentice Hall Upper Saddle River, 2001. Google ScholarDigital Library
- J. Van Haaren and J. Davis. Markov network structure learning: A randomized feature generation approach. In AAAI, 2012.Google Scholar
- D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein. Bayesstore: managing large, uncertain data repositories with probabilistic graphical models. VLDB, 2008. Google ScholarDigital Library
- M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and mcmc. VLDB, 2010. Google ScholarDigital Library
- M. L. Wick and A. McCallum. Query-aware mcmc. In NIPS, pages 2564--2572, 2011.Google Scholar
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
- W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481--492. ACM, 2012. Google ScholarDigital Library
- D. Z. W. Yang Chen. Web-scale knowledge inference using markov logic networks. ICML workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs, 2013.Google Scholar
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010. Google ScholarDigital Library
- C. Zhang and C. Ré. Towards high-throughput gibbs sampling at scale: A study across storage managers. In SIGMOD. ACM, 2013. Google ScholarDigital Library
Index Terms
- Knowledge expansion over probabilistic knowledge bases
Recommendations
Knowledge vault: a web-scale approach to probabilistic knowledge fusion
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningRecent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft's Satori, and Google's Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing ...
Predicting Completeness in Knowledge Bases
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data MiningKnowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real ...
Inconsistency-tolerant reasoning over linear probabilistic knowledge bases
We consider the problem of reasoning under uncertainty in the presence of inconsistencies. Our knowledge bases consist of linear probabilistic constraints that, in particular, generalize many probabilistic-logical knowledge representation formalisms. We ...
Comments