research-article

Knowledge expansion over probabilistic knowledge bases

Authors:
Yang Chen

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

,
Daisy Zhe Wang

University of Florida, Gainesville, FL, USA

University of Florida, Gainesville, FL, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 649–660https://doi.org/10.1145/2588555.2610516

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 649–660

ABSTRACT

Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

References

S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In SIGMOD, pages 519--530. ACM, 2010. Google ScholarDigital Library
S. Arumugam, F. Xu, R. Jampani, C. Jermaine, L. L. Perez, and P. J. Haas. Mcdb-r: Risk analysis in the database. VLDB, 3(1--2):782--793, 2010. Google ScholarDigital Library
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722--735. Springer, 2007. Google ScholarDigital Library
G. O. Blog. Introducing the knowledge graph: thing, not strings. http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-thin%gs-not.html, 2012.Google Scholar
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250. ACM, 2008. Google ScholarDigital Library
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. VLDB, 3(1--2):285--296, 2010. Google ScholarDigital Library
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 2, 2010.Google Scholar
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. VLDB, 2(2):1481--1492, 2009. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, pages 10--10, 2004. Google ScholarDigital Library
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. Open information extraction: The second generation. In IJCAI. AAAI Press, 2011. Google ScholarDigital Library
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. Google ScholarDigital Library
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In SIGMOD, pages 325--336. ACM, 2012. Google ScholarDigital Library
V. Gogate and P. Domingos. Probabilistic theorem proving. In UAI, pages 256--265, Corvallis, Oregon, 2011. AUAI Press.Google Scholar
J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, pages 324--332, 2011.Google Scholar
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, 2009.Google Scholar
C. E. Grant, J.-d. Gumbs, K. Li, D. Z. Wang, and G. Chitouras. Madden: query-driven statistical text analytics. In CIKM, pages 2740--2742. ACM, 2012. Google ScholarDigital Library
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. VLDB, 5(12):1700--1711, 2012. Google ScholarDigital Library
A. Horn. On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic, 16(1):14--21, 1951.Google ScholarCross Ref
T. N. Huynh and R. J. Mooney. Discriminative structure and parameter learning for markov logic networks. In ICML, 2008. Google ScholarDigital Library
S. Kok. Structure Learning in Markov Logic Networks. PhD thesis, University of Washington, 2010. Google ScholarDigital Library
S. Kok and P. Domingos. Learning markov logic network structure via hypergraph lifting. In ICML. ACM, 2009. Google ScholarDigital Library
S. Kok and P. Domingos. Learning markov logic networks using structural motifs. In ICML, pages 551--558, 2010.Google Scholar
S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, and P. Domingos. The alchemy system for statistical relational ai (technical report). department of computer science and engineering, university of washington, seattle, wa, 2006.Google Scholar
D. Kollar and N. Friedman. Probabilistic graphical models: principles and techniques. The MIT Press, 2009. Google ScholarDigital Library
F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 2001. Google ScholarDigital Library
S. Lightstone, T. Teorey, and T. Nadeau. Physical database design. Morgan Kaufman, pages 318--334, 2007.Google Scholar
T. Lin, O. Etzioni, et al. Identifying functional relations in web text. In EMNLP, 2010. Google ScholarDigital Library
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. VLDB, 2012. Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, July 2010.Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146. ACM, 2010. Google ScholarDigital Library
S. Muggleton. Inverse entailment and progol. New generation computing, 13(3--4):245--286, 1995.Google Scholar
F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: scaling up statistical inference in markov logic networks using an rdbms. VLDB, pages 373--384, 2011. Google ScholarDigital Library
F. Niu, C. Zhang, C. Ré, and J. Shavlik. Scaling inference for markov logic via dual decomposition. In ICDM, pages 1032--1037. IEEE, 2012. Google ScholarDigital Library
H. Poon and P. Domingos. Sound and efficient inference with probabilistic and deterministic dependencies. In AAAI, 2006. Google ScholarDigital Library
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, volume 7, pages 913--918, 2007. Google ScholarDigital Library
J. R. Quinlan. Learning logical definitions from relations. Machine learning, 5(3):239--266, 1990. Google ScholarDigital Library
S. Raghavan and R. J. Mooney. Online inference-rule learning from natural-language extractions. In Proceedings of the AAAI Workshop on Statistical Relational AI (StaRAI-13), 2013.Google Scholar
M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1--2):107--136, 2006. Google ScholarDigital Library
A. Ritter, D. Downey, S. Soderland, and O. Etzioni. It's a contradiction--no, it's not: a case study using functional relations. In EMNLP, pages 11--20, 2008. Google ScholarDigital Library
M. Schmitz, R. Bart, S. Soderland, O. Etzioni, et al. Open language learning for information extraction. In EMNLP, 2012.Google ScholarDigital Library
S. Schoenmackers, O. Etzioni, and D. S. Weld. Scaling textual inference to the web. In EMNLP, 2008. Google ScholarDigital Library
S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis. Learning first-order horn clauses from web text. In EMNLP, 2010. Google ScholarDigital Library
P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, pages 572--582. IEEE, 2006. Google ScholarDigital Library
P. Singla and P. Domingos. Memory-efficient inference in relational domains. In AAAI, volume 21, page 488, 2006. Google ScholarDigital Library
P. Singla and P. Domingos. Lifted first-order belief propagation. In AAAI, volume 2, pages 1094--1099, 2008. Google ScholarDigital Library
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706. ACM, 2007. Google ScholarDigital Library
J. D. Ullman, H. Garcia-Molina, and J. Widom. Database systems: the complete book. Prentice Hall Upper Saddle River, 2001. Google ScholarDigital Library
J. Van Haaren and J. Davis. Markov network structure learning: A randomized feature generation approach. In AAAI, 2012.Google Scholar
D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein. Bayesstore: managing large, uncertain data repositories with probabilistic graphical models. VLDB, 2008. Google ScholarDigital Library
M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and mcmc. VLDB, 2010. Google ScholarDigital Library
M. L. Wick and A. McCallum. Query-aware mcmc. In NIPS, pages 2564--2572, 2011.Google Scholar
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481--492. ACM, 2012. Google ScholarDigital Library
D. Z. W. Yang Chen. Web-scale knowledge inference using markov logic networks. ICML workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs, 2013.Google Scholar
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010. Google ScholarDigital Library
C. Zhang and C. Ré. Towards high-throughput gibbs sampling at scale: A study across storage managers. In SIGMOD. ACM, 2013. Google ScholarDigital Library

Index Terms

Knowledge expansion over probabilistic knowledge bases
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
  2. Machine learning
    1. Machine learning approaches
      1. Rule learning

Recommendations

Knowledge vault: a web-scale approach to probabilistic knowledge fusion
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft's Satori, and Google's Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing ...
Read More
Predicting Completeness in Knowledge Bases
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real ...
Read More
Inconsistency-tolerant reasoning over linear probabilistic knowledge bases

We consider the problem of reasoning under uncertainty in the presence of inconsistencies. Our knowledge bases consist of linear probabilistic constraints that, in particular, generalize many probabilistic-logical knowledge representation formalisms. We ...
Read More

Reviews

Reviewer: Vincent J Kovarik

Extrapolating implicit facts from existing data using logic and rules enables the construction of a more semantically complete knowledge base. The authors present an approach that extends previous approaches based on a Markov logic network (MLN) by defining a probabilistic knowledge base consisting of a set of entities, class relations, and weighted facts. The work focuses on two focus areas: improving grounding efficiency using a relational database management system (DBMS) by applying inference rules in batches, and identifying and recovering from errors in the grounding process, which inhibits propagation in the inference chain. The knowledge elements are represented as a collection of relational database tables that enables a structured query language (SQL)-based inference algorithm to perform the knowledge expansion and construction of the MLN graphs in batches. A MLN provides the ability to compute an inferred fact with a specified degree of probability or certainty based on the network structure and link probabilities. Semantic constraints are defined within the process as a first-order formula with an infinite weight, thereby defining a fact or assertion that must be satisfied by all possible combinations of rules within the system. Thus, the semantic constraint enables the identification of potential errors in assertions due to inconsistent or incorrect rules. This is a beneficial capability in any system that attempts to extrapolate or infer new information because it provides a compensation mechanism for incorrect information and ambiguous rules. Inconsistencies and conflicts are identified through the construction of ground factor graphs. The grounding algorithm that constructs the graphs consists of two steps: 1) compute the ground atoms, which are comprised of both given and inferred facts, until the transitive closure is computed, and 2) “apply the rules again to construct the ground factors.” Empirical research in the paper provides quantitative data: the parallel inference algorithm using SQL-based expressions is shown to increase the performance of the inference process over sequential algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
databases
knowledge bases
probabilistic reasoning
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 611
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Knowledge expansion over probabilistic knowledge bases

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Knowledge vault: a web-scale approach to probabilistic knowledge fusion

Predicting Completeness in Knowledge Bases

Inconsistency-tolerant reasoning over linear probabilistic knowledge bases

Reviews

Access critical reviews of Computing literature here