skip to main content
10.1145/2588555.2610516acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Knowledge expansion over probabilistic knowledge bases

Published:18 June 2014Publication History

ABSTRACT

Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

References

  1. S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In SIGMOD, pages 519--530. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Arumugam, F. Xu, R. Jampani, C. Jermaine, L. L. Perez, and P. J. Haas. Mcdb-r: Risk analysis in the database. VLDB, 3(1--2):782--793, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722--735. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. O. Blog. Introducing the knowledge graph: thing, not strings. http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-thin%gs-not.html, 2012.Google ScholarGoogle Scholar
  5. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. VLDB, 3(1--2):285--296, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 2, 2010.Google ScholarGoogle Scholar
  8. J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. VLDB, 2(2):1481--1492, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, pages 10--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. Open information extraction: The second generation. In IJCAI. AAAI Press, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In SIGMOD, pages 325--336. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Gogate and P. Domingos. Probabilistic theorem proving. In UAI, pages 256--265, Corvallis, Oregon, 2011. AUAI Press.Google ScholarGoogle Scholar
  14. J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, pages 324--332, 2011.Google ScholarGoogle Scholar
  15. J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, 2009.Google ScholarGoogle Scholar
  16. C. E. Grant, J.-d. Gumbs, K. Li, D. Z. Wang, and G. Chitouras. Madden: query-driven statistical text analytics. In CIKM, pages 2740--2742. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. VLDB, 5(12):1700--1711, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Horn. On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic, 16(1):14--21, 1951.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. N. Huynh and R. J. Mooney. Discriminative structure and parameter learning for markov logic networks. In ICML, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Kok. Structure Learning in Markov Logic Networks. PhD thesis, University of Washington, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kok and P. Domingos. Learning markov logic network structure via hypergraph lifting. In ICML. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Kok and P. Domingos. Learning markov logic networks using structural motifs. In ICML, pages 551--558, 2010.Google ScholarGoogle Scholar
  23. S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, and P. Domingos. The alchemy system for statistical relational ai (technical report). department of computer science and engineering, university of washington, seattle, wa, 2006.Google ScholarGoogle Scholar
  24. D. Kollar and N. Friedman. Probabilistic graphical models: principles and techniques. The MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Lightstone, T. Teorey, and T. Nadeau. Physical database design. Morgan Kaufman, pages 318--334, 2007.Google ScholarGoogle Scholar
  27. T. Lin, O. Etzioni, et al. Identifying functional relations in web text. In EMNLP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, July 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Muggleton. Inverse entailment and progol. New generation computing, 13(3--4):245--286, 1995.Google ScholarGoogle Scholar
  32. F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: scaling up statistical inference in markov logic networks using an rdbms. VLDB, pages 373--384, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. Niu, C. Zhang, C. Ré, and J. Shavlik. Scaling inference for markov logic via dual decomposition. In ICDM, pages 1032--1037. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Poon and P. Domingos. Sound and efficient inference with probabilistic and deterministic dependencies. In AAAI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, volume 7, pages 913--918, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. R. Quinlan. Learning logical definitions from relations. Machine learning, 5(3):239--266, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Raghavan and R. J. Mooney. Online inference-rule learning from natural-language extractions. In Proceedings of the AAAI Workshop on Statistical Relational AI (StaRAI-13), 2013.Google ScholarGoogle Scholar
  38. M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1--2):107--136, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Ritter, D. Downey, S. Soderland, and O. Etzioni. It's a contradiction--no, it's not: a case study using functional relations. In EMNLP, pages 11--20, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Schmitz, R. Bart, S. Soderland, O. Etzioni, et al. Open language learning for information extraction. In EMNLP, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Schoenmackers, O. Etzioni, and D. S. Weld. Scaling textual inference to the web. In EMNLP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis. Learning first-order horn clauses from web text. In EMNLP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, pages 572--582. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. P. Singla and P. Domingos. Memory-efficient inference in relational domains. In AAAI, volume 21, page 488, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. P. Singla and P. Domingos. Lifted first-order belief propagation. In AAAI, volume 2, pages 1094--1099, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. D. Ullman, H. Garcia-Molina, and J. Widom. Database systems: the complete book. Prentice Hall Upper Saddle River, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Van Haaren and J. Davis. Markov network structure learning: A randomized feature generation approach. In AAAI, 2012.Google ScholarGoogle Scholar
  49. D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein. Bayesstore: managing large, uncertain data repositories with probabilistic graphical models. VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and mcmc. VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. L. Wick and A. McCallum. Query-aware mcmc. In NIPS, pages 2564--2572, 2011.Google ScholarGoogle Scholar
  52. J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google ScholarGoogle Scholar
  53. W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481--492. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. D. Z. W. Yang Chen. Web-scale knowledge inference using markov logic networks. ICML workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs, 2013.Google ScholarGoogle Scholar
  55. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. C. Zhang and C. Ré. Towards high-throughput gibbs sampling at scale: A study across storage managers. In SIGMOD. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Knowledge expansion over probabilistic knowledge bases

      Recommendations

      Reviews

      Vincent J Kovarik

      Extrapolating implicit facts from existing data using logic and rules enables the construction of a more semantically complete knowledge base. The authors present an approach that extends previous approaches based on a Markov logic network (MLN) by defining a probabilistic knowledge base consisting of a set of entities, class relations, and weighted facts. The work focuses on two focus areas: improving grounding efficiency using a relational database management system (DBMS) by applying inference rules in batches, and identifying and recovering from errors in the grounding process, which inhibits propagation in the inference chain. The knowledge elements are represented as a collection of relational database tables that enables a structured query language (SQL)-based inference algorithm to perform the knowledge expansion and construction of the MLN graphs in batches. A MLN provides the ability to compute an inferred fact with a specified degree of probability or certainty based on the network structure and link probabilities. Semantic constraints are defined within the process as a first-order formula with an infinite weight, thereby defining a fact or assertion that must be satisfied by all possible combinations of rules within the system. Thus, the semantic constraint enables the identification of potential errors in assertions due to inconsistent or incorrect rules. This is a beneficial capability in any system that attempts to extrapolate or infer new information because it provides a compensation mechanism for incorrect information and ambiguous rules. Inconsistencies and conflicts are identified through the construction of ground factor graphs. The grounding algorithm that constructs the graphs consists of two steps: 1) compute the ground atoms, which are comprised of both given and inferred facts, until the transitive closure is computed, and 2) “apply the rules again to construct the ground factors.” Empirical research in the paper provides quantitative data: the parallel inference algorithm using SQL-based expressions is shown to increase the performance of the inference process over sequential algorithms. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
        June 2014
        1645 pages
        ISBN:9781450323765
        DOI:10.1145/2588555

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader