skip to main content
column

DeepDive: Declarative Knowledge Base Construction

Published: 02 June 2016 Publication History

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

References

[1]
G. Angeli et al. Stanford's 2014 slot filling systems. TAC KBP, 2014.
[2]
M. Banko et al. Open information extraction from the Web. In IJCAI, 2007.
[3]
J. Betteridge, A. Carlson, S. A. Hong, E. R. Hruschka Jr, E. L. Law, T. M. Mitchell, and S. H. Wang. Toward never ending language learning. In AAAI Spring Symposium, 2009.
[4]
S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1999.
[5]
E. Brown et al. Tools and methods for building watson. IBM Research Report, 2013.
[6]
A. Carlson et al. Toward an architecture for never-ending language learning. In AAAI, 2010.
[7]
F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, 2008.
[8]
F. Chen et al. Optimizing statistical information extraction programs over evolving text. In ICDE, 2012.
[9]
Y. Chen and D. Z. Wang. Knowledge expansion over probabilistic knowledge bases. In SIGMOD, 2014.
[10]
P. Domingos and D. Lowd. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.
[11]
X. L. Dong et al. From data fusion to knowledge fusion. In VLDB, 2014.
[12]
O. Etzioni et al. Web-scale information extraction in KnowItAll: (preliminary results). In WWW, 2004.
[13]
D. Ferrucci et al. Building Watson: An overview of the DeepQA project. AI Magazine, 2010.
[14]
V. Govindaraju et al. Understanding tables in context using standard NLP toolkits. In ACL, 2013.
[15]
A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. SIGMOD Rec., 1993.
[16]
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING, 1992.
[17]
R. Hoffmann et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.
[18]
R. Jampani et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD, 2008.
[19]
E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.
[20]
S. Jiang et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM, 2012.
[21]
G. Kasneci et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec., 2009.
[22]
M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS. PVLDB, 2011.
[23]
R. Krishnamurthy et al. SystemT: A system for declarative information extraction. SIGMOD Rec., 2009.
[24]
Y. Li, F. R. Reiss, and L. Chiticariu. SystemT: A declarative information extraction system. In HLT, 2011.
[25]
J. Liu and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014.
[26]
J. Madhavan et al. Web-scale data integration: You can only afford to pay as you go. In CIDR, 2007.
[27]
E. K. Mallory et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics, 2015.
[28]
M. Mintz et al. Distant supervision for relation extraction without labeled data. In ACL, 2009.
[29]
N. Nakashole et al. Scalable knowledge harvesting with high precision and high recall. In WSDM, 2011.
[30]
F. Niu et al. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.
[31]
F. Niu et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB, 2011.
[32]
F. Niu et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst., 2012.
[33]
F. Niu et al. Scaling inference for Markov logic via dual decomposition. In ICDM, 2012.
[34]
S. E. Peters et al. A machine reading system for assembling synthetic Paleontological databases. PloS ONE, 2014.
[35]
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, 2007.
[36]
C. Ré et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 2014.
[37]
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.
[38]
W. Shen et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007.
[39]
J. Shin et al. Incremental knowledge base construction using deepdive. PVLDB, 2015.
[40]
F. M. Suchanek et al. SOFIE: A self-organizing framework for information extraction. In WWW, 2009.
[41]
M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc., 2006.
[42]
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. FTML, 2008.
[43]
G. Weikum and M. Theobald. From information to knowledge: Harvesting entities and relationships from web sources. In PODS, 2010.
[44]
M. Wick et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB, 2010.
[45]
A. Yates et al. TextRunner: Open information extraction on the Web. In NAACL, 2007.
[46]
C. Zhang et al. GeoDeepDive: statistical inference using familiar data-processing languages. In SIGMOD, 2013.
[47]
C. Zhang and C. Ré. Towards high-throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD, 2013.
[48]
C. Zhang and C. Ré. DimmWitted: A study of main-memory statistical analytics. PVLDB, 2014.
[49]
J. Zhu et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW, 2009.
[50]
M. Zinkevich and et al. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Cited By

View all
  • (2025)AutoTag: a framework for generating training dataProgress in Artificial Intelligence10.1007/s13748-024-00360-xOnline publication date: 11-Jan-2025
  • (2024)Probabilistic qualitative spatial reasoning with applications to GeoQAInternational Journal of Geographical Information Science10.1080/13658816.2024.2434613(1-30)Online publication date: 3-Dec-2024
  • (2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 24-Aug-2023
  • Show More Cited By
  1. DeepDive: Declarative Knowledge Base Construction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 45, Issue 1
    March 2016
    73 pages
    ISSN:0163-5808
    DOI:10.1145/2949741
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 June 2016
    Published in SIGMOD Volume 45, Issue 1

    Check for updates

    Qualifiers

    • Column

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)AutoTag: a framework for generating training dataProgress in Artificial Intelligence10.1007/s13748-024-00360-xOnline publication date: 11-Jan-2025
    • (2024)Probabilistic qualitative spatial reasoning with applications to GeoQAInternational Journal of Geographical Information Science10.1080/13658816.2024.2434613(1-30)Online publication date: 3-Dec-2024
    • (2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 24-Aug-2023
    • (2023)MLN4KB: an efficient Markov logic network engine for large-scale knowledge bases and structured logic rulesProceedings of the ACM Web Conference 202310.1145/3543507.3583248(2423-2432)Online publication date: 30-Apr-2023
    • (2023)On Explaining Confounding Bias2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00144(1846-1859)Online publication date: Apr-2023
    • (2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341(372-392)Online publication date: Dec-2023
    • (2023)Knowledge base construction for the semantic management of environment-enriched built heritage: The case of Algerian traditional houses architectureJournal of Cultural Heritage10.1016/j.culher.2023.08.00763(217-229)Online publication date: Sep-2023
    • (2023)Text MiningEncyclopedia of Mathematical Geosciences10.1007/978-3-030-85040-1_325(1535-1537)Online publication date: 14-Jul-2023
    • (2022)A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting AttributesBig Data and Cognitive Computing10.3390/bdcc60401146:4(114)Online publication date: 13-Oct-2022
    • (2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media