column

DeepDive: Declarative Knowledge Base Construction

Authors:

Christopher De Sa,

Christopher Ré,

Ce ZhangAuthors Info & Claims

ACM SIGMOD Record, Volume 45, Issue 1

Pages 60 - 67

https://doi.org/10.1145/2949741.2949756

Published: 02 June 2016 Publication History

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

References

[1]

G. Angeli et al. Stanford's 2014 slot filling systems. TAC KBP, 2014.

[2]

M. Banko et al. Open information extraction from the Web. In IJCAI, 2007.

Digital Library

[3]

J. Betteridge, A. Carlson, S. A. Hong, E. R. Hruschka Jr, E. L. Law, T. M. Mitchell, and S. H. Wang. Toward never ending language learning. In AAAI Spring Symposium, 2009.

[4]

S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1999.

Digital Library

[5]

E. Brown et al. Tools and methods for building watson. IBM Research Report, 2013.

[6]

A. Carlson et al. Toward an architecture for never-ending language learning. In AAAI, 2010.

Digital Library

[7]

F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, 2008.

Digital Library

[8]

F. Chen et al. Optimizing statistical information extraction programs over evolving text. In ICDE, 2012.

Digital Library

[9]

Y. Chen and D. Z. Wang. Knowledge expansion over probabilistic knowledge bases. In SIGMOD, 2014.

Digital Library

[10]

P. Domingos and D. Lowd. Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool, 2009.

Digital Library

[11]

X. L. Dong et al. From data fusion to knowledge fusion. In VLDB, 2014.

Digital Library

[12]

O. Etzioni et al. Web-scale information extraction in KnowItAll: (preliminary results). In WWW, 2004.

Digital Library

[13]

D. Ferrucci et al. Building Watson: An overview of the DeepQA project. AI Magazine, 2010.

[14]

V. Govindaraju et al. Understanding tables in context using standard NLP toolkits. In ACL, 2013.

[15]

A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. SIGMOD Rec., 1993.

Digital Library

[16]

M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING, 1992.

Digital Library

[17]

R. Hoffmann et al. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.

Digital Library

[18]

R. Jampani et al. MCDB: A Monte Carlo approach to managing uncertain data. In SIGMOD, 2008.

Digital Library

[19]

E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.

[20]

S. Jiang et al. Learning to refine an automatically extracted knowledge base using Markov logic. In ICDM, 2012.

Digital Library

[21]

G. Kasneci et al. The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec., 2009.

Digital Library

[22]

M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS. PVLDB, 2011.

Digital Library

[23]

R. Krishnamurthy et al. SystemT: A system for declarative information extraction. SIGMOD Rec., 2009.

Digital Library

[24]

Y. Li, F. R. Reiss, and L. Chiticariu. SystemT: A declarative information extraction system. In HLT, 2011.

Digital Library

[25]

J. Liu and et al. An asynchronous parallel stochastic coordinate descent algorithm. ICML, 2014.

[26]

J. Madhavan et al. Web-scale data integration: You can only afford to pay as you go. In CIDR, 2007.

[27]

E. K. Mallory et al. Large-scale extraction of gene interactions from full text literature using deepdive. Bioinformatics, 2015.

[28]

M. Mintz et al. Distant supervision for relation extraction without labeled data. In ACL, 2009.

Digital Library

[29]

N. Nakashole et al. Scalable knowledge harvesting with high precision and high recall. In WSDM, 2011.

Digital Library

[30]

F. Niu et al. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.

Digital Library

[31]

F. Niu et al. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB, 2011.

Digital Library

[32]

F. Niu et al. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst., 2012.

Digital Library

[33]

F. Niu et al. Scaling inference for Markov logic via dual decomposition. In ICDM, 2012.

Digital Library

[34]

S. E. Peters et al. A machine reading system for assembling synthetic Paleontological databases. PloS ONE, 2014.

[35]

H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, 2007.

Digital Library

[36]

C. Ré et al. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 2014.

[37]

C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

Digital Library

[38]

W. Shen et al. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007.

Digital Library

[39]

J. Shin et al. Incremental knowledge base construction using deepdive. PVLDB, 2015.

Digital Library

[40]

F. M. Suchanek et al. SOFIE: A self-organizing framework for information extraction. In WWW, 2009.

Digital Library

[41]

M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference in discrete Markov random fields. Trans. Sig. Proc., 2006.

[42]

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. FTML, 2008.

Digital Library

[43]

G. Weikum and M. Theobald. From information to knowledge: Harvesting entities and relationships from web sources. In PODS, 2010.

Digital Library

[44]

M. Wick et al. Scalable probabilistic databases with factor graphs and MCMC. PVLDB, 2010.

Digital Library

[45]

A. Yates et al. TextRunner: Open information extraction on the Web. In NAACL, 2007.

Digital Library

[46]

C. Zhang et al. GeoDeepDive: statistical inference using familiar data-processing languages. In SIGMOD, 2013.

Digital Library

[47]

C. Zhang and C. Ré. Towards high-throughput Gibbs sampling at scale: A study across storage managers. In SIGMOD, 2013.

Digital Library

[48]

C. Zhang and C. Ré. DimmWitted: A study of main-memory statistical analytics. PVLDB, 2014.

Digital Library

[49]

J. Zhu et al. StatSnowball: A statistical approach to extracting entity relationships. In WWW, 2009.

Digital Library

[50]

M. Zinkevich and et al. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.

Cited By

Mondal SBhattacharaya TRai ASodhi GBansal RMondal AGupta A(2025)AutoTag: a framework for generating training dataProgress in Artificial Intelligence10.1007/s13748-024-00360-xOnline publication date: 11-Jan-2025
https://doi.org/10.1007/s13748-024-00360-x
Kazemi Beydokhti MDuckham MGriffin ATao YPurves RVasardani M(2024)Probabilistic qualitative spatial reasoning with applications to GeoQAInternational Journal of Geographical Information Science10.1080/13658816.2024.2434613(1-30)Online publication date: 3-Dec-2024
https://doi.org/10.1080/13658816.2024.2434613
Fernandez RElmore AFranklin MKrishnan STan C(2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611527
Show More Cited By

DeepDive: Declarative Knowledge Base Construction
1. Information systems
  1. Data management systems

Recommendations

Extracting Databases from Dark Data with DeepDive
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data --- scientific ...
Incremental knowledge base construction using DeepDive

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge ...
DeepDive: declarative knowledge base construction

The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 45, Issue 1

March 2016

73 pages

ISSN:0163-5808

DOI:10.1145/2949741

Editors:
Yanlei Diao
University of Massachusetts Amherst
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Zackary Ives
University of Pennsylvania
,
Anastasios Kementsietsidis
Google Research
,
Jeffrey Naughton
University of Wisconsin-Madison
,
Olga Papaemmanoui
Brandeis Univesity
,
Aditya Parameswaran
University of Illinois
,
Anish Das Sarma
Google Research
,
Alkis Simitsis
HP Labs
,
Wang-Chiew Tan
University of California Santa Cruz
,
Nesime Tatbul
MIT CSAIL
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University

Issue’s Table of Contents

Copyright © 2016 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2016

Published in SIGMOD Volume 45, Issue 1

Check for updates

Qualifiers

Column

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

86
Total Citations
View Citations
704
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)6

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mondal SBhattacharaya TRai ASodhi GBansal RMondal AGupta A(2025)AutoTag: a framework for generating training dataProgress in Artificial Intelligence10.1007/s13748-024-00360-xOnline publication date: 11-Jan-2025
https://doi.org/10.1007/s13748-024-00360-x
Kazemi Beydokhti MDuckham MGriffin ATao YPurves RVasardani M(2024)Probabilistic qualitative spatial reasoning with applications to GeoQAInternational Journal of Geographical Information Science10.1080/13658816.2024.2434613(1-30)Online publication date: 3-Dec-2024
https://doi.org/10.1080/13658816.2024.2434613
Fernandez RElmore AFranklin MKrishnan STan C(2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611527
Fang HLiu YCai YSun M(2023)MLN4KB: an efficient Markov logic network engine for large-scale knowledge bases and structured logic rulesProceedings of the ACM Web Conference 202310.1145/3543507.3583248(2423-2432)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583248
Youngmann BCafarella MMoskovitch YSalimi B(2023)On Explaining Confounding Bias2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00144(1846-1859)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00144
Peterfreund L(2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341(372-392)Online publication date: Dec-2023
https://doi.org/10.1016/j.dam.2023.08.014
Khouri SOufaida HAmrani RKacher SOuahab SCherrad M(2023)Knowledge base construction for the semantic management of environment-enriched built heritage: The case of Algerian traditional houses architectureJournal of Cultural Heritage10.1016/j.culher.2023.08.00763(217-229)Online publication date: Sep-2023
https://doi.org/10.1016/j.culher.2023.08.007
Wang CMa X(2023)Text MiningEncyclopedia of Mathematical Geosciences10.1007/978-3-030-85040-1_325(1535-1537)Online publication date: 14-Jul-2023
https://doi.org/10.1007/978-3-030-85040-1_325
Jaradat ASafieddine FDeraman AAli OAl-Ahmad AAlzoubi Y(2022)A Probabilistic Data Fusion Modeling Approach for Extracting True Values from Uncertain and Conflicting AttributesBig Data and Cognitive Computing10.3390/bdcc60401146:4(114)Online publication date: 13-Oct-2022
https://doi.org/10.3390/bdcc6040114
Ilyas IRekatsinas TKonda VPound JQi XSoliman MIves ZBonifati AEl Abbadi A(2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526049
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents