DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation

Moreira, Johny; Barbosa, Luciano

doi:10.1007/s13740-021-00134-x

DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation

Original Article
Published: 06 July 2021

Volume 10, pages 309–325, (2021)
Cite this article

Journal on Data Semantics

262 Accesses
Explore all metrics

Abstract

Knowledge bases allow data organization and exploration, making easier the data semantic understanding and its use by machines. Traditional strategies for knowledge base construction and augmentation have mostly relied on manual effort or automatic extraction of content from structured and semi-structured sources. In this work, we present DeepEx, a system that autonomously extracts missing attributes of entities in knowledge bases from unstructured text. We use Wikipedia as data source. Given entities on Wikipedia represented by their articles (text and infobox), DeepEx uses a classifier to detect sentences in the articles mentioning the possible missing attributes of the entities and then employs a deep-learning extraction model on those sentences to identify the attributes. The sentence classifier and attribute extractor are built with labels automatically produced by a weak supervision approach using infobox structured information as supervision source. We have compared our strategy with previous approaches to this problem on 29 different attributes from 4 domains. The results showed that our extraction pipeline achieved statistically superior performance in comparison with some baselines and variations of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Open Knowledge Extraction Challenge 2017

The Second Open Knowledge Extraction Challenge

K-DLM: A Domain-Adaptive Language Model Pre-Training Framework with Knowledge Graph

Notes

The source code of our solution and datasets used in this paper are publicly available on https://github.com/guardiaum/DeepEx.
https://en.wikipedia.org/wiki/Help:Infobox.
https://en.wikipedia.org/wiki/Help:Wikitext.
http://wikidata.dbpedia.org/develop/datasets/dbpedia-version-2016-10.
https://mwparserfromhell.readthedocs.io/en/latest/.
http://downloads.dbpedia.org/2016-10/core-i18n/en/.
https://scikit-learn.org/.
https://sklearn-crfsuite.readthedocs.io/en/latest/api.html.

References

Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20
Google Scholar
Balog K (2018) Entity-oriented search. the information retrieval series. Springer International Publishing, New York
Google Scholar
Banerjee S, Tsioutsiouliklis K (2018) Relation extraction using multi-encoder lstm network on a distant supervised dataset. In: IEEE 12th International Conference on Semantic Computing
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the International Conference on Management of Data
Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16:1190
Article MathSciNet Google Scholar
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370
Article Google Scholar
Cohen WW, Ravikumar P, Fienberg SE, et al. (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the International Conference on Information Integration on the Web, p 73–78
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293
MATH Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics
Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining pp 601–610
Dozat T (2016) Incorporating nesterov momentum into adam. International Conference on Learning Representations Workshop
Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J, Schlaefer N, Welty C (2010) Building watson: an overview of the DeepQA project. AI Magazine 31:59–79
Article Google Scholar
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610
Article Google Scholar
Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24:8–12
Article Google Scholar
Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Market 36:20–38
Article Google Scholar
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of the 18th International Conference on Machine Learning
Lange D, Böhm C, Naumann F (2010) Extracting structured information from wikipedia articles to populate infoboxes. In: Proceeding of the 19th ACM International Conference on Information and Knowledge Management
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Van Kleef P, Auer S et al (2015) Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195
Article Google Scholar
Lockard C, Dong XL, Einolghozati A, Shiralkar P (2018) Ceres: Distantly supervised relation extraction from the semi-structured web. Proceeding VLDB Endowment
Lockard C, Shiralkar P, Dong XL, Hajishirzi H (2020) ZeroShotCeres: Zero-shot relation extraction from semi-structured webpages. In: Proceeding of the 58th Annual Meeting of the Association for Computational Linguistics
Martinez-Rodriguez JL, Hogan A, Lopez-Arevalo I (2020) Information Extraction meets the Semantic Web: A Survey, vol 11
Min B, Grishman R, Wan L, Wang C, Gondek D (2013) Distant supervision for relation extraction with an incomplete knowledge base. Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceeding of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, vol 2
Nickel M, Tresp V, Kriegel HP (2012) Factorizing yago: Scalable machine learning for linked data. In: Proceeding of the 21st International Conference on World Wide Web
Paulheim H (2016) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8:12
Google Scholar
Paulheim H (2017) Data-driven joint debugging of the dbpedia mappings and ontology. Semant Web 81:404–418
Article Google Scholar
Paulheim H, Bizer C (2013) Type inference on noisy rdf data. in the semantic web. Springer, Berlin
Google Scholar
Paulheim H, Bizer C (2014) Improving the quality of linked data using statistical distributions. Int J Semant Web Inf Syst 10:63–86
Article Google Scholar
Pennington J, Socher R, Manning C (2014) Glove: Global Vectors for Word Representation. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics
Qu J, Ouyang D, Hua W, Ye Y, Li X (2018) Distant supervision for neural relation extraction integrated with word attention and property features. Neural Netw 100:59–69
Article Google Scholar
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
Ratner A, Sa CD, Wu S, Selsam D, Ré C (2016) Data programming: Creating large training sets, quickly. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, p 3574–3582
Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C (2020) Snorkel: rapid training data creation with weak supervision. VLDB J 29(2):709–730
Article Google Scholar
Ristoski P, Gentile AL, Alba A, Gruhl D, Welch S (2020) Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. J Web Semant 600:100546
Article Google Scholar
Sáez T, Hogan A (2018) Automatically generating wikipedia info-boxes from wikidata. In: Companion Proceeding of the The Web Conference 2018, WWW ’18, p 1823–1830
Sleeman J, Finin T (2013) Type prediction for efficient coreference resolution in heterogeneous semantic graphs. Proceeding of the IEEE 7th International Conferenec on Semantic Computing
Sleeman J, Finin T, Joshi A (2015) Topic modeling for RDF graphs. CEUR Workshop Proceeding
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. Proceeding of the 16th International Conference on World Wide Web p 697–706
Takamatsu S, Sato I, Nakagawa H (2012) Reducing wrong labels in distant supervision for relation extraction. In: Proceeding of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, pp 721–729
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser u, Polosukhin I (2017) Attention is all you need. In: Proceeding of the 31st International Conference on Neural Information Processing Systems, p 6000–6010
Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85
Article Google Scholar
Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? probing numeracy in embeddings. In: Proceeding of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 5307–5315
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 55:192
Google Scholar
Wu F, Weld DS (2007) Autonomously semantifying wikipedia. In: Proceeding of the 16th ACM Conference on Information and Knowledge Management, pp 41–50
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 13(3):55–75
Article Google Scholar
Yus R, Mulwad V, Finin T, Mena E, et al. (2014) Infoboxer: using statistical and semantic knowledge to help create wikipedia infoboxes. In: 13th International Semantic Web Conference

Download references

Funding

This work was supported by Fundação de Amparo a Ciência e Tecnologia do Estado de Pernambuco (FACEPE) under funding grant No.IBPG-1172-1.03/16.

Author information

Authors and Affiliations

Centro de Informática, Universidade Federal de Pernambuco, Av. Prof. Moraes Rego, 1235 - Cidade Universitária, Recife - PE, 50670-901, Brazil
Johny Moreira & Luciano Barbosa

Authors

Johny Moreira
View author publications
You can also search for this author in PubMed Google Scholar
Luciano Barbosa
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JM, LB contributed to conceptualization; JM, L contributed to methodology; JM contributed to formal analysis and investigation; : JM contributed to software; JM contributed to writing—original draft preparation; LB contributed to writing—review and editing; Luciano Barbosa contributed to funding acquisition; LB contributed to supervision.

Corresponding author

Correspondence to Johny Moreira.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Data Availability

The datasets used in this paper are publicly available on https://cin.ufpe.br/~jms5/deepex-data.zip.

Code Availability

The source code of our solution is publicly available on https://github.com/guardiaum/DeepEx.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moreira, J., Barbosa, L. DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation . J Data Semant 10, 309–325 (2021). https://doi.org/10.1007/s13740-021-00134-x

Download citation

Received: 12 January 2021
Revised: 28 April 2021
Accepted: 22 June 2021
Published: 06 July 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s13740-021-00134-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation

Abstract

Access this article

Similar content being viewed by others

Open Knowledge Extraction Challenge 2017

The Second Open Knowledge Extraction Challenge

K-DLM: A Domain-Adaptive Language Model Pre-Training Framework with Knowledge Graph

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Data Availability

Code Availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation

Abstract

Access this article

Similar content being viewed by others

Open Knowledge Extraction Challenge 2017

The Second Open Knowledge Extraction Challenge

K-DLM: A Domain-Adaptive Language Model Pre-Training Framework with Knowledge Graph

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Data Availability

Code Availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation