Skip to main content
Log in

DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation

  • Original Article
  • Published:
Journal on Data Semantics

Abstract

Knowledge bases allow data organization and exploration, making easier the data semantic understanding and its use by machines. Traditional strategies for knowledge base construction and augmentation have mostly relied on manual effort or automatic extraction of content from structured and semi-structured sources. In this work, we present DeepEx, a system that autonomously extracts missing attributes of entities in knowledge bases from unstructured text. We use Wikipedia as data source. Given entities on Wikipedia represented by their articles (text and infobox), DeepEx uses a classifier to detect sentences in the articles mentioning the possible missing attributes of the entities and then employs a deep-learning extraction model on those sentences to identify the attributes. The sentence classifier and attribute extractor are built with labels automatically produced by a weak supervision approach using infobox structured information as supervision source. We have compared our strategy with previous approaches to this problem on 29 different attributes from 4 domains. The results showed that our extraction pipeline achieved statistically superior performance in comparison with some baselines and variations of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The source code of our solution and datasets used in this paper are publicly available on https://github.com/guardiaum/DeepEx.

  2. https://en.wikipedia.org/wiki/Help:Infobox.

  3. https://en.wikipedia.org/wiki/Help:Wikitext.

  4. http://wikidata.dbpedia.org/develop/datasets/dbpedia-version-2016-10.

  5. https://mwparserfromhell.readthedocs.io/en/latest/.

  6. http://downloads.dbpedia.org/2016-10/core-i18n/en/.

  7. https://scikit-learn.org/.

  8. https://sklearn-crfsuite.readthedocs.io/en/latest/api.html.

References

  1. Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20

    Google Scholar 

  2. Balog K (2018) Entity-oriented search. the information retrieval series. Springer International Publishing, New York

    Google Scholar 

  3. Banerjee S, Tsioutsiouliklis K (2018) Relation extraction using multi-encoder lstm network on a distant supervised dataset. In: IEEE 12th International Conference on Semantic Computing

  4. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the International Conference on Management of Data

  5. Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16:1190

    Article  MathSciNet  Google Scholar 

  6. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Article  Google Scholar 

  7. Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370

    Article  Google Scholar 

  8. Cohen WW, Ravikumar P, Fienberg SE, et al. (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the International Conference on Information Integration on the Web, p 73–78

  9. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293

    MATH  Google Scholar 

  10. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics

  11. Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining pp 601–610

  12. Dozat T (2016) Incorporating nesterov momentum into adam. International Conference on Learning Representations Workshop

  13. Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J, Schlaefer N, Welty C (2010) Building watson: an overview of the DeepQA project. AI Magazine 31:59–79

    Article  Google Scholar 

  14. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610

    Article  Google Scholar 

  15. Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24:8–12

    Article  Google Scholar 

  16. Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Market 36:20–38

    Article  Google Scholar 

  17. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing

  18. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of the 18th International Conference on Machine Learning

  19. Lange D, Böhm C, Naumann F (2010) Extracting structured information from wikipedia articles to populate infoboxes. In: Proceeding of the 19th ACM International Conference on Information and Knowledge Management

  20. Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Van Kleef P, Auer S et al (2015) Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195

    Article  Google Scholar 

  21. Lockard C, Dong XL, Einolghozati A, Shiralkar P (2018) Ceres: Distantly supervised relation extraction from the semi-structured web. Proceeding VLDB Endowment

  22. Lockard C, Shiralkar P, Dong XL, Hajishirzi H (2020) ZeroShotCeres: Zero-shot relation extraction from semi-structured webpages. In: Proceeding of the 58th Annual Meeting of the Association for Computational Linguistics

  23. Martinez-Rodriguez JL, Hogan A, Lopez-Arevalo I (2020) Information Extraction meets the Semantic Web: A Survey, vol 11

  24. Min B, Grishman R, Wan L, Wang C, Gondek D (2013) Distant supervision for relation extraction with an incomplete knowledge base. Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  25. Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceeding of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, vol 2

  26. Nickel M, Tresp V, Kriegel HP (2012) Factorizing yago: Scalable machine learning for linked data. In: Proceeding of the 21st International Conference on World Wide Web

  27. Paulheim H (2016) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8:12

    Google Scholar 

  28. Paulheim H (2017) Data-driven joint debugging of the dbpedia mappings and ontology. Semant Web 81:404–418

    Article  Google Scholar 

  29. Paulheim H, Bizer C (2013) Type inference on noisy rdf data. in the semantic web. Springer, Berlin

    Google Scholar 

  30. Paulheim H, Bizer C (2014) Improving the quality of linked data using statistical distributions. Int J Semant Web Inf Syst 10:63–86

    Article  Google Scholar 

  31. Pennington J, Socher R, Manning C (2014) Glove: Global Vectors for Word Representation. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing

  32. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics

  33. Qu J, Ouyang D, Hua W, Ye Y, Li X (2018) Distant supervision for neural relation extraction integrated with word attention and property features. Neural Netw 100:59–69

    Article  Google Scholar 

  34. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training

  35. Ratner A, Sa CD, Wu S, Selsam D, Ré C (2016) Data programming: Creating large training sets, quickly. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, p 3574–3582

  36. Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C (2020) Snorkel: rapid training data creation with weak supervision. VLDB J 29(2):709–730

    Article  Google Scholar 

  37. Ristoski P, Gentile AL, Alba A, Gruhl D, Welch S (2020) Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. J Web Semant 600:100546

    Article  Google Scholar 

  38. Sáez T, Hogan A (2018) Automatically generating wikipedia info-boxes from wikidata. In: Companion Proceeding of the The Web Conference 2018, WWW ’18, p 1823–1830

  39. Sleeman J, Finin T (2013) Type prediction for efficient coreference resolution in heterogeneous semantic graphs. Proceeding of the IEEE 7th International Conferenec on Semantic Computing

  40. Sleeman J, Finin T, Joshi A (2015) Topic modeling for RDF graphs. CEUR Workshop Proceeding

  41. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  42. Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. Proceeding of the 16th International Conference on World Wide Web p 697–706

  43. Takamatsu S, Sato I, Nakagawa H (2012) Reducing wrong labels in distant supervision for relation extraction. In: Proceeding of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, pp 721–729

  44. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser u, Polosukhin I (2017) Attention is all you need. In: Proceeding of the 31st International Conference on Neural Information Processing Systems, p 6000–6010

  45. Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85

    Article  Google Scholar 

  46. Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? probing numeracy in embeddings. In: Proceeding of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 5307–5315

  47. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 55:192

    Google Scholar 

  48. Wu F, Weld DS (2007) Autonomously semantifying wikipedia. In: Proceeding of the 16th ACM Conference on Information and Knowledge Management, pp 41–50

  49. Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 13(3):55–75

    Article  Google Scholar 

  50. Yus R, Mulwad V, Finin T, Mena E, et al. (2014) Infoboxer: using statistical and semantic knowledge to help create wikipedia infoboxes. In: 13th International Semantic Web Conference

Download references

Funding

This work was supported by Fundação de Amparo a Ciência e Tecnologia do Estado de Pernambuco (FACEPE) under funding grant No.IBPG-1172-1.03/16.

Author information

Authors and Affiliations

Authors

Contributions

JM, LB contributed to conceptualization; JM, L contributed to methodology; JM contributed to formal analysis and investigation; : JM contributed to software; JM contributed to writing—original draft preparation; LB contributed to writing—review and editing; Luciano Barbosa contributed to funding acquisition; LB contributed to supervision.

Corresponding author

Correspondence to Johny Moreira.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Data Availability

The datasets used in this paper are publicly available on https://cin.ufpe.br/~jms5/deepex-data.zip.

Code Availability

The source code of our solution is publicly available on https://github.com/guardiaum/DeepEx.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moreira, J., Barbosa, L. DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation . J Data Semant 10, 309–325 (2021). https://doi.org/10.1007/s13740-021-00134-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13740-021-00134-x

Keywords

Navigation