Skip to main content

Semi-structured Document Annotation Using Entity and Relation Types

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

  • 1673 Accesses

Abstract

Semi-structured documents such as web-pages and reports contain text units with complex structure connecting these. We motivate and address the problem of annotating such semi-structured documents using a knowledge graph schema with entity and relation types. This poses significant challenges not addressed by the existing literature. The latent document structure needs to be recovered, and paths in the latent structure need to be jointly annotated with entities and relationships. We present a two stage solution. First, the most likely document structure is recovered by structure search using a probabilistic graphical model. Next, nodes and edges in the recovered document structure are jointly annotated using a probabilistic logic program, considering logical constraints as well as uncertainty. We additionally discover new entity and relation types beyond those in the specified schema. We perform experiments on real webpage and complex table data to show that our model outperforms existing table and webpage annotation models for entity and relation annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. (JMLR) 18, 1–67 (2017)

    MathSciNet  MATH  Google Scholar 

  2. De Raedt, L., Kersting, K.: Probabilistic inductive logic programming. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.) Probabilistic Inductive Logic Programming. LNCS (LNAI), vol. 4911, pp. 1–27. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78652-8_1

    Chapter  MATH  Google Scholar 

  3. Deng, L., Zhang, S., Balog, K.: Table2Vec: neural word and entity embeddings for table population and retrieval. In: SIGIR (2019)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT (2019)

    Google Scholar 

  5. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)

    Article  Google Scholar 

  6. Griffiths, T.L., Ghahramani, Z.: The Indian buffet process: an introduction and review. J. Mach. Learn. Res. 12(32), 1185–1224 (2011)

    MathSciNet  MATH  Google Scholar 

  7. Gulhane, P., et al.: Web-scale information extraction with vertex. In: ICDE (2011)

    Google Scholar 

  8. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010)

    Article  Google Scholar 

  9. Lockard, C., Shiralkar, P., Dong, X.L.: OpenCeres: when open information extraction meets the semi-structured web. In: NAACL-HLT (2019)

    Google Scholar 

  10. Lockard, C., Dong, X.L., Einolghozati, A., Shiralkar, P.: Ceres: distantly supervised relation extraction from the semi-structured web. Proc. VLDB Endow. 11(10), 1084–1096 (2018)

    Article  Google Scholar 

  11. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. ACL 3, 299–313 (2015)

    Google Scholar 

  12. Orbanz, P., Teh, Y.: Modern Bayesian nonparametrics. In: NIPS Tutorial (2011)

    Google Scholar 

  13. Raedt, L.D., Kimmig, A.: Probabilistic programming. In: Tutorial at IJCAI (2015)

    Google Scholar 

  14. Raedt, L.D., Poole, D., Kersting, K., Natarajan, S.: Statistical relational artificial intelligence: logic, probability and computation. In: Tutorial at Neurips (2017)

    Google Scholar 

  15. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)

    Article  Google Scholar 

  16. Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical modeling. J. Artif. Intell. Res. 15, 391–454 (2001)

    Article  MathSciNet  Google Scholar 

  17. Takeoka, K., Oyamada, M., Nakadai, S., Okadome, T.: Meimei: an efficient probabilistic approach for semantically annotating tables. In: AAAI (2019)

    Google Scholar 

  18. Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data. In: SIGMOD (2018)

    Google Scholar 

  19. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations, pp. 239–269. Morgan Kaufmann Publishers Inc. (2003)

    Google Scholar 

  20. Zhang, S., Balog, K.: Ad hoc table retrieval using semantic similarity. In: WWW (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arpita Kundu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kundu, A., Ghosh, S., Bhattacharya, I. (2021). Semi-structured Document Annotation Using Entity and Relation Types. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86523-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86522-1

  • Online ISBN: 978-3-030-86523-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics