Abstract
Semi-structured documents such as web-pages and reports contain text units with complex structure connecting these. We motivate and address the problem of annotating such semi-structured documents using a knowledge graph schema with entity and relation types. This poses significant challenges not addressed by the existing literature. The latent document structure needs to be recovered, and paths in the latent structure need to be jointly annotated with entities and relationships. We present a two stage solution. First, the most likely document structure is recovered by structure search using a probabilistic graphical model. Next, nodes and edges in the recovered document structure are jointly annotated using a probabilistic logic program, considering logical constraints as well as uncertainty. We additionally discover new entity and relation types beyond those in the specified schema. We perform experiments on real webpage and complex table data to show that our model outperforms existing table and webpage annotation models for entity and relation annotation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. (JMLR) 18, 1–67 (2017)
De Raedt, L., Kersting, K.: Probabilistic inductive logic programming. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.) Probabilistic Inductive Logic Programming. LNCS (LNAI), vol. 4911, pp. 1–27. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78652-8_1
Deng, L., Zhang, S., Balog, K.: Table2Vec: neural word and entity embeddings for table population and retrieval. In: SIGIR (2019)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT (2019)
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)
Griffiths, T.L., Ghahramani, Z.: The Indian buffet process: an introduction and review. J. Mach. Learn. Res. 12(32), 1185–1224 (2011)
Gulhane, P., et al.: Web-scale information extraction with vertex. In: ICDE (2011)
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010)
Lockard, C., Shiralkar, P., Dong, X.L.: OpenCeres: when open information extraction meets the semi-structured web. In: NAACL-HLT (2019)
Lockard, C., Dong, X.L., Einolghozati, A., Shiralkar, P.: Ceres: distantly supervised relation extraction from the semi-structured web. Proc. VLDB Endow. 11(10), 1084–1096 (2018)
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. ACL 3, 299–313 (2015)
Orbanz, P., Teh, Y.: Modern Bayesian nonparametrics. In: NIPS Tutorial (2011)
Raedt, L.D., Kimmig, A.: Probabilistic programming. In: Tutorial at IJCAI (2015)
Raedt, L.D., Poole, D., Kersting, K., Natarajan, S.: Statistical relational artificial intelligence: logic, probability and computation. In: Tutorial at Neurips (2017)
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical modeling. J. Artif. Intell. Res. 15, 391–454 (2001)
Takeoka, K., Oyamada, M., Nakadai, S., Okadome, T.: Meimei: an efficient probabilistic approach for semantically annotating tables. In: AAAI (2019)
Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data. In: SIGMOD (2018)
Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations, pp. 239–269. Morgan Kaufmann Publishers Inc. (2003)
Zhang, S., Balog, K.: Ad hoc table retrieval using semantic similarity. In: WWW (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kundu, A., Ghosh, S., Bhattacharya, I. (2021). Semi-structured Document Annotation Using Entity and Relation Types. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-86523-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86522-1
Online ISBN: 978-3-030-86523-8
eBook Packages: Computer ScienceComputer Science (R0)