To Extend or Not to Extend? Context-Specific Corpus Enrichment

Kuhr, Felix; Braun, Tanya; Bender, Magnus; Möller, Ralf

doi:10.1007/978-3-030-35288-2_29

Felix Kuhr¹⁰,
Tanya Braun¹⁰,
Magnus Bender¹⁰ &
…
Ralf Möller¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11919))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

2122 Accesses
10 Citations

Abstract

An agent in pursuit of a task may work with a corpus of documents with linked subjective content descriptions. Faced with a new document, an agent has to decide whether to include that document in its corpus or not. Basing the decision on only words, topics, or entities, has shown to not lead to a balanced performance for varying documents. Therefore, this paper presents an approach for an agent to decide if a new document adds value to its existing corpus by combining texts and content descriptions. Furthermore, an agent can use the approach as a starting point for high quality content descriptions for new documents. A case study shows the effectiveness of our approach given varying types of new documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angeli, G., Premkumar, M.J.J., Manning, C.D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, Beijing, China, Volume 1: Long Papers, 26–31 July 2015, pp. 344–354 (2015)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, 11–15 July 2010 (2010)
Google Scholar
Collarana, D., Galkin, M., Ribón, I.T., Vidal, M., Lange, C., Auer, S.: MINTE: semantically integrating RDF graphs. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS 2017, Amantea, Italy, 19–22 June 2017, pp. 22:1–22:11 (2017)
Google Scholar
Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLoS Comput. Biol. 9(2), e1002854 (2013)
Article Google Scholar
Dong, X.L., et al.: From data fusion to knowledge fusion. PVLDB 7(10), 881–892 (2014)
Google Scholar
Getoor, L., Diehl, C.P.: Link mining: a survey. In: SIGKDD Explorations, vol. 7, no. 2, pp. 3–12 (2005)
Article Google Scholar
Kuhr, F., Witten, B., Möller, R.: Corpus-driven annotation enrichment. In: 13th IEEE International Conference on Semantic Computing, ICSC 2019, Newport Beach, CA, USA, 30 January 30 – 1 February 2019, pp. 138–141 (2019)
Google Scholar
Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Google Scholar
Newcombe, H.B., Kennedy, J.M., Axford, S., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Papantoniou, K., Tsatsaronis, G., Paliouras, G.: KDTA: automated knowledge-driven text annotation. In: Proceedings of Machine Learning and Knowledge Discovery in Databases, European Conference, Part III, ECML PKDD 2010, Barcelona, Spain, 20–24 September 2010, pp. 611–614 (2010)
Chapter Google Scholar
Reuters, T.: Opencalais. Accessed 16 June 2008
Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Article Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, 8–12 May 2007, pp. 697–706 (2007)
Google Scholar
Braun, T., Kuhr, F., Möller, R.: Unsupervised text annotations. In: Formal and Cognitive Reasoning - Workshop at the 40th Annual German Conference on AI (KI-2017) (2017)
Google Scholar
Yang, J., Zhang, Y., Li, L., Li, X.: YEDDA: a lightweight collaborative text span annotation tool. In: Proceedings of ACL 2018, System Demonstrations Melbourne, Australia, 15–20 July 2018, pp. 31–36 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Systems, University of Lübeck, Lübeck, Germany
Felix Kuhr, Tanya Braun, Magnus Bender & Ralf Möller

Authors

Felix Kuhr
View author publications
You can also search for this author in PubMed Google Scholar
Tanya Braun
View author publications
You can also search for this author in PubMed Google Scholar
Magnus Bender
View author publications
You can also search for this author in PubMed Google Scholar
Ralf Möller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Felix Kuhr .

Editor information

Editors and Affiliations

University of South Australia, Adelaide, SA, Australia
Jixue Liu
The University of Melbourne, Melbourne, VIC, Australia
James Bailey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuhr, F., Braun, T., Bender, M., Möller, R. (2019). To Extend or Not to Extend? Context-Specific Corpus Enrichment. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-35288-2_29
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35287-5
Online ISBN: 978-3-030-35288-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics