Incremental Information Extraction Using Tree-Based Context Representations

Siefkes, Christian

doi:10.1007/978-3-540-30586-6_55

Christian Siefkes¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

Abstract

The purpose of information extraction (IE) is to find desired pieces of information in natural language texts and store them in a form that is suitable for automatic processing. Providing annotated training data to adapt a trainable IE system to a new domain requires a considerable amount of work. To address this, we explore incremental learning. Here training documents are annotated sequentially by a user and immediately incorporated into the extraction model. Thus the system can support the user by proposing extractions based on the current extraction model, reducing the workload of the user over time.

We introduce an approach to modeling IE as a token classification task that allows incremental training. To provide sufficient information to the token classifiers, we use rich, tree-based context representations of each token as feature vectors. These representations make use of the heuristically deduced document structure in addition to linguistic and semantic information. We consider the resulting feature vectors as ordered and combine proximate features into more expressive joint features, called “Orthogonal Sparse Bigrams” (OSB). Our results indicate that this setup makes it possible to employ IE in an incremental fashion without a serious performance penalty.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The enrichment of lexical resources through incremental parsebanking

Article Open access 30 May 2016

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

References

Chieu, H.L., Ng, H.T.: A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pp. 786–791 (2002)
Google Scholar
Ciravegna, F.: (LP)², an adaptive algorithm for information extraction from Webrelated texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, USA (2001)
Google Scholar
Dagan, I., Karov, Y., Roth, D.: Mistake-driven learning in text categorization. In: Cardie, C., Weischedel, R. (eds.) Proceedings of EMNLP 1997, 2nd Conference on Empirical Methods in Natural Language Processing, Providence, US, pp. 55–63. Association for Computational Linguistics (1997)
Google Scholar
De Sitter, A., Daelemans, W.: Information extraction via double classification. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining, ATEM 2003 (2003)
Google Scholar
Finn, A., Kushmerick, N.: Active learning selection strategies for information extraction. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining (2003)
Google Scholar
Finn, A., Kushmerick, N.: Information extraction by convergent boundary classification. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)
Google Scholar
Freitag, D., McCallum, A.K.: Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
JTidy, http://jtidy.sourceforge.net/
Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: A critical survey of the methodology for IE evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)
Google Scholar
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2, 285–318 (1988)
Google Scholar
Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI (2003)
Google Scholar
Roth, D., Yih, W.-t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI (2001)
Google Scholar
Scheffer, T., Wrobel, S., Popov, B., Ognianov, D., Decomain, C., Hoche, S.: Learning hiddenMarkov models for information extraction actively from partially labeled text. Künstliche Intelligenz (2) (2002)
Google Scholar
Siefkes, C.: A shallow algorithm for correcting nesting errors and other wellformedness violations in XML-like input. In: Extreme Markup Languages, EML 2004 (2004)
Google Scholar
Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining Winnow and orthogonal sparse bigrams for incremental spam filtering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 410–421. Springer, Heidelberg (2004)
Chapter Google Scholar
Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: CRYSTAL: Inducing a conceptual dictionary. In: IJCAI (1995)
Google Scholar
Trainable Information Extractor, http://www.inf.fu-berlin.de/inst/ag-db/software/tie/
TreeTagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
txt2html, http://txt2html.sourceforge.net/
XML Path Language (XPath) 2.0, 2004. W3C Working Draft (October 29 2004)
Google Scholar
Zavrel, J., Daelemans, W.: Feature-rich memory-based classification for shallow NLP and information extraction. In: Franke, J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining, Theoretical Aspects and Applications, pp. 33–54. Springer/Physica, Heidelberg (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Berlin-Brandenburg Graduate School in Distributed Information Systems, Database and Information Systems Group, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
Christian Siefkes

Authors

Christian Siefkes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siefkes, C. (2005). Incremental Information Extraction Using Tree-Based Context Representations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_55

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Incremental Information Extraction Using Tree-Based Context Representations

Abstract

Access this chapter

Preview

Similar content being viewed by others

The enrichment of lexical resources through incremental parsebanking

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Incremental Information Extraction Using Tree-Based Context Representations

Abstract

Access this chapter

Preview

Similar content being viewed by others

The enrichment of lexical resources through incremental parsebanking

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation