Abstract
The purpose of information extraction (IE) is to find desired pieces of information in natural language texts and store them in a form that is suitable for automatic processing. Providing annotated training data to adapt a trainable IE system to a new domain requires a considerable amount of work. To address this, we explore incremental learning. Here training documents are annotated sequentially by a user and immediately incorporated into the extraction model. Thus the system can support the user by proposing extractions based on the current extraction model, reducing the workload of the user over time.
We introduce an approach to modeling IE as a token classification task that allows incremental training. To provide sufficient information to the token classifiers, we use rich, tree-based context representations of each token as feature vectors. These representations make use of the heuristically deduced document structure in addition to linguistic and semantic information. We consider the resulting feature vectors as ordered and combine proximate features into more expressive joint features, called “Orthogonal Sparse Bigrams” (OSB). Our results indicate that this setup makes it possible to employ IE in an incremental fashion without a serious performance penalty.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chieu, H.L., Ng, H.T.: A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pp. 786–791 (2002)
Ciravegna, F.: (LP)2, an adaptive algorithm for information extraction from Webrelated texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, USA (2001)
Dagan, I., Karov, Y., Roth, D.: Mistake-driven learning in text categorization. In: Cardie, C., Weischedel, R. (eds.) Proceedings of EMNLP 1997, 2nd Conference on Empirical Methods in Natural Language Processing, Providence, US, pp. 55–63. Association for Computational Linguistics (1997)
De Sitter, A., Daelemans, W.: Information extraction via double classification. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining, ATEM 2003 (2003)
Finn, A., Kushmerick, N.: Active learning selection strategies for information extraction. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining (2003)
Finn, A., Kushmerick, N.: Information extraction by convergent boundary classification. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)
Freitag, D., McCallum, A.K.: Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: A critical survey of the methodology for IE evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2, 285–318 (1988)
Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI (2003)
Roth, D., Yih, W.-t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI (2001)
Scheffer, T., Wrobel, S., Popov, B., Ognianov, D., Decomain, C., Hoche, S.: Learning hiddenMarkov models for information extraction actively from partially labeled text. Künstliche Intelligenz (2) (2002)
Siefkes, C.: A shallow algorithm for correcting nesting errors and other wellformedness violations in XML-like input. In: Extreme Markup Languages, EML 2004 (2004)
Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining Winnow and orthogonal sparse bigrams for incremental spam filtering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 410–421. Springer, Heidelberg (2004)
Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: CRYSTAL: Inducing a conceptual dictionary. In: IJCAI (1995)
Trainable Information Extractor, http://www.inf.fu-berlin.de/inst/ag-db/software/tie/
TreeTagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
txt2html, http://txt2html.sourceforge.net/
XML Path Language (XPath) 2.0, 2004. W3C Working Draft (October 29 2004)
Zavrel, J., Daelemans, W.: Feature-rich memory-based classification for shallow NLP and information extraction. In: Franke, J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining, Theoretical Aspects and Applications, pp. 33–54. Springer/Physica, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Siefkes, C. (2005). Incremental Information Extraction Using Tree-Based Context Representations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_55
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)