Skip to main content

Incremental Information Extraction Using Tree-Based Context Representations

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))


The purpose of information extraction (IE) is to find desired pieces of information in natural language texts and store them in a form that is suitable for automatic processing. Providing annotated training data to adapt a trainable IE system to a new domain requires a considerable amount of work. To address this, we explore incremental learning. Here training documents are annotated sequentially by a user and immediately incorporated into the extraction model. Thus the system can support the user by proposing extractions based on the current extraction model, reducing the workload of the user over time.

We introduce an approach to modeling IE as a token classification task that allows incremental training. To provide sufficient information to the token classifiers, we use rich, tree-based context representations of each token as feature vectors. These representations make use of the heuristically deduced document structure in addition to linguistic and semantic information. We consider the resulting feature vectors as ordered and combine proximate features into more expressive joint features, called “Orthogonal Sparse Bigrams” (OSB). Our results indicate that this setup makes it possible to employ IE in an incremental fashion without a serious performance penalty.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Chieu, H.L., Ng, H.T.: A maximum entropy approach to information extraction from semi-structured and free text. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI 2002), pp. 786–791 (2002)

    Google Scholar 

  2. Ciravegna, F.: (LP)2, an adaptive algorithm for information extraction from Webrelated texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, USA (2001)

    Google Scholar 

  3. Dagan, I., Karov, Y., Roth, D.: Mistake-driven learning in text categorization. In: Cardie, C., Weischedel, R. (eds.) Proceedings of EMNLP 1997, 2nd Conference on Empirical Methods in Natural Language Processing, Providence, US, pp. 55–63. Association for Computational Linguistics (1997)

    Google Scholar 

  4. De Sitter, A., Daelemans, W.: Information extraction via double classification. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining, ATEM 2003 (2003)

    Google Scholar 

  5. Finn, A., Kushmerick, N.: Active learning selection strategies for information extraction. In: Proceedings of the International Workshop on Adaptive Text Extraction and Mining (2003)

    Google Scholar 

  6. Finn, A., Kushmerick, N.: Information extraction by convergent boundary classification. In: AAAI 2004 Workshop on Adaptive Text Extraction and Mining, San Jose, USA (2004)

    Google Scholar 

  7. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: AAAI/IAAI, pp. 577–583 (2000)

    Google Scholar 

  8. Freitag, D., McCallum, A.K.: Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  9. JTidy,

  10. Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., Romano, L.: A critical survey of the methodology for IE evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)

    Google Scholar 

  11. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2, 285–318 (1988)

    Google Scholar 

  12. Peshkin, L., Pfeffer, A.: Bayesian information extraction network. In: IJCAI (2003)

    Google Scholar 

  13. Roth, D., Yih, W.-t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI (2001)

    Google Scholar 

  14. Scheffer, T., Wrobel, S., Popov, B., Ognianov, D., Decomain, C., Hoche, S.: Learning hiddenMarkov models for information extraction actively from partially labeled text. Künstliche Intelligenz (2) (2002)

    Google Scholar 

  15. Siefkes, C.: A shallow algorithm for correcting nesting errors and other wellformedness violations in XML-like input. In: Extreme Markup Languages, EML 2004 (2004)

    Google Scholar 

  16. Siefkes, C., Assis, F., Chhabra, S., Yerazunis, W.S.: Combining Winnow and orthogonal sparse bigrams for incremental spam filtering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 410–421. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: CRYSTAL: Inducing a conceptual dictionary. In: IJCAI (1995)

    Google Scholar 

  18. Trainable Information Extractor,

  19. TreeTagger,

  20. txt2html,

  21. XML Path Language (XPath) 2.0, 2004. W3C Working Draft (October 29 2004)

    Google Scholar 

  22. Zavrel, J., Daelemans, W.: Feature-rich memory-based classification for shallow NLP and information extraction. In: Franke, J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining, Theoretical Aspects and Applications, pp. 33–54. Springer/Physica, Heidelberg (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Siefkes, C. (2005). Incremental Information Extraction Using Tree-Based Context Representations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24523-0

  • Online ISBN: 978-3-540-30586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics