Skip to main content

Learning Information Extraction Rules for Protein Annotation from Unannotated Corpora

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

  • 2223 Accesses

Abstract

As the number of published papers on proteins increases rapidly, manual protein annotation for biological sequence databases faces the problem of catching up with the speed of publication. Automated information extraction for protein annotation offers a solution to this problem. Generally, information extraction tasks have relied on the availability of pre-defined templates as well as annotated corpora. However, in many real world applications, it is difficult to fulfill this requirement; only relevant sentences for target domains can be easily collected. At the same time, other resources can be harnessed to compensate for this difficulty: natural language processing provides reliable tools for syntactic text analysis, and in bio-medical domains, there is a large amount of background knowledge available, e.g., in the form of ontologies. In this paper, we present a method for learning information extraction rules without pre-defined templates or labor-intensive pre-annotation by exploiting various types of background knowledge in an inductive logic programming framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: Prints and its automatic supplement, preprints. Nucleic Acids Research 31(1), 400–402 (2003)

    Article  Google Scholar 

  2. Basili, R., Pazienza, M.T., Zanzotto, F.M.: Learning IE patterns: a terminology extraction perspective. In: Workshop of Event Modelling for Multilingual Document Linking at LREC 2002 (2002)

    Google Scholar 

  3. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32, D267–D270 (2004)

    Article  Google Scholar 

  4. Buchholz, S., Veenstra, J., Daelemans, W.: Cascaded grammatical relation assignment. In: EMNLP-VLC 1999, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)

    Google Scholar 

  5. Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research 4, 177–210 (2003)

    Article  MathSciNet  Google Scholar 

  6. Daelemans, W., Buchholz, S., Veenstra, J.: Memory-based shallow parsing. In: Proceedings of CoNLL 1999 (1999)

    Google Scholar 

  7. Dzeroski, S., Cussens, J., Manandhar, S.: An introduction to Inductive Logic Programming and learning language in logic. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 3–35. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  8. Flach, P., Lavrac, N.: Intelligent Data Analysis. In: Rule Induction, ch. 7, pp. 229–267. Springer, Heidelberg (2002)

    Google Scholar 

  9. Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 247–258. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  10. Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus-a semantically annotated corpus for bio-textmining. In: Bioinformatics (2003)

    Google Scholar 

  11. Kim, J.-H., Kwak, B.-K., Lee, S., Lee, G., Lee, J.-H.: A corpus-based learning method for compound noun indexing rules for Korean. Information Retreival 4, 115–132 (2001)

    Article  MATH  Google Scholar 

  12. Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)

    Article  MathSciNet  Google Scholar 

  13. Muslea, I.: Extraction patterns for information extraction tasks: A survey. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  14. Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI 1996), pp. 1044–1049 (1996)

    Google Scholar 

  15. Rosario, B., Hearst, M.A.: Classifying semantic relations in bioscience texts. In: Proc. ACL (2004)

    Google Scholar 

  16. Srinivasan, A.: The Aleph manual. Technical report, Computing Laboratory, Oxford University (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, JH., Hilario, M. (2005). Learning Information Extraction Rules for Protein Annotation from Unannotated Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_56

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30586-6_56

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24523-0

  • Online ISBN: 978-3-540-30586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics