Learning Information Extraction Rules for Protein Annotation from Unannotated Corpora

Kim, Jee-Hyub; Hilario, Melanie

doi:10.1007/978-3-540-30586-6_56

Jee-Hyub Kim¹⁷ &
Melanie Hilario¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2223 Accesses

Abstract

As the number of published papers on proteins increases rapidly, manual protein annotation for biological sequence databases faces the problem of catching up with the speed of publication. Automated information extraction for protein annotation offers a solution to this problem. Generally, information extraction tasks have relied on the availability of pre-defined templates as well as annotated corpora. However, in many real world applications, it is difficult to fulfill this requirement; only relevant sentences for target domains can be easily collected. At the same time, other resources can be harnessed to compensate for this difficulty: natural language processing provides reliable tools for syntactic text analysis, and in bio-medical domains, there is a large amount of background knowledge available, e.g., in the form of ontologies. In this paper, we present a method for learning information extraction rules without pre-defined templates or labor-intensive pre-annotation by exploiting various types of background knowledge in an inductive logic programming framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: Prints and its automatic supplement, preprints. Nucleic Acids Research 31(1), 400–402 (2003)
Article Google Scholar
Basili, R., Pazienza, M.T., Zanzotto, F.M.: Learning IE patterns: a terminology extraction perspective. In: Workshop of Event Modelling for Multilingual Document Linking at LREC 2002 (2002)
Google Scholar
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32, D267–D270 (2004)
Article Google Scholar
Buchholz, S., Veenstra, J., Daelemans, W.: Cascaded grammatical relation assignment. In: EMNLP-VLC 1999, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)
Google Scholar
Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research 4, 177–210 (2003)
Article MathSciNet Google Scholar
Daelemans, W., Buchholz, S., Veenstra, J.: Memory-based shallow parsing. In: Proceedings of CoNLL 1999 (1999)
Google Scholar
Dzeroski, S., Cussens, J., Manandhar, S.: An introduction to Inductive Logic Programming and learning language in logic. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 3–35. Springer, Heidelberg (2000)
Chapter Google Scholar
Flach, P., Lavrac, N.: Intelligent Data Analysis. In: Rule Induction, ch. 7, pp. 229–267. Springer, Heidelberg (2002)
Google Scholar
Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 247–258. Springer, Heidelberg (2000)
Chapter Google Scholar
Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus-a semantically annotated corpus for bio-textmining. In: Bioinformatics (2003)
Google Scholar
Kim, J.-H., Kwak, B.-K., Lee, S., Lee, G., Lee, J.-H.: A corpus-based learning method for compound noun indexing rules for Korean. Information Retreival 4, 115–132 (2001)
Article MATH Google Scholar
Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)
Article MathSciNet Google Scholar
Muslea, I.: Extraction patterns for information extraction tasks: A survey. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI 1996), pp. 1044–1049 (1996)
Google Scholar
Rosario, B., Hearst, M.A.: Classifying semantic relations in bioscience texts. In: Proc. ACL (2004)
Google Scholar
Srinivasan, A.: The Aleph manual. Technical report, Computing Laboratory, Oxford University (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Lab, University of Geneva, CH-1211, Geneva 4, Switzerland
Jee-Hyub Kim & Melanie Hilario

Authors

Jee-Hyub Kim
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Hilario
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, JH., Hilario, M. (2005). Learning Information Extraction Rules for Protein Annotation from Unannotated Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_56

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics