Abstract
As the number of published papers on proteins increases rapidly, manual protein annotation for biological sequence databases faces the problem of catching up with the speed of publication. Automated information extraction for protein annotation offers a solution to this problem. Generally, information extraction tasks have relied on the availability of pre-defined templates as well as annotated corpora. However, in many real world applications, it is difficult to fulfill this requirement; only relevant sentences for target domains can be easily collected. At the same time, other resources can be harnessed to compensate for this difficulty: natural language processing provides reliable tools for syntactic text analysis, and in bio-medical domains, there is a large amount of background knowledge available, e.g., in the form of ontologies. In this paper, we present a method for learning information extraction rules without pre-defined templates or labor-intensive pre-annotation by exploiting various types of background knowledge in an inductive logic programming framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: Prints and its automatic supplement, preprints. Nucleic Acids Research 31(1), 400–402 (2003)
Basili, R., Pazienza, M.T., Zanzotto, F.M.: Learning IE patterns: a terminology extraction perspective. In: Workshop of Event Modelling for Multilingual Document Linking at LREC 2002 (2002)
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32, D267–D270 (2004)
Buchholz, S., Veenstra, J., Daelemans, W.: Cascaded grammatical relation assignment. In: EMNLP-VLC 1999, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)
Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research 4, 177–210 (2003)
Daelemans, W., Buchholz, S., Veenstra, J.: Memory-based shallow parsing. In: Proceedings of CoNLL 1999 (1999)
Dzeroski, S., Cussens, J., Manandhar, S.: An introduction to Inductive Logic Programming and learning language in logic. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 3–35. Springer, Heidelberg (2000)
Flach, P., Lavrac, N.: Intelligent Data Analysis. In: Rule Induction, ch. 7, pp. 229–267. Springer, Heidelberg (2002)
Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 247–258. Springer, Heidelberg (2000)
Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus-a semantically annotated corpus for bio-textmining. In: Bioinformatics (2003)
Kim, J.-H., Kwak, B.-K., Lee, S., Lee, G., Lee, J.-H.: A corpus-based learning method for compound noun indexing rules for Korean. Information Retreival 4, 115–132 (2001)
Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)
Muslea, I.: Extraction patterns for information extraction tasks: A survey. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI 1996), pp. 1044–1049 (1996)
Rosario, B., Hearst, M.A.: Classifying semantic relations in bioscience texts. In: Proc. ACL (2004)
Srinivasan, A.: The Aleph manual. Technical report, Computing Laboratory, Oxford University (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, JH., Hilario, M. (2005). Learning Information Extraction Rules for Protein Annotation from Unannotated Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_56
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)