Abstract
As the sizes of biomedical literature databases increase, there is an urgent need to develop intelligent systems that automatically discover Protein-Protein interactions from text. Despite resource-intensive efforts to create manually curated interaction databases, the sheer volume of biological literature databases makes it impossible to achieve significant coverage. In this paper, we describe a scalable hierarchical Support Vector Machine(SVM) based framework to efficiently mine protein interactions with high precision. In addition, we describe a convolution tree-vector kernel based on syntactic similarity of natural language text to further enhance the mining process. By using the inherent syntactic similarity of interaction phrases as a kernel method, we are able to significantly improve the classification quality. Our hierarchical framework allows us to reduce the search space dramatically with each stage, while sustaining a high level of accuracy. We test our framework on a corpus of over 10000 manually annotated phrases gathered from various sources. The convolution kernel technique identifies sentences describing interactions with a precision of 95% and a recall of 92%, yielding significant improvements over previous machine learning techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alfarano, C., et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33, D418–D424 (2005)
Blaschke, C., et al.: Automatic extraction of biological information from scientific text: protein-protein interactions. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., pp. 60–67 (1999)
Brown, K.R., et al.: Online predicted human interaction database. Bioinformatics 21, 2076–2082 (2005)
Chatr-aryamontri, A., et al.: MINT: the Molecular INTeraction database. Nucleic Acids Res. 35, D572–D574 (2007)
Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures (2002)
Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. Computational Linguistics (2003)
Donaldson, I., et al.: PreBIND and Textomy–mining the biomedical literature for proteinprotein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003)
Fukuda, K., et al.: Toward information extraction: identifying protein names from biological papers. In: Pac. Symp. Biocomput., pp. 707–718 (1998)
Genia Project: Mining literature for knowledge in molecular biology (2008), http://wwwtsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi
Gilfillan, I.: A database of proteins that are known to interact. Genome Biology 1; Reports220 (November 2000)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods-Support Vector Learning (1999)
Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on SVMs. J. Bio. med. Inform. 37, 436–447 (2004)
Marcotte, E.M., et al.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)
Ramani, A.K., et al.: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 6, R40 (2005)
Rosario, B., Hearst, A.: Multi-way Relation Classification: Application to Protein-Protein Interaction. In: Human Language Technology Conference on Empirical Methods in Natural Language Processing (2005)
Rindflesch, T.C., et al.: Mining molecular binding terminology from biomedical text. In: Proc. AMIA Symp., pp. 127–131 (1999)
Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19, 2046–2053 (2003)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Yu, H., et al.: Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. In: Proc. AMIA Symp., pp. 919–923 (2002)
Culotta, A., Sorensen, J.: Dependency Tree Kernels for Relation Extraction. In: Proceedings of ACL 2004 (2004)
Bunescu, R., Mooney, R.J.: Subsequence kernels for relation extraction. In: Proceedings of the 19th Conference on Neural Information Processing Systems, Vancouver, British Columbia (2005)
Collins, M., Duffy, N.: Convolution kernels for natural language. In: NIPS 2001 (2001)
Yuka, T., Tsujii, J.: Part-of-Speech Annotation of Biology Research Abstracts. In: The Proceedings of 4th International Conference on Language Resource and Evaluation (LREC 2004), Lisbon, Portugal, May 2004, pp. 1267–1270 (2004)
Collins, M.: A New Statistical Parser Based on Bigram Lexical Dependencies. In: Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Biocreative 2: http://biocreative.sourceforge.net/biocreative_2.html
Shin, et al.: Identifying Protein-Protein Interaction Sentences Using Boosting and Kernel Method. In: Second BioCreative Challenge Evaluation Workshop (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Narayanan, R., Misra, S., Lin, S., Choudhary, A. (2010). Mining Protein Interactions from Text Using Convolution Kernels. In: Theeramunkong, T., et al. New Frontiers in Applied Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14640-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-14640-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14639-8
Online ISBN: 978-3-642-14640-4
eBook Packages: Computer ScienceComputer Science (R0)