Abstract
Supervised machine learning methods have been employed with great success in the task of biomedical relation extraction. However, existing methods are not practical enough, since manual construction of large training data is very expensive. Therefore, active learning is urgently needed for designing practical relation extraction methods with little human effort. In this paper, we describe a unified active learning framework. Particularly, our framework systematically addresses some practical issues during active learning process, including a strategy for selecting informative data, a data diversity selection algorithm, an active feature acquisition method, and an informative feature selection algorithm, in order to meet the challenges due to the immense amount of complex and diverse biomedical text. The framework is evaluated on protein-protein interaction (PPI) extraction and is shown to achieve promising results with a significant reduction in editorial effort and labeling time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Faro A, Giordano D, Spampinato C (2012) Combining literature text mining with microarray data: Advances for system biology modeling. Brief Bioinform 13(1):61–82
Hunter L, Cohen K (2006) Biomedical language processing: What’s beyond PubMed? Mol Cell 21(5):589–594
Huang M, Ding S, Wang H, Zhu X (2008) Mining physical protein-protein interactions from the literature. Genome Biology 9(Suppl 2):S12
Katrenko S, Adriaans P. Learning relations from biomedical corpora using dependency trees. In Lecture Notes in Computer Science, Tuyls K, Westra R, Saeys T et al. (eds.), Springer-Verlag, 2007, 4366, pp.61–80.
Miwa M, Sætre R, Miyao Y, Tsujii J. A rich feature vector for protein-protein interaction extraction from multiple corpora. In Proc. the Conference on Empirical Methods in Natural Language Processing, August 2009, pp.121–130.
Yang Z, Lin H, Li Y (2010) BioPPISVMExtractor: A protein-protein interaction extractor for biomedical literature using SVM and rich feature sets. Journal of Biomedical Informatics 43(1):88–96
Li Y, Hu X, Lin H, Yang Z (2010) Learning an enriched representation from unlabelled data for protein-protein interaction extraction. BMC Bioinformatics 11(Suppl 2):S7
Landeghem S, Abeel T, Saeys Y, Peer Y (2010) Discriminative and informative features for biomolecular text mining with ensemble feature selection. Bioinformatics 26(18):554–560
Bui Q, Katrenko S, Sloot P (2011) A hybrid approach to extract protein-protein interactions. Bioinformatics 27(2):259–265
van Landeghem S, Saeys Y, Deu Baets B, van De Peer Y. Extracting protein-protein interactions from text using rich feature vectors and feature selection. In Proc. the 3th International Symposium on Semantic Mining in Biomedicine, September 2008, pp.77–84.
Fayruzov T, De Cock M, Cornelis C, Hoste V (2009) Linguistic feature analysis for protein interaction extraction. BMC Bioinformatics 10:374
Miyao Y, Sagae K, Sætre R, Matsuzaki T, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25(3):394–400
Niu Y, Otasek D, Jurisica I (2010) Evaluation of linguistic features useful in extraction of interactions from PubMed; Application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics 26(1):111–119
Erkan G, Ozgur A, Radev D. Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2007, pp.228–237.
Kim S, Yoon J, Yang J (2008) Kernel approaches for genic interaction extraction. Bioinformatics 24(1):118–126
Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, Salakoski T (2008) All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 9(Suppl 11):S2
Segura-Bedmar I, Martínez P, de Pablo-Sánchez C (2011) Using a shallow linguistic kernel for drug-drug interaction extraction. J Biomed Inform 44(5):789–804
Burr S. Active learning literature survey. Technical Report, University of Wisconsin-Madison. 2009.
Dai H, Chang Y, Tsai RT, Hsu W (2010) New challenges for biological text-mining in the next decade. J Comput Sci Technol 25(1):169–179
Wang M, Hua X. Active learning in multimedia annotation and retrieval: A survey. ACM Transactions on Intelligent Systems and Technology, 2011, 2(2), Article No. 10.
Long B, Chapelle O, Zhang Y, Chang Y, Zheng Z, Tseng B. Active learning for ranking through expected loss optimization. In Proc. the 33rd Intarnational Conference on Research and Development in Information Retrieval, July 2010, pp.267–274.
He X (2010) Laplacian regularized d-optimal design for active learning and its application to image retrieval. IEEE Transactions on Image Processing 19(1):254–263
Bloodgood M, Callison-Burch C. Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In Proc. the 48th Annual Meeting of the Association for Computational Linguistics, July 2010, pp.854–864.
Mohamed T, Carbonell J, Ganapathiraju M (2010) Active learning for human protein-protein interaction prediction. BMC Bioinformatics 11(Suppl 1):S57
Klaus B. Incorporating diversity in active learning with support vector machines. In Proc. the 20th International Conference on Machine Learning, August 2003, pp.59–66.
Huang M, Zhu X, Hao Y, Payan D, Qu K, Li M (2004) Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20(18):3604–3612
Wu F, Weld D. Open information extraction using wikipedia. In Proc. the 48th ACL, 2010, pp.118–127.
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research 5:1205–1224
Riloff E. Automatically generating extraction patterns from untagged text. In Proc. the 13th National Conference on Artificial Intelligence, August 1996, pp.1044–1049.
Quinlan J. Unknown attribute values in induction. In Proc. the 6th Int. Workshop on Machine Learning, June 1989, pp.164–168.
Zhang H, Huang M, Zhu X. Protein-protein interaction extraction from bio-literature with compact features and data sampling strategy. In Proc. the 4th BMEI, October 2011, pp.1779–1783.
Pyysalo S, Airola A, Heimonen J et al (2008) Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics 9(Suppl 3):S6
Author information
Authors and Affiliations
Corresponding author
Additional information
The work is supported by the National Natural Science Foundation of China under Grant No. 60973104 and the National Basic Research 973 Program of China under Grant No. 2012CB316301.
Rights and permissions
About this article
Cite this article
Zhang, HT., Huang, ML. & Zhu, XY. A Unified Active Learning Framework for Biomedical Relation Extraction. J. Comput. Sci. Technol. 27, 1302–1313 (2012). https://doi.org/10.1007/s11390-012-1306-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1306-0