Abstract
In this paper, we propose two independent solutions to the problems of spelling variants and the lack of annotated corpus, which are the main difficulties in SVM(Support-Vector Machine) and other machine-learning based biological named entity recognition. To resolve the problem of spelling variants, we propose the use of edit-distance as a feature for SVM. To resolve the lack-of-corpus problem, we propose the use of virtual examples, by which the annotated corpus can be automatically expanded in a fast, efficient and easy way. The experimental results show that the introduction of edit-distance produces some improvements. And the model, which is trained with the corpus expanded by virtual examples, outperforms the model trained with the original corpus. Finally, we achieved the high performance of 71.46 % in F-measure (64.03 % in precision, 80.84 % in recall) in the experiment of five categories named entity recognition on GENIA corpus (version 3.0).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Yi, E., Lee, G.G., Park, S.: HMM-based protein name recognition with editdistance using automatically annotated corpus. In: Proceedings of the workshop on BioLINK text data mining SIG: Biology literature information and knowledge, ISMB 2003 (2003)
An, J., Lee, S., Lee, G.: Automatic acquisition of named entity tagged corpus from World Wide Web. In: Preceeding of ACL 2003 (2003)
Lee, K., Hwang, Y., Rim, H.: Two-phase biomedical NE recognition based on SVMs. In: Proceedings of ACL 2003 Workshop on Natural Language Processing in Biomedicine (2003)
Yamamoto, K.: T,Kudo, A.Konagaya, Y.Matusmoto: Protein name tagging for biomedical annotation in text. In: Proceedings of ACL 2003 Workshop on Natural Language Processing in Biomedicine (2003)
Niyogi, P., Girosi, F., Poggio, T.: Incorporating prior information in machine learning by creating virtual examples. Proceedings of IEEE 86, 2196–2207 (1998)
Wagner, R.A., Fisher, M.J.: The string-to-string correction problem. Journal of the Association for Computer Machinery 21(1) (1974)
Ohta, T., Tateisi, Y., Kim, J., Mima, H., Tsujii, J.: The genia corpus: An annotated research abstract corpus in molecular biology domain. In: Proceedings of HLT 2002 (2002)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Tsuruoka, Y., Tsujii, J.: Boosting precision and recall of dictionary-based protein name recognition. In: Proceeding of ACL 2003 Workshop on Natural Language Processing in Biomedicine (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yi, E., Lee, G.G., Song, Y., Park, SJ. (2005). SVM-Based Biological Named Entity Recognition Using Minimum Edit-Distance Feature Boosted by Virtual Examples. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_86
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_86
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)