Abstract
Natural Language Processing (NLP) techniques have been used for the task of extracting and mining knowledge from biomedical literature. One of the critical steps of such a task is biomedical named entity tagging (BNER) which usually contains two steps: the first step is the identification of biomedical names in text and the second is the assignment of semantic classes predefined to names identified by the first step. Headwords and suffixes have been used frequently by BNER systems as features for the assignment of semantic classes to names in text. However, there are few studies to evaluate the performance of headwords and suffixes in predicting semantic classes of biomedical terms utilizing knowledge sources in an unsupervised way. We conducted a study to evaluate the performance of headwords and suffixes using names in the Unified Medical Language System (UMLS) where the semantic classes associated with these names were obtained by modifying an existing UMLS semantic group system and incorporating the GENIA ontology. We define headwords and suffixes that are significantly associated with a specific semantic class as semantic suffixes. The performance of semantic assignment using semantic suffixes achieved an F-measure of 86.4% with a precision of 91.6% and a recall of 81.7%. When applying these semantic suffixes obtained using the UMLS to names extracted from the GENIA corpus, the system achieved an F-measure of 73.4% with a precision of 84.2% and a recall of 65.1% where these performance measures could be improved dramatically when limited to names associated with classes that have the corresponding GENIA types.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hirschman, L., Park, J.C., Tsujii, J., Wong, L., Wu, C.H.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12), 1553–1561 (2002)
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1), S1 (2005)
Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: an overview. J. Comput. Biol. 10(6), 821–855 (2003)
Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. J. Biomed. Inform. 37(6), 512–526 (2004)
Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19(1), 135–143 (2003)
Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on SVMs. J. Biomed. Inform. 37(6), 436–447 (2004)
Torii, M., Kamboj, S., Vijay-Shanker, K.: Using name-internal and contextual features to classify biological terms. J. Biomed. Inform. 37(6), 498–511 (2004)
Nenadic, G., Spasic, I., Ananiadou, S.: Terminology-driven mining of biomedical literature. Bioinformatics 19(8), 938–943 (2003)
Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K.: A biological named entity recognizer. Pac. Symp. Biocomput., 427–438 (2003)
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(Database issue), D267–270 (2004)
Johnson, S.B.: A semantic lexicon for medical language processing. J. Am. Med. Inform. Assoc. 6(3), 205–218 (1999)
Friedman, C., Liu, H., Shagina, L., Johnson, S., Hripcsak, G.: Evaluating the UMLS as a source of lexical knowledge for medical language processing. Proc AMIA Symp., 189–193 (2001)
Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A general natural-language text processor for clinical radiology. J. Am. Med. Inform. Assoc. 1(2), 161–174 (1994)
McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating UMLS semantic types for reducing conceptual complexity. Medinfo 10(Pt 1), 216–220 (2001)
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), i180–182 (2003)
Zhou, G., Zhang, J., Su, J., Shen, D., Tan, C.: Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20(7), 1178–1190 (2004)
Tsuruoka, Y., Tsujii, J.: Improving the performance of dictionary-based approaches in protein name recognition. J. Biomed. Inform. 37(6), 461–470 (2004)
Torii, M., Vijay-Shanker, K.: Using Unlabeled MEDLINE Abstracts for Biological Named Entity Classification. In: Proceedings of Genome Informatics Workshop: 2002, pp. 567–568 (2002)
Cucerzan, S., Yarowsky, D.: Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In: Proceedings of the Workshop on Very Large Cor- pora at the Conference on Empirical Methods in NLP 1999 (1999)
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Empirical Methods in Natural Language Processing and Very Large Corpora 1999 (1999)
Kazama, J., Makino, T., Ohta, Y., Tsujii, J.: Tuning support vector machine for biomedical named entity recognition. In: Workshop on Natural Language Processing in the Biomedical Domain, ACL 2002 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Torii, M., Liu, H. (2006). Headwords and Suffixes in Biomedical Names. In: Bremer, E.G., Hakenberg, J., Han, EH.(., Berrar, D., Dubitzky, W. (eds) Knowledge Discovery in Life Science Literature. KDLL 2006. Lecture Notes in Computer Science(), vol 3886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11683568_3
Download citation
DOI: https://doi.org/10.1007/11683568_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32809-4
Online ISBN: 978-3-540-32810-0
eBook Packages: Computer ScienceComputer Science (R0)