Abstract
Most research on biomedical named entity recognition has focused on English texts, e.g., MEDLINE abstracts. However, recent years have also seen significant growth of biomedical publications in other languages. For example, the Chinese Biomedical Bibliographic Database has collected over 3 million articles published after 1978 from 1600 Chinese biomedical journals. We present here a Conditional Random Field (CRF) based system for recognizing biomedical named entities in Chinese texts. Viewing Chinese sentences as sequences of characters, we trained and tested the CRF model using a manually annotated corpus containing 106 research abstracts (481 sentences in total). The features we used for the CRF model include word segmentation tags provided by a segmenter trained on newswire corpora, and lists of frequent characters gathered from training data and external resources. Randomly selecting 400 sentences for training and the rest for testing, our system obtained an 68.60% F-score on average, significantly outperforming the baseline system (F-score 60.54% using a simple dictionary match). This suggests that statistical approaches such as CRFs based on annotated corpora hold promise for the biomedical NER task in Chinese texts.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: A high-performance learning name finder. In: Proceedings Of The 5th Conference On Applied Natural Language Processing (1997)
Borthwick, A.: A Maximum Entropy Approach To Named Entity Recognition. PhD thesis, New York University (1999)
Carpenter, B.: Character language models for chinese word segmentation and named entity recognition. In: Proceedings of SIGHAN Bakeoff (2006)
Chen, A., Peng, F., Shan, R., Sun, G.: Chinese named entity recognition with conditional probabilistic models. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (2006)
Feng, Y., Sun, L., Lv, Y.: Chinese word segmentation and named entity recognition based on conditional random fields models. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (2006)
Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of International Joint Workshop on NLP in Biomedicine and Its Applications (2004)
Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of Conference on Computational Natural Language Learning (2003)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (2001)
Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical ne recognition based on SVMs. In: Proceedings of ACL Workshop on NLP in Biomedicine (2003)
Lin, Y.-F., Tsai, T.-H., Chou, W.-C., Wu, K.-P., Sung, T.-Y., Hsu, W.-L.: A maximum entropy approach to biomedical named entity recognition. In: Proceedings of the 4th SIGKDD Workshop on Data Mining in Bioinformatics (2004)
Mayfield, J., McNamee, P., Piatko, C.: Named entity recognition using hundreds of thousands of features. In: Proceedings of CoNLL (2003)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature selection and web-enhanced lexicons. In: Proceedings of CoNLL (2003)
Mikheev, A., Grover, C., Moens, M.: Description of the LTG system used for MUC-7. In: Proceedings of 7th Message Understanding Conference (MUC-7) (1998)
Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazeteers. In: Proceedings of Conference of European Chapter of ACL (1999)
Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R.: BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (1998)
Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of JNLPBA (2004)
Yeh, A., Morgan, A., Colosimo, M., Hirschman, L.: BioCreAtIvE task 1A: Gene mention finding evaluation. BMC Bioinformatics (2005)
Yu, S., Bai, S., Wu, P.: Description of the kent ridge digital labs system used for MUC-7. In: Proceedings of 7th Message Understanding Conference (1998)
Zhou, G., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: Proceedings of 40th Annual Meeting of ACL (2002)
Zhou, G., Zhang, J., Su, J., Shen, D., Tan, C.: Recognizing names in biomedical texts: A machine learning approach. Bioinformatics (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gu, B., Popowich, F., Dahl, V. (2008). Recognizing Biomedical Named Entities in Chinese Research Abstracts. In: Bergler, S. (eds) Advances in Artificial Intelligence. Canadian AI 2008. Lecture Notes in Computer Science(), vol 5032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68825-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-68825-9_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68821-1
Online ISBN: 978-3-540-68825-9
eBook Packages: Computer ScienceComputer Science (R0)