Skip to main content

Recognizing Biomedical Named Entities in Chinese Research Abstracts

  • Conference paper
Advances in Artificial Intelligence (Canadian AI 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5032))

Abstract

Most research on biomedical named entity recognition has focused on English texts, e.g., MEDLINE abstracts. However, recent years have also seen significant growth of biomedical publications in other languages. For example, the Chinese Biomedical Bibliographic Database has collected over 3 million articles published after 1978 from 1600 Chinese biomedical journals. We present here a Conditional Random Field (CRF) based system for recognizing biomedical named entities in Chinese texts. Viewing Chinese sentences as sequences of characters, we trained and tested the CRF model using a manually annotated corpus containing 106 research abstracts (481 sentences in total). The features we used for the CRF model include word segmentation tags provided by a segmenter trained on newswire corpora, and lists of frequent characters gathered from training data and external resources. Randomly selecting 400 sentences for training and the rest for testing, our system obtained an 68.60% F-score on average, significantly outperforming the baseline system (F-score 60.54% using a simple dictionary match). This suggests that statistical approaches such as CRFs based on annotated corpora hold promise for the biomedical NER task in Chinese texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: A high-performance learning name finder. In: Proceedings Of The 5th Conference On Applied Natural Language Processing (1997)

    Google Scholar 

  2. Borthwick, A.: A Maximum Entropy Approach To Named Entity Recognition. PhD thesis, New York University (1999)

    Google Scholar 

  3. Carpenter, B.: Character language models for chinese word segmentation and named entity recognition. In: Proceedings of SIGHAN Bakeoff (2006)

    Google Scholar 

  4. Chen, A., Peng, F., Shan, R., Sun, G.: Chinese named entity recognition with conditional probabilistic models. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (2006)

    Google Scholar 

  5. Feng, Y., Sun, L., Lv, Y.: Chinese word segmentation and named entity recognition based on conditional random fields models. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (2006)

    Google Scholar 

  6. Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of International Joint Workshop on NLP in Biomedicine and Its Applications (2004)

    Google Scholar 

  7. Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of Conference on Computational Natural Language Learning (2003)

    Google Scholar 

  8. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (2001)

    Google Scholar 

  9. Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical ne recognition based on SVMs. In: Proceedings of ACL Workshop on NLP in Biomedicine (2003)

    Google Scholar 

  10. Lin, Y.-F., Tsai, T.-H., Chou, W.-C., Wu, K.-P., Sung, T.-Y., Hsu, W.-L.: A maximum entropy approach to biomedical named entity recognition. In: Proceedings of the 4th SIGKDD Workshop on Data Mining in Bioinformatics (2004)

    Google Scholar 

  11. Mayfield, J., McNamee, P., Piatko, C.: Named entity recognition using hundreds of thousands of features. In: Proceedings of CoNLL (2003)

    Google Scholar 

  12. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature selection and web-enhanced lexicons. In: Proceedings of CoNLL (2003)

    Google Scholar 

  13. Mikheev, A., Grover, C., Moens, M.: Description of the LTG system used for MUC-7. In: Proceedings of 7th Message Understanding Conference (MUC-7) (1998)

    Google Scholar 

  14. Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazeteers. In: Proceedings of Conference of European Chapter of ACL (1999)

    Google Scholar 

  15. Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R.: BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (1998)

    Google Scholar 

  16. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of JNLPBA (2004)

    Google Scholar 

  17. Yeh, A., Morgan, A., Colosimo, M., Hirschman, L.: BioCreAtIvE task 1A: Gene mention finding evaluation. BMC Bioinformatics (2005)

    Google Scholar 

  18. Yu, S., Bai, S., Wu, P.: Description of the kent ridge digital labs system used for MUC-7. In: Proceedings of 7th Message Understanding Conference (1998)

    Google Scholar 

  19. Zhou, G., Su, J.: Named entity recognition using an HMM-based chunk tagger. In: Proceedings of 40th Annual Meeting of ACL (2002)

    Google Scholar 

  20. Zhou, G., Zhang, J., Su, J., Shen, D., Tan, C.: Recognizing names in biomedical texts: A machine learning approach. Bioinformatics (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Sabine Bergler

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gu, B., Popowich, F., Dahl, V. (2008). Recognizing Biomedical Named Entities in Chinese Research Abstracts. In: Bergler, S. (eds) Advances in Artificial Intelligence. Canadian AI 2008. Lecture Notes in Computer Science(), vol 5032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68825-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68825-9_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68821-1

  • Online ISBN: 978-3-540-68825-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics