Skip to main content
Log in

Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper represents a two-phase approach based on semi-Markov conditional random fields model (semi-CRFs) and explores novel feature sets for identifying the entities in text into 5 types: protein, DNA, RNA, cell_line and cell_type. Semi-CRFs put the label to a segment not a single word which is more natural than the other machine learning methods such as conditional random fields model (CRFs). Our approach divides the biomedical named entity recognition task into two sub-tasks: term boundary detection and semantic labeling. At the first phase, term boundary detection sub-task detects the boundary of the entities and classifies the entities into one type C. At the second phase, semantic labeling sub-task labels the entities detected at the first phase the correct entity type. We explore novel feature sets at both phases to improve the performance. To make a comparison, experiments conducted both on CRFs and on semi-CRFs models at each phase. Our experiments carried out on JNLPBA 2004 datasets achieve an F-score of 74.64 % based on semi-CRFs without deep domain knowledge and post-processing algorithms, which outperforms most of the state-of-the-art systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. CRF package supports CRFs and Semi-CRFs. CRF package is available at http://crf.sourceforge.net.

Abbreviations

Semi-CRFs:

Semi-Markov conditional random fields

CRFs:

Conditional random fields

NER:

Named entity recognition

References

  1. Chan S, Lam W, Yu X (2007) A cascaded approach to biomedical named entity recognition using a unified model. In: Proceedings of the 2007 7th IEEE international conference on data mining (ICDM ’07), pp 93–102

  2. Cohen A, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinformatics 6(1):57–71

    Article  Google Scholar 

  3. Finkel J, Dingare S, Nguyen H et al (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 88–91

  4. Kim J, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus-a semantically annotated corpus for bio-text mining. Bioinformatics 19(suppl 1):i180–i182

    Article  Google Scholar 

  5. Kim J, Ohta T, Tsuruoka Y et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ‘04), pp 70–75

  6. Kim S, Yoon J, Park K, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd international joint conference (IJCNLP 2005), pp 646–657

  7. Kim S, Yoon J (2007) Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans Inf Syst E90–D(7):1103–1110

    Article  MathSciNet  Google Scholar 

  8. Kulick S, Bies A, Liberman M, (2004) Integrated annotation for biomedical information extraction. In: HLT-NAACL 2004 workshop, linking biological literature, ontologies and databases, pp 61–68

  9. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on, machine learning (ICML ’01), pp 282–289

  10. Lee C, Hou W, Chen H (2004) Annotating multiple types of biomedical entities: a single word classification approach. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 83–86

  11. Lee K, Hwang YS, Rim HC (2003) Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL 2003 workshop on natural language processing in, biomedicine (BioMed ’03), pp 33–40

  12. Li L, Zhou R, Huang D (2009) Two-phase biomedical named entity recognition using CRFs. Comput Biol Chem 33(4):334–338

    Article  Google Scholar 

  13. McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 6(suppl 1):s6

    Article  Google Scholar 

  14. Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J (2006) Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL, pp 465–472

  15. Olsson F, Eriksson G, Franzen K et al (2002) Notions of correctness when evaluating protein name taggers. In: Proceedings of the 19th international conference on, computational linguistics, pp 765–771

  16. Pablo-Sánchez CD, Segura-Bedmar I, Martínez P, Iglesias-Maqueda A (2012) Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst. doi:10.1007/s10115-012-0502-0

  17. Pérez-Catalán M, Berlanga R, Sanz I, Aramburu MJ (2012) A semantic approach for the requirement-driven discovery of web resources in the Life Sciences. Knowl Inf Syst 34(3):671–690. doi:10.1007/s10115-012-0498-5

    Article  Google Scholar 

  18. Sarawagi S, Cohen W (2004) Semi-Markov conditional random fields for information extraction. Adv Neural Inf Process Syst 17:1185–1192

    Google Scholar 

  19. Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 104–107

  20. Shehata S, Karray F, Kamel M (2012) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst. doi:10.1007/s10115-012-0504-y

  21. Sundheim B (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding (MUC6 ‘95), pp 13–31

  22. Tsai R, Sung C, Dai H et al (2006) NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(suppl 5):s11

    Article  Google Scholar 

  23. Yang L, Zhou Y (2010) Two-phase biomedical named entity recognition based on semi-CRFs. In: Proceedings of the IEEE international conference on bio-inspired computing: theories and applications, pp 1061–1065

  24. Yang Z, Lin H, Li Y (2008) Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Comput Biol Chem 32(4):287–291

    Article  MATH  Google Scholar 

  25. You W, Fontaine D, Barthès J (2012) An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst 34(3):691–724. doi:10.1007/s10115-012-0480-2

    Article  Google Scholar 

  26. Zhou G, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (JNLPBA ’04), pp 96–99

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant 30971642, Natural Science Foundation of Hubei Province under Grant 2009CDA161.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Yang.

Additional information

A preliminary version of this paper appears in the 2010 IEEE International Conference on Bio-inspired Computing: Theories and Applications [23]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, L., Zhou, Y. Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs. Knowl Inf Syst 40, 439–453 (2014). https://doi.org/10.1007/s10115-013-0637-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0637-7

Keywords

Navigation