A comparative study for biomedical named entity recognition

Wang, Xu; Yang, Chen; Guan, Renchu

doi:10.1007/s13042-015-0426-6

A comparative study for biomedical named entity recognition

Original Article
Published: 15 September 2015

Volume 9, pages 373–382, (2018)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Xu Wang¹,
Chen Yang² &
Renchu Guan¹

2709 Accesses
37 Citations
Explore all metrics

Abstract

With high-throughput technologies applied in biomedical research, the quantity of biomedical literatures grows exponentially. It becomes more and more important to quickly as well as accurately extract knowledge from manuscripts, especially in the era of big data. Named entity recognition (NER), aiming at identifying chunks of text that refers to specific entities, is essentially the initial step for information extraction. In this paper, we will review the three models of biomedical NER and two famous machine learning methods, Hidden Markov Model and Conditional Random Fields, which have been widely applied in bioinformatics. Based on these two methods, six excellent biomedical NER tools are compared in terms of programming language, feature sets, underlying mathematical methods, post-processing techniques and flowcharts. Experimental results of these tools against two widely used corpora, GENETAG and JNLPBA, are conducted. The comparison varies from different entity types to the overall performance. Furthermore, we put forward suggestions about the selection of Bio-NER tools for different applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Bioinformatics: new tools and applications in life science and personalized medicine

Article 06 January 2021

References

Rosario B, Hearst MA (2004) Classifying semantic relations in bioscience texts. In: Proceedings 42nd annual meeting association computional linguistics. doi:10.3115/1218955.1219010
Chiang J-H, Yu H-C (2003) MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics 19:1417–1422. doi:10.1093/bioinformatics/btg160
Article Google Scholar
Ciaramita M, Gangemi A, Ratsch E et al (2005) Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In: IJCAI. pp 659–664
Zhou G, Su J (2002) Named entity recognition using an hmm-based chunk tagger. In: Proceedings 40th annual meeting association computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 473–480
Collier N, Nobata C, Tsujii J (2000) Extracting the names of genes and gene products with a hidden markov model. In: Proceedings 18th conference computional linguistics, vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 201–207
Gaizauskas R, Demetriou G, Humphreys K (2000) Term recognition and classification in biological science journal articles. In: Proceedings computional terminology for medical and biological applications workshop 2nd international conference NLP. pp 37–44
Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning support vector machines for biomedical named entity recognition. In: Proceedings ACL-02 workshop natural language processing in the biomedicine domain, vol 3. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–8
Takeuchi K, Collier N (2002) Use of support vector machines in extended named entity recognition. In: Proceedings 6th Confernce Natural Language Learn, vol 20. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–7
Zhou G, Zhang J, Su J et al (2004) Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20:1178–1190. doi:10.1093/bioinformatics/bth060
Article Google Scholar
Fukuda K, Tamura A, Tsunoda T, Takagi T (1998) Toward information extraction: identifying protein names from biological papers. Pacific Symposium Biocomputing Pacific Symposium Biocomputional. pp 707–718
Nobata C, Collier N, Tsujii J (1999) Automatic term identification and classification in biology texts. In: Proceedings 5th NLPRS. pp 369–374
Chang JT, Schütze H, Altman RB (2002) Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc JAMIA 9:612–620
Article Google Scholar
Liu H, Aronson AR, Friedman C (2002) A study of abbreviations in MEDLINE abstracts. In: Proceedings AMIA annual symposium AMIA symposium. pp 464–468
Sondhi P A survey on named entity extraction in the biomedical domain. Available online at http://sifaka.cs.uiuc.edu/~sondhi1/survey1.pdf
Tsuruoka Y, Tsujii J (2003) Boosting precision and recall of dictionary-based protein name recognition. In: Proceedings ACL 2003 workshop natural language processing biomedicine, vol 13. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 41–48
Yang Z, Lin H, Li Y (2008) Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Comput Biol Chem 32:287–291 (2008.03.008)
Article MATH Google Scholar
Proux D, Rechenmann F, Julliard V et al (1998) Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. Genome Inform Workshop Genome Inform 9:72–80
Google Scholar
Tsai RT, Sung C-L, Dai H-J et al (2006) NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinform 7:S11. doi:10.1186/1471-2105-7-S5-S11
Article Google Scholar
He X, Zemel RS, Carreira-Perpindn MA (2004) Multiscale conditional random fields for image labeling. In: Proceedings 2004 IEEE computional society conference computional vis. pattern recognition 2004 CVPR 2004, vol 2. pp II–695–II–702
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings 2003 conference North America chapter association computional linguistics human language technology, vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 134–141
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings international joint workshop natural language processing biomedicine its application. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 104–107
Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21:3191–3192. doi:10.1093/bioinformatics/bti475
Article Google Scholar
Baldwin B, Carpenter B (2003) LingPipe. World Wide Web Httpalias-Comlingpipe
Leaman R, Gonzalez G, others (2008) BANNER: an executable survey of advances in biomedical named entity recognition. In: Pacific Symposium Biocomputing. pp 652–663
Cho HC (2010) NERsuite: a named entity recognition toolkit. Tsujii Laboratory, Department of Information Science, University of Tokyo, Tokyo, Japan. http://nersuite.nlplab.org. http://nersuite.nlplab.org/. Accessed 14 Nov 2014
Campos D, Matos S, Oliveira JL (2013) Gimli: open source and high-performance biomedical name recognition. BMC Bioinform 14:54. doi:10.1186/1471-2105-14-54
Article Google Scholar
Tsuruoka Y (2006) GENIA tagger: Part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text
Tsuruoka Y, Tsujii J (2005) Bidirectional inference with the easiest-first strategy for tagging sequence data. In: Proceedings conference human language technology empirical methods natural language processing. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 467–474
Tanabe L et al (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC bioinform 6(Suppl 1):S3
Article Google Scholar
Zhou X, Zhang X, Hu X (2007) Dragon toolkit: incorporating auto-learned semantic knowledge into large-scale text retrieval and mining. In: Tools artificial intelligence 2007 ICTAI 2007 19th IEEE international Conference on IEEE. pp 197–201
McCallum AK (2002) Mallet: a machine learning for language toolkit. Available online at https://people.cs.umass.edu/~mccallum/mallet/
Sagae K, Tsujii J (2007) Dependency parsing and domain adaptation with LR models and parser ensembles. In: EMNLP-CoNLL. pp 1044–1050
Liu H, Hu Z-Z, Zhang J, Wu C (2006) BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22:103–105. doi:10.1093/bioinformatics/bti749
Article Google Scholar
Kim J-D, Ohta T, Tsuruoka Y et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceeding international joint workshop natural language processing biomedicine its applications. Association for Computational Linguistics, pp 70–75
Smith L, Tanabe LK, Ando RJ et al (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9:S2
Article Google Scholar
Dingare S, Nissim M, Finkel J et al (2005) A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations. Comp Funct Genom 6:77–85. doi:10.1002/cfg.457
Article Google Scholar
Zhang S, Elhadad N (2013) Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J Biomed Inform. doi:10.1016/j.jbi.2013.08.004
Google Scholar
Tang Z, Jiang L, Yang L et al (2015) CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Clust Comput 18:493–505. doi:10.1007/s10586-015-0426-z
Article Google Scholar
Li K, Ai W, Tang Z et al (2015) Hadoop recognition of biomedical named entity using conditional random fields. In: IEEE transaction parallel distribution system. pp 1–1. doi:10.1109/TPDS.2014.2368568

Download references

Acknowledgments

This paper is supported by the National Key Basic Research Program of China (No. 2015CB453000), National Natural Science Foundation of China (Nos. 61572228, 41101376, 61272207 and 61300147), and the Science Technology Development Project of Jilin Province of China (20130101070JC, 20130522106JH and 20140520070JH).

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, 2699 Qianjin Street, Changchun, 130012, People’s Republic of China
Xu Wang & Renchu Guan
College of Earth Sciences, Jilin University, 2699 Qianjin Street, Changchun, 130012, People’s Republic of China
Chen Yang

Authors

Xu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Renchu Guan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renchu Guan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Yang, C. & Guan, R. A comparative study for biomedical named entity recognition. Int. J. Mach. Learn. & Cyber. 9, 373–382 (2018). https://doi.org/10.1007/s13042-015-0426-6

Download citation

Received: 31 March 2015
Accepted: 07 September 2015
Published: 15 September 2015
Issue Date: March 2018
DOI: https://doi.org/10.1007/s13042-015-0426-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study for biomedical named entity recognition

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Bioinformatics: new tools and applications in life science and personalized medicine

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comparative study for biomedical named entity recognition

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

Perseus: A Bioinformatics Platform for Integrative Analysis of Proteomics Data in Cancer Research

Bioinformatics: new tools and applications in life science and personalized medicine

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation