Brief Communication
Two-phase biomedical named entity recognition using CRFs

https://doi.org/10.1016/j.compbiolchem.2009.07.004Get rights and content

Abstract

As a fundamental step of biomedical text mining, Biomedical Named Entity Recognition (Bio-NER) remains a challenging task. This paper explores a so-called two-phase approach to identify biomedical entities, in which the recognition task is divided into two subtasks: Named Entity Detection (NED) and Named Entity Classification (NEC). And the two subtasks are finished in two phases. At the first phase, we try to identify each named entity with a Conditional Random Fields (CRFs) model without identifying its type; at the second phase, another CRFs model is used to determine the correct entity type for each identified entity. This treatment can reduce the training time significantly and furthermore, more relevant features can be selected for each subtask. In order to achieve a better performance, post-processing algorithms are employed before NEC subtask. Experiments conducted on JNLPBA2004 datasets show that our two-phase approach can achieve an F-score of 74.31%, which outperforms most of the state-of-the-art systems.

Introduction

With the rapid development of computational and biological technology, biomedical literatures are expanding at an exponential rate. The explosion of literatures in biomedical domain has provided an opportunity for text mining techniques in this field. Aiming to identify words or phrases referring to specific entities in biomedical literatures, Bio-NER is a critical step for biomedical text mining. Only when biomedical named entities are correctly identified could other more complex tasks, such as, human gene/protein normalization and protein–protein interaction extraction, be performed effectively. While many algorithms have been proposed for this task, Bio-NER remains a challenging task and there is still a large gap between the best Bio-NER systems and the best algorithms in newswire domain. The best NER systems on newswire articles can achieve an F-score of over 96% (Sundheim, 1995), while performances of the state-of-the-art gene/protein NER systems are between 75% and 85% in F-score (Cohen and Hersh, 2005).

Current methods for Bio-NER task fall into three general classes: dictionary-based methods (Yang et al., 2008), heuristic rule-based methods (Olsson et al., 2002) and statistical machine learning methods. Compared with other methods, machine learning based methods are more robust and there is an advantage that they can identify potential biomedical entities which are not previously included in standard dictionaries. There have been many attempts to develop machine learning techniques to identify named entities in biomedical literatures. These techniques include Hidden Markov Model (HMM) (Zhou and Su, 2004), Support Vector Machine (SVM) (Lee et al., 2004), Maximum Entropy Markov Model (MEMM) (Finkel et al., 2004) and CRFs (McDonald and Pereira, 2005, Tsai et al., 2006, Settles, 2004). However, most of the state-of-the-art systems adopt one-phase approaches, in which NED and NEC subtasks are integrated. According to their approaches, an output label is represented by combining a region information B/I/O with a semantic class C (such as protein, DNA and RNA), which increases the number of features due to the increased number of labels. The training cost will also be substantially higher.

This paper presents a two-phase approach which divides Bio-NER task into two subtasks, i.e. NED and NEC. At the first phase, all entity types of interest are grouped into one type, and biomedical entities are identified by a CRFs model; at the second phase, another CRFs model is used to determine the correct entity type for each identified entity. To achieve a better performance, post-processing algorithms are employed before the NEC subtask. Experiments conducted on JNLPBA2004 (Kim et al., 2004) datasets show that the presented approach can not only reduce the training time but also boost the identification performance for Bio-NER task.

The remaining part of this paper is organized as follows: Section 2 describes our methods in detail. Experiments and results are analyzed in Section 3. Section 4 makes comparisons between our two-phase approach and other systems. Finally, conclusions and future work are given in Section 5.

Section snippets

Methods

When CRFs was first introduced by Lafferty et al. (2001), it was used to solve sequence labeling problems. It represents the state-of-the-art algorithm in sequence labeling, and has shown good performance in NER task.

Experiments and results

Our experiments are all based on JNLPBA2004 datasets. The training data of the shared task is GENIA corpus v3.02 (Kim et al., 2003), which consists of 2000 abstracts retrieved on MEDLINE using the MeSH terms “human”, “blood cells” and “transcription factors”. The test data includes 404 abstracts and half of them are from the same domain as the training data and the other half of them are from the super-domain of “blood cells” and “transcription factors”. In JNLPBA2004, systems participated were

Comparisons

We compare our two-phase approach with one-phase approach in Table 8. All the experiments here are carried out on a unified computer platform with a 2.2 GHz CPU and a 2 GB RAM. The experimental results show that our two-phase approach can achieve a better performance and reduce the training time.

Some related work has been done to explore the method of identifying biomedical entities through two-phase approaches. Lee et al. (2003) present a two-phase Bio-NER method based on SVM, which is the first

Conclusions and future work

This paper presents a two-phase approach for extracting biomedical entities from unstructured texts using two CRFs models. By dividing Bio-NER into two subtasks, more relevant features are selected for each subtask respectively, and the training time is reduced. We demonstrate the effectiveness of our approach on JNLPBA2004 datasets, and make comparisons with some related works. It shows that our two-phase approach can achieve a better performance, which is 74.31% in F-score. However, the

References (20)

  • Z.H. Yang et al.

    Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature

    Comput. Biol. Chem.

    (2008)
  • S.K. Chan et al.

    A cascaded approach to biomedical named entity recognition using a unified model

  • A.M. Cohen et al.

    A survey of current work in biomedical text mining

    Brief Bioinform.

    (2005)
  • S. Dingare et al.

    A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations

    Comp. Funct. Genom.

    (2005)
  • J. Finkel et al.

    Exploiting context for biomedical entity recognition: from syntax to the web

  • J.D. Kim et al.

    GENIA corpus—a semantically annotated corpus for bio-text mining

    Bioinformatics

    (2003)
  • J.D. Kim et al.

    Introduction to the bioentity recognition task at JNLPBA

  • S. Kim et al.

    Experimental study on a two phase method for biomedical named entity recognition

    IEICE-Trans. Info Syst.

    (2007)
  • S. Kulick et al.

    Integrated annotation for biomedical information extraction

  • J. Lafferty et al.

    Conditional random fields: probabilistic models for segmenting and labeling sequence data

There are more references available in the full text version of this article.

Cited by (0)

View full text