Brief CommunicationTwo-phase biomedical named entity recognition using CRFs
Introduction
With the rapid development of computational and biological technology, biomedical literatures are expanding at an exponential rate. The explosion of literatures in biomedical domain has provided an opportunity for text mining techniques in this field. Aiming to identify words or phrases referring to specific entities in biomedical literatures, Bio-NER is a critical step for biomedical text mining. Only when biomedical named entities are correctly identified could other more complex tasks, such as, human gene/protein normalization and protein–protein interaction extraction, be performed effectively. While many algorithms have been proposed for this task, Bio-NER remains a challenging task and there is still a large gap between the best Bio-NER systems and the best algorithms in newswire domain. The best NER systems on newswire articles can achieve an F-score of over 96% (Sundheim, 1995), while performances of the state-of-the-art gene/protein NER systems are between 75% and 85% in F-score (Cohen and Hersh, 2005).
Current methods for Bio-NER task fall into three general classes: dictionary-based methods (Yang et al., 2008), heuristic rule-based methods (Olsson et al., 2002) and statistical machine learning methods. Compared with other methods, machine learning based methods are more robust and there is an advantage that they can identify potential biomedical entities which are not previously included in standard dictionaries. There have been many attempts to develop machine learning techniques to identify named entities in biomedical literatures. These techniques include Hidden Markov Model (HMM) (Zhou and Su, 2004), Support Vector Machine (SVM) (Lee et al., 2004), Maximum Entropy Markov Model (MEMM) (Finkel et al., 2004) and CRFs (McDonald and Pereira, 2005, Tsai et al., 2006, Settles, 2004). However, most of the state-of-the-art systems adopt one-phase approaches, in which NED and NEC subtasks are integrated. According to their approaches, an output label is represented by combining a region information B/I/O with a semantic class C (such as protein, DNA and RNA), which increases the number of features due to the increased number of labels. The training cost will also be substantially higher.
This paper presents a two-phase approach which divides Bio-NER task into two subtasks, i.e. NED and NEC. At the first phase, all entity types of interest are grouped into one type, and biomedical entities are identified by a CRFs model; at the second phase, another CRFs model is used to determine the correct entity type for each identified entity. To achieve a better performance, post-processing algorithms are employed before the NEC subtask. Experiments conducted on JNLPBA2004 (Kim et al., 2004) datasets show that the presented approach can not only reduce the training time but also boost the identification performance for Bio-NER task.
The remaining part of this paper is organized as follows: Section 2 describes our methods in detail. Experiments and results are analyzed in Section 3. Section 4 makes comparisons between our two-phase approach and other systems. Finally, conclusions and future work are given in Section 5.
Section snippets
Methods
When CRFs was first introduced by Lafferty et al. (2001), it was used to solve sequence labeling problems. It represents the state-of-the-art algorithm in sequence labeling, and has shown good performance in NER task.
Experiments and results
Our experiments are all based on JNLPBA2004 datasets. The training data of the shared task is GENIA corpus v3.02 (Kim et al., 2003), which consists of 2000 abstracts retrieved on MEDLINE using the MeSH terms “human”, “blood cells” and “transcription factors”. The test data includes 404 abstracts and half of them are from the same domain as the training data and the other half of them are from the super-domain of “blood cells” and “transcription factors”. In JNLPBA2004, systems participated were
Comparisons
We compare our two-phase approach with one-phase approach in Table 8. All the experiments here are carried out on a unified computer platform with a 2.2 GHz CPU and a 2 GB RAM. The experimental results show that our two-phase approach can achieve a better performance and reduce the training time.
Some related work has been done to explore the method of identifying biomedical entities through two-phase approaches. Lee et al. (2003) present a two-phase Bio-NER method based on SVM, which is the first
Conclusions and future work
This paper presents a two-phase approach for extracting biomedical entities from unstructured texts using two CRFs models. By dividing Bio-NER into two subtasks, more relevant features are selected for each subtask respectively, and the training time is reduced. We demonstrate the effectiveness of our approach on JNLPBA2004 datasets, and make comparisons with some related works. It shows that our two-phase approach can achieve a better performance, which is 74.31% in F-score. However, the
References (20)
- et al.
Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature
Comput. Biol. Chem.
(2008) - et al.
A cascaded approach to biomedical named entity recognition using a unified model
- et al.
A survey of current work in biomedical text mining
Brief Bioinform.
(2005) - et al.
A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations
Comp. Funct. Genom.
(2005) - et al.
Exploiting context for biomedical entity recognition: from syntax to the web
- et al.
GENIA corpus—a semantically annotated corpus for bio-text mining
Bioinformatics
(2003) - et al.
Introduction to the bioentity recognition task at JNLPBA
- et al.
Experimental study on a two phase method for biomedical named entity recognition
IEICE-Trans. Info Syst.
(2007) - et al.
Integrated annotation for biomedical information extraction
- et al.
Conditional random fields: probabilistic models for segmenting and labeling sequence data