Classifier subset selection for biomedical named entity recognition

Dimililer, Nazife; Varoğlu, Ekrem; Altınçay, Hakan

doi:10.1007/s10489-008-0124-0

Classifier subset selection for biomedical named entity recognition

Published: 08 March 2008

Volume 31, pages 267–282, (2009)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Nazife Dimililer¹,
Ekrem Varoğlu¹ &
Hakan Altınçay¹

295 Accesses
11 Citations
Explore all metrics

Abstract

Classifier ensembling approach is considered for biomedical named entity recognition task. A vote-based classifier selection scheme having an intermediate level of search complexity between static classifier selection and real-valued and class-dependent weighting approaches is developed. Assuming that the reliability of the predictions of each classifier differs among classes, the proposed approach is based on selection of the classifiers by taking into account their individual votes. A wide set of classifiers, each based on a different set of features and modeling parameter setting are generated for this purpose. A genetic algorithm is developed so as to label the predictions of these classifiers as reliable or not. During testing, the votes that are labeled as being reliable are combined using weighted majority voting. The classifier ensemble formed by the proposed scheme surpasses the full object F-score of the best individual classifier by 2.75% and it is the highest score achieved on the data set considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

http://www.ncbi.nlm.nih.gov
MUC6 (1995) Proceedings of the sixth message understanding conference (MUC-6). Morgan Kaufmann, Columbia, Maryland
MUC7 (1998) Proceedings of the seventh message understanding conference (MUC-7). Morgan Kaufmann, Fairfax, Virginia
Wilbur WJ, Hazard GF, Divita G, Mork JG, Aronson AR, Browne AC (1999) Analysis of biomedical text for chemical names: A comparison of three methods. In: Proc of AMIA annual symposium, pp 176–180
Collier N, Nobata C, Tsujii J (2001) Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. J Terminol 7(2):239–257
Article Google Scholar
Collier N, Takeuchi K (2004) Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 37(6):423–435
Article Google Scholar
Zhou GD, Zhang J, Su J, Shen D, Tan C (2004) Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20(7):1178–1190
Article Google Scholar
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London
Google Scholar
Zhou GD, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 6(Suppl 1):S7
Article Google Scholar
Zhou ZH (2005) Ensembling local learners through multimodal perturbation. IEEE Trans Syst Man Cybern Part B Cybern 35(4):725–735
Article Google Scholar
Dimililer N, Varoğlu E (2006) Recognizing biomedical named entities using SVMs: Improving recognition performance with a minimal set of features. In: Proc of KDLL 2006. Lecture notes in computer science, vol 3886. Springer, Berlin, pp 53–67
Google Scholar
Kim J, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 70–75
Ruta D, Gabrys B (2005) Classifier selection for majority voting. Inf Fusion 6(1):63–81
Article Google Scholar
Zhou GD, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 96–99
Zhou GD (2006) Recognizing names in biomedical text using mutual information independence model and SVM plus sigmoid. Int J Med Inform 75(6):456–467
Article Google Scholar
Lin Y, Tsai T, Chou W, Wu K, Sung T, Hsu W (2004) A maximum entropy approach to biomedical named entity recognition. In: Proc of 4th workshop on data mining in bioinformatics, ACM SIGKDD conference, pp 56–61
Patrick J, Wang Y (2005) Biomedical named entity recognition system. In: Proc of the 10th Australasian document computing symposium, Sydney, Australia, pp 64–71
Krauthammer M, Rzhetsky A, Morozov P, Friedman C (2000) Using BLAST for identifying gene and protein names in journal articles. Gene 259(1–2):245–252
Article Google Scholar
Jensen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28
Article Google Scholar
Ono T, Hishigaki H, Tanigami A, Takagi T (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2):155–161
Article Google Scholar
Tsuruoka Y, Tsujii J (2004) Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 37:461–470
Article Google Scholar
Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B (1998) Detecting gene symbols and names in biomedical texts: A first step toward pertinent information. In: Proc of Genome information series workshop Genome information, pp 72–80
Fukuda K, Tsunoda T, Tamura A, Takagi T (1998) Toward information extraction: Identifying protein names from biological papers. In: Proc of the pacific symposium on biocomputing, pp 707–718
Gaizauskas R, Demetriou G, Humphreys K (2000) Term recognition and classification in biological science journal articles. In: Proc of workshop on computational terminology for medical and biological applications, Patras, Greece, pp 37–40
Park JC, Kim J (2006) Named entity recognition. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 130–131
Google Scholar
Lee K, Hwang Y, Kim S, Rim H (2004) Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform 37:436–447
Article Google Scholar
Takeuchi K, Collier N (2005) Bio-medical entity extraction using support vector machines. Artif Intell Med 33(2):125–137
Article Google Scholar
Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning support vector machines for biomedical named entity recognition. In: Proc of workshop on NLP in the biomedical domain, ACL 2002, pp 1–8
Mitsumori T, Fation S, Murata M, Doi DK, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics 6(Suppl 1):S8
Article Google Scholar
Ohta T, Tateisi Y, Mima H, Tsujii J (2002) The GENIA corpus: An annotated research abstract corpus in the molecular biology domain. In: Proc of 2nd intl conf on human language technology research, San Diego, pp 82–86
Tateisi Y, Ohta T, Collier N, Nobata C, Ibushi K, Tsujii J (2000) Building an annotated corpus in the molecular-biology domain. In: Proc of COLING 2000 workshop on semantic annotation and intelligent content, Luxemburg, pp 28–34
Franzén K, Eriksson G, Olsson F, Asker L, Lidén P, Cöster J (2002) Protein names and how to find them. Int J Med Inform 67:49–61
Article Google Scholar
Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreatIve: Critical assessment of information extraction for biology. BMC Bioinformatics 6(Suppl 1):S1
Article Google Scholar
Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 104–107
Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: From syntax to the web. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 88–91
Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48
Article Google Scholar
http://www.ncbi.nlm.nih.gov/LocusLink
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proc of the 7th conference on natural language learning at HLT-NAACL 2003, vol 4, pp 168–171
Opitz D, Maclin R (1999) Popular ensemble methods: An empirical study. J Artif Intell Res 11:169–198
MATH Google Scholar
Kittler J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Article Google Scholar
Kuncheva LI (2004) Combining pattern classifiers methods and algorithms. Wiley, New York
Book MATH Google Scholar
Zenobi G, Cunningham P (2001) Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error. In: Proc of the 12th conference on machine learning. Lecture notes in computer science, vol 2167. Springer, Berlin, pp 576–587
Google Scholar
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51:181–207
Article MATH Google Scholar
Minaei-Bidgoli B, Kortemeyer G, Punch WF (2004) Optimizing classification ensembles via a genetic algorithm for a web-based educational system. In: Proc of workshops on syntactical and structural pattern recognition (SSPR 2004) and statistical pattern recognition (SPR 2004), international association for pattern recognition. Lecture notes in computer science, vol 3138. Springer, Berlin, pp 397–406
Google Scholar
Kim M, Min S, Han I (2006) An evolutionary approach to the combination of multiple classifiers to predict a stock price index. Expert Syst Appl 31:241–247
Article Google Scholar
Gabrys B, Ruta D (2006) Genetic algorithms in classifier fusion. Appl Soft Comput 6(4):337–347
Article Google Scholar
Kudo T, Matsumoto Y (2001) Chunking with support vector machines. In: Proc of the 2nd meeting of the North American association for computational linguistics (NAACL), pp 192–199
Tsuruoka Y, Tateisi Y, Ohto T, McNaught J, Ananiadou S, Tsujii J (2005) Developing a robust Part-of-Speech tagger for biomedical text. In: Proc of 10th panhellenic conference on informatics. Lecture notes in computer science, vol 3746. Springer, Berlin, pp 382–392
Google Scholar
http://nlp.cs.jhu.edu/~rflorian/fntbl
Song Y, Kim E, Lee GG, Yi B (2005) POSBIOTM-NER: A trainable biomedical named-entity recognition system. Bioinformatics 21(11):2794–2796
Article Google Scholar
Dimililer N, Varoğlu E, Altınçay H (2007) Vote-based classifier selection for biomedical NER using genetic algorithms. In: Proc of 3rd Iberian conference on pattern recognition and image analysis (IbPRAI 2007), vol 4478, pp 202–209
Vapnik VN (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Article Google Scholar
http://chasen.org/~taku/software/TinySVM
Dietterich TG (1997) Machine learning research: four current directions. Artif Intell Mag 18:97–136
Google Scholar
Scott S, Matwin S (1999) Feature engineering for text classification. In: Proc of the 16th international conference on mach learn, Bled, Slovenia, pp 379–388
Erik F, Sang TK, Veenstra J (1999) Representing text chunks. In: Proc of the 9th conference on European chapter of the association for computational linguistics (EACL), pp 173–179
Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259
Article Google Scholar
Kim J, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus—a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Eastern Mediterranean University, Mağusa, Northern Cyprus
Nazife Dimililer, Ekrem Varoğlu & Hakan Altınçay

Authors

Nazife Dimililer
View author publications
You can also search for this author in PubMed Google Scholar
Ekrem Varoğlu
View author publications
You can also search for this author in PubMed Google Scholar
Hakan Altınçay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hakan Altınçay.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dimililer, N., Varoğlu, E. & Altınçay, H. Classifier subset selection for biomedical named entity recognition. Appl Intell 31, 267–282 (2009). https://doi.org/10.1007/s10489-008-0124-0

Download citation

Received: 03 October 2007
Accepted: 21 February 2008
Published: 08 March 2008
Issue Date: December 2009
DOI: https://doi.org/10.1007/s10489-008-0124-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classifier subset selection for biomedical named entity recognition

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Introduction to Machine Learning

Artificial intelligence and machine learning in precision and genomic medicine

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classifier subset selection for biomedical named entity recognition

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Introduction to Machine Learning

Artificial intelligence and machine learning in precision and genomic medicine

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation