Abstract
Classifier ensembling approach is considered for biomedical named entity recognition task. A vote-based classifier selection scheme having an intermediate level of search complexity between static classifier selection and real-valued and class-dependent weighting approaches is developed. Assuming that the reliability of the predictions of each classifier differs among classes, the proposed approach is based on selection of the classifiers by taking into account their individual votes. A wide set of classifiers, each based on a different set of features and modeling parameter setting are generated for this purpose. A genetic algorithm is developed so as to label the predictions of these classifiers as reliable or not. During testing, the votes that are labeled as being reliable are combined using weighted majority voting. The classifier ensemble formed by the proposed scheme surpasses the full object F-score of the best individual classifier by 2.75% and it is the highest score achieved on the data set considered.
Similar content being viewed by others
References
MUC6 (1995) Proceedings of the sixth message understanding conference (MUC-6). Morgan Kaufmann, Columbia, Maryland
MUC7 (1998) Proceedings of the seventh message understanding conference (MUC-7). Morgan Kaufmann, Fairfax, Virginia
Wilbur WJ, Hazard GF, Divita G, Mork JG, Aronson AR, Browne AC (1999) Analysis of biomedical text for chemical names: A comparison of three methods. In: Proc of AMIA annual symposium, pp 176–180
Collier N, Nobata C, Tsujii J (2001) Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. J Terminol 7(2):239–257
Collier N, Takeuchi K (2004) Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 37(6):423–435
Zhou GD, Zhang J, Su J, Shen D, Tan C (2004) Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20(7):1178–1190
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London
Zhou GD, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 6(Suppl 1):S7
Zhou ZH (2005) Ensembling local learners through multimodal perturbation. IEEE Trans Syst Man Cybern Part B Cybern 35(4):725–735
Dimililer N, Varoğlu E (2006) Recognizing biomedical named entities using SVMs: Improving recognition performance with a minimal set of features. In: Proc of KDLL 2006. Lecture notes in computer science, vol 3886. Springer, Berlin, pp 53–67
Kim J, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 70–75
Ruta D, Gabrys B (2005) Classifier selection for majority voting. Inf Fusion 6(1):63–81
Zhou GD, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 96–99
Zhou GD (2006) Recognizing names in biomedical text using mutual information independence model and SVM plus sigmoid. Int J Med Inform 75(6):456–467
Lin Y, Tsai T, Chou W, Wu K, Sung T, Hsu W (2004) A maximum entropy approach to biomedical named entity recognition. In: Proc of 4th workshop on data mining in bioinformatics, ACM SIGKDD conference, pp 56–61
Patrick J, Wang Y (2005) Biomedical named entity recognition system. In: Proc of the 10th Australasian document computing symposium, Sydney, Australia, pp 64–71
Krauthammer M, Rzhetsky A, Morozov P, Friedman C (2000) Using BLAST for identifying gene and protein names in journal articles. Gene 259(1–2):245–252
Jensen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28
Ono T, Hishigaki H, Tanigami A, Takagi T (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2):155–161
Tsuruoka Y, Tsujii J (2004) Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 37:461–470
Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B (1998) Detecting gene symbols and names in biomedical texts: A first step toward pertinent information. In: Proc of Genome information series workshop Genome information, pp 72–80
Fukuda K, Tsunoda T, Tamura A, Takagi T (1998) Toward information extraction: Identifying protein names from biological papers. In: Proc of the pacific symposium on biocomputing, pp 707–718
Gaizauskas R, Demetriou G, Humphreys K (2000) Term recognition and classification in biological science journal articles. In: Proc of workshop on computational terminology for medical and biological applications, Patras, Greece, pp 37–40
Park JC, Kim J (2006) Named entity recognition. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 130–131
Lee K, Hwang Y, Kim S, Rim H (2004) Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform 37:436–447
Takeuchi K, Collier N (2005) Bio-medical entity extraction using support vector machines. Artif Intell Med 33(2):125–137
Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning support vector machines for biomedical named entity recognition. In: Proc of workshop on NLP in the biomedical domain, ACL 2002, pp 1–8
Mitsumori T, Fation S, Murata M, Doi DK, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics 6(Suppl 1):S8
Ohta T, Tateisi Y, Mima H, Tsujii J (2002) The GENIA corpus: An annotated research abstract corpus in the molecular biology domain. In: Proc of 2nd intl conf on human language technology research, San Diego, pp 82–86
Tateisi Y, Ohta T, Collier N, Nobata C, Ibushi K, Tsujii J (2000) Building an annotated corpus in the molecular-biology domain. In: Proc of COLING 2000 workshop on semantic annotation and intelligent content, Luxemburg, pp 28–34
Franzén K, Eriksson G, Olsson F, Asker L, Lidén P, Cöster J (2002) Protein names and how to find them. Int J Med Inform 67:49–61
Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreatIve: Critical assessment of information extraction for biology. BMC Bioinformatics 6(Suppl 1):S1
Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 104–107
Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: From syntax to the web. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 88–91
Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proc of the 7th conference on natural language learning at HLT-NAACL 2003, vol 4, pp 168–171
Opitz D, Maclin R (1999) Popular ensemble methods: An empirical study. J Artif Intell Res 11:169–198
Kittler J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Kuncheva LI (2004) Combining pattern classifiers methods and algorithms. Wiley, New York
Zenobi G, Cunningham P (2001) Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error. In: Proc of the 12th conference on machine learning. Lecture notes in computer science, vol 2167. Springer, Berlin, pp 576–587
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51:181–207
Minaei-Bidgoli B, Kortemeyer G, Punch WF (2004) Optimizing classification ensembles via a genetic algorithm for a web-based educational system. In: Proc of workshops on syntactical and structural pattern recognition (SSPR 2004) and statistical pattern recognition (SPR 2004), international association for pattern recognition. Lecture notes in computer science, vol 3138. Springer, Berlin, pp 397–406
Kim M, Min S, Han I (2006) An evolutionary approach to the combination of multiple classifiers to predict a stock price index. Expert Syst Appl 31:241–247
Gabrys B, Ruta D (2006) Genetic algorithms in classifier fusion. Appl Soft Comput 6(4):337–347
Kudo T, Matsumoto Y (2001) Chunking with support vector machines. In: Proc of the 2nd meeting of the North American association for computational linguistics (NAACL), pp 192–199
Tsuruoka Y, Tateisi Y, Ohto T, McNaught J, Ananiadou S, Tsujii J (2005) Developing a robust Part-of-Speech tagger for biomedical text. In: Proc of 10th panhellenic conference on informatics. Lecture notes in computer science, vol 3746. Springer, Berlin, pp 382–392
Song Y, Kim E, Lee GG, Yi B (2005) POSBIOTM-NER: A trainable biomedical named-entity recognition system. Bioinformatics 21(11):2794–2796
Dimililer N, Varoğlu E, Altınçay H (2007) Vote-based classifier selection for biomedical NER using genetic algorithms. In: Proc of 3rd Iberian conference on pattern recognition and image analysis (IbPRAI 2007), vol 4478, pp 202–209
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Dietterich TG (1997) Machine learning research: four current directions. Artif Intell Mag 18:97–136
Scott S, Matwin S (1999) Feature engineering for text classification. In: Proc of the 16th international conference on mach learn, Bled, Slovenia, pp 379–388
Erik F, Sang TK, Veenstra J (1999) Representing text chunks. In: Proc of the 9th conference on European chapter of the association for computational linguistics (EACL), pp 173–179
Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259
Kim J, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus—a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dimililer, N., Varoğlu, E. & Altınçay, H. Classifier subset selection for biomedical named entity recognition. Appl Intell 31, 267–282 (2009). https://doi.org/10.1007/s10489-008-0124-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-008-0124-0