Skip to main content

Advertisement

Log in

Classifier subset selection for biomedical named entity recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Classifier ensembling approach is considered for biomedical named entity recognition task. A vote-based classifier selection scheme having an intermediate level of search complexity between static classifier selection and real-valued and class-dependent weighting approaches is developed. Assuming that the reliability of the predictions of each classifier differs among classes, the proposed approach is based on selection of the classifiers by taking into account their individual votes. A wide set of classifiers, each based on a different set of features and modeling parameter setting are generated for this purpose. A genetic algorithm is developed so as to label the predictions of these classifiers as reliable or not. During testing, the votes that are labeled as being reliable are combined using weighted majority voting. The classifier ensemble formed by the proposed scheme surpasses the full object F-score of the best individual classifier by 2.75% and it is the highest score achieved on the data set considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. http://www.ncbi.nlm.nih.gov

  2. MUC6 (1995) Proceedings of the sixth message understanding conference (MUC-6). Morgan Kaufmann, Columbia, Maryland

  3. MUC7 (1998) Proceedings of the seventh message understanding conference (MUC-7). Morgan Kaufmann, Fairfax, Virginia

  4. Wilbur WJ, Hazard GF, Divita G, Mork JG, Aronson AR, Browne AC (1999) Analysis of biomedical text for chemical names: A comparison of three methods. In: Proc of AMIA annual symposium, pp 176–180

  5. Collier N, Nobata C, Tsujii J (2001) Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. J Terminol 7(2):239–257

    Article  Google Scholar 

  6. Collier N, Takeuchi K (2004) Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 37(6):423–435

    Article  Google Scholar 

  7. Zhou GD, Zhang J, Su J, Shen D, Tan C (2004) Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20(7):1178–1190

    Article  Google Scholar 

  8. van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  9. Zhou GD, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics 6(Suppl 1):S7

    Article  Google Scholar 

  10. Zhou ZH (2005) Ensembling local learners through multimodal perturbation. IEEE Trans Syst Man Cybern Part B Cybern 35(4):725–735

    Article  Google Scholar 

  11. Dimililer N, Varoğlu E (2006) Recognizing biomedical named entities using SVMs: Improving recognition performance with a minimal set of features. In: Proc of KDLL 2006. Lecture notes in computer science, vol 3886. Springer, Berlin, pp 53–67

    Google Scholar 

  12. Kim J, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 70–75

  13. Ruta D, Gabrys B (2005) Classifier selection for majority voting. Inf Fusion 6(1):63–81

    Article  Google Scholar 

  14. Zhou GD, Su J (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 96–99

  15. Zhou GD (2006) Recognizing names in biomedical text using mutual information independence model and SVM plus sigmoid. Int J Med Inform 75(6):456–467

    Article  Google Scholar 

  16. Lin Y, Tsai T, Chou W, Wu K, Sung T, Hsu W (2004) A maximum entropy approach to biomedical named entity recognition. In: Proc of 4th workshop on data mining in bioinformatics, ACM SIGKDD conference, pp 56–61

  17. Patrick J, Wang Y (2005) Biomedical named entity recognition system. In: Proc of the 10th Australasian document computing symposium, Sydney, Australia, pp 64–71

  18. Krauthammer M, Rzhetsky A, Morozov P, Friedman C (2000) Using BLAST for identifying gene and protein names in journal articles. Gene 259(1–2):245–252

    Article  Google Scholar 

  19. Jensen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28

    Article  Google Scholar 

  20. Ono T, Hishigaki H, Tanigami A, Takagi T (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2):155–161

    Article  Google Scholar 

  21. Tsuruoka Y, Tsujii J (2004) Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 37:461–470

    Article  Google Scholar 

  22. Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B (1998) Detecting gene symbols and names in biomedical texts: A first step toward pertinent information. In: Proc of Genome information series workshop Genome information, pp 72–80

  23. Fukuda K, Tsunoda T, Tamura A, Takagi T (1998) Toward information extraction: Identifying protein names from biological papers. In: Proc of the pacific symposium on biocomputing, pp 707–718

  24. Gaizauskas R, Demetriou G, Humphreys K (2000) Term recognition and classification in biological science journal articles. In: Proc of workshop on computational terminology for medical and biological applications, Patras, Greece, pp 37–40

  25. Park JC, Kim J (2006) Named entity recognition. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 130–131

    Google Scholar 

  26. Lee K, Hwang Y, Kim S, Rim H (2004) Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform 37:436–447

    Article  Google Scholar 

  27. Takeuchi K, Collier N (2005) Bio-medical entity extraction using support vector machines. Artif Intell Med 33(2):125–137

    Article  Google Scholar 

  28. Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning support vector machines for biomedical named entity recognition. In: Proc of workshop on NLP in the biomedical domain, ACL 2002, pp 1–8

  29. Mitsumori T, Fation S, Murata M, Doi DK, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics 6(Suppl 1):S8

    Article  Google Scholar 

  30. Ohta T, Tateisi Y, Mima H, Tsujii J (2002) The GENIA corpus: An annotated research abstract corpus in the molecular biology domain. In: Proc of 2nd intl conf on human language technology research, San Diego, pp 82–86

  31. Tateisi Y, Ohta T, Collier N, Nobata C, Ibushi K, Tsujii J (2000) Building an annotated corpus in the molecular-biology domain. In: Proc of COLING 2000 workshop on semantic annotation and intelligent content, Luxemburg, pp 28–34

  32. Franzén K, Eriksson G, Olsson F, Asker L, Lidén P, Cöster J (2002) Protein names and how to find them. Int J Med Inform 67:49–61

    Article  Google Scholar 

  33. Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreatIve: Critical assessment of information extraction for biology. BMC Bioinformatics 6(Suppl 1):S1

    Article  Google Scholar 

  34. Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 104–107

  35. Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: From syntax to the web. In: Proc of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 88–91

  36. Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48

    Article  Google Scholar 

  37. http://www.ncbi.nlm.nih.gov/LocusLink

  38. Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proc of the 7th conference on natural language learning at HLT-NAACL 2003, vol 4, pp 168–171

  39. Opitz D, Maclin R (1999) Popular ensemble methods: An empirical study. J Artif Intell Res 11:169–198

    MATH  Google Scholar 

  40. Kittler J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239

    Article  Google Scholar 

  41. Kuncheva LI (2004) Combining pattern classifiers methods and algorithms. Wiley, New York

    Book  MATH  Google Scholar 

  42. Zenobi G, Cunningham P (2001) Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error. In: Proc of the 12th conference on machine learning. Lecture notes in computer science, vol 2167. Springer, Berlin, pp 576–587

    Google Scholar 

  43. Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51:181–207

    Article  MATH  Google Scholar 

  44. Minaei-Bidgoli B, Kortemeyer G, Punch WF (2004) Optimizing classification ensembles via a genetic algorithm for a web-based educational system. In: Proc of workshops on syntactical and structural pattern recognition (SSPR 2004) and statistical pattern recognition (SPR 2004), international association for pattern recognition. Lecture notes in computer science, vol 3138. Springer, Berlin, pp 397–406

    Google Scholar 

  45. Kim M, Min S, Han I (2006) An evolutionary approach to the combination of multiple classifiers to predict a stock price index. Expert Syst Appl 31:241–247

    Article  Google Scholar 

  46. Gabrys B, Ruta D (2006) Genetic algorithms in classifier fusion. Appl Soft Comput 6(4):337–347

    Article  Google Scholar 

  47. Kudo T, Matsumoto Y (2001) Chunking with support vector machines. In: Proc of the 2nd meeting of the North American association for computational linguistics (NAACL), pp 192–199

  48. Tsuruoka Y, Tateisi Y, Ohto T, McNaught J, Ananiadou S, Tsujii J (2005) Developing a robust Part-of-Speech tagger for biomedical text. In: Proc of 10th panhellenic conference on informatics. Lecture notes in computer science, vol 3746. Springer, Berlin, pp 382–392

    Google Scholar 

  49. http://nlp.cs.jhu.edu/~rflorian/fntbl

  50. Song Y, Kim E, Lee GG, Yi B (2005) POSBIOTM-NER: A trainable biomedical named-entity recognition system. Bioinformatics 21(11):2794–2796

    Article  Google Scholar 

  51. Dimililer N, Varoğlu E, Altınçay H (2007) Vote-based classifier selection for biomedical NER using genetic algorithms. In: Proc of 3rd Iberian conference on pattern recognition and image analysis (IbPRAI 2007), vol 4478, pp 202–209

  52. Vapnik VN (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  53. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

    Article  Google Scholar 

  54. http://chasen.org/~taku/software/TinySVM

  55. Dietterich TG (1997) Machine learning research: four current directions. Artif Intell Mag 18:97–136

    Google Scholar 

  56. Scott S, Matwin S (1999) Feature engineering for text classification. In: Proc of the 16th international conference on mach learn, Bled, Slovenia, pp 379–388

  57. Erik F, Sang TK, Veenstra J (1999) Representing text chunks. In: Proc of the 9th conference on European chapter of the association for computational linguistics (EACL), pp 173–179

  58. Wolpert DH (1992) Stacked generalization. Neural Netw 5:241–259

    Article  Google Scholar 

  59. Kim J, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus—a semantically annotated corpus for bio-text mining. Bioinformatics 19(Suppl 1):i180–i182

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hakan Altınçay.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dimililer, N., Varoğlu, E. & Altınçay, H. Classifier subset selection for biomedical named entity recognition. Appl Intell 31, 267–282 (2009). https://doi.org/10.1007/s10489-008-0124-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-008-0124-0

Keywords

Navigation