Skip to main content
Log in

Information theoretic-PSO-based feature selection: an application in biomedical entity extraction

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Named entity recognition is a vital task for various applications related to biomedical natural language processing. It aims at extracting different biomedical entities from the text and classifying them into some predefined categories. The types could vary depending upon the genre and domain, such as gene versus non-gene in a coarse-grained scenario, or protein, DNA, RNA, cell line, and cell-type in a fine-grained scenario. In this paper, we present a novel filter-based feature selection technique utilizing the search capability of particle swarm optimization (PSO) for determining the most optimal feature combination. The technique yields in the most optimized feature set, that when used for classifiers learning, enhance the system performance. The proposed approach is assessed over four popular biomedical corpora, namely GENIA, GENETAG, AIMed, and Biocreative-II Gene Mention Recognition (BC-II). Our proposed model obtains the F score values of \(74.49\%\), \(91.11\%\), \(90.47\%\), \(88.64\%\) on GENIA, GENETAG, AIMed, and BC-II dataset, respectively. The efficiency of feature pruning through PSO is evident with significant performance gains, even with a much reduced set of features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://www.nactem.ac.uk/GENIA/tagger/.

  2. https://www.taku910.github.io/crfpp/.

  3. ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/.

  4. http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html.

  5. https://sourceforge.net/projects/biocreative/files/biocreative2entitytagging/1.1/.

  6. The details are provided in Supplementary.

References

  1. Ando RK (2007) Biocreative II gene mention tagging system at IBM watson. In: Proceedings of the second biocreative challenge evaluation workshop, Centro Nacional de Investigaciones Oncologicas (CNIO) Madrid, Spain, vol 23, pp 101–103

  2. Aronson AR (2001) Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: Proceedings of the AMIA symposium, American Medical Informatics Association, p 17

  3. Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann Math Stat 41(1):164–171

    Article  MathSciNet  MATH  Google Scholar 

  4. Bhadra T, Bandyopadhyay S (2015) Unsupervised feature selection using an improved version of differential evolution. Expert Syst Appl 42(8):4042–4053

    Article  Google Scholar 

  5. Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl 1):D267–D270

    Article  Google Scholar 

  6. Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  7. Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York

    Book  MATH  Google Scholar 

  8. Danger R, Pla F, Molina A, Rosso P (2014) Towards a protein-protein interaction information extraction system: recognizing named entities. Knowl Based Syst 57:104–118

    Article  Google Scholar 

  9. Deb K, Agrawal S, Pratap A, Meyarivan T (2000) A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: International conference on parallel problem solving from nature. Springer, pp 849–858

  10. Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. Proceedings of the sixth international symposium on micro machine and human science, New York, NY 1:39–43

    Article  Google Scholar 

  11. Ekbal A, Saha S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction. Knowl Based Syst 46:22–32

    Article  Google Scholar 

  12. Ekbal A, Saha S, Garbe CS (2010) Feature selection using multiobjective optimization for named entity recognition. In: 20th international conference on pattern recognition (ICPR), 2010. IEEE, pp 1937–1940

  13. Ekbal A, Saha S, Sikdar UK (2013) Biomedical named entity extraction: some issues of corpus compatibilities. SpringerPlus 2(1):1

    Article  Google Scholar 

  14. Ekbal A, Saha S, Bhattacharyya P et al (2016) A deep learning architecture for protein-protein interaction article identification. In: 23rd international conference on pattern recognition (ICPR), 2016. IEEE, pp 3128–3133

  15. Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 88–91

  16. Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinf 6(Suppl 1):S5

    Article  Google Scholar 

  17. Friedrich CM, Revillion T, Hofmann M, Fluck J (2006) Biomedical and chemical named entity recognition with conditional random fields: the advantage of dictionary features. In: Proceedings of the second international symposium on semantic mining in biomedicine (SMBM 2006), vol 7. BioMed Central Ltd, London, UK, pp 85–89

  18. Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13

    Article  MATH  Google Scholar 

  19. GuoDong Z, Jian S (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 96–99

  20. Gupta D, Tripathi S, Ekbal A, Bhattacharyya P (2016) A hybrid approach for entity extraction in code-mixed social media data. MONEY 25:66

    Google Scholar 

  21. Gupta DK, Reddy KS, Ekbal A et al (2015) Pso-asent: Feature selection using particle swarm optimization for aspect based sentiment analysis. In: International conference on applications of natural language to information systems. Springer, pp 220–233

  22. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  23. Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato

  24. Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J (2005) Prominer: organism-specific protein name detection using approximate string matching. BMC Bioinf 6(Suppl 1):S14

    Article  Google Scholar 

  25. Kennedy J, Eberhart R (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics, 1997. Computational cybernetics and simulation, vol 5, pp 4104–4108

  26. Kim JD, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 70–75

  27. Kim S, Yoon J, Park KM, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Natural language processing–IJCNLP 2005. Springer, pp 646–657

  28. Kinoshita S, Cohen KB, Ogren PV, Hunter L (2005) Biocreative task1a: entity identification with a stochastic tagger. BMC bioinf 6(Suppl 1):S4

    Article  Google Scholar 

  29. Kittler J (1978) Feature set search algorithms. In: Chen CH (ed) Pattern recognition and signal processing. Sijthoff and Noordhoff, Alphen aan den Rijn, Netherlands, pp 41–60

    Chapter  Google Scholar 

  30. Kumar A, Ekbal A, Saha S, Bhattacharyya P et al (2016) A recurrent neural network architecture for de-identifying clinical records. In: Proceedings of the 13th international conference on natural language processing, pp 188–197

  31. Kuo CJ, Chang YM, Huang HS, Lin KT, Yang BH, Lin YS, Hsu CN, Chung IF (2007) Rich feature set, unification of bidirectional parsing and dictionary filtering for high f-score gene mention tagging. In: Proceedings of the second biocreative challenge evaluation workshop. Centro Nacional de Investigaciones Oncologicas (CNIO) Madrid, Spain, vol 23, pp 105–107

  32. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp 282–289

  33. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360

  34. Li L, Jin L, Jiang Z, Song D, Huang D (2015) Biomedical named entity recognition based on extended recurrent neural networks. In: IEEE international conference on bioinformatics and biomedicine (BIBM), 2015. IEEE, pp 649–652

  35. McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinf 6(Suppl 1):S6

    Article  Google Scholar 

  36. Mitsumori T, Fation S, Murata M, Doi K, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinf 6(Suppl 1):S8

    Article  Google Scholar 

  37. Park KM, Kim SH, Rim HC, Hwang YS (2006) Me-based biomedical named entity recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21

    Article  Google Scholar 

  38. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  39. Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical named entity recognition: a poor knowledge hmm-based approach. In: Natural language processing and information systems. Springer, pp 382–387

  40. Ramadan RM, Abdel-Kader RF (2009) Face recognition using particle swarm optimization-based selected features. Int J Signal Process Image Process Pattern Recognit 2(2):51–65

    Google Scholar 

  41. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) Edgar: extraction of drugs, genes and relations from the biomedical literature. In: Pacific symposium on biocomputing. Pacific Symposium on Biocomputing, NIH Public Access, p 517

  42. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. bioinformatics 23(19):2507–2517

    Article  Google Scholar 

  43. Saha S, Ekbal A, Sikdar UK (2015) Named entity recognition and classification in biomedical text using classifier ensemble. Int J Data Min Bioinf 11(4):365–391

    Article  Google Scholar 

  44. Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inf 42(5):905–911

    Article  Google Scholar 

  45. Segura-Bedmar I, Martínez P, Segura-Bedmar M (2008) Drug name recognition and classification in biomedical texts: a case study outlining approaches underpinning automated systems. Drug Discov Today 13(17):816–823

    Article  Google Scholar 

  46. Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 104–107

  47. Sikdar UK, Ekbal A, Saha S (2015) Mode: multiobjective differential evolution for feature selection and classifier ensemble. Soft Comput 19(12):3529–3549

    Article  Google Scholar 

  48. Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K (2008) Overview of biocreative ii gene mention recognition. Genome Biol 9(Suppl 2):S2

    Article  Google Scholar 

  49. Tanabe L, Wilbur WJ (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18(8):1124–1132

    Article  Google Scholar 

  50. Tang B, Cao H, Wu Y, Jiang M, Xu H (2012) Clinical entity recognition using structural support vector machines with rich features. In: Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics. ACM, pp 13–20

  51. Tang B, Cao H, Wang X, Chen Q, Xu H (2014) Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res Int 2014: https://doi.org/10.1155/2014/240403

  52. Tang B, Cao H, Wang X, Chen Q, Xu H (2014) Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res Int

  53. Thang ND, Lee YK et al (2010) An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In: 10th IEEE/IPSJ international symposium on applications and the internet (SAINT), 2010. IEEE, pp 395–398

  54. Tjong Kim Sang EF, De Meulder F (2003) Introduction to the Conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, vol 4. Association for Computational Linguistics, pp 142–147

  55. Wang H, Zhao T, Tan H, Zhang S (2008) Biomedical named entity recognition based on classifiers ensemble. IJCSA 5(2):1–11

    Google Scholar 

  56. Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KF, Mewes HW (2005) Gene selection from microarray data for cancer classificationa machine learning approach. Comput Biol Chem 29(1):37–46

    Article  MATH  Google Scholar 

  57. Yadav S, Ekbal A, Saha S, Bhattacharyya P (2016) Deep learning architecture for patient data de-identification in clinical records. In: Proceedings of the clinical natural language processing workshop (ClinicalNLP), pp 32–41

  58. Yadav S, Ekbal A, Saha S (2017a) Feature selection for entity extraction from multiple biomedical corpora: a PSO-based approach. Soft Comput. https://doi.org/10.1007/s00500-017-2714-4

    Article  Google Scholar 

  59. Yadav S, Ekbal A, Saha S (2017b) Feature selection for entity extraction from multiple biomedical corpora: a PSO-based approach. Soft Comput 21:1–24

    Article  Google Scholar 

  60. Yadav S, Ekbal A, Saha S, Bhattacharyya P (2017c) Entity extraction in biomedical corpora: An approach to evaluate word embedding features with pso based feature selection. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: volume 1, Long Papers, vol 1, pp 1159–1170

  61. Yadav S, Ekbal A, Saha S, Pathak PS, Bhattacharyya P (2017d) Patient data de-identification: a conditional random-field-based supervised approach. In: Handbook of research on applied cybernetics and systems science. IGI Global, pp 234–253

  62. Yadav S, Ekbal A, Saha S, Bhattacharyya P, Sheth A (2018a) Multi-task learning framework for mining crowd intelligence towards clinical treatment. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 2 (short papers), vol 2, pp 271–277

  63. Yadav S, Kumar A, Ekbal A, Saha S, Bhattacharyya P (2018b) Feature assisted bi-directional LSTM model for protein–protein interaction identification from biomedical texts. arXiv preprint arXiv:1807.02162

  64. Zhang S, Elhadad N (2013) Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J Biomed Inf 46(6):1088–1098

    Article  Google Scholar 

  65. Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl Based Syst 64:22–31

    Article  Google Scholar 

  66. Zhao S (2004) Named entity recognition in biomedical texts using an hmm model. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 84–87

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shweta Yadav.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 202 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yadav, S., Ekbal, A. & Saha, S. Information theoretic-PSO-based feature selection: an application in biomedical entity extraction. Knowl Inf Syst 60, 1453–1478 (2019). https://doi.org/10.1007/s10115-018-1265-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1265-z

Keywords

Navigation