Abstract
Named entity recognition is a vital task for various applications related to biomedical natural language processing. It aims at extracting different biomedical entities from the text and classifying them into some predefined categories. The types could vary depending upon the genre and domain, such as gene versus non-gene in a coarse-grained scenario, or protein, DNA, RNA, cell line, and cell-type in a fine-grained scenario. In this paper, we present a novel filter-based feature selection technique utilizing the search capability of particle swarm optimization (PSO) for determining the most optimal feature combination. The technique yields in the most optimized feature set, that when used for classifiers learning, enhance the system performance. The proposed approach is assessed over four popular biomedical corpora, namely GENIA, GENETAG, AIMed, and Biocreative-II Gene Mention Recognition (BC-II). Our proposed model obtains the F score values of \(74.49\%\), \(91.11\%\), \(90.47\%\), \(88.64\%\) on GENIA, GENETAG, AIMed, and BC-II dataset, respectively. The efficiency of feature pruning through PSO is evident with significant performance gains, even with a much reduced set of features.
Similar content being viewed by others
References
Ando RK (2007) Biocreative II gene mention tagging system at IBM watson. In: Proceedings of the second biocreative challenge evaluation workshop, Centro Nacional de Investigaciones Oncologicas (CNIO) Madrid, Spain, vol 23, pp 101–103
Aronson AR (2001) Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. In: Proceedings of the AMIA symposium, American Medical Informatics Association, p 17
Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann Math Stat 41(1):164–171
Bhadra T, Bandyopadhyay S (2015) Unsupervised feature selection using an improved version of differential evolution. Expert Syst Appl 42(8):4042–4053
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 32(suppl 1):D267–D270
Cortes C, Vapnik V (1995) Support vector machine. Mach Learn 20(3):273–297
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York
Danger R, Pla F, Molina A, Rosso P (2014) Towards a protein-protein interaction information extraction system: recognizing named entities. Knowl Based Syst 57:104–118
Deb K, Agrawal S, Pratap A, Meyarivan T (2000) A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: International conference on parallel problem solving from nature. Springer, pp 849–858
Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. Proceedings of the sixth international symposium on micro machine and human science, New York, NY 1:39–43
Ekbal A, Saha S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction. Knowl Based Syst 46:22–32
Ekbal A, Saha S, Garbe CS (2010) Feature selection using multiobjective optimization for named entity recognition. In: 20th international conference on pattern recognition (ICPR), 2010. IEEE, pp 1937–1940
Ekbal A, Saha S, Sikdar UK (2013) Biomedical named entity extraction: some issues of corpus compatibilities. SpringerPlus 2(1):1
Ekbal A, Saha S, Bhattacharyya P et al (2016) A deep learning architecture for protein-protein interaction article identification. In: 23rd international conference on pattern recognition (ICPR), 2016. IEEE, pp 3128–3133
Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 88–91
Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinf 6(Suppl 1):S5
Friedrich CM, Revillion T, Hofmann M, Fluck J (2006) Biomedical and chemical named entity recognition with conditional random fields: the advantage of dictionary features. In: Proceedings of the second international symposium on semantic mining in biomedicine (SMBM 2006), vol 7. BioMed Central Ltd, London, UK, pp 85–89
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13
GuoDong Z, Jian S (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 96–99
Gupta D, Tripathi S, Ekbal A, Bhattacharyya P (2016) A hybrid approach for entity extraction in code-mixed social media data. MONEY 25:66
Gupta DK, Reddy KS, Ekbal A et al (2015) Pso-asent: Feature selection using particle swarm optimization for aspect based sentiment analysis. In: International conference on applications of natural language to information systems. Springer, pp 220–233
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J (2005) Prominer: organism-specific protein name detection using approximate string matching. BMC Bioinf 6(Suppl 1):S14
Kennedy J, Eberhart R (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics, 1997. Computational cybernetics and simulation, vol 5, pp 4104–4108
Kim JD, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 70–75
Kim S, Yoon J, Park KM, Rim HC (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Natural language processing–IJCNLP 2005. Springer, pp 646–657
Kinoshita S, Cohen KB, Ogren PV, Hunter L (2005) Biocreative task1a: entity identification with a stochastic tagger. BMC bioinf 6(Suppl 1):S4
Kittler J (1978) Feature set search algorithms. In: Chen CH (ed) Pattern recognition and signal processing. Sijthoff and Noordhoff, Alphen aan den Rijn, Netherlands, pp 41–60
Kumar A, Ekbal A, Saha S, Bhattacharyya P et al (2016) A recurrent neural network architecture for de-identifying clinical records. In: Proceedings of the 13th international conference on natural language processing, pp 188–197
Kuo CJ, Chang YM, Huang HS, Lin KT, Yang BH, Lin YS, Hsu CN, Chung IF (2007) Rich feature set, unification of bidirectional parsing and dictionary filtering for high f-score gene mention tagging. In: Proceedings of the second biocreative challenge evaluation workshop. Centro Nacional de Investigaciones Oncologicas (CNIO) Madrid, Spain, vol 23, pp 105–107
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp 282–289
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360
Li L, Jin L, Jiang Z, Song D, Huang D (2015) Biomedical named entity recognition based on extended recurrent neural networks. In: IEEE international conference on bioinformatics and biomedicine (BIBM), 2015. IEEE, pp 649–652
McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinf 6(Suppl 1):S6
Mitsumori T, Fation S, Murata M, Doi K, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinf 6(Suppl 1):S8
Park KM, Kim SH, Rim HC, Hwang YS (2006) Me-based biomedical named entity recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical named entity recognition: a poor knowledge hmm-based approach. In: Natural language processing and information systems. Springer, pp 382–387
Ramadan RM, Abdel-Kader RF (2009) Face recognition using particle swarm optimization-based selected features. Int J Signal Process Image Process Pattern Recognit 2(2):51–65
Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) Edgar: extraction of drugs, genes and relations from the biomedical literature. In: Pacific symposium on biocomputing. Pacific Symposium on Biocomputing, NIH Public Access, p 517
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. bioinformatics 23(19):2507–2517
Saha S, Ekbal A, Sikdar UK (2015) Named entity recognition and classification in biomedical text using classifier ensemble. Int J Data Min Bioinf 11(4):365–391
Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inf 42(5):905–911
Segura-Bedmar I, MartÃnez P, Segura-Bedmar M (2008) Drug name recognition and classification in biomedical texts: a case study outlining approaches underpinning automated systems. Drug Discov Today 13(17):816–823
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 104–107
Sikdar UK, Ekbal A, Saha S (2015) Mode: multiobjective differential evolution for feature selection and classifier ensemble. Soft Comput 19(12):3529–3549
Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K (2008) Overview of biocreative ii gene mention recognition. Genome Biol 9(Suppl 2):S2
Tanabe L, Wilbur WJ (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18(8):1124–1132
Tang B, Cao H, Wu Y, Jiang M, Xu H (2012) Clinical entity recognition using structural support vector machines with rich features. In: Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics. ACM, pp 13–20
Tang B, Cao H, Wang X, Chen Q, Xu H (2014) Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res Int 2014: https://doi.org/10.1155/2014/240403
Tang B, Cao H, Wang X, Chen Q, Xu H (2014) Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res Int
Thang ND, Lee YK et al (2010) An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In: 10th IEEE/IPSJ international symposium on applications and the internet (SAINT), 2010. IEEE, pp 395–398
Tjong Kim Sang EF, De Meulder F (2003) Introduction to the Conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, vol 4. Association for Computational Linguistics, pp 142–147
Wang H, Zhao T, Tan H, Zhang S (2008) Biomedical named entity recognition based on classifiers ensemble. IJCSA 5(2):1–11
Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KF, Mewes HW (2005) Gene selection from microarray data for cancer classificationa machine learning approach. Comput Biol Chem 29(1):37–46
Yadav S, Ekbal A, Saha S, Bhattacharyya P (2016) Deep learning architecture for patient data de-identification in clinical records. In: Proceedings of the clinical natural language processing workshop (ClinicalNLP), pp 32–41
Yadav S, Ekbal A, Saha S (2017a) Feature selection for entity extraction from multiple biomedical corpora: a PSO-based approach. Soft Comput. https://doi.org/10.1007/s00500-017-2714-4
Yadav S, Ekbal A, Saha S (2017b) Feature selection for entity extraction from multiple biomedical corpora: a PSO-based approach. Soft Comput 21:1–24
Yadav S, Ekbal A, Saha S, Bhattacharyya P (2017c) Entity extraction in biomedical corpora: An approach to evaluate word embedding features with pso based feature selection. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: volume 1, Long Papers, vol 1, pp 1159–1170
Yadav S, Ekbal A, Saha S, Pathak PS, Bhattacharyya P (2017d) Patient data de-identification: a conditional random-field-based supervised approach. In: Handbook of research on applied cybernetics and systems science. IGI Global, pp 234–253
Yadav S, Ekbal A, Saha S, Bhattacharyya P, Sheth A (2018a) Multi-task learning framework for mining crowd intelligence towards clinical treatment. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 2 (short papers), vol 2, pp 271–277
Yadav S, Kumar A, Ekbal A, Saha S, Bhattacharyya P (2018b) Feature assisted bi-directional LSTM model for protein–protein interaction identification from biomedical texts. arXiv preprint arXiv:1807.02162
Zhang S, Elhadad N (2013) Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J Biomed Inf 46(6):1088–1098
Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl Based Syst 64:22–31
Zhao S (2004) Named entity recognition in biomedical texts using an hmm model. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 84–87
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yadav, S., Ekbal, A. & Saha, S. Information theoretic-PSO-based feature selection: an application in biomedical entity extraction. Knowl Inf Syst 60, 1453–1478 (2019). https://doi.org/10.1007/s10115-018-1265-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1265-z