Abstract
Entity extraction is an important step in biomedical text mining. Among many other challenges, there are two very crucial issues, viz. determining the most applicable feature set so that the model can be precise and less complex, and adapting the system across multiple benchmark corpora. In this paper, we propose a novel method for feature selection using the search capability of particle swarm optimization. The compact feature set used for training the classifier yields much better results when compared to the baseline model, which was developed with a complete set of features. A large number of features suitable for named entity recognition task from biomedical domain are also developed in the current paper. The complete set of features is implemented by studying the properties of datasets and from the domain knowledge. We have used conditional random field, a robust classifier as the underlying learning algorithm which has shown success in solving similar kinds of problems. Our experiments on multiple benchmark corpora yield the level of performance which are at par the state-of-the-art techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
A part of training set is used as validation set. We have divided the original training set into two sets: validation set and new training set.
I, O and B represent the intermediate, outside and beginning token of a NE.
References
Aghdam MH, Heidari S (2015) Feature selection using particle swarm optimization in text categorization. J Artif Intell Soft Comput Res 5(4):231–238
Alatas B, Akin E (2008) Rough particle swarm optimization and its applications in data mining. Soft Comput 12(12):1205–1218
Ando RK (2007) Biocreative II gene mention tagging system at IBM Watson. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 101–103
Baumgartner Jr WA, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A (2007) An integrated approach to concept recognition in biomedical text. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 257–271
Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71
Bickel S, Brefeld U, Faulstich L, Hakenberg J, Leser U, Plake C (2004) A support vector machine classifier for gene name recognition. In: Embo workshop: a critical assessment of text mining methods in molecular biology, Granada, Spain
Cagnina LC, Errecalde ML, Ingaramo DA, Rosso P (2008) A discrete particle swarm optimizer for clustering short-text corpora. In: Proceedings of bioinspired optimization methods and their applications, BIOMA-2008, Ljubljana, Slovenia
Chen W-N, Zhang J, Lin Y, Chen N, Zhan Z-H, Chung HS-H (2013) Particle swarm optimization with an aging leader and challengers. IEEE Trans Evol Comput 17(2):241–258
Chinnaswamy A, Srinivasan R (2016) Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data. In: Innovations in bio-inspired computing and applications. Springer, Berlin, pp 229–239
Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32(1):29–38
Correa ES, Freitas AA, Johnson CG (2006) A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set. In: Proceedings of the 8th annual conference on genetic and evolutionary computation, pp 35–42
Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: ICML, vol 1, pp 74–81
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205
Eberhart RC, Shi Y (1998) Comparison between genetic algorithms and particle swarm optimization. In: International conference on evolutionary programming, pp 611–616
Ekbal A, Saha S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction. Knowl Based Syst 46:22–32
Ekbal A, Saha S, Sikdar UK (2013) Biomedical named entity extraction: some issues of corpus compatibilities. SpringerPlus 2(1):1
Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinform 6(Suppl 1):S5
Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 88–91
Ganchev K, Crammer K, Pereira F, Mann G, Bellare K, McCallum A, Carroll S, Jin Y, White P (2007) Penn/umass/chop biocreative II systems. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23. pp 119–124
Ghamisi P, Benediktsson JA (2015) Feature selection based on hybridization of genetic algorithm and particle swarm optimization. IEEE Geosci Remote Sens Lett 12(2):309–313
Grover C, Haddow B, Klein E, Matthews M, Nielsen LA, Tobin R (2007) Adapting a relation extraction pipeline for the biocreative II task. In: Proceedings of the biocreative II workshop, vol 2
GuoDong Z, Jian S (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 96–99
Gupta DK, Reddy KS, Ekbal A (2015) Pso-asent: feature selection using particle swarm optimization for aspect based sentiment analysis. In: International conference on applications of natural language to information systems, pp 220–233
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Holland JH (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press, Cambridge, MA, USA
Hsieh S-T, Sun T-Y, Liu C-C, Tsai S-J (2009) Efficient population utilization strategy for particle swarm optimizer. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):444–456
Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M, Yang B-H (2007) High-recall gene mention recognition by unification of multiple backward parsing models. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 109–111
Juang C-F (2004) A hybrid of genetic algorithm and particle swarm optimization for recurrent network design. IEEE Trans Syst Man Cybern Part B (Cybern) 34(2):997–1006
Kao Y-T, Zahara E (2008) A hybrid genetic algorithm and particle swarm optimization for multimodal functions. Appl Soft Comput 8(2):849–857
Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics, 1997. Computational cybernetics and simulation, vol 5, pp 4104–4108
Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 70–75
Kim S, Yoon J, Park K-M, Rim H-C (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Natural language processing—IJCNLP 2005. Springer, Berlin, pp 646–657
Kinoshita S, Cohen KB, Ogren PV, Hunter L (2005) Biocreative Task1A: entity identification with a stochastic tagger. BMC Bioinform 6(Suppl 1):S4
Kittler J (1978) Feature set search algorithms. In: Pattern recognition and signal processing, pp 41–60
Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius M (2007) Named entity recognition with combinations of conditional random fields. In: Proceedings of the second biocreative challenge evaluation workshop
Krisshna NA, Deepak VK, Manikantan K, Ramachandran S (2014) Face recognition using transform domain feature extraction and pso-based feature selection. Appl Soft Comput 22:141–161
Kumar A, Patidar V, Khazanchi D, Saini P (2016) Optimizing feature selection using particle swarm optimization and utilizing ventral sides of leaves for plant leaf classification. Procedia Comput Sci 89:324–332
Kuo C-J, Chang Y-M, Huang H-S, Lin K-T, Yang B-H, Lin Y-S (2007) Rich feature set, unification of bidirectional parsing and dictionary filtering for high f-score gene mention tagging. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 105–107
Lafferty JD, McCallum A, Pereia FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, pp 282–289. http://dl.acm.org/citation.cfm?id=645530.655813
Lin S-W, Lee Z-J, Chen S-C, Tseng T-Y (2008) Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl Soft Comput 8(4):1505–1512
Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35(4):1817–1824
Liu H, Torii M, Hu Z, Wu C (2007) Gene mention and gene normalization based on machine learning and online resources. In: Proceedings of the second biocreative challenge workshop, pp 135–140
Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200
Liu Z, Liu S, Liu L, Sun J, Peng X, Wang T (2016) Sentiment recognition of online course reviews using multi-swarm optimization-based selected features. Neurocomputing 185:11–20
Lu Y, Liang M, Ye Z, Cao L (2015) Improved particle swarm optimization algorithm and its application in text feature selection. Appl Soft Comput 35:629–636
McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinform 6(Suppl 1):S6
Merwe D, Van der Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: The 2003 Congress on evolutionary computation, 2003. CEC’03, vol 1, pp 215–220
Mitsumori T, Fation S, Murata M, Doi K, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinform 6(Suppl 1):S8
Park K-M, Kim S-H, Rim H-C, Hwang Y-S (2006) ME-based biomedical named entity recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21
Pedersen MEH (2010) Good parameters for particle swarm optimization. Hvass Lab., Copenhagen, Denmark, Tech. Rep. HL1001
Peram T, Veeramachaneni K, Mohan CK (2003) Fitness-distance-ratio based particle swarm optimization. In: Proceedings of the 2003 IEEE Swarm intelligence symposium, 2003. SIS’03
Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical named entity recognition: a poor knowledge hmm-based approach. In: Natural language processing and information systems. Springer, Berlin, pp 382–387
Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice-Hall, Inc., NJ, USA
Ramadan RM, Abdel-Kader RF (2009) Face recognition using particle swarm optimization-based selected features. Int J Signal Process Image Process Pattern Recognit 2(2):51–65
Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inform 42(5):905–911
Samadzadegan F, Saeedi S (2009) Clustering of lidar data using particle swarm optimization algorithm in urban area. Laserscanning 09(38):334–339
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 104–107
Shang L, Zhou Z, Liu X (2016) Particle swarm optimization-based feature selection in sentiment classification. Soft Comput 20(10):1–14. doi:10.1007/s00500-016-2093-2
Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput 24(111):647–656
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3/4):591–611
Sheikhpour R, Sarram MA, Sheikhpour R (2016) Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer. Appl Soft Comput 40:113–131
Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: The 1998 IEEE international conference on evolutionary computation proceedings, 1998. IEEE World Congress on computational intelligence, pp 69–73
Shi Y, Eberhart RC (2001) Fuzzy adaptive particle swarm optimization. In: Proceedings of the 2001 Congress on evolutionary computation, vol 1, pp 101–106
Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Proceedings of the eleventh international conference on machine learning, pp 293–301
Song Y, Kim E, Lee GG, Yi B-k (2004) POSBIOTM-NER in the shared task of BioNLP/NLPBA 2004. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 100–103
Struble CA, Povinelli RJ, Johnson MT, Berchanskiy D, Tao J, Trawicki M (2007) Combined conditional random fields and n-gram language models for gene mention recognition. In: Proceedings of the second biocreative challenge evaluation workshop; 23–25 April 2007; Madrid, Spain, pp 81–83
Tran B, Xue B, Zhang M (2014) Overview of particle swarm optimisation for feature selection in classification. In: Asia-Pacific conference on simulated evolution and learning, pp 605–617
Vlachos A (2007) Tackling the biocreative2 gene mention task with conditional random fields and syntactic parsing. In: Proceedings of the second biocreative challenge evaluation workshop; 23–25 April 2007; Madrid, Spain, pp 85–87
Wang H, Zhao T, Tan H, Zhang S (2008) Biomedical named entity recognition based on classifiers ensemble. IJCSA 5(2):1–11
Xi M-L, Sun J, Wu Y (2010) Quantum-behaved particle swarm optimization with binary encoding. Control Decis 1:019
Yan X, Wu Q, Liu H, Huang W (2013) An improved particle swarm optimization algorithm and its application. Int J Comput Sci Issues (IJCSI) 10(1):316–324
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zhang J-R, Zhang J, Lok T-M, Lyu MR (2007) A hybrid particle swarm optimization-back-propagation algorithm for feedforward neural network training. Appl Math Comput 185(2):1026–1037
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Yadav, S., Ekbal, A. & Saha, S. Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach. Soft Comput 22, 6881–6904 (2018). https://doi.org/10.1007/s00500-017-2714-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-017-2714-4