Skip to main content
Log in

Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Entity extraction is an important step in biomedical text mining. Among many other challenges, there are two very crucial issues, viz. determining the most applicable feature set so that the model can be precise and less complex, and adapting the system across multiple benchmark corpora. In this paper, we propose a novel method for feature selection using the search capability of particle swarm optimization. The compact feature set used for training the classifier yields much better results when compared to the baseline model, which was developed with a complete set of features. A large number of features suitable for named entity recognition task from biomedical domain are also developed in the current paper. The complete set of features is implemented by studying the properties of datasets and from the domain knowledge. We have used conditional random field, a robust classifier as the underlying learning algorithm which has shown success in solving similar kinds of problems. Our experiments on multiple benchmark corpora yield the level of performance which are at par the state-of-the-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. A part of training set is used as validation set. We have divided the original training set into two sets: validation set and new training set.

  2. http://www.nactem.ac.uk/GENIA/tagger/.

  3. https://taku910.github.io/crfpp/.

  4. http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html.

  5. http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004

  6. ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/.

  7. ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz.

  8. http://biocreative.sourceforge.net/biocreative_2_dataset.html.

  9. I, O and B represent the intermediate, outside and beginning token of a NE.

References

  • Aghdam MH, Heidari S (2015) Feature selection using particle swarm optimization in text categorization. J Artif Intell Soft Comput Res 5(4):231–238

    Article  Google Scholar 

  • Alatas B, Akin E (2008) Rough particle swarm optimization and its applications in data mining. Soft Comput 12(12):1205–1218

    Article  Google Scholar 

  • Ando RK (2007) Biocreative II gene mention tagging system at IBM Watson. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 101–103

  • Baumgartner Jr WA, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A (2007) An integrated approach to concept recognition in biomedical text. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 257–271

  • Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71

    Google Scholar 

  • Bickel S, Brefeld U, Faulstich L, Hakenberg J, Leser U, Plake C (2004) A support vector machine classifier for gene name recognition. In: Embo workshop: a critical assessment of text mining methods in molecular biology, Granada, Spain

  • Cagnina LC, Errecalde ML, Ingaramo DA, Rosso P (2008) A discrete particle swarm optimizer for clustering short-text corpora. In: Proceedings of bioinspired optimization methods and their applications, BIOMA-2008, Ljubljana, Slovenia

  • Chen W-N, Zhang J, Lin Y, Chen N, Zhan Z-H, Chung HS-H (2013) Particle swarm optimization with an aging leader and challengers. IEEE Trans Evol Comput 17(2):241–258

    Article  Google Scholar 

  • Chinnaswamy A, Srinivasan R (2016) Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data. In: Innovations in bio-inspired computing and applications. Springer, Berlin, pp 229–239

    Google Scholar 

  • Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32(1):29–38

    Article  Google Scholar 

  • Correa ES, Freitas AA, Johnson CG (2006) A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set. In: Proceedings of the 8th annual conference on genetic and evolutionary computation, pp 35–42

  • Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: ICML, vol 1, pp 74–81

  • Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205

    Article  Google Scholar 

  • Eberhart RC, Shi Y (1998) Comparison between genetic algorithms and particle swarm optimization. In: International conference on evolutionary programming, pp 611–616

    Google Scholar 

  • Ekbal A, Saha S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction. Knowl Based Syst 46:22–32

    Article  Google Scholar 

  • Ekbal A, Saha S, Sikdar UK (2013) Biomedical named entity extraction: some issues of corpus compatibilities. SpringerPlus 2(1):1

    Article  Google Scholar 

  • Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinform 6(Suppl 1):S5

    Article  Google Scholar 

  • Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 88–91

  • Ganchev K, Crammer K, Pereira F, Mann G, Bellare K, McCallum A, Carroll S, Jin Y, White P (2007) Penn/umass/chop biocreative II systems. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23. pp 119–124

  • Ghamisi P, Benediktsson JA (2015) Feature selection based on hybridization of genetic algorithm and particle swarm optimization. IEEE Geosci Remote Sens Lett 12(2):309–313

    Article  Google Scholar 

  • Grover C, Haddow B, Klein E, Matthews M, Nielsen LA, Tobin R (2007) Adapting a relation extraction pipeline for the biocreative II task. In: Proceedings of the biocreative II workshop, vol 2

  • GuoDong Z, Jian S (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 96–99

  • Gupta DK, Reddy KS, Ekbal A (2015) Pso-asent: feature selection using particle swarm optimization for aspect based sentiment analysis. In: International conference on applications of natural language to information systems, pp 220–233

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  Google Scholar 

  • Holland JH (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press, Cambridge, MA, USA

  • Hsieh S-T, Sun T-Y, Liu C-C, Tsai S-J (2009) Efficient population utilization strategy for particle swarm optimizer. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):444–456

    Article  Google Scholar 

  • Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M, Yang B-H (2007) High-recall gene mention recognition by unification of multiple backward parsing models. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 109–111

  • Juang C-F (2004) A hybrid of genetic algorithm and particle swarm optimization for recurrent network design. IEEE Trans Syst Man Cybern Part B (Cybern) 34(2):997–1006

    Article  Google Scholar 

  • Kao Y-T, Zahara E (2008) A hybrid genetic algorithm and particle swarm optimization for multimodal functions. Appl Soft Comput 8(2):849–857

    Article  Google Scholar 

  • Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics, 1997. Computational cybernetics and simulation, vol 5, pp 4104–4108

  • Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 70–75

  • Kim S, Yoon J, Park K-M, Rim H-C (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Natural language processing—IJCNLP 2005. Springer, Berlin, pp 646–657

    Google Scholar 

  • Kinoshita S, Cohen KB, Ogren PV, Hunter L (2005) Biocreative Task1A: entity identification with a stochastic tagger. BMC Bioinform 6(Suppl 1):S4

    Article  Google Scholar 

  • Kittler J (1978) Feature set search algorithms. In: Pattern recognition and signal processing, pp 41–60

    Chapter  Google Scholar 

  • Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius M (2007) Named entity recognition with combinations of conditional random fields. In: Proceedings of the second biocreative challenge evaluation workshop

  • Krisshna NA, Deepak VK, Manikantan K, Ramachandran S (2014) Face recognition using transform domain feature extraction and pso-based feature selection. Appl Soft Comput 22:141–161

    Article  Google Scholar 

  • Kumar A, Patidar V, Khazanchi D, Saini P (2016) Optimizing feature selection using particle swarm optimization and utilizing ventral sides of leaves for plant leaf classification. Procedia Comput Sci 89:324–332

    Article  Google Scholar 

  • Kuo C-J, Chang Y-M, Huang H-S, Lin K-T, Yang B-H, Lin Y-S (2007) Rich feature set, unification of bidirectional parsing and dictionary filtering for high f-score gene mention tagging. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 105–107

  • Lafferty JD, McCallum A, Pereia FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, pp 282–289. http://dl.acm.org/citation.cfm?id=645530.655813

  • Lin S-W, Lee Z-J, Chen S-C, Tseng T-Y (2008) Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl Soft Comput 8(4):1505–1512

    Article  Google Scholar 

  • Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35(4):1817–1824

    Article  Google Scholar 

  • Liu H, Torii M, Hu Z, Wu C (2007) Gene mention and gene normalization based on machine learning and online resources. In: Proceedings of the second biocreative challenge workshop, pp 135–140

  • Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200

    Article  Google Scholar 

  • Liu Z, Liu S, Liu L, Sun J, Peng X, Wang T (2016) Sentiment recognition of online course reviews using multi-swarm optimization-based selected features. Neurocomputing 185:11–20

    Article  Google Scholar 

  • Lu Y, Liang M, Ye Z, Cao L (2015) Improved particle swarm optimization algorithm and its application in text feature selection. Appl Soft Comput 35:629–636

    Article  Google Scholar 

  • McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinform 6(Suppl 1):S6

    Article  Google Scholar 

  • Merwe D, Van der Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: The 2003 Congress on evolutionary computation, 2003. CEC’03, vol 1, pp 215–220

  • Mitsumori T, Fation S, Murata M, Doi K, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinform 6(Suppl 1):S8

    Article  Google Scholar 

  • Park K-M, Kim S-H, Rim H-C, Hwang Y-S (2006) ME-based biomedical named entity recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21

    Article  Google Scholar 

  • Pedersen MEH (2010) Good parameters for particle swarm optimization. Hvass Lab., Copenhagen, Denmark, Tech. Rep. HL1001

  • Peram T, Veeramachaneni K, Mohan CK (2003) Fitness-distance-ratio based particle swarm optimization. In: Proceedings of the 2003 IEEE Swarm intelligence symposium, 2003. SIS’03

  • Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical named entity recognition: a poor knowledge hmm-based approach. In: Natural language processing and information systems. Springer, Berlin, pp 382–387

  • Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice-Hall, Inc., NJ, USA

  • Ramadan RM, Abdel-Kader RF (2009) Face recognition using particle swarm optimization-based selected features. Int J Signal Process Image Process Pattern Recognit 2(2):51–65

    Google Scholar 

  • Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inform 42(5):905–911

    Article  Google Scholar 

  • Samadzadegan F, Saeedi S (2009) Clustering of lidar data using particle swarm optimization algorithm in urban area. Laserscanning 09(38):334–339

    Google Scholar 

  • Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 104–107

  • Shang L, Zhou Z, Liu X (2016) Particle swarm optimization-based feature selection in sentiment classification. Soft Comput 20(10):1–14. doi:10.1007/s00500-016-2093-2

    Article  Google Scholar 

  • Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput 24(111):647–656

    Article  MathSciNet  Google Scholar 

  • Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3/4):591–611

    Article  MathSciNet  Google Scholar 

  • Sheikhpour R, Sarram MA, Sheikhpour R (2016) Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer. Appl Soft Comput 40:113–131

    Article  Google Scholar 

  • Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: The 1998 IEEE international conference on evolutionary computation proceedings, 1998. IEEE World Congress on computational intelligence, pp 69–73

  • Shi Y, Eberhart RC (2001) Fuzzy adaptive particle swarm optimization. In: Proceedings of the 2001 Congress on evolutionary computation, vol 1, pp 101–106

  • Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Proceedings of the eleventh international conference on machine learning, pp 293–301

    Chapter  Google Scholar 

  • Song Y, Kim E, Lee GG, Yi B-k (2004) POSBIOTM-NER in the shared task of BioNLP/NLPBA 2004. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 100–103

  • Struble CA, Povinelli RJ, Johnson MT, Berchanskiy D, Tao J, Trawicki M (2007) Combined conditional random fields and n-gram language models for gene mention recognition. In: Proceedings of the second biocreative challenge evaluation workshop; 23–25 April 2007; Madrid, Spain, pp 81–83

  • Tran B, Xue B, Zhang M (2014) Overview of particle swarm optimisation for feature selection in classification. In: Asia-Pacific conference on simulated evolution and learning, pp 605–617

    Google Scholar 

  • Vlachos A (2007) Tackling the biocreative2 gene mention task with conditional random fields and syntactic parsing. In: Proceedings of the second biocreative challenge evaluation workshop; 23–25 April 2007; Madrid, Spain, pp 85–87

  • Wang H, Zhao T, Tan H, Zhang S (2008) Biomedical named entity recognition based on classifiers ensemble. IJCSA 5(2):1–11

    Google Scholar 

  • Xi M-L, Sun J, Wu Y (2010) Quantum-behaved particle swarm optimization with binary encoding. Control Decis 1:019

    MathSciNet  Google Scholar 

  • Yan X, Wu Q, Liu H, Huang W (2013) An improved particle swarm optimization algorithm and its application. Int J Comput Sci Issues (IJCSI) 10(1):316–324

    Google Scholar 

  • Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    MathSciNet  MATH  Google Scholar 

  • Zhang J-R, Zhang J, Lok T-M, Lyu MR (2007) A hybrid particle swarm optimization-back-propagation algorithm for feedforward neural network training. Appl Math Comput 185(2):1026–1037

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shweta Yadav.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yadav, S., Ekbal, A. & Saha, S. Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach. Soft Comput 22, 6881–6904 (2018). https://doi.org/10.1007/s00500-017-2714-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-017-2714-4

Keywords

Navigation