Skip to main content
Log in

An Effective Biomedical Named Entity Recognition by Handling Imbalanced Data Sets Using Deep Learning and Rule-Based Methods

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Biomedical named entity recognition (Bio-NER) is an important and fundamental task for biomedical text mining. Bio-NER has various applications, especially in Natural Language Processing, and its performance heavily impacts dependent tasks. However, the class imbalance problem is the major issue for an effective Bio-NER. Over the past few years, rule-based methods and deep learning-based methods have been widely used. However, they are not able to effectively handle class imbalance issues. Therefore, we proposed a hybrid method by taking advantage of both methods for Bio-NER, which is named Hybrid Bio-NER (HBio-NER). Furthermore, we propose a training data format called ‘single_entity-O’, which reduces the class imbalance problem. Data preprocessing and ‘single_entity-O’ data representation format improve the class balance, further improving the HBio-NER system performance. HBio-NER identifies the named entities using the Bidirectional Long Short-Term Memory (Bi-LSTM) model and named entity boundaries (begin, inside of entity mentions) using rules. The effectiveness of the proposed method is tested on two data sets (NCBI disease and CHEMDNER chemical data sets) with different word representation models (GloVe, Word2Vec and FastText). The comparative analysis indicates that significant performance gain is achieved by the proposed method especially for disease named entity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability

The data that support the findings of this study are publicly available.

Notes

  1. https://www.ncbi.nlm.nih.gov/research/bionlp/Data/disease/.

  2. Corpus. https://biocreative.bioinformatics.udel.edu/tasks/biocreative-iv/chemdner/.

References

  1. Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J. An attention-based bilstm-crf approach to document-level chemical named entity recognition. Bioinformatics. 2018;34(8):1381–8.

    Article  Google Scholar 

  2. Peng K, Yin C, Rong W, Lin C, Zhou D, Xiong Z. Named entity aware transfer learning for biomedical factoid question answering. IEEE/ACM Trans Comput Biol Bioinform. 2021;19:2365–76.

    Article  Google Scholar 

  3. Pereira A, Trifan A, Lopes RP, Oliveira JL. Systematic review of question answering over knowledge bases. IET Softw. 2022;16(1):1–13.

    Article  Google Scholar 

  4. Abacha AB, Chowdhury MFM, Karanasiou A, Mrabet Y, Lavelli A, Zweigenbaum P. Text mining for pharmacovigilance: using machine learning for drug name recognition and drug–drug interaction extraction and classification. J Biomed Inform. 2015;58:122–32.

    Article  Google Scholar 

  5. Pozi MSM, Azhar NA, Raziff ARA, Ajrina LH. Svgpm: evolving svm decision function by using genetic programming to solve imbalanced classification problem. Prog Artif Intell. 2022;11(1):65–77.

    Article  Google Scholar 

  6. Raghuwanshi BS, Shukla S. Classifying multiclass imbalanced data using generalized class-specific extreme learning machine. Progr Artif Intell. 2021;10(3):259–81.

    Article  Google Scholar 

  7. Akkasi A, Varoğlu E, Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell. 2018;48(8):1965–78.

    Article  Google Scholar 

  8. Akkasi A, Varoglu E. Improvement of chemical named entity recognition through sentence-based random under-sampling and classifier combination. J AI Data Min. 2019;7(2):311–9.

    Google Scholar 

  9. Gliozzo AM, Giuliano C, Rinaldi R (2005) Instance pruning by filtering uninformative words: an information extraction case study. In: International conference on intelligent text processing and computational linguistics. Springer, pp 498–509

  10. Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics. 2019;35(10):1745–52.

    Article  Google Scholar 

  11. Goyal A, Gupta V, Kumar M. Recent named entity recognition and classification techniques: a systematic review. Comput Sci Rev. 2018;29:21–43.

    Article  Google Scholar 

  12. Yoon W, So CH, Lee J, Kang J. Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinform. 2019;20(10):55–65.

    Google Scholar 

  13. Li J, Sun A, Han J, Li C. A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng. 2020;34(1):50–70.

    Article  Google Scholar 

  14. Lee K-J, Hwang Y-S, Rim HC (2003) Two-phase biomedical ne recognition based on svms. In: Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, pp 33–40

  15. Zhao S (2004) Named entity recognition in biomedical texts using an hmm model. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (NLPBA/BioNLP), pp 87–90

  16. Song Y, Kim E, Lee GG, Yi B-K (2004) Posbiotm-ner in the shared task of bionlp/nlpba2004. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (NLPBA/BioNLP), pp 103–106

  17. McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

  18. Tian Y, Shen W, Song Y, Xia F, He M, Li K. Improving biomedical named entity recognition with syntactic information. BMC Bioinform. 2020;21(1):1–17.

    Article  Google Scholar 

  19. Liao Z, Wu H (2012) Biomedical named entity recognition based on skip-chain crfs. In: 2012 International conference on industrial control and electronics engineering. IEEE, pp 1495–1498

  20. Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991

  21. Suárez-Paniagua V, Zavala RMR, Segura-Bedmar I, Martínez P. A two-stage deep learning approach for extracting entities and relationships from medical texts. J Biomed Inform. 2019;99: 103285.

    Article  Google Scholar 

  22. Yu G, Yang Y, Wang X, Zhen H, He G, Li Z, Zhao Y, Shu Q, Shu L. Adversarial active learning for the identification of medical concepts and annotation inconsistency. J Biomed Inform. 2020;108: 103481.

    Article  Google Scholar 

  23. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.

    Article  MATH  Google Scholar 

  24. Grancharova M, Berg H, Dalianis H (2020) Improving named entity recognition and classification in class imbalanced Swedish electronic patient records through resampling. In: Eighth Swedish language technology conference (SLTC). Förlag Göteborgs Universitet

  25. Karia V, Zhang W, Naeim A, Ramezani R (2019) Gensample: a genetic algorithm for oversampling in imbalanced datasets. arXiv preprint arXiv:1910.10806

  26. Chakraborty A, Ghosh KK, De R, Cuevas E, Sarkar R. Learning automata based particle swarm optimization for solving class imbalance problem. Appl Soft Comput. 2021;113: 107959.

    Article  Google Scholar 

  27. Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), New York City, New York, USA, August 27–31, 1998, pp 73–79

  28. Mikolov, T., Chen, K., Corrado, G., Dean, J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  29. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  30. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46.

    Article  Google Scholar 

  31. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.

    MATH  Google Scholar 

  32. Doğan RI, Leaman R, Lu Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1–10.

    Article  Google Scholar 

  33. Cho H, Lee H. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinform. 2019;20:1–11.

    Article  Google Scholar 

  34. Zuo M, Zhang Y. Dataset-aware multi-task learning approaches for biomedical named entity recognition. Bioinformatics. 2020;36(15):4331–8.

    Article  Google Scholar 

  35. Zhu Q, Li X, Conesa A, Pereira C. Gram-cnn: a deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics. 2018;34(9):1547–54.

    Article  Google Scholar 

  36. Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33(14):37–48.

    Article  Google Scholar 

  37. Limsopatham N, Collier N (2016) Learning orthographic features in bi-directional lstm for biomedical named entity recognition. In: Proceedings of the fifth workshop on building and evaluating resources for biomedical text mining (BioTxtM2016), pp 10–19

  38. Korvigo I, Holmatov M, Zaikovskii A, Skoblov M. Putting hands to rest: efficient deep cnn-rnn architecture for chemical named entity recognition with no hand-crafted rules. J Cheminform. 2018;10(1):1–10.

    Article  Google Scholar 

  39. Leaman R, Wei C-H, Lu Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J Cheminform. 2015;7(1):1–10.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. M. Archana.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh and S. Karthikeyan.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Archana, S.M., Prakash, J., Singh, P.K. et al. An Effective Biomedical Named Entity Recognition by Handling Imbalanced Data Sets Using Deep Learning and Rule-Based Methods. SN COMPUT. SCI. 4, 650 (2023). https://doi.org/10.1007/s42979-023-02068-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02068-6

Keywords

Navigation