Abstract
Biomedical named entity recognition (Bio-NER) is an important and fundamental task for biomedical text mining. Bio-NER has various applications, especially in Natural Language Processing, and its performance heavily impacts dependent tasks. However, the class imbalance problem is the major issue for an effective Bio-NER. Over the past few years, rule-based methods and deep learning-based methods have been widely used. However, they are not able to effectively handle class imbalance issues. Therefore, we proposed a hybrid method by taking advantage of both methods for Bio-NER, which is named Hybrid Bio-NER (HBio-NER). Furthermore, we propose a training data format called ‘single_entity-O’, which reduces the class imbalance problem. Data preprocessing and ‘single_entity-O’ data representation format improve the class balance, further improving the HBio-NER system performance. HBio-NER identifies the named entities using the Bidirectional Long Short-Term Memory (Bi-LSTM) model and named entity boundaries (begin, inside of entity mentions) using rules. The effectiveness of the proposed method is tested on two data sets (NCBI disease and CHEMDNER chemical data sets) with different word representation models (GloVe, Word2Vec and FastText). The comparative analysis indicates that significant performance gain is achieved by the proposed method especially for disease named entity recognition.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are publicly available.
References
Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J. An attention-based bilstm-crf approach to document-level chemical named entity recognition. Bioinformatics. 2018;34(8):1381–8.
Peng K, Yin C, Rong W, Lin C, Zhou D, Xiong Z. Named entity aware transfer learning for biomedical factoid question answering. IEEE/ACM Trans Comput Biol Bioinform. 2021;19:2365–76.
Pereira A, Trifan A, Lopes RP, Oliveira JL. Systematic review of question answering over knowledge bases. IET Softw. 2022;16(1):1–13.
Abacha AB, Chowdhury MFM, Karanasiou A, Mrabet Y, Lavelli A, Zweigenbaum P. Text mining for pharmacovigilance: using machine learning for drug name recognition and drug–drug interaction extraction and classification. J Biomed Inform. 2015;58:122–32.
Pozi MSM, Azhar NA, Raziff ARA, Ajrina LH. Svgpm: evolving svm decision function by using genetic programming to solve imbalanced classification problem. Prog Artif Intell. 2022;11(1):65–77.
Raghuwanshi BS, Shukla S. Classifying multiclass imbalanced data using generalized class-specific extreme learning machine. Progr Artif Intell. 2021;10(3):259–81.
Akkasi A, Varoğlu E, Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell. 2018;48(8):1965–78.
Akkasi A, Varoglu E. Improvement of chemical named entity recognition through sentence-based random under-sampling and classifier combination. J AI Data Min. 2019;7(2):311–9.
Gliozzo AM, Giuliano C, Rinaldi R (2005) Instance pruning by filtering uninformative words: an information extraction case study. In: International conference on intelligent text processing and computational linguistics. Springer, pp 498–509
Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics. 2019;35(10):1745–52.
Goyal A, Gupta V, Kumar M. Recent named entity recognition and classification techniques: a systematic review. Comput Sci Rev. 2018;29:21–43.
Yoon W, So CH, Lee J, Kang J. Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinform. 2019;20(10):55–65.
Li J, Sun A, Han J, Li C. A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng. 2020;34(1):50–70.
Lee K-J, Hwang Y-S, Rim HC (2003) Two-phase biomedical ne recognition based on svms. In: Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, pp 33–40
Zhao S (2004) Named entity recognition in biomedical texts using an hmm model. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (NLPBA/BioNLP), pp 87–90
Song Y, Kim E, Lee GG, Yi B-K (2004) Posbiotm-ner in the shared task of bionlp/nlpba2004. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (NLPBA/BioNLP), pp 103–106
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons
Tian Y, Shen W, Song Y, Xia F, He M, Li K. Improving biomedical named entity recognition with syntactic information. BMC Bioinform. 2020;21(1):1–17.
Liao Z, Wu H (2012) Biomedical named entity recognition based on skip-chain crfs. In: 2012 International conference on industrial control and electronics engineering. IEEE, pp 1495–1498
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991
Suárez-Paniagua V, Zavala RMR, Segura-Bedmar I, Martínez P. A two-stage deep learning approach for extracting entities and relationships from medical texts. J Biomed Inform. 2019;99: 103285.
Yu G, Yang Y, Wang X, Zhen H, He G, Li Z, Zhao Y, Shu Q, Shu L. Adversarial active learning for the identification of medical concepts and annotation inconsistency. J Biomed Inform. 2020;108: 103481.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Grancharova M, Berg H, Dalianis H (2020) Improving named entity recognition and classification in class imbalanced Swedish electronic patient records through resampling. In: Eighth Swedish language technology conference (SLTC). Förlag Göteborgs Universitet
Karia V, Zhang W, Naeim A, Ramezani R (2019) Gensample: a genetic algorithm for oversampling in imbalanced datasets. arXiv preprint arXiv:1910.10806
Chakraborty A, Ghosh KK, De R, Cuevas E, Sarkar R. Learning automata based particle swarm optimization for solving class imbalance problem. Appl Soft Comput. 2021;113: 107959.
Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), New York City, New York, USA, August 27–31, 1998, pp 73–79
Mikolov, T., Chen, K., Corrado, G., Dean, J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46.
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.
Doğan RI, Leaman R, Lu Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1–10.
Cho H, Lee H. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinform. 2019;20:1–11.
Zuo M, Zhang Y. Dataset-aware multi-task learning approaches for biomedical named entity recognition. Bioinformatics. 2020;36(15):4331–8.
Zhu Q, Li X, Conesa A, Pereira C. Gram-cnn: a deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics. 2018;34(9):1547–54.
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33(14):37–48.
Limsopatham N, Collier N (2016) Learning orthographic features in bi-directional lstm for biomedical named entity recognition. In: Proceedings of the fifth workshop on building and evaluating resources for biomedical text mining (BioTxtM2016), pp 10–19
Korvigo I, Holmatov M, Zaikovskii A, Skoblov M. Putting hands to rest: efficient deep cnn-rnn architecture for chemical named entity recognition with no hand-crafted rules. J Cheminform. 2018;10(1):1–10.
Leaman R, Wei C-H, Lu Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J Cheminform. 2015;7(1):1–10.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Research Trends in Computational Intelligence” guest edited by Anshul Verma, Pradeepika Verma, Vivek Kumar Singh and S. Karthikeyan.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Archana, S.M., Prakash, J., Singh, P.K. et al. An Effective Biomedical Named Entity Recognition by Handling Imbalanced Data Sets Using Deep Learning and Rule-Based Methods. SN COMPUT. SCI. 4, 650 (2023). https://doi.org/10.1007/s42979-023-02068-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-02068-6