Abstract
The class imbalance problem is a key factor that affects the performance of many classification tasks when using machine learning methods. This mainly refers to the problem where the number of samples in certain classes is much greater than in others. Such imbalance considerably affects the performance of classifiers in which the majority class or classes are often favored, thus resulting in high-precision/low-recall classifiers. Named entity recognition in free text suffers from this problem to a large extent because in any given free text, many samples do not belong to a specific entity. Furthermore, the data used in this specific type of classification is in sequenced mode and is different than that used in other common classification tasks such as image classification, spam detection, and text classification in which no semantic or syntactic relation exists between samples. In this study, we propose an undersampling approach for sequenced data that preserves existing correlations between sequenced samples that comprise sentences and thus improve the performance of classifiers. We call this method balanced undersampling (BUS). Considering the recent increased interest in the use of NER in the chemical and biomedical domains, the proposed method is developed and tested on four recent state-of-the-art corpora in these domains, including BioCreative IV ChemDNER, Bio-entity Recognition Challenge of JNLPBA (JNLPBA), SemEval2013 DDI DrugBank, and SemEval2013 DDI Medline datasets. The performance of the proposed method is evaluated against two other common undersampling methods: random undersampling and stop-word filtering. Our method is shown to outperform both methods with respect to F-score for all datasets used.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B (Cybern) 42(4):1119–1130
Chawla N V, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newslett 6(1):1–6
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man, Cyberne Part C (Appl Rev) 42(4):463–484
Marsh E, Perzanowski D (1998) Muc-7 evaluation of information extraction technology: overview of results. In Seventh message understanding conference (MUC-7), pp 1251–1256
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4, pp 192–201
Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol 68, pp 10–15
Yang Q, Wu X (2006) 10 Challenging problems in data mining research. Int J Inf Technol Decis Mak 5 (4):597–604
Ghanem A S, Venkatesh S, West G (2010) Multi-class pattern classification in imbalanced data. In: Pattern recognition (ICPR), pp 2881–2884
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36
Visa S, Ralescu A (2005) Issues in mining imbalanced data sets-a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, pp 67–73
Monard M C, Batista G E (2002) Learmng with skewed class distrihutions, advances in logic. Artif Intell Robot LAPTEC 2002(85):173
Chawla N V, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newslett 6(1):1–6
Wang S, Tang K, Yao X (2009) Diversity exploration and negative correlation learning on imbalanced data sets. In: 2009 International joint conference on neural networks, pp 3259–3266
Williams D P, Myers V, Silvious M S (2009) Mine classification with imbalanced data. IEEE Geosci Rem Sens Lett 6(3):528–532
Thai-Nghe N, Do T N, Schmidt-Thieme L (2010) Learning optimal threshold on resampling data to deal with class imbalance. In: Proceeding of IEEE RIVF international conference on computing and telecommunication technologies, pp 71–76
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164
Sun Y, Kamel M S, Wong A K, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recog 40(12):3358–3378
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209
Zhao X, Li X, Chen L, Aihara K (2008) Protein classification with imbalanced data. Proteins 70 (4):1125–1132
Mingrui W, Jieping Y (2009) A small sphere and large margin approach for novelty detection using training data with outliers. IEEE Trans Pattern Anal Mach Intell 31(11):2088–2092
Li X, Wang L, Sung E (2008) Adaboost with svm-based component classifiers. Eng Appl Artif Intell 21 (5):785–795
Partalas I, Tsoumakas G, Vlahavas I (2010) An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Mach Learn 81:257–282
Qun D (2013) A competitive ensemble pruning approach based on cross-validation technique. Knowl-Based Syst 37:394–414
Qun D, Ting Z, Ningzhong L (2015) A new reverse reduce-error ensemble pruning algorithm. Appl Soft Comput 28:237–249
Haibo H, Yunqian M (2013) Imbalanced Learning, foundations, algorithms, and applications. Wiley-IEEE, ISBN: 978-1-118-07462-6, Hardcover, 216 pages, Wiley-IEEE
Longadge R, Dongre S (2013) Class imbalance problem in data mining review, arXiv:1305.1707
Seiffert C, Khoshgoftaar T M, Van H J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern-Part A: Syst Humans 40(1):185– 197
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution Conference on artificial intelligence in medicine in Europe. Springer, Berlin Heidelberg, pp 63–66
Geoffery W G (1972) Reduced nearest neighbor rule. IEEE Trans Inf Theory 18:431–433
Hart P H (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 515–516
Ritter G L, Woodruff H B, Lowry S R, Isenhour T L (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. ICML 97:179–186
Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Folorunso S O, Adeyemo A B (2012) Theoretical comparison of undersampling techniques against their underlying data reduction techniques. In: 2nd International conference on computer, energy, network, robotics and telecom
Kim M S (2007) An effective under-sampling method for class imbalance data problem. In: ISIS 2007 Proceedings of the 8th symposium on advanced intelligent systems, pp 825–829
Gary M, Provost W F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-43, Department of Computer Science, Rutgers University
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Benjamin X, Japkowicz W N (2004) Imbalanced data set learning with synthetic examples. In: IRIS Machine learning workshop
Han H, Wang W Y, Mao B H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer Berlin Heidelberg, pp 878–887
He H, Bai Y, Garcia E A, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International joint conference on neural Networks (IEEE World congress on computational intelligence), pp 1322–1328
Cho H C, Okazaki N, Miwa M, Tsujii J I (2013) Named entity recognition with multiple segment representations. Inf Process Manag 49(4):954–965
Massimiliano G A, Giulian C, Rinaldi R (2005) Instance filtering for entity recognition. SIGKDD Explor 7:11–18
Gliozzo A M, Giuliano C, Rinaldi R (2005) Instance pruning by filtering uninformative words: an information extraction case study. In: International conference on intelligent text processing and computational linguistics. Springer Berlin Heidelberg, pp 498–509
Tomanek K, Hahn U (2009) Reducing class imbalance during active learning for named entity annotation. In: Proceedings of the fifth international conference on knowledge capture. ACM, pp 105–112
Akkasi A, Varoglu E, Dimililer N (2016) ChemTok: a new rule based tokenizer for chemical named entity recognition. BioMed Research International. doi:10.1155/2016/4248026
Sang E F, Veenstra J (1999) Representing text chunks. In: Proceedings of the ninth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, pp 173–179
Takeuchi K, Collier N (2003) Bio-medical entity extraction using support vector machines. In: Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, pp 57–64
Collier N, Takeuchi K (2004) Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 37:423–35
Kudo T, Matsumoto Y (2003) Chunking with support vector machines. In: Proceeding of the second meeting of the North American chapter of the association for computational linguistics on language technologies, pp 1–8
Eltyeb S, Naomie S (2014) Chemical named entities recognition: a review on approaches and applications. J Cheminform 6:1–17
Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado et al (2015) The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform 7(1)
Kim J D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 70–75
Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T (2013) The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions. J Biomed Inf 46(5):914–920
Segura Bedmar I, Martinez P, Herrero Z M (2013) Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics
Klinger R, Tomanek K (2007) Classical probabilistic models and conditional random fields. TU, Algorithm Engineering
McCallum A K (2002) Mallet: a machine learning for language toolkit
Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A (2013) Overview of the chemical compound and drug name recognition (CHEMDNER) task. In: BioCreative challenge evaluation workshop, vol 2, pp 2–33
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Akkasi, A., Varoğlu, E. & Dimililer, N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell 48, 1965–1978 (2018). https://doi.org/10.1007/s10489-017-0920-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-0920-5