Abstract
The imbalanced class problem is noteworthy given its impact on the induction of predictive models and its constant presence in several application areas. It is a challenge in supervised classification, since most of classifiers are very sensitive to class distributions. Consequently, the predictive model is biased to the majority class, which leads to a low performance. In this paper, we analyze the reliability of resampling strategies through the influence of some factors such as dataset characteristics and the classifiers used for building the models, in order to improve the performance and determine which resampling method will be used according to these factors. Experiments over 24 real datasets with different imbalance ratio, using six different classifiers, seven resampling algorithms and six performance evaluation measures have been conducted aiming at showing which resampling method will be the most suitable depending on these factors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
He, H., GarcĂa, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Engine 21(9), 1263–1284 (2009)
Chawla, N., Bowyer, K., Hall, L., Kebelmeyer, W.P.: SMOTE: synthetic minority over sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS, vol. 2101, pp. 63–66. Springer, Heidelberg (2001). doi:10.1007/3-540-48229-6_9
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one side selection. In: Fisher, D.H. (ed.) ICML, pp. 179–186. Morgan Kaufmann, San Francisco (1979)
Bekkar, M., Alitouche, T.A.: Imbalanced data learning approaches review. Int. J. Data Min. Knowl. Manage. Process (IJDK), 3(4) (2013)
Drummond, C., Holte, R.C.: C4.5, class imbalance and cost sensitivity: why under-sampling beats oversampling. In: Workshop on Learning from Imbalanced Datasets II, held in Conjunction with ICML (2003)
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theor. 14, 515–516 (1968)
Garcia, V., Sanchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Elsevier 25, 13–21 (2012)
GarcĂa, V., Mollineda, R.A., SĂ¡nchez, J.S.: Index of balanced accuracy: a performance measure for skewed class distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 441–448. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02172-5_57
Toussaint, G.T.: A counterexample to Tomek’s consistency theorem for a condensed nearest neighbor decision rule. Pattern Recogn. Lett. 15, 797–801 (1994)
Tomek, I.: A generalization of the K-NN rule. IEEE Trans. SMC 6, 121–126 (1976)
Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. SMC 2(23), 408–421 (1972)
Hand, B.J., Batchelor, B.G.: Experiments on the edited condensed nearest neighbor rule. Inf. Sci. 14(3), 171–180 (1978)
Barley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 359–377 (1997)
Jin, H., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)
Ranawana, R., Palade, V.: Optimized precision - a new measure for classifier performance evaluation. In: Proceeding of the IEEE Congress on Computational Intelligence, Vancouver, Canada, pp. 2245–2261 (2006)
Loyola-Gonzalez, O., Martinez-Trinidad, F.J., Carrasco-Ochoa, J.A.: Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175, 935–947 (2016)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1–37 (2008)
Demsar, J.: Statistical comparison of classifiers over multiple datasets. J. Mach. Learn. Res. 7, 1–30 (2006)
Garcia, S., Herrera, F.: An extension on statistical comparisons of classifiers over multiple datasets for all pairwise comparisons. J. Mach. Learn. Res. 9, 2677–2694 (2008)
Hulse, J.V., Khoshgoftaar, T.M., Naplolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, Corvalis, Oregon, pp. 935–942 (2007)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kraiem, M.S., Moreno, M.N. (2017). Effectiveness of Basic and Advanced Sampling Strategies on the Classification of Imbalanced Data. A Comparative Study Using Classical and Novel Metrics. In: MartĂnez de PisĂ³n, F., Urraca, R., QuintiĂ¡n, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2017. Lecture Notes in Computer Science(), vol 10334. Springer, Cham. https://doi.org/10.1007/978-3-319-59650-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-59650-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59649-5
Online ISBN: 978-3-319-59650-1
eBook Packages: Computer ScienceComputer Science (R0)