Abstract
Class Imbalance problem has received considerable attention in the machine learning research. Among the methods which handle class imbalance problem, undersampling is a data level approach which preprocesses the data set to reduce the size of the majority class instances. Most of the existing undersampling methods apply either prototype selection or clustering techniques to balance the data set. They are effective and popular, but both processes are complex. Drawbacks of the cluster based undersampling methods are: The quality of the chosen majority class samples varies depending on clustering algorithm, number of clusters and also the convergence is difficult. Drawback of prototype selection methods is that they have to compare each majority instance with it’s k nearest neighbors to decide which majority class instance should be selected/discarded which is not only time consuming and is also difficult to implement for large datasets. Proposed undersampling method MahalanobisCentroidbasedUndersampingwithFilter (MahalCUSFilter) overcomes the above said problems: parameter dependence, complexity and information loss. Proposed method is used in conjunction with c4.5 and kNN classifiers, and found to improve the minority class classification rate of all datasets with comparable overall performance for the entire dataset. To the best of our knowledge this kind of grouping has not been used in undersampling to improve the classification accuracy of imbalanced data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., et al.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput.-A Fus. Found. Methodol. Appl. 13(3), 307–318 (2009)
Alshomrani, S., Bawakid, A., Shim, S.-O., Fernández, A., Herrera, F.: A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl.-Based Syst. 73, 1–17 (2015)
Asuncion, A., Newman, D.: Uci machine learning repository (2007)
Barella, V.H., Costa, E.P., Carvalho, A.C.P.L.F.: Clusteross: a new undersampling method for imbalanced learning. In: Brazilian Conference on Intelligent Systems, 3rd; Encontro Nacional de Inteligência Artificial e Computacional, 11th. Universidade de São Paulo-USP (2014)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl. 6(1), 20–29 (2004)
Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)
Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl.-Based Syst. 85, 96–111 (2015)
Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97, Nashville, USA, pp. 179–186 (1997)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Longadge, M.R., Dongre, M.S.S., Malik, L.: Multi-cluster based approach for skewed data in data mining. J. Comput. Eng. (IOSR-JCE) 12(6), 66–73 (2013)
Manjula, M., Seeniselvi, T.: Ensembles of first order logical decision trees for imbalanced classification problems
Ng, W.W., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)
Rahman, M.M., Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 3, pp. 3–5 (2013)
Rencher, A.C.: Methods of Multivariate Analysis, vol. 492. Wiley, Hoboken (2003)
Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17876-9_5
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Wang, C., Hu, L., Guo, M., Liu, X., Zou, Q.: imDC: an ensemble learning method for imbalanced classification with mirna data. Genet. Mol. Res. 14(1), 123–133 (2015)
Witten, I.H., Frank, E., Trigg, L.E., Hall, M.A., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations (1999)
Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Zhang, S., Sadaoui, S., Mouhoub, M.: An empirical analysis of imbalanced data classification. Comput. Inf. Sci. 8(1), 151 (2015)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chennuru, V.K., Timmappareddy, S.R. (2017). MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets. In: Ghosh, A., Pal, R., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2017. Lecture Notes in Computer Science(), vol 10682. Springer, Cham. https://doi.org/10.1007/978-3-319-71928-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-71928-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71927-6
Online ISBN: 978-3-319-71928-3
eBook Packages: Computer ScienceComputer Science (R0)