MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets

Chennuru, Venkata Krishnaveni; Timmappareddy, Sobha Rani

doi:10.1007/978-3-319-71928-3_5

Venkata Krishnaveni Chennuru¹⁶ &
Sobha Rani Timmappareddy¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10682))

Included in the following conference series:

International Conference on Mining Intelligence and Knowledge Exploration

1161 Accesses
2 Citations

Abstract

Class Imbalance problem has received considerable attention in the machine learning research. Among the methods which handle class imbalance problem, undersampling is a data level approach which preprocesses the data set to reduce the size of the majority class instances. Most of the existing undersampling methods apply either prototype selection or clustering techniques to balance the data set. They are effective and popular, but both processes are complex. Drawbacks of the cluster based undersampling methods are: The quality of the chosen majority class samples varies depending on clustering algorithm, number of clusters and also the convergence is difficult. Drawback of prototype selection methods is that they have to compare each majority instance with it’s k nearest neighbors to decide which majority class instance should be selected/discarded which is not only time consuming and is also difficult to implement for large datasets. Proposed undersampling method MahalanobisCentroidbasedUndersampingwithFilter (MahalCUSFilter) overcomes the above said problems: parameter dependence, complexity and information loss. Proposed method is used in conjunction with c4.5 and kNN classifiers, and found to improve the minority class classification rate of all datasets with comparable overall performance for the entire dataset. To the best of our knowledge this kind of grouping has not been used in undersampling to improve the classification accuracy of imbalanced data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., et al.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput.-A Fus. Found. Methodol. Appl. 13(3), 307–318 (2009)
Google Scholar
Alshomrani, S., Bawakid, A., Shim, S.-O., Fernández, A., Herrera, F.: A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl.-Based Syst. 73, 1–17 (2015)
Article Google Scholar
Asuncion, A., Newman, D.: Uci machine learning repository (2007)
Google Scholar
Barella, V.H., Costa, E.P., Carvalho, A.C.P.L.F.: Clusteross: a new undersampling method for imbalanced learning. In: Brazilian Conference on Intelligent Systems, 3rd; Encontro Nacional de Inteligência Artificial e Computacional, 11th. Universidade de São Paulo-USP (2014)
Google Scholar
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl. 6(1), 20–29 (2004)
Article Google Scholar
Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)
Article Google Scholar
Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl.-Based Syst. 85, 96–111 (2015)
Article Google Scholar
Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
Article Google Scholar
Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97, Nashville, USA, pp. 179–186 (1997)
Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Chapter Google Scholar
Longadge, M.R., Dongre, M.S.S., Malik, L.: Multi-cluster based approach for skewed data in data mining. J. Comput. Eng. (IOSR-JCE) 12(6), 66–73 (2013)
Google Scholar
Manjula, M., Seeniselvi, T.: Ensembles of first order logical decision trees for imbalanced classification problems
Google Scholar
Ng, W.W., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)
Article Google Scholar
Rahman, M.M., Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 3, pp. 3–5 (2013)
Google Scholar
Rencher, A.C.: Methods of Multivariate Analysis, vol. 492. Wiley, Hoboken (2003)
MATH Google Scholar
Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17876-9_5
Google Scholar
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)
Article Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
MathSciNet MATH Google Scholar
Wang, C., Hu, L., Guo, M., Liu, X., Zou, Q.: imDC: an ensemble learning method for imbalanced classification with mirna data. Genet. Mol. Res. 14(1), 123–133 (2015)
Article Google Scholar
Witten, I.H., Frank, E., Trigg, L.E., Hall, M.A., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations (1999)
Google Scholar
Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Article Google Scholar
Zhang, S., Sadaoui, S., Mouhoub, M.: An empirical analysis of imbalanced data classification. Comput. Inf. Sci. 8(1), 151 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

SCIS, University of Hyderabad, Hyderabad, India
Venkata Krishnaveni Chennuru & Sobha Rani Timmappareddy

Authors

Venkata Krishnaveni Chennuru
View author publications
You can also search for this author in PubMed Google Scholar
Sobha Rani Timmappareddy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Venkata Krishnaveni Chennuru or Sobha Rani Timmappareddy .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
Ashish Ghosh
Institute for Development and Research in Banking Technology, Hyderabad, India
Rajarshi Pal
Indian Institute of Information Technology, Sri City, India
Rajendra Prasath

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chennuru, V.K., Timmappareddy, S.R. (2017). MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets. In: Ghosh, A., Pal, R., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2017. Lecture Notes in Computer Science(), vol 10682. Springer, Cham. https://doi.org/10.1007/978-3-319-71928-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-71928-3_5
Published: 28 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71927-6
Online ISBN: 978-3-319-71928-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics