Skip to main content

MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets

  • Conference paper
  • First Online:
Book cover Mining Intelligence and Knowledge Exploration (MIKE 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10682))

Abstract

Class Imbalance problem has received considerable attention in the machine learning research. Among the methods which handle class imbalance problem, undersampling is a data level approach which preprocesses the data set to reduce the size of the majority class instances. Most of the existing undersampling methods apply either prototype selection or clustering techniques to balance the data set. They are effective and popular, but both processes are complex. Drawbacks of the cluster based undersampling methods are: The quality of the chosen majority class samples varies depending on clustering algorithm, number of clusters and also the convergence is difficult. Drawback of prototype selection methods is that they have to compare each majority instance with it’s k nearest neighbors to decide which majority class instance should be selected/discarded which is not only time consuming and is also difficult to implement for large datasets. Proposed undersampling method MahalanobisCentroidbasedUndersampingwithFilter (MahalCUSFilter) overcomes the above said problems: parameter dependence, complexity and information loss. Proposed method is used in conjunction with c4.5 and kNN classifiers, and found to improve the minority class classification rate of all datasets with comparable overall performance for the entire dataset. To the best of our knowledge this kind of grouping has not been used in undersampling to improve the classification accuracy of imbalanced data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., et al.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput.-A Fus. Found. Methodol. Appl. 13(3), 307–318 (2009)

    Google Scholar 

  2. Alshomrani, S., Bawakid, A., Shim, S.-O., Fernández, A., Herrera, F.: A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl.-Based Syst. 73, 1–17 (2015)

    Article  Google Scholar 

  3. Asuncion, A., Newman, D.: Uci machine learning repository (2007)

    Google Scholar 

  4. Barella, V.H., Costa, E.P., Carvalho, A.C.P.L.F.: Clusteross: a new undersampling method for imbalanced learning. In: Brazilian Conference on Intelligent Systems, 3rd; Encontro Nacional de Inteligência Artificial e Computacional, 11th. Universidade de São Paulo-USP (2014)

    Google Scholar 

  5. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor. Newsl. 6(1), 20–29 (2004)

    Article  Google Scholar 

  6. Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)

    Article  Google Scholar 

  7. Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C., Kuncheva, L.I.: Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl.-Based Syst. 85, 96–111 (2015)

    Article  Google Scholar 

  8. Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515–516 (1968)

    Article  Google Scholar 

  9. Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97, Nashville, USA, pp. 179–186 (1997)

    Google Scholar 

  10. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9

    Chapter  Google Scholar 

  11. Longadge, M.R., Dongre, M.S.S., Malik, L.: Multi-cluster based approach for skewed data in data mining. J. Comput. Eng. (IOSR-JCE) 12(6), 66–73 (2013)

    Google Scholar 

  12. Manjula, M., Seeniselvi, T.: Ensembles of first order logical decision trees for imbalanced classification problems

    Google Scholar 

  13. Ng, W.W., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)

    Article  Google Scholar 

  14. Rahman, M.M., Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 3, pp. 3–5 (2013)

    Google Scholar 

  15. Rencher, A.C.: Methods of Multivariate Analysis, vol. 492. Wiley, Hoboken (2003)

    MATH  Google Scholar 

  16. Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17876-9_5

    Google Scholar 

  17. Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)

    Article  Google Scholar 

  18. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)

    MathSciNet  MATH  Google Scholar 

  19. Wang, C., Hu, L., Guo, M., Liu, X., Zou, Q.: imDC: an ensemble learning method for imbalanced classification with mirna data. Genet. Mol. Res. 14(1), 123–133 (2015)

    Article  Google Scholar 

  20. Witten, I.H., Frank, E., Trigg, L.E., Hall, M.A., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations (1999)

    Google Scholar 

  21. Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)

    Article  Google Scholar 

  22. Zhang, S., Sadaoui, S., Mouhoub, M.: An empirical analysis of imbalanced data classification. Comput. Inf. Sci. 8(1), 151 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Venkata Krishnaveni Chennuru or Sobha Rani Timmappareddy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chennuru, V.K., Timmappareddy, S.R. (2017). MahalCUSFilter: A Hybrid Undersampling Method to Improve the Minority Classification Rate of Imbalanced Datasets. In: Ghosh, A., Pal, R., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2017. Lecture Notes in Computer Science(), vol 10682. Springer, Cham. https://doi.org/10.1007/978-3-319-71928-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-71928-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-71927-6

  • Online ISBN: 978-3-319-71928-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics