Skip to main content

Calibrated Resampling for Imbalanced and Long-Tails in Deep Learning

  • Conference paper
  • First Online:
Discovery Science (DS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

Abstract

Long-tailed distributions and class imbalance are problems of significant importance in applied deep learning where trained models are exploited for decision support and decision automation in critical areas such as health and medicine, transportation and finance. The challenge of learning deep models from such data remains high, and the state-of-the-art solutions are typically data dependent and primarily focused on images. Important real-world problems, however, are much more diverse thus necessitating a general solution that can be applied to diverse data types. In this paper, we propose ReMix, a training technique that seamlessly leverages batch resampling, instance mixing and soft-labels to efficiently enable the induction of robust deep models from imbalanced and long-tailed datasets. Our results show that fully connected neural networks and Convolutional Neural Networks (CNNs) trained with ReMix generally outperform the alternatives according to the g-mean and are better calibrated according to the balanced Brier score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The code will be made available after publication.

  2. 2.

    We use 1-BBS so higher scores are better.

  3. 3.

    Individual results for all datasets including means and standard deviation are included in the supplementary material.

References

  1. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)

  2. Anand, R., Mehrotra, K.G., Mohan, C.K., Ranka, S.: An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans. Neural Netw. 4(6), 962–969 (1993)

    Article  Google Scholar 

  3. Bellinger, C., Drummond, C., Japkowicz, N.: Manifold-based synthetic oversampling with manifold conformance estimation. Mach. Learn. 107(3), 605–637 (2017). https://doi.org/10.1007/s10994-017-5670-4

    Article  MathSciNet  MATH  Google Scholar 

  4. Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. (CSUR) 49(2), 1–50 (2016)

    Article  Google Scholar 

  5. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018)

    Article  Google Scholar 

  6. Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: Advances in Neural Information Processing Systems, pp. 1567–1578 (2019)

    Google Scholar 

  7. Chapelle, O., Weston, J., Bottou, L., Vapnik, V.: Vicinal risk minimization. In: Advances in Neural Information Processing Systems, pp. 416–422 (2001)

    Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  9. Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)

    Google Scholar 

  10. Dai, W., Ng, K., Severson, K., Huang, W., Anderson, F., Stultz, C.: Generative oversampling with a contrastive variational autoencoder. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 101–109. IEEE (2019)

    Google Scholar 

  11. DeVries, T., Taylor, G.W.: Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865 (2018)

  12. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  13. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. arXiv preprint arXiv:1706.04599 (2017)

  14. Guo, H.: Nonlinear mixup: out-of-manifold data augmentation for text classification. In: AAAI, pp. 4044–4051 (2020)

    Google Scholar 

  15. Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384 (2016)

    Google Scholar 

  16. Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019). https://doi.org/10.1186/s40537-019-0192-5

    Article  Google Scholar 

  17. Krawczyk, B., Bellinger, C., Corizzo, R., Japkowicz, N.: Undersampling with support vectors for multi-class imbalanced data classification. In: 2021 International Joint Conference on Neural Networks (IJCNN) (2021)

    Google Scholar 

  18. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  19. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

  20. Mullick, S.S., Datta, S., Das, S.: Generative adversarial minority oversampling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1695–1704 (2019)

    Google Scholar 

  21. Niculescu-Mizil, A., Caruana, R.: Obtaining calibrated probabilities from boosting. In: UAI, p. 413 (2005)

    Google Scholar 

  22. Rao, R.B., Krishnan, S., Niculescu, R.S.: Data mining for improved cardiac care. ACM SIGKDD Explor. Newsl. 8(1), 3–10 (2006)

    Article  Google Scholar 

  23. Sanz, J.A., Bernardo, D., Herrera, F., Bustince, H., Hagras, H.: A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23(4), 973–990 (2014)

    Article  Google Scholar 

  24. Verma, V., et al.: Manifold mixup: better representations by interpolating hidden states. In: International Conference on Machine Learning, pp. 6438–6447 (2019)

    Google Scholar 

  25. Wallace, B.C., Dahabreh, I.J.: Class probability estimates are unreliable for imbalanced data (and how to fix them). In: 2012 IEEE 12th International Conference on Data Mining, pp. 695–704. IEEE (2012)

    Google Scholar 

  26. Wallace, B.C., Dahabreh, I.J.: Improving class probability estimates for imbalanced data. Knowl. Inf. Syst. 41(1), 33–52 (2013). https://doi.org/10.1007/s10115-013-0670-6

    Article  Google Scholar 

  27. Wang, Q., et al.: WGAN-based synthetic minority over-sampling technique: improving semantic fine-grained classification for lung nodules in CT images. IEEE Access 7, 18450–18463 (2019)

    Article  Google Scholar 

  28. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

  29. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Colin Bellinger .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 98 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 National Research Council Canada

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bellinger, C., Corizzo, R., Japkowicz, N. (2021). Calibrated Resampling for Imbalanced and Long-Tails in Deep Learning. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88942-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88941-8

  • Online ISBN: 978-3-030-88942-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics