Skip to main content
Log in

JSMix: a holistic algorithm for learning with label noise

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The success of deep learning is mainly dependent on large-scale and accurately labeled datasets. However, real-world datasets are marked with much noise. Directly training on datasets with label noise may lead to the overfitting. Recent research is under the spotlight on how to design algorithms that can learn robust models from noisy datasets, via designing the loss function and integrating the idea of Semi-supervised learning (SSL). This paper proposes a robust algorithm for learning with label noise that does not require additional clean data and an auxiliary model. Specifically, on the one hand, Jensen–Shannon (JS) divergence is introduced as a component of the loss function, which measures the distance between the predicted distribution and the noisy label distribution. It can alleviate the overfitting problem caused by the traditional cross entropy loss theoretically and experimentally. On the other hand, a dynamic sample selection mechanism is also proposed. The dataset is divided into the pseudo-clean labeled subset and the pseudo-noisy labeled subset. Two subsets are treated differently to exploit prior information about the data, and then learned by SSL. The dynamic sample selection is performed with the iteration between two subsets and the model parameters, which are different from the conventional training. Considering the label of the pseudo-clean labeled subset is not correct entirely, they are also refined by linear interpolation. Furthermore, we experimentally show that the integration of SSL helps the model divide two subsets more precise and build decision boundaries more explicit. Extensive experimental results on corrupted data from benchmark datasets and the real-world dataset, including CIFAR-10, CIFAR-100, and Clothing1M, demonstrate that our method is superior to many state-of-the-art approaches for learning with label noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Alam K, Siddique N, Adeli H (2019) A dynamic ensemble learning algorithm for neural networks. Neural Comput Appl 32:8675–8690

    Article  Google Scholar 

  2. Arazo E, Ortego D, Albert P, OConnor, NE, McGuinness K (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International joint conference on neural networks (IJCNN), pp 1–8

  3. Arpit D, Jastrzebski S, Ballas N, Krueger D, et al. (2017) A closer look at memorization in deep networks. In: Proceedings of the international conference on machine learning (ICML), pp 233–242

  4. Berthelot D, Carlini N, Goodfellow I, Papernot N, Oliver A, Raffel C (2019) Mixmatch: a holistic approach to semi-supervised learning. In: Neural information processing systems (NIPS), vol 32, pp 5050–5060

  5. Cheng H, Zhu Z, Li X, Gong Y, Sun X, Liu Y (2021) Learning with instance-dependent label noise: a sample sieve approach. In: International conference on learning representations (ICLR)

  6. Ding Y, Wang L, Fan D, Gong B (2018) A semi-supervised two-stage approach to learning from noisy labels. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp 1215–1224

  7. Dong Z, Qin Y, Zou B, Xu J, Tang YY (2021) Lmsvcr: novel effective method of semi-supervised multi-classification. Neural Comput Appl 1:1–17

    Google Scholar 

  8. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88(2):303–338

    Article  Google Scholar 

  9. Feng L, Shu S, Lin Z, Lv F, Li L, An B (2020) Can cross entropy loss be robust to label noise. In: Proceedings of the 29th international joint conferences on artificial intelligence (IJCAI), pp 2206–2212

  10. Frenay B, Verleysen M (2014) Classification in the presence of label noise: a surve 25(5):845–869

  11. Ghosh A, Kumar H, Sastry P (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  12. Gong M, Li H, Meng D, Miao Q, Liu J (2018) Decomposition-based evolutionary multiobjective optimization to self-paced learning. IEEE Trans Evol Comput 23(2):288–302

    Article  Google Scholar 

  13. Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning, vol 1

  14. Gui X, Wang W, Tian Z (2021) Towards understanding deep learning from noisy labels with small-loss criterion. In: Proceedings of the 30th international joint conferences on artificial intelligence (IJCAI), pp 2469–2475

  15. Han T, Tu WW, Li YF (2021) Explanation consistency training: facilitating consistency-based semi-supervised learning with interpretability. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 7639–7646

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  17. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision, pp 630–645

  18. Hein M, Andriushchenko M, Bitterwolf J (2019) Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 41–50

  19. Hu Z, Yang Z, Hu X, Nevatia R (2021) Simple: similar pseudo label exploitation for semi-supervised classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15099–15108

  20. Kang Z, Pan H, Hoi SCH, Xu Z (2020) Robust graph learning from noisy data. IEEE Trans Cybern 50(1):1833–1843

    Article  Google Scholar 

  21. Kang Z, Peng C, Cheng Q, Liu X, Peng X, Xu Z, Tian L (2021) Structured graph learning for clustering and semi-supervised classification. Pattern Recogn 110:107627

    Article  Google Scholar 

  22. Kong K, Lee J, Kwak Y, Cho YR, Kim SE, Song WJ (2022) Penalty based robust learning with noisy labels. Neurocomputing 489:112–127

    Article  Google Scholar 

  23. Kong K, Lee J, Kwak Y, Kang M, Kim SG, Song WJ (2019) Recycling: semi-supervised learning with noisy labels in deep neural networks. IEEE Access 7:66998–67005

    Article  Google Scholar 

  24. Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images

  25. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks, pp 1097–1105

  26. Kumar M, Packer B, Koller D (2010) Self-paced learning for latent variable models, pp 1189–1197

  27. Li J, Kang Z, Peng C, Chen W (2021) Self-paced two-dimensional PCA. In: Proceedings of the 35th AAAI conference on artificial intelligence, pp 8392–8400

  28. Li J, Socher R, Hoi SC (2019) Dividemix: Learning with noisy labels as semi-supervised learning. In: International conference on learning representations (ICLR)

  29. Li J, Wong Y, Zhao Q, Kankanhalli MS (2019) Learning to learn from noisy labeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5051–5059

  30. Liu J, Ren Z, Lu R, Luo X (2021) Gmm discriminant analysis with noisy label for each class. Neural Comput Appl 33:1171–1191

    Article  Google Scholar 

  31. Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):2579–2605

    MATH  Google Scholar 

  32. Malach E, Shalev-Shwartz S (2017) Decoupling” when to update” from” how to update”. In: Neural information processing systems (NIPS), vol 30, pp 960–970

  33. Martin A, Camacho D (2022) Recent advances on effective and efficient deep learning-based solutions. Neural Comput Appl 34:10205–10210

    Article  Google Scholar 

  34. Nguyen T, Mummadi C, Ngo T, Beggel L, Brox T (2020) Self: learning to filter noisy labels with self-ensembling. In: International conference on learning representations (ICLR)

  35. Ouali Y, Hudelot C, Tami M (2020) An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278

  36. Ouali Y, Hudelot C, Tami M (2020) Semi-supervised semantic segmentation with cross-consistency training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12674–12684

  37. Reed SE, Lee H, Anguelov D, Szegedy C, Erhan D, Rabinovich A (2015) Training deep neural networks on noisy labels with bootstrapping. In: International conference on learning representations (ICLR)

  38. Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, Meng D (2019) Meta-weight-net: learning an explicit mapping for sample weighting. In: Neural information processing systems (NIPS), vol 32, pp 1917–1928

  39. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition

  40. Song H, Kim M, Lee JG (2019) Selfie: refurbishing unclean samples for robust deep learning. In: International conference on machine learning (ICML), pp 5907–5915

  41. Song H, Kim M, Park D, Lee JG (2020) Learning from noisy labels with deep neural networks: a survey

  42. Sphaier P, Paes A (2022) User intent classification in noisy texts: an investigation on neural language models. Neural Comput Appl

  43. Sugiyama M (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Neural information processing systems (NIPS), vol 31, pp 8536–8546

  44. Tanaka D, Ikami D, Yamasaki T, Aizawa K (2018) Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5552–5560

  45. Wang Y, Ma X, Chen Z, Luo Y, Yi J, Bailey J (2019) Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 322–330

  46. Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2691–2699

  47. Xu Y, Shang L, Ye J, Qian Q, Li YF, Sun B, Li H, Jin R (2021) Dash: Semi-supervised learning with dynamic thresholding. In: International conference on machine learning(ICML), pp 11525–11536

  48. Yi K, Wu J (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7017–7025

  49. Yu X, Han B, Yao J, Niu G, Tsang I, Sugiyama M (2019) How does disagreement help generalization against label corruption? In: International conference on machine learning (ICML), pp 7164–7173

  50. Yuan W, Guan D, Zhu Q, Ma T (2018) Novel mislabeled training data detection algorithm. Neural Comput Appl 29:673–683

    Article  Google Scholar 

  51. Zagoruyko S, Komodakis N (2016) Wide residual networks. In: British machine vision conference 2016

  52. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: International conference on learning representations (ICLR)

  53. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: International conference on learning representations (ICLR)

  54. Zhang X, Wu X, Chen F, Zhao L, Lu CT (2020) Self-paced robust learning for leveraging clean labels in noisy data. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 6853–6860

  55. Zhang Z, Sabuncu MR (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In: Neural information processing systems (NIPS), vol 31, pp 8792–8802

  56. Zhang Z, Zhang H, Arik SO, Lee H, Pfister T (2020) Distilling effective supervision from severe label noise. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9294–9303

  57. Zhu X, Li Y, Sun J, Chen H, Zhu J (2021) Learning with noisy labels method for unsupervised domain adaptive person re-identification. Neurocomputing 452:78–88

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shihui Ying.

Ethics declarations

Conflicts of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by National Key R and D Program of China (No. 2021YFA1003004), and the National Natural Science Foundation of China (11971296).

Appendices

Appendices

1.1 Proof of the robustness of JS loss

Lemma 1

For any x with ground truth label \(y_x\) and for any \(i \not = y_x\), we have \( {\mathcal {L}}_{JS}(f(x), e_i) \le \log 2 \). The classifier \(f(\cdot )\) contains the softmax layer.

Proof

For \(f(\cdot )\) is the classifier with softmax layer, so we have \(\sum \nolimits _{c=1}^C f_c(x) = 1\).

$$\begin{aligned} {\mathcal {L}}_{JS} (f(x),e_i)&= \frac{1}{2} \sum \limits _{c=1}^C \bigg (e_{ic} \log \frac{2e_{ic}}{e_{ic}+f_c(x)}) + f_c(x) \log \frac{2f_c(x)}{e_{ic}+f_c(x)} \bigg ) \\&= \frac{1}{2} \bigg ( \log \frac{2}{1+ f_i(x)} + \sum \limits _{c \not = i} f_c(x) \log 2 + f_i(x) \log \frac{2f_i(x)}{1+f_i(x)} \bigg ) \\&= \frac{1}{2} \bigg ( \log \frac{2}{1+ f_i(x)} + \sum \limits _{c \not = i} f_c(x) \log 2 + f_i(x) \log \left(2-\frac{2}{1+f_i(x)}\right)\bigg ) \\&\le \frac{1}{2} \left(\log 2 + \sum \limits _{c \not = i} f_c(x) \log 2 +f_i(x) \log 2\right) \\&= \log 2. \end{aligned}$$

Prove Finished. \(\square \)

Theorem 1

In a C-class classification task and any softmax output f, under symmetric noise with noise rate \(\eta < 1 - 1/C\), we have

$$\begin{aligned} {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f) < \frac{1}{C}, \end{aligned}$$
(15)

where \(f^*\) is the global minimizer of \({\mathcal {R}}_{{\mathcal {L}}_{JS}}(f)\).

Proof

For any f,

$$\begin{aligned}{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } &= {\mathbb {E}}_{x,{\hat{y}}_x} {\mathcal {L}}_{JS}(f(x), {\hat{y}}_x) \\&= {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x} {\mathbb {E}}_{{\hat{y}}_x|x,y_x} {\mathcal {L}}_{JS}(f(x), {\hat{y}}_x) \\&= {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x} \bigg [ (1-\eta ) {\mathcal {L}}_{JS}(f(x),y_x)+\frac{\eta }{C-1} \sum \limits _{c \not =y_x}{\mathcal {L}}_{JS}({f(x), e_c})\bigg ] \\&=(1-\eta ) {\mathcal {R}}_{{\mathcal {L}}_{JS}}(f(x)) + \frac{\eta }{C-1} \bigg (\sum \limits _{c=1}^C {\mathcal {L}}_{JS}(f(x),e_c)-{\mathcal {L}}_{JS}(f(x),y_x) \bigg ) \\&= \left(1-\frac{\eta c}{C-1}) {\mathcal {R}}_{{\mathcal {L}}_{JS}}(f(x)\right) + \frac{\eta }{C-1} \sum \limits _{c=1}^C {\mathcal {L}}_{JS}(f(x),e_c). \end{aligned}$$

Since \(f^*\) is the global minimizer of \({\mathcal {R}}_{{\mathcal {L}}_{JS}}(f)\) and \(\eta < 1 - 1/C\),

$$\begin{aligned}&{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*) -{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f) \le (1-\frac{\eta c}{C-1})({\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*)-{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f))\nonumber \\&+ \frac{\eta c}{C-1} < \frac{1}{C}. \end{aligned}$$

Prove Finished. \(\square \)

Theorem 2

In a C-class classification task under asymmetric noise when noise rate \(\eta _{y_xc}<1-\eta _{y_x}\) with \(\sum _{c \not = y_x} \eta _{y_xc}= \eta _{y_x}\), for any softmax output f, if \({\mathcal {R}}_{{\mathcal {L}}_{JS}}(f^*) =0\) we have

$$\begin{aligned} {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f) \le B, \end{aligned}$$
(16)

where \(B=C{\mathbb {E}}(1-\eta _{y_x}) \ge 0\), and \(f^*\) is the global minimizer of \({\mathcal {R}}_{{\mathcal {L}}_{JS}}(f)\).

Proof

$$\begin{aligned}&{\mathcal {R}}_{JS}^{\eta }(f) = {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x} \left[(1-\eta _{y_x}){\mathcal {L}}_{JS}(f(x),y_x)\right] \\&+ {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x}\left[\sum \limits _{c \not = y_x} \eta _{y_xc}{\mathcal {L}}_{JS}(f(x),e_c) \right] \\&= {\mathbb {E}}_D \left[(1-\eta _{y_x}) ( \sum \limits _{c=1}^C {\mathcal {L}}_{JS}(f(x),e_c)-\sum \limits _{c \not = y_x} {\mathcal {L}}_{JS}(f(x),e_c)\right]\\&+ {\mathbb {E}}_D \left[ \sum \limits _{c \not = y_x} \eta _{y_xc} {\mathcal {L}}_{JS}(f(x),e_c) \right] \\&\le {\mathbb {E}}_D \left[(1-\eta _{y_x}) ( C-\sum \limits _{c \not = y_x} {\mathcal {L}}_{JS}(f(x),e_c)\right]\\&+ {\mathbb {E}}_D \left[ \sum \limits _{c \not = y_x} \eta _{y_xc} {\mathcal {L}}_{JS}(f(x),e_c) \right] \\&= C {\mathbb {E}}_D (1- \eta _{y_x}) - {\mathbb {E}}_D\left[\sum \limits _{c \not = y_x}(1-\eta _{y_x}-\eta _{y_xc}){\mathcal {L}}_{JS}(f(x),e_c) \right]. \end{aligned}$$

Thus,

$$\begin{aligned} \begin{array}{rl} &{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f) \le C {\mathbb {E}}_D(1-\eta _{y_x}) - {\mathbb {E}}_D \left[\sum \limits _{c \not = y_x} (1-\eta _{y_x}-\eta _{y_xc}){\mathcal {L}}_{JS}(f^*(x),e_c)\right] \\ &\quad + {\mathbb {E}}_D\left[\sum \limits _{c \not = y_x} (1-\eta _{y_x}-\eta _{y_xc}){\mathcal {L}}_{JS}(f(x),e_c)\right] \\ &= C {\mathbb {E}}_D(1-\eta _{y_x}) \\ &\quad + {\mathbb {E}}_D \left[\sum \limits _{c \not = y_x} (1-\eta _{y_x}-\eta _{y_xc})({\mathcal {L}}_{JS}(f(x),e_c)-{\mathcal {L}}_{JS}(f^*(x),e_c))\right]. \end{array} \end{aligned}$$

From our assumption that \({\mathcal {R}}_{{\mathcal {L}}_{JS}}(f^*) =0\), we have \({\mathcal {L}}_{JS}(f^*(x),y_x)=0\). For any \(i \not = y_x\), \({\mathcal {L}}_{JS}(f^*(x),e_i)= \log 2\). And from Lemma 1, we have: \({\mathcal {L}}_{JS}(f(x), e_i) \le \log 2\) for \(i \not = y_x\). So we have:

$$\begin{aligned} {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f) \le C {\mathbb {E}}_D(1-\eta _{y_x}) \triangleq B, \end{aligned}$$

where \(B=C{\mathbb {E}}(1-\eta _{y_x}) \ge 0\). Prove Finished. \(\square \)

1.2 The value of hyperparameter \(\beta \)

Noise type

Symmetric noise

Asymmetric noise

Datasets

CIFAR-10

CIFAR-100

CIFAR-10/CIFAR-100/Clothing1M

Noise Rate

0.2/0.4

0.6/0.8

0.2/0.4

0.6/0.8

0.1/0.2/0.3/0.4

\(\beta \)

0

25

25

125

0

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, Z., Xu, H. & Ying, S. JSMix: a holistic algorithm for learning with label noise. Neural Comput & Applic 35, 1519–1533 (2023). https://doi.org/10.1007/s00521-022-07770-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07770-9

Keywords

Navigation