JSMix: a holistic algorithm for learning with label noise

Wen, Zhijie; Xu, Hui; Ying, Shihui

doi:10.1007/s00521-022-07770-9

JSMix: a holistic algorithm for learning with label noise

Original Article
Published: 29 September 2022

Volume 35, pages 1519–1533, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

324 Accesses
1 Citation
Explore all metrics

Abstract

The success of deep learning is mainly dependent on large-scale and accurately labeled datasets. However, real-world datasets are marked with much noise. Directly training on datasets with label noise may lead to the overfitting. Recent research is under the spotlight on how to design algorithms that can learn robust models from noisy datasets, via designing the loss function and integrating the idea of Semi-supervised learning (SSL). This paper proposes a robust algorithm for learning with label noise that does not require additional clean data and an auxiliary model. Specifically, on the one hand, Jensen–Shannon (JS) divergence is introduced as a component of the loss function, which measures the distance between the predicted distribution and the noisy label distribution. It can alleviate the overfitting problem caused by the traditional cross entropy loss theoretically and experimentally. On the other hand, a dynamic sample selection mechanism is also proposed. The dataset is divided into the pseudo-clean labeled subset and the pseudo-noisy labeled subset. Two subsets are treated differently to exploit prior information about the data, and then learned by SSL. The dynamic sample selection is performed with the iteration between two subsets and the model parameters, which are different from the conventional training. Considering the label of the pseudo-clean labeled subset is not correct entirely, they are also refined by linear interpolation. Furthermore, we experimentally show that the integration of SSL helps the model divide two subsets more precise and build decision boundaries more explicit. Extensive experimental results on corrupted data from benchmark datasets and the real-world dataset, including CIFAR-10, CIFAR-100, and Clothing1M, demonstrate that our method is superior to many state-of-the-art approaches for learning with label noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

A survey on semi-supervised learning

Article Open access 15 November 2019

Learning from positive and unlabeled data: a survey

Article 02 April 2020

References

Alam K, Siddique N, Adeli H (2019) A dynamic ensemble learning algorithm for neural networks. Neural Comput Appl 32:8675–8690
Article Google Scholar
Arazo E, Ortego D, Albert P, OConnor, NE, McGuinness K (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International joint conference on neural networks (IJCNN), pp 1–8
Arpit D, Jastrzebski S, Ballas N, Krueger D, et al. (2017) A closer look at memorization in deep networks. In: Proceedings of the international conference on machine learning (ICML), pp 233–242
Berthelot D, Carlini N, Goodfellow I, Papernot N, Oliver A, Raffel C (2019) Mixmatch: a holistic approach to semi-supervised learning. In: Neural information processing systems (NIPS), vol 32, pp 5050–5060
Cheng H, Zhu Z, Li X, Gong Y, Sun X, Liu Y (2021) Learning with instance-dependent label noise: a sample sieve approach. In: International conference on learning representations (ICLR)
Ding Y, Wang L, Fan D, Gong B (2018) A semi-supervised two-stage approach to learning from noisy labels. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp 1215–1224
Dong Z, Qin Y, Zou B, Xu J, Tang YY (2021) Lmsvcr: novel effective method of semi-supervised multi-classification. Neural Comput Appl 1:1–17
Google Scholar
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88(2):303–338
Article Google Scholar
Feng L, Shu S, Lin Z, Lv F, Li L, An B (2020) Can cross entropy loss be robust to label noise. In: Proceedings of the 29th international joint conferences on artificial intelligence (IJCAI), pp 2206–2212
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a surve 25(5):845–869
Ghosh A, Kumar H, Sastry P (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Gong M, Li H, Meng D, Miao Q, Liu J (2018) Decomposition-based evolutionary multiobjective optimization to self-paced learning. IEEE Trans Evol Comput 23(2):288–302
Article Google Scholar
Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning, vol 1
Gui X, Wang W, Tian Z (2021) Towards understanding deep learning from noisy labels with small-loss criterion. In: Proceedings of the 30th international joint conferences on artificial intelligence (IJCAI), pp 2469–2475
Han T, Tu WW, Li YF (2021) Explanation consistency training: facilitating consistency-based semi-supervised learning with interpretability. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 7639–7646
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision, pp 630–645
Hein M, Andriushchenko M, Bitterwolf J (2019) Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 41–50
Hu Z, Yang Z, Hu X, Nevatia R (2021) Simple: similar pseudo label exploitation for semi-supervised classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15099–15108
Kang Z, Pan H, Hoi SCH, Xu Z (2020) Robust graph learning from noisy data. IEEE Trans Cybern 50(1):1833–1843
Article Google Scholar
Kang Z, Peng C, Cheng Q, Liu X, Peng X, Xu Z, Tian L (2021) Structured graph learning for clustering and semi-supervised classification. Pattern Recogn 110:107627
Article Google Scholar
Kong K, Lee J, Kwak Y, Cho YR, Kim SE, Song WJ (2022) Penalty based robust learning with noisy labels. Neurocomputing 489:112–127
Article Google Scholar
Kong K, Lee J, Kwak Y, Kang M, Kim SG, Song WJ (2019) Recycling: semi-supervised learning with noisy labels in deep neural networks. IEEE Access 7:66998–67005
Article Google Scholar
Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks, pp 1097–1105
Kumar M, Packer B, Koller D (2010) Self-paced learning for latent variable models, pp 1189–1197
Li J, Kang Z, Peng C, Chen W (2021) Self-paced two-dimensional PCA. In: Proceedings of the 35th AAAI conference on artificial intelligence, pp 8392–8400
Li J, Socher R, Hoi SC (2019) Dividemix: Learning with noisy labels as semi-supervised learning. In: International conference on learning representations (ICLR)
Li J, Wong Y, Zhao Q, Kankanhalli MS (2019) Learning to learn from noisy labeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5051–5059
Liu J, Ren Z, Lu R, Luo X (2021) Gmm discriminant analysis with noisy label for each class. Neural Comput Appl 33:1171–1191
Article Google Scholar
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):2579–2605
MATH Google Scholar
Malach E, Shalev-Shwartz S (2017) Decoupling” when to update” from” how to update”. In: Neural information processing systems (NIPS), vol 30, pp 960–970
Martin A, Camacho D (2022) Recent advances on effective and efficient deep learning-based solutions. Neural Comput Appl 34:10205–10210
Article Google Scholar
Nguyen T, Mummadi C, Ngo T, Beggel L, Brox T (2020) Self: learning to filter noisy labels with self-ensembling. In: International conference on learning representations (ICLR)
Ouali Y, Hudelot C, Tami M (2020) An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278
Ouali Y, Hudelot C, Tami M (2020) Semi-supervised semantic segmentation with cross-consistency training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12674–12684
Reed SE, Lee H, Anguelov D, Szegedy C, Erhan D, Rabinovich A (2015) Training deep neural networks on noisy labels with bootstrapping. In: International conference on learning representations (ICLR)
Shu J, Xie Q, Yi L, Zhao Q, Zhou S, Xu Z, Meng D (2019) Meta-weight-net: learning an explicit mapping for sample weighting. In: Neural information processing systems (NIPS), vol 32, pp 1917–1928
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Song H, Kim M, Lee JG (2019) Selfie: refurbishing unclean samples for robust deep learning. In: International conference on machine learning (ICML), pp 5907–5915
Song H, Kim M, Park D, Lee JG (2020) Learning from noisy labels with deep neural networks: a survey
Sphaier P, Paes A (2022) User intent classification in noisy texts: an investigation on neural language models. Neural Comput Appl
Sugiyama M (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Neural information processing systems (NIPS), vol 31, pp 8536–8546
Tanaka D, Ikami D, Yamasaki T, Aizawa K (2018) Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5552–5560
Wang Y, Ma X, Chen Z, Luo Y, Yi J, Bailey J (2019) Symmetric cross entropy for robust learning with noisy labels. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 322–330
Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2691–2699
Xu Y, Shang L, Ye J, Qian Q, Li YF, Sun B, Li H, Jin R (2021) Dash: Semi-supervised learning with dynamic thresholding. In: International conference on machine learning(ICML), pp 11525–11536
Yi K, Wu J (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7017–7025
Yu X, Han B, Yao J, Niu G, Tsang I, Sugiyama M (2019) How does disagreement help generalization against label corruption? In: International conference on machine learning (ICML), pp 7164–7173
Yuan W, Guan D, Zhu Q, Ma T (2018) Novel mislabeled training data detection algorithm. Neural Comput Appl 29:673–683
Article Google Scholar
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: British machine vision conference 2016
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: International conference on learning representations (ICLR)
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: International conference on learning representations (ICLR)
Zhang X, Wu X, Chen F, Zhao L, Lu CT (2020) Self-paced robust learning for leveraging clean labels in noisy data. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 6853–6860
Zhang Z, Sabuncu MR (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In: Neural information processing systems (NIPS), vol 31, pp 8792–8802
Zhang Z, Zhang H, Arik SO, Lee H, Pfister T (2020) Distilling effective supervision from severe label noise. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9294–9303
Zhu X, Li Y, Sun J, Chen H, Zhu J (2021) Learning with noisy labels method for unsupervised domain adaptive person re-identification. Neurocomputing 452:78–88
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, School of Science, Shanghai University, Shanghai, China
Zhijie Wen, Hui Xu & Shihui Ying

Authors

Zhijie Wen
View author publications
You can also search for this author in PubMed Google Scholar
Hui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shihui Ying
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shihui Ying.

Ethics declarations

Conflicts of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by National Key R and D Program of China (No. 2021YFA1003004), and the National Natural Science Foundation of China (11971296).

Appendices

1.1 Proof of the robustness of JS loss

Lemma 1

For any x with ground truth label $y_x$ and for any $i \not = y_x$, we have $ {\mathcal {L}}_{JS}(f(x), e_i) \le \log 2 $. The classifier $f(\cdot )$ contains the softmax layer.

Proof

For $f(\cdot )$ is the classifier with softmax layer, so we have $\sum \nolimits _{c=1}^C f_c(x) = 1$.

$$\begin{aligned} {\mathcal {L}}_{JS} (f(x),e_i)&= \frac{1}{2} \sum \limits _{c=1}^C \bigg (e_{ic} \log \frac{2e_{ic}}{e_{ic}+f_c(x)}) + f_c(x) \log \frac{2f_c(x)}{e_{ic}+f_c(x)} \bigg ) \\&= \frac{1}{2} \bigg ( \log \frac{2}{1+ f_i(x)} + \sum \limits _{c \not = i} f_c(x) \log 2 + f_i(x) \log \frac{2f_i(x)}{1+f_i(x)} \bigg ) \\&= \frac{1}{2} \bigg ( \log \frac{2}{1+ f_i(x)} + \sum \limits _{c \not = i} f_c(x) \log 2 + f_i(x) \log \left(2-\frac{2}{1+f_i(x)}\right)\bigg ) \\&\le \frac{1}{2} \left(\log 2 + \sum \limits _{c \not = i} f_c(x) \log 2 +f_i(x) \log 2\right) \\&= \log 2. \end{aligned}$$

Prove Finished. $\square $

Theorem 1

In a C-class classification task and any softmax output f, under symmetric noise with noise rate $\eta < 1 - 1/C$, we have

$$\begin{aligned} {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f) < \frac{1}{C}, \end{aligned}$$

(15)

where $f^*$ is the global minimizer of ${\mathcal {R}}_{{\mathcal {L}}_{JS}}(f)$.

Proof

For any f,

$$\begin{aligned}{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } &= {\mathbb {E}}_{x,{\hat{y}}_x} {\mathcal {L}}_{JS}(f(x), {\hat{y}}_x) \\&= {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x} {\mathbb {E}}_{{\hat{y}}_x|x,y_x} {\mathcal {L}}_{JS}(f(x), {\hat{y}}_x) \\&= {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x} \bigg [ (1-\eta ) {\mathcal {L}}_{JS}(f(x),y_x)+\frac{\eta }{C-1} \sum \limits _{c \not =y_x}{\mathcal {L}}_{JS}({f(x), e_c})\bigg ] \\&=(1-\eta ) {\mathcal {R}}_{{\mathcal {L}}_{JS}}(f(x)) + \frac{\eta }{C-1} \bigg (\sum \limits _{c=1}^C {\mathcal {L}}_{JS}(f(x),e_c)-{\mathcal {L}}_{JS}(f(x),y_x) \bigg ) \\&= \left(1-\frac{\eta c}{C-1}) {\mathcal {R}}_{{\mathcal {L}}_{JS}}(f(x)\right) + \frac{\eta }{C-1} \sum \limits _{c=1}^C {\mathcal {L}}_{JS}(f(x),e_c). \end{aligned}$$

Since $f^*$ is the global minimizer of ${\mathcal {R}}_{{\mathcal {L}}_{JS}}(f)$ and $\eta < 1 - 1/C$,

$$\begin{aligned}&{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*) -{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f) \le (1-\frac{\eta c}{C-1})({\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*)-{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f))\nonumber \\&+ \frac{\eta c}{C-1} < \frac{1}{C}. \end{aligned}$$

Prove Finished. $\square $

Theorem 2

In a C-class classification task under asymmetric noise when noise rate $\eta _{y_xc}<1-\eta _{y_x}$ with $\sum _{c \not = y_x} \eta _{y_xc}= \eta _{y_x}$, for any softmax output f, if ${\mathcal {R}}_{{\mathcal {L}}_{JS}}(f^*) =0$ we have

$$\begin{aligned} {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta } (f) \le B, \end{aligned}$$

(16)

where $B=C{\mathbb {E}}(1-\eta _{y_x}) \ge 0$, and $f^*$ is the global minimizer of ${\mathcal {R}}_{{\mathcal {L}}_{JS}}(f)$.

Proof

$$\begin{aligned}&{\mathcal {R}}_{JS}^{\eta }(f) = {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x} \left[(1-\eta _{y_x}){\mathcal {L}}_{JS}(f(x),y_x)\right] \\&+ {\mathbb {E}}_{x} {\mathbb {E}}_{y_x|x}\left[\sum \limits _{c \not = y_x} \eta _{y_xc}{\mathcal {L}}_{JS}(f(x),e_c) \right] \\&= {\mathbb {E}}_D \left[(1-\eta _{y_x}) ( \sum \limits _{c=1}^C {\mathcal {L}}_{JS}(f(x),e_c)-\sum \limits _{c \not = y_x} {\mathcal {L}}_{JS}(f(x),e_c)\right]\\&+ {\mathbb {E}}_D \left[ \sum \limits _{c \not = y_x} \eta _{y_xc} {\mathcal {L}}_{JS}(f(x),e_c) \right] \\&\le {\mathbb {E}}_D \left[(1-\eta _{y_x}) ( C-\sum \limits _{c \not = y_x} {\mathcal {L}}_{JS}(f(x),e_c)\right]\\&+ {\mathbb {E}}_D \left[ \sum \limits _{c \not = y_x} \eta _{y_xc} {\mathcal {L}}_{JS}(f(x),e_c) \right] \\&= C {\mathbb {E}}_D (1- \eta _{y_x}) - {\mathbb {E}}_D\left[\sum \limits _{c \not = y_x}(1-\eta _{y_x}-\eta _{y_xc}){\mathcal {L}}_{JS}(f(x),e_c) \right]. \end{aligned}$$

Thus,

$$\begin{aligned} \begin{array}{rl} &{\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f) \le C {\mathbb {E}}_D(1-\eta _{y_x}) - {\mathbb {E}}_D \left[\sum \limits _{c \not = y_x} (1-\eta _{y_x}-\eta _{y_xc}){\mathcal {L}}_{JS}(f^*(x),e_c)\right] \\ &\quad + {\mathbb {E}}_D\left[\sum \limits _{c \not = y_x} (1-\eta _{y_x}-\eta _{y_xc}){\mathcal {L}}_{JS}(f(x),e_c)\right] \\ &= C {\mathbb {E}}_D(1-\eta _{y_x}) \\ &\quad + {\mathbb {E}}_D \left[\sum \limits _{c \not = y_x} (1-\eta _{y_x}-\eta _{y_xc})({\mathcal {L}}_{JS}(f(x),e_c)-{\mathcal {L}}_{JS}(f^*(x),e_c))\right]. \end{array} \end{aligned}$$

From our assumption that ${\mathcal {R}}_{{\mathcal {L}}_{JS}}(f^*) =0$, we have ${\mathcal {L}}_{JS}(f^*(x),y_x)=0$. For any $i \not = y_x$, ${\mathcal {L}}_{JS}(f^*(x),e_i)= \log 2$. And from Lemma 1, we have: ${\mathcal {L}}_{JS}(f(x), e_i) \le \log 2$ for $i \not = y_x$. So we have:

$$\begin{aligned} {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f^*) - {\mathcal {R}}_{{\mathcal {L}}_{JS}}^{\eta }(f) \le C {\mathbb {E}}_D(1-\eta _{y_x}) \triangleq B, \end{aligned}$$

where $B=C{\mathbb {E}}(1-\eta _{y_x}) \ge 0$. Prove Finished. $\square $

1.2 The value of hyperparameter $\beta $

Noise type	Symmetric noise				Asymmetric noise
Datasets	CIFAR-10		CIFAR-100		CIFAR-10/CIFAR-100/Clothing1M
Noise Rate	0.2/0.4	0.6/0.8	0.2/0.4	0.6/0.8	0.1/0.2/0.3/0.4
$\beta $	0	25	25	125	0

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wen, Z., Xu, H. & Ying, S. JSMix: a holistic algorithm for learning with label noise. Neural Comput & Applic 35, 1519–1533 (2023). https://doi.org/10.1007/s00521-022-07770-9

Download citation

Received: 16 January 2022
Accepted: 30 August 2022
Published: 29 September 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00521-022-07770-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

JSMix: a holistic algorithm for learning with label noise

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendices

1.1 Proof of the robustness of JS loss

Lemma 1

Proof

Theorem 1

Proof

Theorem 2

Proof

1.2 The value of hyperparameter \(\beta \)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

JSMix: a holistic algorithm for learning with label noise

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendices

1.1 Proof of the robustness of JS loss

Lemma 1

Proof

Theorem 1

Proof

Theorem 2

Proof

1.2 The value of hyperparameter \(\beta \)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation