Variational Rectification Inference for Learning with Noisy Labels

Sun, Haoliang; Wei, Qi; Feng, Lei; Hu, Yupeng; Liu, Fan; Fan, Hehe; Yin, Yilong

doi:10.1007/s11263-024-02205-5

Variational Rectification Inference for Learning with Noisy Labels

Published: 12 August 2024

Volume 133, pages 652–671, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

622 Accesses
2 Citations
Explore all metrics

A Correction to this article was published on 25 September 2024

This article has been updated

Abstract

Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (e.g., re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning with noisy labels via clean aware sharpness aware minimization

Article Open access 08 January 2025

Learning with noisy labels via logit adjustment based on gradient prior method

Article 25 July 2023

Enhancing Fairness and Robustness in Label-Noise Learning Through Advanced Sample Selection and Adversarial Optimization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Change history

25 September 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11263-024-02242-0

Notes

We utilize a robust image augmentation policy, RandAugmentMC (Cubuk et al., 2020) During each training iteration, two strategies are randomly selected for image transformation. Importantly, strong augmentation is applied exclusively to the (noisy) training data, and not to the meta data.

References

Arazo, E., Ortego, D., Albert, P., et al. (2019). Unsupervised label noise modeling and loss correction. In: ICML
Arpit, D., Jastrzkebski, S., Ballas, N., et al. (2017). A closer look at memorization in deep networks. In: ICML
Bai, Y., & Liu, T. (2021). Me-momentum: Extracting hard confident examples from noisily labeled data. In: ICCV
Bai, Y., Yang, E., Han, B., et al. (2021). Understanding and improving early stopping for learning with noisy labels. In: NeurIPS
Bao, F., Wu, G., Li, C., et al. (2021). Stability and generalization of bilevel programming in hyperparameter optimization. In: NeurIPS
Berthelot, D., Carlini, N., Goodfellow, I., et al. (2019). Mixmatch: A holistic approach to semi-supervised learning. NeurIPS
Bossard, L., Guillaumin, M., Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In: ECCV
Chen, Y., Shen, X., Hu, S. X., et al. (2021). Boosting co-teaching with compression regularization for label noise. In: CVPR
Chen, Y., Hu, S. X., Shen, X., et al. (2022). Compressing features for learning with noisy labels. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2022.3186930
Article MATH Google Scholar
Cheng, D., Ning, Y., Wang, N., et al. (2022). Class-dependent label-noise learning with cycle-consistency regularization. Advances in Neural Information Processing Systems, 35, 11104–11116.
MATH Google Scholar
Cheng, H., Zhu, Z., Li, X., et al. (2021). Learning with instance-dependent label noise: A sample sieve approach. In: ICLR
Cubuk, E. D., Zoph, B., Shlens, J., et al. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In: CVPR workshops, pp. 702–703
Cui, Y., Jia, M., Lin, T. Y., et al. (2019). Class-balanced loss based on effective number of samples. In: CVPR
Englesson, E. (2021). Generalized Jensen-Shannon divergence loss for learning with noisy labels. In: NeurIPS
Fallah, A., Mokhtari, A., & Ozdaglar, A. (2020). On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In: AISTATS
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML
Franceschi, L., Frasconi, P., Salzo, S. et al. (2018). Bilevel programming for hyperparameter optimization and meta-learning. In: ICML
Fu, Z., Song, K., Zhou, L., et al. (2024). Noise-aware image captioning with progressively exploring mismatched words. In: AAAI, pp. 12091–12099
Ghosh, A., Kumar, H., Sastry, P. (2017). Robust loss functions under label noise for deep neural networks. In: AAAI
Goldberger, J., & Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation layer. In: ICLR
Gudovskiy, D., Rigazio, L., Ishizaka, S., et al. (2021). Autodo: Robust autoaugment for biased data with label noise via scalable probabilistic implicit differentiation. In: CVPR
Han, B., Yao, J., Niu, G., et al. (2018a). Masking: A new perspective of noisy supervision. In: NeurIPS
Han, B., Yao, Q., Yu, X., et al. (2018b). Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS 31
Han, J., Luo, P., & Wang, X. (2019). Deep self-learning from noisy labels. In: ICCV
He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: CVPR
Hendrycks, D., Mazeika, M., Wilson, D., et al. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. In: NeurIPS
Higgins, I., Matthey, L., Pal, A., et al. (2017) beta-vae: Learning basic visual concepts with a constrained variational framework. In: ICLR
Hospedales, T., Antoniou, A., Micaelli, P., et al. (2022). Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5149–5169.
MATH Google Scholar
Huang, H., Kang, H., Liu, S., et al. (2023). Paddles: Phase-amplitude spectrum disentangled early stopping for learning with noisy labels. In: ICCV
Iakovleva, E., Verbeek, J., & Alahari, K. (2020). Meta-learning with shared amortized variational inference. In: ICML
Iscen, A., Valmadre, J., Arnab, A., et al. (2022). Learning with neighbor consistency for noisy labels. In: CVPR
Jiang, L., Zhou, Z., Leung, T., et al. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML
Kang, H., Liu, S., Huang, H., et al. (2023). Unleashing the potential of regularization strategies in learning with noisy labels. arXiv preprint arXiv:2307.05025
Kim, Y., Yun, J., Shon, H., et al. (2021). Joint negative and positive learning for noisy labels. In: CVPR
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In: ICLR
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Kumar, M. P., Packer, B., Koller, D. (2010). Self-paced learning for latent variable models. In: NeurIPS
Kye, S. M., Choi, K., Yi, J., et al. (2022). Learning with noisy labels by efficient transition matrix estimation to combat label miscorrection. In: ECCV, Springer, pp. 717–738
Lee, K. H., He, X., Zhang, L., et al. (2018). Cleannet: Transfer learning for scalable image classifier training with label noise. In: CVPR
Li, J., Wong, Y., Zhao, Q., et al. (2019). Learning to learn from noisy labeled data. In: CVPR
Li, J., Socher, R. & Hoi, S. C. (2020). Dividemix: Learning with noisy labels as semi-supervised learning. In: ICLR
Li, J., Xiong, C., & Hoi, S. (2021). Mopro: Webly supervised learning with momentum prototypes. In: ICLR
Li, S., Xia, X., Ge, S., et al. (2022a). Selective-supervised contrastive learning with noisy labels. In: CVPR
Li, S., Xia, X., Zhang, H., et al. (2022). Estimating noise transition matrix with label correlations for noisy multi-label learning. Advances in Neural Information Processing Systems, 35, 24184–24198.
MATH Google Scholar
Liu, H., Zhong, Z., Sebe, N., et al. (2023). Mitigating robust overfitting via self-residual-calibration regularization. Artificial Intelligence, 317, 103877.
Article MathSciNet MATH Google Scholar
Liu, S., Niles-Weed, J., Razavian, N., et al. (2020). Early-learning regularization prevents memorization of noisy labels. In: NeurIPS
Liu, S., Zhu, Z., Qu, Q., et al. (2022). Robust training under label noise by over-parameterization. In: ICML
Liu, T., & Tao, D. (2015). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 447–461.
Article MATH Google Scholar
Liu, Y., & Guo, H. (2020). Peer loss functions: Learning from noisy labels without knowing noise rates. In: ICML
Ma, X., Wang, Y., Houle, M. E., et al. (2018). Dimensionality-driven learning with noisy labels. In: ICML
Malach, E., & Shalev-Shwartz, S. (2017). Decoupling "when to update" from "how to update". NeurIPS 30
Murphy, K. P. (2023). Probabilistic machine learning: Advanced topics. MIT Press.
MATH Google Scholar
Nishi, K., Ding, Y., Rich, A., et al. (2021). Augmentation strategies for learning with noisy labels. In: CVPR
Ortego, D., Arazo, E., Albert, P., et al. (2021). Multi-objective interpolation training for robustness to label noise. In: CVPR
Pereyra, G., Tucker, G., Chorowski, J., et al. (2017). Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548
Pu, N., Zhong, Z., Sebe, N., et al. (2023). A memorizing and generalizing framework for lifelong person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 13567–13585.
Article MATH Google Scholar
Reed, S., Lee, H., Anguelov, D., et al. (2015). Training deep neural networks on noisy labels with bootstrapping. In: ICLR
Ren, M., Zeng, W., Yang, B., et al. (2018). Learning to reweight examples for robust deep learning. In: ICML
Sharma, K., Donmez, P., Luo, E., et al. (2020). Noiserank: Unsupervised label noise reduction with dependence models. In: ECCV
Shen, Y., & Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. In: ICML
Shen, Y., Liu, L., & Shao, L. (2019). Unsupervised binary representation learning with deep variational networks. International Journal of Computer Vision, 127(11), 1614–1628.
Article MATH Google Scholar
Shu, J., Xie, Q., Yi, L., et al. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In: NeurIPS
Shu, J., Yuan, X., Meng, D., et al. (2023). Cmw-net: Learning a class-aware sample weighting mapping for robust deep learning. IEEE Transaction on Pattern Analysis and Machine Intelligence, 45(10), 11521–11539.
Article MATH Google Scholar
Sohn, K., Berthelot, D., Carlini, N., et al. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS
Song, H., Kim, M., & Lee, J. G. (2019). Selfie: Refurbishing unclean samples for robust deep learning. In: ICML
Sukhbaatar, S., Bruna, J., Paluri, M., et al. (2015). Training convolutional networks with noisy labels. In: ICLR
Sun, H., Guo, C., Wei, Q., et al. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.
Article MATH Google Scholar
Sun, Z., Shen, F., Huang, D., et al. (2022b). Pnp: Robust learning from noisy labels by probabilistic noise prediction. In: CVPR, pp. 5311–5320
Tanno, R., Saeedi, A., Sankaranarayanan, S., et al. (2019). Learning from noisy labels by regularized estimation of annotator confusion. In: CVPR
Taraday, M. K., & Baskin, C. (2023). Enhanced meta label correction for coping with label corruption. In: ICCV, pp. 16295–16304
Vahdat, A. (2017). Toward robustness against label noise in training deep discriminative neural networks. In: NeurIPS
Virmaux, A., & Scaman, K. (2018). Lipschitz regularity of deep neural networks: Analysis and efficient estimation. NeurIPS 31
Wang, X., Kodirov, E., Hua, Y., et al. (2019). Improving MAE against CCE under label noise. arXiv preprint arXiv:1903.12141
Wang, Y., Kucukelbir, A., Blei, D. M. (2017). Robust probabilistic modeling with Bayesian data reweighting. In: ICML
Wang, Z., Hu, G., & Hu, Q. (2020). Training noise-robust deep neural networks via meta-learning. In: CVPR
Wei, H., Feng, L., Chen, X., et al. (2020). Combating noisy labels by agreement: A joint training method with co-regularization. In: CVPR
Wei, Q., Sun, H., Lu, X., et al. (2022). Self-filtering: A noise-aware sample selection for label noise with confidence penalization. In: ECCV
Wei, Q., Feng, L., Sun, H., et al. (2023). Fine-grained classification with noisy labels. In: CVPR
Wu, Y., Shu, J., Xie, Q., et al. (2021). Learning to purify noisy labels via meta soft label corrector. In: AAAI
Xia, X., Liu, T., Han, B., et al. (2020a). Robust early-learning: Hindering the memorization of noisy labels. In: ICLR
Xia, X., Liu, T., Han, B., et al. (2020b). Part-dependent label noise: Towards instance-dependent label noise. In: NeurIPS
Xia, X., Han, B., Zhan, Y., et al. (2023). Combating noisy labels with sample selection by mining high-discrepancy examples. In: ICCV
Xiao, T., Xia, T., Yang, Y., et al. (2015). Learning from massive noisy labeled data for image classification. In: CVPR
Xu, Y., Zhu, L., Jiang, L., et al. (2021a). Faster meta update strategy for noise-robust deep learning. In: CVPR
Xu, Y., Zhu, L., Jiang, L., et al. (2021b). Faster meta update strategy for noise-robust deep learning. In: CVPR
Xu, Y., Niu, X., Yang, J., et al. (2023). Usdnl: Uncertainty-based single dropout in noisy label learning. In: AAAI, pp. 10648–10656
Yang, Y., Jiang, N., Xu, Y., et al. (2024). Robust semi-supervised learning by wisely leveraging open-set data. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–15
Yao, Y., Liu, T., Han, B., et al. (2020). Dual t: Reducing estimation error for transition matrix in label-noise learning. In: NeurIPS
Yao, Y., Liu, T., Gong, M., et al. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409–4420.
MATH Google Scholar
Yao, Y., Sun, Z., Zhang, C., et al. (2021b). Jo-src: A contrastive approach for combating noisy labels. In: CVPR, pp. 5192–5201
Yao, Y., Gong, M., Du, Y., et al. (2023). Which is better for learning with noisy labels: The semi-supervised method or modeling label noise? In: ICML
Yu, X., Han, B., Yao, J., et al. (2019). How does disagreement help generalization against label corruption? In: ICML
Yu, X., Jiang, Y., Shi, T., et al. (2023). How to prevent the continuous damage of noises to model training? In: CVPR
Yuan, S., Feng, L., & Liu, T. (2023). Late stopping: Avoiding confidently learning from mislabeled examples. In: ICCV
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In: ICML
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In: BMVC
Zhang, H., Cisse, M., Dauphin, Y. N., et al. (2018). mixup: Beyond empirical risk minimization. In: ICLR
Zhang, W., Wang, Y., & Qiao, Y. (2019). Metacleaner: Learning to hallucinate clean representations for noisy-labeled visual recognition. In: CVPR
Zhang, Y., Niu, G., Sugiyama, M. (2021a). Learning noise transition matrix from only noisy labels via total variation regularization. In: ICML
Zhang, Y., Zheng, S., Wu, P., et al. (2021b). Learning with feature-dependent label noise: A progressive approach. In: ICLR
Zhang, Z., & Pfister, T. (2021). Learning fast sample re-weighting without reward data. In: ICCV, pp. 725–734
Zhang, Z., & Sabuncu, M. R. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In: NeurIPS
Zhao, Q., Shu, J., Yuan, X., et al. (2023). A probabilistic formulation for meta-weight-net. IEEE Transactions on Neural Networks and Learning Systems, 34(3), 1194–1208.
Article MATH Google Scholar
Zheng, G., Awadallah, A. H., & Dumais, S. (2021). Meta label correction for noisy label learning. In: AAAI
Zhou, X., Liu, X., Wang, C., et al. (2021). Learning with noisy labels via sparse regularization. In: ICCV
Zhu, J., Zhao, D., Zhang, B., et al. (2022). Disentangled inference for GANs with latently invertible autoencoder. International Journal of Computer Vision, 130(5), 1259–1276.
Article MATH Google Scholar
Zhu, Z., Liu, T., & Liu, Y. (2021). A second-order approach to learning with instance-dependent label noise. In: CVPR

Download references

Funding

This research was supported by Young Expert of Taishan Scholars in Shandong Province (No. tsqn202312026), Natural Science Foundation of China (No. 62106129, 62176139, 62276155), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15, ZR2021MF040).

Author information

Haoliang Sun and Qi Wei Equal contribution.

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Haoliang Sun, Yupeng Hu & Yilong Yin
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Qi Wei
Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore, Singapore
Lei Feng
School of Computing, National University of Singapore, Singapore, Singapore
Fan Liu
School of Computer Science and Technology, Zhejiang University, Hangzhou, China
Hehe Fan

Authors

Haoliang Sun
View author publications
You can also search for this author inPubMed Google Scholar
Qi Wei
View author publications
You can also search for this author inPubMed Google Scholar
Lei Feng
View author publications
You can also search for this author inPubMed Google Scholar
Yupeng Hu
View author publications
You can also search for this author inPubMed Google Scholar
Fan Liu
View author publications
You can also search for this author inPubMed Google Scholar
Hehe Fan
View author publications
You can also search for this author inPubMed Google Scholar
Yilong Yin
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

H-Sun conceptualized the learning problem and provided the main idea. He also drafted the article. Q-Wei completed main experiments and provided the analysis of experimental results. L-Feng provided the theoretical guarantee for the learning algorithm. F-Liu and H-Fan contributed to participating in discussions of the algorithm and experimental designs. Y-Hu and Y-Yin provided funding supports, and Y-Hu approved the final version of the article.

Corresponding author

Correspondence to Yupeng Hu.

Ethics declarations

Conflict of interest

The author declares that he has no confict of interest.

Code availability

The code is now available at https://github.com/haolsun/VRI.

Additional information

Communicated by Hong Liu

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The co-author’s affiliation has been corrected.

A Appendix

1.1 Derivations of The ELBO

For a singe observation $(\textbf{x}, \textbf{y} )$, the ELBO can be derived from the perspective of the KL divergence between the variational posterior $q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})$ and the posterior $p(\textbf{v}| \textbf{x}, \textbf{y})$:

$$\begin{aligned} & D_{\textrm{KL}}[q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) || p(\textbf{v}| \textbf{x}, \textbf{y})] \nonumber \\ & = \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \left[ \log q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) - \log p(\textbf{v}| \textbf{x}, \textbf{y})\right] \nonumber \\ & = \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \left[ \log q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) - \log \frac{p(\textbf{v}| \textbf{x}, \textbf{y}) p(\textbf{x}, \textbf{y})}{p(\textbf{x}, \textbf{y})}\right] \nonumber \\ & = \log p(\textbf{y}| \textbf{x}) + \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \big [\log q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) \nonumber \\ & \quad \quad \quad \quad \quad \quad \quad \quad - \log p(\textbf{y}| \textbf{x}, \textbf{v}) - \log p(\textbf{v}| \textbf{x}) \big ] \nonumber \\ & = \log p(\textbf{y}| \textbf{x}) - \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \left[ \log p(\textbf{y}| \textbf{x}, \textbf{v})\right] \nonumber \\ & \quad \quad \quad \quad \quad \quad \quad \quad +D_{\textrm{KL}}[q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) || p(\textbf{v}| \textbf{x})] \nonumber \\ & \ge 0. \end{aligned}$$

(A1)

Specifically, we apply Bayes’ rule to derive Eq. (A1) as

$$\begin{aligned} \begin{aligned} p&(\textbf{v}| \textbf{x}, \textbf{y}) = \frac{p(\textbf{v}| \textbf{x}, \textbf{y}) p(\textbf{x}, \textbf{y})}{p(\textbf{x}, \textbf{y})} \\&= \frac{p(\textbf{y} | \textbf{x},\textbf{v} ) p(\textbf{x},\textbf{v})}{p(\textbf{x}, \textbf{y})} = \frac{p(\textbf{y} | \textbf{x},\textbf{v} ) p(\textbf{v}|\textbf{x})}{p(\textbf{y}| \textbf{x})}. \end{aligned} \end{aligned}$$

(A2)

Therefore, the ELBO for the log-likelihood of the predictive distribution in Eq. (3) can be written as follows

(A3)

1.2 Proof

Lemma 1(Smoothness)

Proof

We begin with computation of the derivation of the meta loss $\widetilde{\mathcal {L}}^{emp}(\hat{\theta }) $ w.r.t. the meta-network $\phi $. By using Eq. (9), we have

$$\begin{aligned} \begin{aligned} \frac{\partial \mathcal {L}^{meta}(\hat{\theta })}{\partial \phi }&= \frac{\partial \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }} \frac{\partial \hat{\theta }}{\partial V(\phi )} \frac{\partial V(\phi ) }{\partial \phi } \\&=\alpha \frac{\partial \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }} \left( \nabla _\theta L(\theta ) + \frac{\partial D_{\textrm{KL}}}{\partial V(\phi )} \right) \frac{\partial V(\phi ) }{\partial \phi }. \end{aligned}\nonumber \\ \end{aligned}$$

(A4)

To simplify the proof, we neglect Monte Carlo estimation in Eq. 6 and consider it as a deterministic rectified vector in the following. This would not affect the result since there ultimately exists a rectified vector for computing the expectation of those sampled losses. Taking the gradient of $\phi $ on both side of Eq. (A4),

(A5)

For the first term ❶ in the right hand, we can obtain the following inequality w.r.t. its norm

(A6)

since we assume $ \Vert \frac{\partial ^2 \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }^2} \Vert \le \ell $, $ \Vert \nabla _\theta L(\theta ) \Vert \le \tau $, $ \Vert \frac{\partial D_{\textrm{KL}}}{\partial V(\phi )} \Vert \le o$, and $ \Vert \frac{\partial V(\phi ) }{\partial \phi } \Vert \le \delta $.

For the second term ❷, we can also obtain

(A7)

with the assumption $ \Vert \frac{\partial ^2 V(\phi ) }{\partial \phi ^2} \Vert \le \zeta $. Therefore, we have

$$\begin{aligned} \begin{aligned} \left\| \frac{\partial ^2 \mathcal {L}^{meta}(\hat{\theta })}{\partial \phi ^2} \right\| \le \alpha (\tau +o)\left( \ell \alpha \delta ^2(\tau +o) + \tau \zeta \right) . \end{aligned} \end{aligned}$$

(A8)

Let $\hat{\ell } = \alpha (\tau +o)\left( \ell \alpha \delta ^2(\tau +o) + \tau \zeta \right) $, we can conclude the proof that

$$\begin{aligned} \Vert \mathcal {L}^{meta}(\hat{\theta }(\phi ^{(t+1)})) - \mathcal {L}^{meta}(\hat{\theta }(\phi ^{(t)})) \Vert \le \hat{\ell } \Vert \phi ^{(t+1)} - \phi ^{(t)} \Vert .\nonumber \\ \end{aligned}$$

(A9)

$\square $

Theorem 1(Convergence Rate)

Proof

Consider

(A10)

For ❸, by Lipschitz smoothness of the meta loss function for $\theta $, we have

$$\begin{aligned} & \mathcal {L}^{meta}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))- \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)})) \nonumber \\ & \le \langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)})), \hat{\theta }^{(t+1)}(\phi ^{(t+1)})-\hat{\theta }^{(t)}(\phi ^{(t+1)}) \rangle \nonumber \\ & \qquad +\frac{\ell }{2}\Vert \hat{\theta }^{(t+1)}(\phi ^{(t+1)})-\hat{\theta }^{(t)}(\phi ^{(t+1)})\Vert _2^2. \end{aligned}$$

(A11)

We firstly write $\hat{\theta }^{(t+1)}(\phi ^{(t+1)}), \hat{\theta }^{(t)}(\phi ^{(t+1)})$ with Eq. (9). Using Eq. (12), we obtain

$$\begin{aligned} \begin{aligned} \hat{\theta }^{(t+1)}&(\phi ^{(t+1)}) - \hat{\theta }^{(t)}(\phi ^{(t+1)}) \\ &= - \alpha \nabla _\theta \mathcal {L}^{emp}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)})). \end{aligned} \end{aligned}$$

(A12)

and

$$\begin{aligned} \begin{aligned} \Vert \mathcal {L}^{meta}&(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))- \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)})) \Vert \\&\le \alpha _t \tau ^2+ \frac{\ell \alpha _t^2}{2} \tau ^2 = \alpha _t\tau ^2 (1+\frac{\alpha _t \ell }{2}), \end{aligned} \end{aligned}$$

(A13)

since $\left\| \frac{\partial L(\theta )}{\partial \theta }\Big |_{\theta ^{(t)}}\right\| \le \tau $, $\left\| \!\frac{\partial L_i^{meta} (\hat{\theta })}{\partial \hat{\theta }}\!\Big |_{\hat{\theta }^{(t)}}\!\right\| \!\!\le \!\! \tau $, and the output of $V(\cdot )$ is bounded with the sigmoid function.

For ❹, since the gradient is computed from a mini-batch of training data that is drawn uniformly, we denote the bias of the stochastic gradient $\varepsilon ^{(t)} = \nabla \widetilde{\mathcal {L}}^{meta}\left( \hat{\theta }^{(t)}\left( \phi ^{(t)}\right) \right) - \nabla \mathcal {L}^{meta}\left( \hat{\theta }^{(t)}\left( \phi ^{(t)}\right) \right) $. We then observe its expectation obeys $\mathbb {E}[\varepsilon ^{(t)}] = 0$ and its variance obeys $\mathbb {E}[\Vert \varepsilon ^{(t)}\Vert _2^2 ] \le \sigma ^2 $.

By smoothness of $\nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ))$ for $\phi $ in Lemma 1, we have

$$\begin{aligned} & \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)}))-\mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})) \nonumber \\ & \le \langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})),\phi ^{(t+1)}-\phi ^{(t)} \rangle + \frac{\hat{\ell }}{2} \Vert \phi ^{(t+1)}-\phi ^{(t)}\Vert _2^2 \nonumber \\ & = \langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})), -\eta _t [\nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))+\varepsilon ^{(t)} ] \rangle \nonumber \\ & \quad + \frac{\hat{\ell } \eta _t^2}{2} \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))+\varepsilon ^{(t)}\Vert _2^2 \nonumber \\ & = -(\eta _t-\frac{\hat{\ell } \eta _t^2}{2}) \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 + \frac{\widetilde{\ell } \eta _t^2}{2}\Vert \varepsilon ^{(t)}\Vert _2^2 \nonumber \\ & \quad - (\eta _t-\hat{\ell } \eta _t^2)\langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})),\varepsilon ^{(t)}\rangle . \end{aligned}$$

(A14)

Thus, Eq.(A10) satisfies

$$\begin{aligned} & \mathcal {L}^{meta}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))-\mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\nonumber \\ & \le \alpha _t\tau ^2 (1+\frac{\alpha _t \ell }{2}) -(\eta _t-\frac{\hat{\ell }\eta _t^2}{2}) \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 \nonumber \\ & \quad + \frac{\hat{\ell }\eta _t^2}{2}\Vert \varepsilon ^{(t)}\Vert _2^2 - (\eta _t-\hat{\ell } \eta _t^2)\langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})),\varepsilon ^{(t)}\rangle .\nonumber \\ \end{aligned}$$

(A15)

We take the expectation w.r.t. $\varepsilon ^{(t)}$ over Eq. (A15) and sum up T inequalities. By the property of the bias $\varepsilon ^{(t)}$, we can obtain

$$\begin{aligned} \begin{aligned} \sum \limits _{t = 1}^T&\left( \mathop { \mathbb {E}}_{\varepsilon ^{(t)}} \mathcal {L}^{meta}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))-\mathop { \mathbb {E}}_{\varepsilon ^{(t)}}\mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\right) \\&\le \tau ^2 \sum \limits _{t = 1}^T \alpha _t(1+\frac{\alpha _t \ell }{2}) \\&\quad - \sum \limits _{t = 1}^T (\eta _t-\frac{\hat{\ell }\eta _t^2}{2})\mathop { \mathbb {E}}_{\varepsilon ^{(t)}} \left[ \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2\right] \\&\quad + \frac{\hat{\ell } \sigma ^2}{2} \sum \limits _{t = 1}^T \eta _t^2. \end{aligned} \end{aligned}$$

(A16)

Taking the total expectation and reordering the terms, we have

$$\begin{aligned} \begin{aligned}&\frac{1}{T}\sum \limits _{t = 1}^T (\eta _t-\frac{\hat{\ell }\eta _t^2}{2}) \mathop { \mathbb {E}} \left[ \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 \right] \\&\le \frac{\mathcal {L}^{meta}(\hat{\theta }^{(0)}(\phi ^{(0)})) - \mathop { \mathbb {E}}\left[ \mathcal {L}^{meta}(\hat{\theta }^{(T+1)}(\phi ^{(T+1)}) ) \right] }{T} \\&+ \frac{\tau ^2 }{T} \sum \limits _{t = 1}^T \alpha _t(1+\frac{\alpha _t \ell }{2}) + \frac{\hat{\ell } \sigma ^2}{2T} \sum \limits _{t = 1}^T \eta _t^2. \end{aligned} \end{aligned}$$

(A17)

Let

(A18)

With the assumption of $\eta _t =\min \{\frac{1}{\hat{\ell }},\frac{C}{\sigma \sqrt{T}}\}$ and $\alpha _t=\min \{1,\frac{\kappa }{T}\}$, we have $\eta _t-\frac{\hat{\ell }\eta _t^2}{2} \ge \eta _t - \frac{\eta _t}{2} = \frac{\eta _t}{2}$ and

$$\begin{aligned} \begin{aligned}&\frac{1}{T}\sum \limits _{t = 1}^T \mathop { \mathbb {E}} \left[ \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 \right] \\&\le \frac{2E}{T\eta _1} + \frac{(2+\ell )\tau ^2 \alpha _1 }{\eta _1} + \hat{\ell } \sigma ^2 \eta _1 \\&= \frac{2E}{T} \max \{\hat{\ell }, \frac{\sigma \sqrt{T}}{C} \} + (2+\ell ) \tau ^2 \min \{1,\frac{\kappa }{T}\} \max \{\hat{\ell }, \frac{\sigma \sqrt{T}}{C} \} \\&\quad + \hat{\ell } \sigma ^2 \min \{\frac{1}{\hat{\ell }},\frac{C}{\sigma \sqrt{T}}\} \\&\le \frac{2\sigma E}{C\sqrt{T}} + \frac{ (2+\ell )\tau ^2 \kappa \sigma }{C \sqrt{T} } + \frac{C\hat{\ell } \sigma ^2}{\sigma \sqrt{T}} = \mathcal {O}(\frac{1}{\sqrt{T}}). r \end{aligned}\nonumber \\ \end{aligned}$$

(A19)

Thus, we conclude our proof. $\square $

1.3 Algorithm for VRI Without the Meta Set

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, H., Wei, Q., Feng, L. et al. Variational Rectification Inference for Learning with Noisy Labels. Int J Comput Vis 133, 652–671 (2025). https://doi.org/10.1007/s11263-024-02205-5

Download citation

Received: 15 October 2023
Accepted: 27 July 2024
Published: 12 August 2024
Issue Date: February 2025
DOI: https://doi.org/10.1007/s11263-024-02205-5

Keywords

Part of a collection:

Special Issue on Open-World Visual Recognition

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variational Rectification Inference for Learning with Noisy Labels

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Learning with noisy labels via clean aware sharpness aware minimization

Learning with noisy labels via logit adjustment based on gradient prior method

Enhancing Fairness and Robustness in Label-Noise Learning Through Advanced Sample Selection and Adversarial Optimization

Explore related subjects

Change history

25 September 2024

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

A Appendix

A Appendix

1.1 Derivations of The ELBO

1.2 Proof

Proof

Proof

1.3 Algorithm for VRI Without the Meta Set

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now