Skip to main content
Log in

Variational Rectification Inference for Learning with Noisy Labels

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

A Correction to this article was published on 25 September 2024

This article has been updated

Abstract

Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (e.g., re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Change history

Notes

  1. We utilize a robust image augmentation policy, RandAugmentMC (Cubuk et al., 2020) During each training iteration, two strategies are randomly selected for image transformation. Importantly, strong augmentation is applied exclusively to the (noisy) training data, and not to the meta data.

References

  • Arazo, E., Ortego, D., Albert, P., et al. (2019). Unsupervised label noise modeling and loss correction. In: ICML

  • Arpit, D., Jastrzkebski, S., Ballas, N., et al. (2017). A closer look at memorization in deep networks. In: ICML

  • Bai, Y., & Liu, T. (2021). Me-momentum: Extracting hard confident examples from noisily labeled data. In: ICCV

  • Bai, Y., Yang, E., Han, B., et al. (2021). Understanding and improving early stopping for learning with noisy labels. In: NeurIPS

  • Bao, F., Wu, G., Li, C., et al. (2021). Stability and generalization of bilevel programming in hyperparameter optimization. In: NeurIPS

  • Berthelot, D., Carlini, N., Goodfellow, I., et al. (2019). Mixmatch: A holistic approach to semi-supervised learning. NeurIPS

  • Bossard, L., Guillaumin, M., Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In: ECCV

  • Chen, Y., Shen, X., Hu, S. X., et al. (2021). Boosting co-teaching with compression regularization for label noise. In: CVPR

  • Chen, Y., Hu, S. X., Shen, X., et al. (2022). Compressing features for learning with noisy labels. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2022.3186930

    Article  MATH  Google Scholar 

  • Cheng, D., Ning, Y., Wang, N., et al. (2022). Class-dependent label-noise learning with cycle-consistency regularization. Advances in Neural Information Processing Systems, 35, 11104–11116.

    MATH  Google Scholar 

  • Cheng, H., Zhu, Z., Li, X., et al. (2021). Learning with instance-dependent label noise: A sample sieve approach. In: ICLR

  • Cubuk, E. D., Zoph, B., Shlens, J., et al. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In: CVPR workshops, pp. 702–703

  • Cui, Y., Jia, M., Lin, T. Y., et al. (2019). Class-balanced loss based on effective number of samples. In: CVPR

  • Englesson, E. (2021). Generalized Jensen-Shannon divergence loss for learning with noisy labels. In: NeurIPS

  • Fallah, A., Mokhtari, A., & Ozdaglar, A. (2020). On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In: AISTATS

  • Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML

  • Franceschi, L., Frasconi, P., Salzo, S. et al. (2018). Bilevel programming for hyperparameter optimization and meta-learning. In: ICML

  • Fu, Z., Song, K., Zhou, L., et al. (2024). Noise-aware image captioning with progressively exploring mismatched words. In: AAAI, pp. 12091–12099

  • Ghosh, A., Kumar, H., Sastry, P. (2017). Robust loss functions under label noise for deep neural networks. In: AAAI

  • Goldberger, J., & Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation layer. In: ICLR

  • Gudovskiy, D., Rigazio, L., Ishizaka, S., et al. (2021). Autodo: Robust autoaugment for biased data with label noise via scalable probabilistic implicit differentiation. In: CVPR

  • Han, B., Yao, J., Niu, G., et al. (2018a). Masking: A new perspective of noisy supervision. In: NeurIPS

  • Han, B., Yao, Q., Yu, X., et al. (2018b). Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS 31

  • Han, J., Luo, P., & Wang, X. (2019). Deep self-learning from noisy labels. In: ICCV

  • He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: CVPR

  • Hendrycks, D., Mazeika, M., Wilson, D., et al. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. In: NeurIPS

  • Higgins, I., Matthey, L., Pal, A., et al. (2017) beta-vae: Learning basic visual concepts with a constrained variational framework. In: ICLR

  • Hospedales, T., Antoniou, A., Micaelli, P., et al. (2022). Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5149–5169.

    MATH  Google Scholar 

  • Huang, H., Kang, H., Liu, S., et al. (2023). Paddles: Phase-amplitude spectrum disentangled early stopping for learning with noisy labels. In: ICCV

  • Iakovleva, E., Verbeek, J., & Alahari, K. (2020). Meta-learning with shared amortized variational inference. In: ICML

  • Iscen, A., Valmadre, J., Arnab, A., et al. (2022). Learning with neighbor consistency for noisy labels. In: CVPR

  • Jiang, L., Zhou, Z., Leung, T., et al. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML

  • Kang, H., Liu, S., Huang, H., et al. (2023). Unleashing the potential of regularization strategies in learning with noisy labels. arXiv preprint arXiv:2307.05025

  • Kim, Y., Yun, J., Shon, H., et al. (2021). Joint negative and positive learning for noisy labels. In: CVPR

  • Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In: ICLR

  • Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images

  • Kumar, M. P., Packer, B., Koller, D. (2010). Self-paced learning for latent variable models. In: NeurIPS

  • Kye, S. M., Choi, K., Yi, J., et al. (2022). Learning with noisy labels by efficient transition matrix estimation to combat label miscorrection. In: ECCV, Springer, pp. 717–738

  • Lee, K. H., He, X., Zhang, L., et al. (2018). Cleannet: Transfer learning for scalable image classifier training with label noise. In: CVPR

  • Li, J., Wong, Y., Zhao, Q., et al. (2019). Learning to learn from noisy labeled data. In: CVPR

  • Li, J., Socher, R. & Hoi, S. C. (2020). Dividemix: Learning with noisy labels as semi-supervised learning. In: ICLR

  • Li, J., Xiong, C., & Hoi, S. (2021). Mopro: Webly supervised learning with momentum prototypes. In: ICLR

  • Li, S., Xia, X., Ge, S., et al. (2022a). Selective-supervised contrastive learning with noisy labels. In: CVPR

  • Li, S., Xia, X., Zhang, H., et al. (2022). Estimating noise transition matrix with label correlations for noisy multi-label learning. Advances in Neural Information Processing Systems, 35, 24184–24198.

    MATH  Google Scholar 

  • Liu, H., Zhong, Z., Sebe, N., et al. (2023). Mitigating robust overfitting via self-residual-calibration regularization. Artificial Intelligence, 317, 103877.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, S., Niles-Weed, J., Razavian, N., et al. (2020). Early-learning regularization prevents memorization of noisy labels. In: NeurIPS

  • Liu, S., Zhu, Z., Qu, Q., et al. (2022). Robust training under label noise by over-parameterization. In: ICML

  • Liu, T., & Tao, D. (2015). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 447–461.

    Article  MATH  Google Scholar 

  • Liu, Y., & Guo, H. (2020). Peer loss functions: Learning from noisy labels without knowing noise rates. In: ICML

  • Ma, X., Wang, Y., Houle, M. E., et al. (2018). Dimensionality-driven learning with noisy labels. In: ICML

  • Malach, E., & Shalev-Shwartz, S. (2017). Decoupling "when to update" from "how to update". NeurIPS 30

  • Murphy, K. P. (2023). Probabilistic machine learning: Advanced topics. MIT Press.

    MATH  Google Scholar 

  • Nishi, K., Ding, Y., Rich, A., et al. (2021). Augmentation strategies for learning with noisy labels. In: CVPR

  • Ortego, D., Arazo, E., Albert, P., et al. (2021). Multi-objective interpolation training for robustness to label noise. In: CVPR

  • Pereyra, G., Tucker, G., Chorowski, J., et al. (2017). Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548

  • Pu, N., Zhong, Z., Sebe, N., et al. (2023). A memorizing and generalizing framework for lifelong person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 13567–13585.

    Article  MATH  Google Scholar 

  • Reed, S., Lee, H., Anguelov, D., et al. (2015). Training deep neural networks on noisy labels with bootstrapping. In: ICLR

  • Ren, M., Zeng, W., Yang, B., et al. (2018). Learning to reweight examples for robust deep learning. In: ICML

  • Sharma, K., Donmez, P., Luo, E., et al. (2020). Noiserank: Unsupervised label noise reduction with dependence models. In: ECCV

  • Shen, Y., & Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. In: ICML

  • Shen, Y., Liu, L., & Shao, L. (2019). Unsupervised binary representation learning with deep variational networks. International Journal of Computer Vision, 127(11), 1614–1628.

    Article  MATH  Google Scholar 

  • Shu, J., Xie, Q., Yi, L., et al. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In: NeurIPS

  • Shu, J., Yuan, X., Meng, D., et al. (2023). Cmw-net: Learning a class-aware sample weighting mapping for robust deep learning. IEEE Transaction on Pattern Analysis and Machine Intelligence, 45(10), 11521–11539.

    Article  MATH  Google Scholar 

  • Sohn, K., Berthelot, D., Carlini, N., et al. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS

  • Song, H., Kim, M., & Lee, J. G. (2019). Selfie: Refurbishing unclean samples for robust deep learning. In: ICML

  • Sukhbaatar, S., Bruna, J., Paluri, M., et al. (2015). Training convolutional networks with noisy labels. In: ICLR

  • Sun, H., Guo, C., Wei, Q., et al. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.

    Article  MATH  Google Scholar 

  • Sun, Z., Shen, F., Huang, D., et al. (2022b). Pnp: Robust learning from noisy labels by probabilistic noise prediction. In: CVPR, pp. 5311–5320

  • Tanno, R., Saeedi, A., Sankaranarayanan, S., et al. (2019). Learning from noisy labels by regularized estimation of annotator confusion. In: CVPR

  • Taraday, M. K., & Baskin, C. (2023). Enhanced meta label correction for coping with label corruption. In: ICCV, pp. 16295–16304

  • Vahdat, A. (2017). Toward robustness against label noise in training deep discriminative neural networks. In: NeurIPS

  • Virmaux, A., & Scaman, K. (2018). Lipschitz regularity of deep neural networks: Analysis and efficient estimation. NeurIPS 31

  • Wang, X., Kodirov, E., Hua, Y., et al. (2019). Improving MAE against CCE under label noise. arXiv preprint arXiv:1903.12141

  • Wang, Y., Kucukelbir, A., Blei, D. M. (2017). Robust probabilistic modeling with Bayesian data reweighting. In: ICML

  • Wang, Z., Hu, G., & Hu, Q. (2020). Training noise-robust deep neural networks via meta-learning. In: CVPR

  • Wei, H., Feng, L., Chen, X., et al. (2020). Combating noisy labels by agreement: A joint training method with co-regularization. In: CVPR

  • Wei, Q., Sun, H., Lu, X., et al. (2022). Self-filtering: A noise-aware sample selection for label noise with confidence penalization. In: ECCV

  • Wei, Q., Feng, L., Sun, H., et al. (2023). Fine-grained classification with noisy labels. In: CVPR

  • Wu, Y., Shu, J., Xie, Q., et al. (2021). Learning to purify noisy labels via meta soft label corrector. In: AAAI

  • Xia, X., Liu, T., Han, B., et al. (2020a). Robust early-learning: Hindering the memorization of noisy labels. In: ICLR

  • Xia, X., Liu, T., Han, B., et al. (2020b). Part-dependent label noise: Towards instance-dependent label noise. In: NeurIPS

  • Xia, X., Han, B., Zhan, Y., et al. (2023). Combating noisy labels with sample selection by mining high-discrepancy examples. In: ICCV

  • Xiao, T., Xia, T., Yang, Y., et al. (2015). Learning from massive noisy labeled data for image classification. In: CVPR

  • Xu, Y., Zhu, L., Jiang, L., et al. (2021a). Faster meta update strategy for noise-robust deep learning. In: CVPR

  • Xu, Y., Zhu, L., Jiang, L., et al. (2021b). Faster meta update strategy for noise-robust deep learning. In: CVPR

  • Xu, Y., Niu, X., Yang, J., et al. (2023). Usdnl: Uncertainty-based single dropout in noisy label learning. In: AAAI, pp. 10648–10656

  • Yang, Y., Jiang, N., Xu, Y., et al. (2024). Robust semi-supervised learning by wisely leveraging open-set data. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–15

  • Yao, Y., Liu, T., Han, B., et al. (2020). Dual t: Reducing estimation error for transition matrix in label-noise learning. In: NeurIPS

  • Yao, Y., Liu, T., Gong, M., et al. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409–4420.

    MATH  Google Scholar 

  • Yao, Y., Sun, Z., Zhang, C., et al. (2021b). Jo-src: A contrastive approach for combating noisy labels. In: CVPR, pp. 5192–5201

  • Yao, Y., Gong, M., Du, Y., et al. (2023). Which is better for learning with noisy labels: The semi-supervised method or modeling label noise? In: ICML

  • Yu, X., Han, B., Yao, J., et al. (2019). How does disagreement help generalization against label corruption? In: ICML

  • Yu, X., Jiang, Y., Shi, T., et al. (2023). How to prevent the continuous damage of noises to model training? In: CVPR

  • Yuan, S., Feng, L., & Liu, T. (2023). Late stopping: Avoiding confidently learning from mislabeled examples. In: ICCV

  • Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In: ICML

  • Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In: BMVC

  • Zhang, H., Cisse, M., Dauphin, Y. N., et al. (2018). mixup: Beyond empirical risk minimization. In: ICLR

  • Zhang, W., Wang, Y., & Qiao, Y. (2019). Metacleaner: Learning to hallucinate clean representations for noisy-labeled visual recognition. In: CVPR

  • Zhang, Y., Niu, G., Sugiyama, M. (2021a). Learning noise transition matrix from only noisy labels via total variation regularization. In: ICML

  • Zhang, Y., Zheng, S., Wu, P., et al. (2021b). Learning with feature-dependent label noise: A progressive approach. In: ICLR

  • Zhang, Z., & Pfister, T. (2021). Learning fast sample re-weighting without reward data. In: ICCV, pp. 725–734

  • Zhang, Z., & Sabuncu, M. R. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In: NeurIPS

  • Zhao, Q., Shu, J., Yuan, X., et al. (2023). A probabilistic formulation for meta-weight-net. IEEE Transactions on Neural Networks and Learning Systems, 34(3), 1194–1208.

    Article  MATH  Google Scholar 

  • Zheng, G., Awadallah, A. H., & Dumais, S. (2021). Meta label correction for noisy label learning. In: AAAI

  • Zhou, X., Liu, X., Wang, C., et al. (2021). Learning with noisy labels via sparse regularization. In: ICCV

  • Zhu, J., Zhao, D., Zhang, B., et al. (2022). Disentangled inference for GANs with latently invertible autoencoder. International Journal of Computer Vision, 130(5), 1259–1276.

    Article  MATH  Google Scholar 

  • Zhu, Z., Liu, T., & Liu, Y. (2021). A second-order approach to learning with instance-dependent label noise. In: CVPR

Download references

Funding

This research was supported by Young Expert of Taishan Scholars in Shandong Province (No. tsqn202312026), Natural Science Foundation of China (No. 62106129, 62176139, 62276155), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15, ZR2021MF040).

Author information

Authors and Affiliations

Authors

Contributions

H-Sun conceptualized the learning problem and provided the main idea. He also drafted the article. Q-Wei completed main experiments and provided the analysis of experimental results. L-Feng provided the theoretical guarantee for the learning algorithm. F-Liu and H-Fan contributed to participating in discussions of the algorithm and experimental designs. Y-Hu and Y-Yin provided funding supports, and Y-Hu approved the final version of the article.

Corresponding author

Correspondence to Yupeng Hu.

Ethics declarations

Conflict of interest

The author declares that he has no confict of interest.

Code availability

The code is now available at https://github.com/haolsun/VRI.

Additional information

Communicated by Hong Liu

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The co-author’s affiliation has been corrected.

A Appendix

A Appendix

1.1 Derivations of The ELBO

For a singe observation \((\textbf{x}, \textbf{y} )\), the ELBO can be derived from the perspective of the KL divergence between the variational posterior \(q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})\) and the posterior \(p(\textbf{v}| \textbf{x}, \textbf{y})\):

$$\begin{aligned} & D_{\textrm{KL}}[q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) || p(\textbf{v}| \textbf{x}, \textbf{y})] \nonumber \\ & = \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \left[ \log q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) - \log p(\textbf{v}| \textbf{x}, \textbf{y})\right] \nonumber \\ & = \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \left[ \log q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) - \log \frac{p(\textbf{v}| \textbf{x}, \textbf{y}) p(\textbf{x}, \textbf{y})}{p(\textbf{x}, \textbf{y})}\right] \nonumber \\ & = \log p(\textbf{y}| \textbf{x}) + \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \big [\log q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) \nonumber \\ & \quad \quad \quad \quad \quad \quad \quad \quad - \log p(\textbf{y}| \textbf{x}, \textbf{v}) - \log p(\textbf{v}| \textbf{x}) \big ] \nonumber \\ & = \log p(\textbf{y}| \textbf{x}) - \mathbb {E}_{q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})} \left[ \log p(\textbf{y}| \textbf{x}, \textbf{v})\right] \nonumber \\ & \quad \quad \quad \quad \quad \quad \quad \quad +D_{\textrm{KL}}[q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y}) || p(\textbf{v}| \textbf{x})] \nonumber \\ & \ge 0. \end{aligned}$$
(A1)

Specifically, we apply Bayes’ rule to derive Eq. (A1) as

$$\begin{aligned} \begin{aligned} p&(\textbf{v}| \textbf{x}, \textbf{y}) = \frac{p(\textbf{v}| \textbf{x}, \textbf{y}) p(\textbf{x}, \textbf{y})}{p(\textbf{x}, \textbf{y})} \\&= \frac{p(\textbf{y} | \textbf{x},\textbf{v} ) p(\textbf{x},\textbf{v})}{p(\textbf{x}, \textbf{y})} = \frac{p(\textbf{y} | \textbf{x},\textbf{v} ) p(\textbf{v}|\textbf{x})}{p(\textbf{y}| \textbf{x})}. \end{aligned} \end{aligned}$$
(A2)

Therefore, the ELBO for the log-likelihood of the predictive distribution in Eq. (3) can be written as follows

(A3)

1.2 Proof

Lemma 1(Smoothness)

Proof

We begin with computation of the derivation of the meta loss \(\widetilde{\mathcal {L}}^{emp}(\hat{\theta }) \) w.r.t. the meta-network \(\phi \). By using Eq. (9), we have

$$\begin{aligned} \begin{aligned} \frac{\partial \mathcal {L}^{meta}(\hat{\theta })}{\partial \phi }&= \frac{\partial \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }} \frac{\partial \hat{\theta }}{\partial V(\phi )} \frac{\partial V(\phi ) }{\partial \phi } \\&=\alpha \frac{\partial \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }} \left( \nabla _\theta L(\theta ) + \frac{\partial D_{\textrm{KL}}}{\partial V(\phi )} \right) \frac{\partial V(\phi ) }{\partial \phi }. \end{aligned}\nonumber \\ \end{aligned}$$
(A4)

To simplify the proof, we neglect Monte Carlo estimation in Eq. 6 and consider it as a deterministic rectified vector in the following. This would not affect the result since there ultimately exists a rectified vector for computing the expectation of those sampled losses. Taking the gradient of \(\phi \) on both side of Eq. (A4),

(A5)

For the first term ❶ in the right hand, we can obtain the following inequality w.r.t. its norm

(A6)

since we assume \( \Vert \frac{\partial ^2 \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }^2} \Vert \le \ell \), \( \Vert \nabla _\theta L(\theta ) \Vert \le \tau \), \( \Vert \frac{\partial D_{\textrm{KL}}}{\partial V(\phi )} \Vert \le o\), and \( \Vert \frac{\partial V(\phi ) }{\partial \phi } \Vert \le \delta \).

For the second term ❷, we can also obtain

(A7)

with the assumption \( \Vert \frac{\partial ^2 V(\phi ) }{\partial \phi ^2} \Vert \le \zeta \). Therefore, we have

$$\begin{aligned} \begin{aligned} \left\| \frac{\partial ^2 \mathcal {L}^{meta}(\hat{\theta })}{\partial \phi ^2} \right\| \le \alpha (\tau +o)\left( \ell \alpha \delta ^2(\tau +o) + \tau \zeta \right) . \end{aligned} \end{aligned}$$
(A8)

Let \(\hat{\ell } = \alpha (\tau +o)\left( \ell \alpha \delta ^2(\tau +o) + \tau \zeta \right) \), we can conclude the proof that

$$\begin{aligned} \Vert \mathcal {L}^{meta}(\hat{\theta }(\phi ^{(t+1)})) - \mathcal {L}^{meta}(\hat{\theta }(\phi ^{(t)})) \Vert \le \hat{\ell } \Vert \phi ^{(t+1)} - \phi ^{(t)} \Vert .\nonumber \\ \end{aligned}$$
(A9)

\(\square \)

Theorem 1(Convergence Rate)

Proof

Consider

(A10)

For ❸, by Lipschitz smoothness of the meta loss function for \(\theta \), we have

$$\begin{aligned} & \mathcal {L}^{meta}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))- \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)})) \nonumber \\ & \le \langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)})), \hat{\theta }^{(t+1)}(\phi ^{(t+1)})-\hat{\theta }^{(t)}(\phi ^{(t+1)}) \rangle \nonumber \\ & \qquad +\frac{\ell }{2}\Vert \hat{\theta }^{(t+1)}(\phi ^{(t+1)})-\hat{\theta }^{(t)}(\phi ^{(t+1)})\Vert _2^2. \end{aligned}$$
(A11)

We firstly write \(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}), \hat{\theta }^{(t)}(\phi ^{(t+1)})\) with Eq. (9). Using Eq. (12), we obtain

$$\begin{aligned} \begin{aligned} \hat{\theta }^{(t+1)}&(\phi ^{(t+1)}) - \hat{\theta }^{(t)}(\phi ^{(t+1)}) \\ &= - \alpha \nabla _\theta \mathcal {L}^{emp}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)})). \end{aligned} \end{aligned}$$
(A12)

and

$$\begin{aligned} \begin{aligned} \Vert \mathcal {L}^{meta}&(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))- \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)})) \Vert \\&\le \alpha _t \tau ^2+ \frac{\ell \alpha _t^2}{2} \tau ^2 = \alpha _t\tau ^2 (1+\frac{\alpha _t \ell }{2}), \end{aligned} \end{aligned}$$
(A13)

since \(\left\| \frac{\partial L(\theta )}{\partial \theta }\Big |_{\theta ^{(t)}}\right\| \le \tau \), \(\left\| \!\frac{\partial L_i^{meta} (\hat{\theta })}{\partial \hat{\theta }}\!\Big |_{\hat{\theta }^{(t)}}\!\right\| \!\!\le \!\! \tau \), and the output of \(V(\cdot )\) is bounded with the sigmoid function.

For ❹, since the gradient is computed from a mini-batch of training data that is drawn uniformly, we denote the bias of the stochastic gradient \(\varepsilon ^{(t)} = \nabla \widetilde{\mathcal {L}}^{meta}\left( \hat{\theta }^{(t)}\left( \phi ^{(t)}\right) \right) - \nabla \mathcal {L}^{meta}\left( \hat{\theta }^{(t)}\left( \phi ^{(t)}\right) \right) \). We then observe its expectation obeys \(\mathbb {E}[\varepsilon ^{(t)}] = 0\) and its variance obeys \(\mathbb {E}[\Vert \varepsilon ^{(t)}\Vert _2^2 ] \le \sigma ^2 \).

By smoothness of \(\nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ))\) for \(\phi \) in Lemma 1, we have

$$\begin{aligned} & \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t+1)}))-\mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})) \nonumber \\ & \le \langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})),\phi ^{(t+1)}-\phi ^{(t)} \rangle + \frac{\hat{\ell }}{2} \Vert \phi ^{(t+1)}-\phi ^{(t)}\Vert _2^2 \nonumber \\ & = \langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})), -\eta _t [\nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))+\varepsilon ^{(t)} ] \rangle \nonumber \\ & \quad + \frac{\hat{\ell } \eta _t^2}{2} \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))+\varepsilon ^{(t)}\Vert _2^2 \nonumber \\ & = -(\eta _t-\frac{\hat{\ell } \eta _t^2}{2}) \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 + \frac{\widetilde{\ell } \eta _t^2}{2}\Vert \varepsilon ^{(t)}\Vert _2^2 \nonumber \\ & \quad - (\eta _t-\hat{\ell } \eta _t^2)\langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})),\varepsilon ^{(t)}\rangle . \end{aligned}$$
(A14)

Thus, Eq.(A10) satisfies

$$\begin{aligned} & \mathcal {L}^{meta}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))-\mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\nonumber \\ & \le \alpha _t\tau ^2 (1+\frac{\alpha _t \ell }{2}) -(\eta _t-\frac{\hat{\ell }\eta _t^2}{2}) \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 \nonumber \\ & \quad + \frac{\hat{\ell }\eta _t^2}{2}\Vert \varepsilon ^{(t)}\Vert _2^2 - (\eta _t-\hat{\ell } \eta _t^2)\langle \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)})),\varepsilon ^{(t)}\rangle .\nonumber \\ \end{aligned}$$
(A15)

We take the expectation w.r.t. \(\varepsilon ^{(t)}\) over Eq. (A15) and sum up T inequalities. By the property of the bias \(\varepsilon ^{(t)}\), we can obtain

$$\begin{aligned} \begin{aligned} \sum \limits _{t = 1}^T&\left( \mathop { \mathbb {E}}_{\varepsilon ^{(t)}} \mathcal {L}^{meta}(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}))-\mathop { \mathbb {E}}_{\varepsilon ^{(t)}}\mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\right) \\&\le \tau ^2 \sum \limits _{t = 1}^T \alpha _t(1+\frac{\alpha _t \ell }{2}) \\&\quad - \sum \limits _{t = 1}^T (\eta _t-\frac{\hat{\ell }\eta _t^2}{2})\mathop { \mathbb {E}}_{\varepsilon ^{(t)}} \left[ \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2\right] \\&\quad + \frac{\hat{\ell } \sigma ^2}{2} \sum \limits _{t = 1}^T \eta _t^2. \end{aligned} \end{aligned}$$
(A16)

Taking the total expectation and reordering the terms, we have

$$\begin{aligned} \begin{aligned}&\frac{1}{T}\sum \limits _{t = 1}^T (\eta _t-\frac{\hat{\ell }\eta _t^2}{2}) \mathop { \mathbb {E}} \left[ \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 \right] \\&\le \frac{\mathcal {L}^{meta}(\hat{\theta }^{(0)}(\phi ^{(0)})) - \mathop { \mathbb {E}}\left[ \mathcal {L}^{meta}(\hat{\theta }^{(T+1)}(\phi ^{(T+1)}) ) \right] }{T} \\&+ \frac{\tau ^2 }{T} \sum \limits _{t = 1}^T \alpha _t(1+\frac{\alpha _t \ell }{2}) + \frac{\hat{\ell } \sigma ^2}{2T} \sum \limits _{t = 1}^T \eta _t^2. \end{aligned} \end{aligned}$$
(A17)

Let

(A18)

With the assumption of \(\eta _t =\min \{\frac{1}{\hat{\ell }},\frac{C}{\sigma \sqrt{T}}\}\) and \(\alpha _t=\min \{1,\frac{\kappa }{T}\}\), we have \(\eta _t-\frac{\hat{\ell }\eta _t^2}{2} \ge \eta _t - \frac{\eta _t}{2} = \frac{\eta _t}{2}\) and

$$\begin{aligned} \begin{aligned}&\frac{1}{T}\sum \limits _{t = 1}^T \mathop { \mathbb {E}} \left[ \Vert \nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ^{(t)}))\Vert _2^2 \right] \\&\le \frac{2E}{T\eta _1} + \frac{(2+\ell )\tau ^2 \alpha _1 }{\eta _1} + \hat{\ell } \sigma ^2 \eta _1 \\&= \frac{2E}{T} \max \{\hat{\ell }, \frac{\sigma \sqrt{T}}{C} \} + (2+\ell ) \tau ^2 \min \{1,\frac{\kappa }{T}\} \max \{\hat{\ell }, \frac{\sigma \sqrt{T}}{C} \} \\&\quad + \hat{\ell } \sigma ^2 \min \{\frac{1}{\hat{\ell }},\frac{C}{\sigma \sqrt{T}}\} \\&\le \frac{2\sigma E}{C\sqrt{T}} + \frac{ (2+\ell )\tau ^2 \kappa \sigma }{C \sqrt{T} } + \frac{C\hat{\ell } \sigma ^2}{\sigma \sqrt{T}} = \mathcal {O}(\frac{1}{\sqrt{T}}). r \end{aligned}\nonumber \\ \end{aligned}$$
(A19)

Thus, we conclude our proof. \(\square \)

1.3 Algorithm for VRI Without the Meta Set

Algorithm 2
figure b

Learning without meta data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, H., Wei, Q., Feng, L. et al. Variational Rectification Inference for Learning with Noisy Labels. Int J Comput Vis 133, 652–671 (2025). https://doi.org/10.1007/s11263-024-02205-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-024-02205-5

Keywords