Abstract
Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (e.g., re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Change history
25 September 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11263-024-02242-0
Notes
We utilize a robust image augmentation policy, RandAugmentMC (Cubuk et al., 2020) During each training iteration, two strategies are randomly selected for image transformation. Importantly, strong augmentation is applied exclusively to the (noisy) training data, and not to the meta data.
References
Arazo, E., Ortego, D., Albert, P., et al. (2019). Unsupervised label noise modeling and loss correction. In: ICML
Arpit, D., Jastrzkebski, S., Ballas, N., et al. (2017). A closer look at memorization in deep networks. In: ICML
Bai, Y., & Liu, T. (2021). Me-momentum: Extracting hard confident examples from noisily labeled data. In: ICCV
Bai, Y., Yang, E., Han, B., et al. (2021). Understanding and improving early stopping for learning with noisy labels. In: NeurIPS
Bao, F., Wu, G., Li, C., et al. (2021). Stability and generalization of bilevel programming in hyperparameter optimization. In: NeurIPS
Berthelot, D., Carlini, N., Goodfellow, I., et al. (2019). Mixmatch: A holistic approach to semi-supervised learning. NeurIPS
Bossard, L., Guillaumin, M., Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In: ECCV
Chen, Y., Shen, X., Hu, S. X., et al. (2021). Boosting co-teaching with compression regularization for label noise. In: CVPR
Chen, Y., Hu, S. X., Shen, X., et al. (2022). Compressing features for learning with noisy labels. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2022.3186930
Cheng, D., Ning, Y., Wang, N., et al. (2022). Class-dependent label-noise learning with cycle-consistency regularization. Advances in Neural Information Processing Systems, 35, 11104–11116.
Cheng, H., Zhu, Z., Li, X., et al. (2021). Learning with instance-dependent label noise: A sample sieve approach. In: ICLR
Cubuk, E. D., Zoph, B., Shlens, J., et al. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In: CVPR workshops, pp. 702–703
Cui, Y., Jia, M., Lin, T. Y., et al. (2019). Class-balanced loss based on effective number of samples. In: CVPR
Englesson, E. (2021). Generalized Jensen-Shannon divergence loss for learning with noisy labels. In: NeurIPS
Fallah, A., Mokhtari, A., & Ozdaglar, A. (2020). On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In: AISTATS
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML
Franceschi, L., Frasconi, P., Salzo, S. et al. (2018). Bilevel programming for hyperparameter optimization and meta-learning. In: ICML
Fu, Z., Song, K., Zhou, L., et al. (2024). Noise-aware image captioning with progressively exploring mismatched words. In: AAAI, pp. 12091–12099
Ghosh, A., Kumar, H., Sastry, P. (2017). Robust loss functions under label noise for deep neural networks. In: AAAI
Goldberger, J., & Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation layer. In: ICLR
Gudovskiy, D., Rigazio, L., Ishizaka, S., et al. (2021). Autodo: Robust autoaugment for biased data with label noise via scalable probabilistic implicit differentiation. In: CVPR
Han, B., Yao, J., Niu, G., et al. (2018a). Masking: A new perspective of noisy supervision. In: NeurIPS
Han, B., Yao, Q., Yu, X., et al. (2018b). Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS 31
Han, J., Luo, P., & Wang, X. (2019). Deep self-learning from noisy labels. In: ICCV
He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: CVPR
Hendrycks, D., Mazeika, M., Wilson, D., et al. (2018). Using trusted data to train deep networks on labels corrupted by severe noise. In: NeurIPS
Higgins, I., Matthey, L., Pal, A., et al. (2017) beta-vae: Learning basic visual concepts with a constrained variational framework. In: ICLR
Hospedales, T., Antoniou, A., Micaelli, P., et al. (2022). Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5149–5169.
Huang, H., Kang, H., Liu, S., et al. (2023). Paddles: Phase-amplitude spectrum disentangled early stopping for learning with noisy labels. In: ICCV
Iakovleva, E., Verbeek, J., & Alahari, K. (2020). Meta-learning with shared amortized variational inference. In: ICML
Iscen, A., Valmadre, J., Arnab, A., et al. (2022). Learning with neighbor consistency for noisy labels. In: CVPR
Jiang, L., Zhou, Z., Leung, T., et al. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML
Kang, H., Liu, S., Huang, H., et al. (2023). Unleashing the potential of regularization strategies in learning with noisy labels. arXiv preprint arXiv:2307.05025
Kim, Y., Yun, J., Shon, H., et al. (2021). Joint negative and positive learning for noisy labels. In: CVPR
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In: ICLR
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Kumar, M. P., Packer, B., Koller, D. (2010). Self-paced learning for latent variable models. In: NeurIPS
Kye, S. M., Choi, K., Yi, J., et al. (2022). Learning with noisy labels by efficient transition matrix estimation to combat label miscorrection. In: ECCV, Springer, pp. 717–738
Lee, K. H., He, X., Zhang, L., et al. (2018). Cleannet: Transfer learning for scalable image classifier training with label noise. In: CVPR
Li, J., Wong, Y., Zhao, Q., et al. (2019). Learning to learn from noisy labeled data. In: CVPR
Li, J., Socher, R. & Hoi, S. C. (2020). Dividemix: Learning with noisy labels as semi-supervised learning. In: ICLR
Li, J., Xiong, C., & Hoi, S. (2021). Mopro: Webly supervised learning with momentum prototypes. In: ICLR
Li, S., Xia, X., Ge, S., et al. (2022a). Selective-supervised contrastive learning with noisy labels. In: CVPR
Li, S., Xia, X., Zhang, H., et al. (2022). Estimating noise transition matrix with label correlations for noisy multi-label learning. Advances in Neural Information Processing Systems, 35, 24184–24198.
Liu, H., Zhong, Z., Sebe, N., et al. (2023). Mitigating robust overfitting via self-residual-calibration regularization. Artificial Intelligence, 317, 103877.
Liu, S., Niles-Weed, J., Razavian, N., et al. (2020). Early-learning regularization prevents memorization of noisy labels. In: NeurIPS
Liu, S., Zhu, Z., Qu, Q., et al. (2022). Robust training under label noise by over-parameterization. In: ICML
Liu, T., & Tao, D. (2015). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3), 447–461.
Liu, Y., & Guo, H. (2020). Peer loss functions: Learning from noisy labels without knowing noise rates. In: ICML
Ma, X., Wang, Y., Houle, M. E., et al. (2018). Dimensionality-driven learning with noisy labels. In: ICML
Malach, E., & Shalev-Shwartz, S. (2017). Decoupling "when to update" from "how to update". NeurIPS 30
Murphy, K. P. (2023). Probabilistic machine learning: Advanced topics. MIT Press.
Nishi, K., Ding, Y., Rich, A., et al. (2021). Augmentation strategies for learning with noisy labels. In: CVPR
Ortego, D., Arazo, E., Albert, P., et al. (2021). Multi-objective interpolation training for robustness to label noise. In: CVPR
Pereyra, G., Tucker, G., Chorowski, J., et al. (2017). Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548
Pu, N., Zhong, Z., Sebe, N., et al. (2023). A memorizing and generalizing framework for lifelong person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 13567–13585.
Reed, S., Lee, H., Anguelov, D., et al. (2015). Training deep neural networks on noisy labels with bootstrapping. In: ICLR
Ren, M., Zeng, W., Yang, B., et al. (2018). Learning to reweight examples for robust deep learning. In: ICML
Sharma, K., Donmez, P., Luo, E., et al. (2020). Noiserank: Unsupervised label noise reduction with dependence models. In: ECCV
Shen, Y., & Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. In: ICML
Shen, Y., Liu, L., & Shao, L. (2019). Unsupervised binary representation learning with deep variational networks. International Journal of Computer Vision, 127(11), 1614–1628.
Shu, J., Xie, Q., Yi, L., et al. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In: NeurIPS
Shu, J., Yuan, X., Meng, D., et al. (2023). Cmw-net: Learning a class-aware sample weighting mapping for robust deep learning. IEEE Transaction on Pattern Analysis and Machine Intelligence, 45(10), 11521–11539.
Sohn, K., Berthelot, D., Carlini, N., et al. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS
Song, H., Kim, M., & Lee, J. G. (2019). Selfie: Refurbishing unclean samples for robust deep learning. In: ICML
Sukhbaatar, S., Bruna, J., Paluri, M., et al. (2015). Training convolutional networks with noisy labels. In: ICLR
Sun, H., Guo, C., Wei, Q., et al. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.
Sun, Z., Shen, F., Huang, D., et al. (2022b). Pnp: Robust learning from noisy labels by probabilistic noise prediction. In: CVPR, pp. 5311–5320
Tanno, R., Saeedi, A., Sankaranarayanan, S., et al. (2019). Learning from noisy labels by regularized estimation of annotator confusion. In: CVPR
Taraday, M. K., & Baskin, C. (2023). Enhanced meta label correction for coping with label corruption. In: ICCV, pp. 16295–16304
Vahdat, A. (2017). Toward robustness against label noise in training deep discriminative neural networks. In: NeurIPS
Virmaux, A., & Scaman, K. (2018). Lipschitz regularity of deep neural networks: Analysis and efficient estimation. NeurIPS 31
Wang, X., Kodirov, E., Hua, Y., et al. (2019). Improving MAE against CCE under label noise. arXiv preprint arXiv:1903.12141
Wang, Y., Kucukelbir, A., Blei, D. M. (2017). Robust probabilistic modeling with Bayesian data reweighting. In: ICML
Wang, Z., Hu, G., & Hu, Q. (2020). Training noise-robust deep neural networks via meta-learning. In: CVPR
Wei, H., Feng, L., Chen, X., et al. (2020). Combating noisy labels by agreement: A joint training method with co-regularization. In: CVPR
Wei, Q., Sun, H., Lu, X., et al. (2022). Self-filtering: A noise-aware sample selection for label noise with confidence penalization. In: ECCV
Wei, Q., Feng, L., Sun, H., et al. (2023). Fine-grained classification with noisy labels. In: CVPR
Wu, Y., Shu, J., Xie, Q., et al. (2021). Learning to purify noisy labels via meta soft label corrector. In: AAAI
Xia, X., Liu, T., Han, B., et al. (2020a). Robust early-learning: Hindering the memorization of noisy labels. In: ICLR
Xia, X., Liu, T., Han, B., et al. (2020b). Part-dependent label noise: Towards instance-dependent label noise. In: NeurIPS
Xia, X., Han, B., Zhan, Y., et al. (2023). Combating noisy labels with sample selection by mining high-discrepancy examples. In: ICCV
Xiao, T., Xia, T., Yang, Y., et al. (2015). Learning from massive noisy labeled data for image classification. In: CVPR
Xu, Y., Zhu, L., Jiang, L., et al. (2021a). Faster meta update strategy for noise-robust deep learning. In: CVPR
Xu, Y., Zhu, L., Jiang, L., et al. (2021b). Faster meta update strategy for noise-robust deep learning. In: CVPR
Xu, Y., Niu, X., Yang, J., et al. (2023). Usdnl: Uncertainty-based single dropout in noisy label learning. In: AAAI, pp. 10648–10656
Yang, Y., Jiang, N., Xu, Y., et al. (2024). Robust semi-supervised learning by wisely leveraging open-set data. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–15
Yao, Y., Liu, T., Han, B., et al. (2020). Dual t: Reducing estimation error for transition matrix in label-noise learning. In: NeurIPS
Yao, Y., Liu, T., Gong, M., et al. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409–4420.
Yao, Y., Sun, Z., Zhang, C., et al. (2021b). Jo-src: A contrastive approach for combating noisy labels. In: CVPR, pp. 5192–5201
Yao, Y., Gong, M., Du, Y., et al. (2023). Which is better for learning with noisy labels: The semi-supervised method or modeling label noise? In: ICML
Yu, X., Han, B., Yao, J., et al. (2019). How does disagreement help generalization against label corruption? In: ICML
Yu, X., Jiang, Y., Shi, T., et al. (2023). How to prevent the continuous damage of noises to model training? In: CVPR
Yuan, S., Feng, L., & Liu, T. (2023). Late stopping: Avoiding confidently learning from mislabeled examples. In: ICCV
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In: ICML
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In: BMVC
Zhang, H., Cisse, M., Dauphin, Y. N., et al. (2018). mixup: Beyond empirical risk minimization. In: ICLR
Zhang, W., Wang, Y., & Qiao, Y. (2019). Metacleaner: Learning to hallucinate clean representations for noisy-labeled visual recognition. In: CVPR
Zhang, Y., Niu, G., Sugiyama, M. (2021a). Learning noise transition matrix from only noisy labels via total variation regularization. In: ICML
Zhang, Y., Zheng, S., Wu, P., et al. (2021b). Learning with feature-dependent label noise: A progressive approach. In: ICLR
Zhang, Z., & Pfister, T. (2021). Learning fast sample re-weighting without reward data. In: ICCV, pp. 725–734
Zhang, Z., & Sabuncu, M. R. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In: NeurIPS
Zhao, Q., Shu, J., Yuan, X., et al. (2023). A probabilistic formulation for meta-weight-net. IEEE Transactions on Neural Networks and Learning Systems, 34(3), 1194–1208.
Zheng, G., Awadallah, A. H., & Dumais, S. (2021). Meta label correction for noisy label learning. In: AAAI
Zhou, X., Liu, X., Wang, C., et al. (2021). Learning with noisy labels via sparse regularization. In: ICCV
Zhu, J., Zhao, D., Zhang, B., et al. (2022). Disentangled inference for GANs with latently invertible autoencoder. International Journal of Computer Vision, 130(5), 1259–1276.
Zhu, Z., Liu, T., & Liu, Y. (2021). A second-order approach to learning with instance-dependent label noise. In: CVPR
Funding
This research was supported by Young Expert of Taishan Scholars in Shandong Province (No. tsqn202312026), Natural Science Foundation of China (No. 62106129, 62176139, 62276155), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15, ZR2021MF040).
Author information
Authors and Affiliations
Contributions
H-Sun conceptualized the learning problem and provided the main idea. He also drafted the article. Q-Wei completed main experiments and provided the analysis of experimental results. L-Feng provided the theoretical guarantee for the learning algorithm. F-Liu and H-Fan contributed to participating in discussions of the algorithm and experimental designs. Y-Hu and Y-Yin provided funding supports, and Y-Hu approved the final version of the article.
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no confict of interest.
Code availability
The code is now available at https://github.com/haolsun/VRI.
Additional information
Communicated by Hong Liu
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: The co-author’s affiliation has been corrected.
A Appendix
A Appendix
1.1 Derivations of The ELBO
For a singe observation \((\textbf{x}, \textbf{y} )\), the ELBO can be derived from the perspective of the KL divergence between the variational posterior \(q_{\phi }(\textbf{v}| \textbf{x}, \textbf{y})\) and the posterior \(p(\textbf{v}| \textbf{x}, \textbf{y})\):
Specifically, we apply Bayes’ rule to derive Eq. (A1) as
Therefore, the ELBO for the log-likelihood of the predictive distribution in Eq. (3) can be written as follows

1.2 Proof
Lemma 1(Smoothness)
Proof
We begin with computation of the derivation of the meta loss \(\widetilde{\mathcal {L}}^{emp}(\hat{\theta }) \) w.r.t. the meta-network \(\phi \). By using Eq. (9), we have
To simplify the proof, we neglect Monte Carlo estimation in Eq. 6 and consider it as a deterministic rectified vector in the following. This would not affect the result since there ultimately exists a rectified vector for computing the expectation of those sampled losses. Taking the gradient of \(\phi \) on both side of Eq. (A4),

For the first term ❶ in the right hand, we can obtain the following inequality w.r.t. its norm

since we assume \( \Vert \frac{\partial ^2 \mathcal {L}^{meta}(\hat{\theta })}{\partial \hat{\theta }^2} \Vert \le \ell \), \( \Vert \nabla _\theta L(\theta ) \Vert \le \tau \), \( \Vert \frac{\partial D_{\textrm{KL}}}{\partial V(\phi )} \Vert \le o\), and \( \Vert \frac{\partial V(\phi ) }{\partial \phi } \Vert \le \delta \).
For the second term ❷, we can also obtain

with the assumption \( \Vert \frac{\partial ^2 V(\phi ) }{\partial \phi ^2} \Vert \le \zeta \). Therefore, we have
Let \(\hat{\ell } = \alpha (\tau +o)\left( \ell \alpha \delta ^2(\tau +o) + \tau \zeta \right) \), we can conclude the proof that
\(\square \)
Theorem 1(Convergence Rate)
Proof
Consider

For ❸, by Lipschitz smoothness of the meta loss function for \(\theta \), we have
We firstly write \(\hat{\theta }^{(t+1)}(\phi ^{(t+1)}), \hat{\theta }^{(t)}(\phi ^{(t+1)})\) with Eq. (9). Using Eq. (12), we obtain
and
since \(\left\| \frac{\partial L(\theta )}{\partial \theta }\Big |_{\theta ^{(t)}}\right\| \le \tau \), \(\left\| \!\frac{\partial L_i^{meta} (\hat{\theta })}{\partial \hat{\theta }}\!\Big |_{\hat{\theta }^{(t)}}\!\right\| \!\!\le \!\! \tau \), and the output of \(V(\cdot )\) is bounded with the sigmoid function.
For ❹, since the gradient is computed from a mini-batch of training data that is drawn uniformly, we denote the bias of the stochastic gradient \(\varepsilon ^{(t)} = \nabla \widetilde{\mathcal {L}}^{meta}\left( \hat{\theta }^{(t)}\left( \phi ^{(t)}\right) \right) - \nabla \mathcal {L}^{meta}\left( \hat{\theta }^{(t)}\left( \phi ^{(t)}\right) \right) \). We then observe its expectation obeys \(\mathbb {E}[\varepsilon ^{(t)}] = 0\) and its variance obeys \(\mathbb {E}[\Vert \varepsilon ^{(t)}\Vert _2^2 ] \le \sigma ^2 \).
By smoothness of \(\nabla \mathcal {L}^{meta}(\hat{\theta }^{(t)}(\phi ))\) for \(\phi \) in Lemma 1, we have
Thus, Eq.(A10) satisfies
We take the expectation w.r.t. \(\varepsilon ^{(t)}\) over Eq. (A15) and sum up T inequalities. By the property of the bias \(\varepsilon ^{(t)}\), we can obtain
Taking the total expectation and reordering the terms, we have
Let

With the assumption of \(\eta _t =\min \{\frac{1}{\hat{\ell }},\frac{C}{\sigma \sqrt{T}}\}\) and \(\alpha _t=\min \{1,\frac{\kappa }{T}\}\), we have \(\eta _t-\frac{\hat{\ell }\eta _t^2}{2} \ge \eta _t - \frac{\eta _t}{2} = \frac{\eta _t}{2}\) and
Thus, we conclude our proof. \(\square \)
1.3 Algorithm for VRI Without the Meta Set
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, H., Wei, Q., Feng, L. et al. Variational Rectification Inference for Learning with Noisy Labels. Int J Comput Vis 133, 652–671 (2025). https://doi.org/10.1007/s11263-024-02205-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-024-02205-5