Skip to main content
Log in

Continual variational dropout: a view of auxiliary local variables in continual learning

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Regularization/prior-based approach appears to be one of the critical strategies in continual learning, considering its mechanism for preserving and preventing forgetting the learned knowledge. Without any retraining on previous data or extending the network architecture, the mechanism works by setting a constraint on the important weights of previous tasks when learning the current task. Regularization/prior approach, on the other hand, suffers the challenge of weights being moved intensely to the parameter region, in which the model achieves good performance for the latest task but poor ones for earlier tasks. To that end, we suggest a novel solution to this problem by continually applying variational dropout (CVD), thereby generating task-specific local variables that work as modifying factors for the global variables to fit the task. In particular, as we impose a variational distribution on the auxiliary local variables employed as multiplicative noise to the layers’ input, the model enables the global variables to be retained in a good region for all tasks and reduces the forgetting phenomenon. Furthermore, we obtained theoretical properties that are currently unavailable in existing methods: (1) uncorrelated likelihoods between different data instances reduce the high variance of stochastic gradient variational Bayes; (2) correlated pre-activation improves the representation ability for each task; and (3) data-dependent regularization assures the global variables to be preserved in a good region for all tasks. Throughout our extensive results, adding the local variables shows its significant advantage in enhancing the performance of regularization/prior-based methods by considerable magnitudes on numerous datasets. Specifically, it brings several standard baselines closer to state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Availability of data and materials

Not applicable.

Notes

  1. https://github.com/csm9493/UCL.

  2. https://github.com/yolky/gvcl.

  3. https://github.com/sangwon79/AGS-CL.

References

  • Ahn, H., Cha, S., Lee, D., & Moon, T. (2019). Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems (pp. 4392–4402).

  • Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., & Tuytelaars, T. (2018). Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 139–154).

  • Bach, T. X., Anh, N. D., Linh, N. V., & Than, K. (2023). Dynamic transformation of prior knowledge into Bayesian models for data streams. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3742–3750.

    Article  Google Scholar 

  • Benzing, F. (2020). Understanding regularisation methods for continual learning. In Workshop of Advances in Neural Information Processing Systems.

  • Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural network. In International conference on machine learning (pp. 1613–1622). PMLR.

  • Boluki, S., Ardywibowo, R., Dadaneh, S. Z., Zhou, M., & Qian, X. (2020). Learnable Bernoulli dropout for bayesian deep learning. In The International Conference on Artificial Intelligence and Statistics, AISTATS (pp. 3905–3916).

  • Cha, S., Hsu, H., Hwang, T., Calmon, F. P., & Moon, T. (2021). CPR: Classifier-projection regularization for continual learning. In 9th International Conference on Learning Representations, ICLR.

  • Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6), 141–142.

    Article  Google Scholar 

  • De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations

  • Farquhar, S., & Gal, Y. (2018). A unifying bayesian view of continual learning. In The Bayesian deep learning workshop at neural information processing systems

  • Gal, Y., Hron, J., & Kendall, A. (2017). Concrete dropout. In Advances in Neural Information Processing Systems (pp. 3581–3590).

  • Ghahramani, Z., & Attias, H. (2000). Online variational Bayesian learning. In Slides from talk presented at NIPS workshop on online learning.

  • Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., & Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211

  • Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (pp. 2348–2356). Citeseer.

  • Ha, C., Tran, V.-D., Van, L. N., & Than, K. (2019). Eliminating overfitting of probabilistic topic models on short and noisy text: The role of dropout. International Journal of Approximate Reasoning, 112, 85–104.

    Article  Google Scholar 

  • Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., & Gilmer, J. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340–8349).

  • Henning, C., Cervera, M., D’Angelo, F., Von Oswald, J., Traber, R., Ehret, B., Kobayashi, S., Grewe, B. F., & Sacramento, J. (2021). Posterior meta-replay for continual learning. In Advances in neural information processing systems (Vol. 34).

  • Jung, S., Ahn, H., Cha, S., & Moon, T. (2020). Continual learning with node-importance based adaptive group sparse regularization. In Advances in neural information processing systems

  • Kingma, D. P., Salimans, T., & Welling, M. (2015). Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, 28, 2575–2583.

    Google Scholar 

  • Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd international conference on learning representations, ICLR.

  • Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.

    Article  MathSciNet  Google Scholar 

  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

  • Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2935–2947.

    Article  Google Scholar 

  • Van Linh, N., Bach, T. X., & Than, K. (2022). A graph convolutional topic model for short and noisy text streams. Neurocomputing, 468, 345–359.

    Article  Google Scholar 

  • Liu, Y., Dong, W., Zhang, L., Gong, D., & Shi, Q. (2019). Variational bayesian dropout with a hierarchical prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7124–7133).

  • Loo, N., Swaroop, S., & Turner, R. E. (2021). Generalized variational continual learning. In International conference on learning representation

  • MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472.

    Article  Google Scholar 

  • Mirzadeh, S., Farajtabar, M., Pascanu, R., & Ghasemzadeh, H. (2020). Understanding the role of training regimes in continual learning. In Advances in neural information processing systems

  • Mirzadeh, S. I., Farajtabar, M., & Ghasemzadeh, H. (2020). Dropout as an implicit gating mechanism for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 232–233).

  • Molchanov, D., Ashukha, A., & Vetrov, D. (2017). Variational dropout sparsifies deep neural networks. In International conference on machine learning (pp. 2498–2507).

  • Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge: MIT Press.

    Google Scholar 

  • Neal, R. M. (1996). Bayesian learning for neural networks. Berlin: Springer.

    Book  Google Scholar 

  • Nguyen, T., Mai, T., Nguyen, N., Van, L. N., & Than, K. (2022b). Balancing stability and plasticity when learning topic models from short and noisy text streams. Neurocomputing, 505, 30–43.

    Article  Google Scholar 

  • Nguyen, S., Nguyen, D., Nguyen, K., Than, K., Bui, H., & Ho, N. (2021). Structured dropout variational inference for Bayesian neural networks. Advances in Neural Information Processing Systems, 34, 15188–15202.

    Google Scholar 

  • Nguyen, H., Pham, H., Nguyen, S., Van Linh, N., & Than, K. (2022a). Adaptive infinite dropout for noisy and sparse data streams. Machine Learning, 111(8), 3025–3060.

    Article  MathSciNet  Google Scholar 

  • Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2018). Variational continual learning. In International conference on learning representation.

  • Nguyen, V.-S., Nguyen, D.-T., Van, L.N., & Than, K. (2019). Infinite dropout for training bayesian models from data streams. In IEEE international conference on big data (Big Data) (pp. 125–134). IEEE

  • Oh, C., Adamczewski, K., & Park, M. (2020). Radial and directional posteriors for Bayesian deep learning. In The thirty-fourth conference on artificial intelligence, AAAI (pp. 5298–5305)

  • Paisley, J. W., Blei, D. M., & Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In Proceedings of the 29th international conference on machine learning, ICML

  • Phan, H., Tuan, A. P., Nguyen, S., Linh, N. V., & Than, K. (2022). Reducing catastrophic forgetting in neural networks via Gaussian mixture approximation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 106–117). Springer: Berlin

  • Sato, M.-A. (2001). Online model selection based on the variational bayes. Neural Computation, 13(7), 1649–1681.

    Article  Google Scholar 

  • Shi, G., Chen, J., Zhang, W., Zhan, L.-M., & Wu, X.-M. (2021). Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. Advances in Neural Information Processing Systems, 34, 6747–6761.

    Google Scholar 

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

    MathSciNet  Google Scholar 

  • Swaroop, S., Nguyen, C.V., Bui, T. D., & Turner, R. E. (2018). Improving and understanding variational continual learning. In NeurIPS Continual Learning Workshop.

  • Swiatkowski, J., Roth, K., Veeling, B., Tran, L., Dillon, J., Snoek, J., Mandt, S., Salimans, T., Jenatton, R., & Nowozin, S. (2020). The k-tied normal distribution: A compact parameterization of Gaussian mean field posteriors in Bayesian neural networks. In International conference on machine learning (pp. 9289–9299). PMLR.

  • Van, L.N., Hai, N.L., Pham, H., & Than, K. (2022). Auxiliary local variables for improving regularization/prior approach in continual learning. In Pacific-Asia conference on knowledge discovery and data mining (pp. 16–28). Springer: Berlin

  • Van de Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. In NeurIPS—Continual learning workshop

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 dataset.

  • Wei, C., Kakade, S., & Ma, T. (2020). The implicit and explicit regularization effects of dropout. In International conference on machine learning (pp. 10181–10192). PMLR.

  • Yin, D., Farajtabar, M., & Li, A. (2020). Sola: Continual learning with second-order loss approximation. In Workshop of advances in neural information processing systems

  • Zenke, F., Poole, B., & Ganguli, S. (2017). Continual learning through synaptic intelligence. Proceedings of Machine Learning Research, 70, 3987.

    Google Scholar 

Download references

Funding

 This research has been supported in part by the NSF grant CNS-1747798 to the IUCRC Center for Big Learning and the NSF grant # 2239570.

Author information

Authors and Affiliations

Authors

Contributions

The contributions of each author are presented as follows: NLH: Methodology, Software, Validation, Formal analysis, Visualization, Writing—original draft, Investigation. TN: Methodology, Software, Validation, Formal analysis, Writing—original draft, Investigation. LNV: Conceptualization, Methodology, Validation, Formal analysis, Writing—original draft, Visualization, Investigation, Project administration. THN: Methodology, Validation, Formal analysis, Writing—review, Visualization, Supervision. KT: Methodology, Validation, Formal analysis, Writing—review, Supervision, Funding acquisition.

Corresponding author

Correspondence to Linh Ngo Van.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Code availability

The implementation for CVD can be found in https://github.com/nguyenvuthientrang/CVD.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Gustavo Batista.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A part of this work appears in Van et al. (2022).

Appendices

Appendix A: Auxiliary local variables for uncertainty regularized continual learning (UCL)

In this section, we review UCL (Ahn et al., 2019) which is one of the state-of-the-art methods for continual learning and how to apply CVD to this method. UCL uses the same the likelihood term of VCL, but reinterprets the KL term of VCL to improve this term. The KL term is rewritten as follows:

$$\begin{aligned} \frac{1}{2} \sum _{l=1}^L \left[ \left\| \frac{\varvec{\mu }_t^{(l)} - \varvec{\mu }_{t-1}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\right\| _2^2 + {\textbf{1}}^\intercal \left\{ \left( \frac{\varvec{\sigma }_{t}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}} \right) ^2 - \log \left( \frac{\varvec{\sigma }_{t}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}} \right) ^2 \right\} \right] \end{aligned}$$
(A1)

where l is the layer index of the neural network. UCL improves VCL by defining node importance and then adding two regularization terms. Based on the node importance, the first term limits the change of weights related to important nodes and the other makes weights more active to learn new tasks. In detail, UCL constrains the standard deviation of all weights connecting to the same node u in layer l to have the same value \(\varvec{\sigma }_{u}^{(l)}\) and then uses this parameter to measure node importance. Moreover, it modifies this KL term to freeze the weights related to important nodes:

$$\begin{aligned} KL&= \sum _{l=1}^{L}\Big [\Big ( \frac{1}{2}\Big \Vert \mathbf {\Lambda }^{(l)}\odot (\varvec{\mu }_{t}^{(l)}-\varvec{\mu }_{t-1}^{(l)})\Big \Vert _2^2 \nonumber \\&+ (\varvec{\sigma }_{\text {init}}^{(l)})^2 \Big \Vert \Big (\frac{\varvec{\mu }_{t-1}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\Big )^{2}\odot (\varvec{\mu }_{t}^{(l)}-\varvec{\mu }_{t-1}^{(l)}) \Big \Vert _1\Big )\nonumber \\&+ \frac{\beta }{2}{\textbf{1}}^\top \Big \{\Big (\frac{\varvec{\sigma }_t^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\Big )^2-\log \Big (\frac{\varvec{\sigma }_t^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\Big )^2 + (\varvec{\sigma }_t^{(l)})^2-\log (\varvec{\sigma }_t^{(l)})^2\Big \}\Big ] \end{aligned}$$
(A2)

where \(\varvec{\sigma }^{(l)}_{\text {init}}\) is the initial standard deviation hyperparameter for all weights on the l-th layer. The matrix \(\mathbf {\Lambda }^{(l)}_{uv}\triangleq \max \Big \{\frac{\varvec{\sigma }_{\text {init}}^{(l)}}{\varvec{\sigma }_{t-1,u}^{(l)}}, \frac{\varvec{\sigma }_{\text {init}}^{(l-1)}}{\varvec{\sigma }_{t-1,v}^{(l-1)}}\Big \}\) defines the regularization strength for the weight \(\varvec{\mu }_{t,uv}^{(l)}\).

CVD for UCL

We also add auxiliary variables to the original model and then maximize the log likelihood: \(\log p({\textbf{Y}}_t \vert {\textbf{X}}_t) = \sum _{i=1}^{N_t} \log p({\textbf{y}}_t^{(i)} \vert {\textbf{x}}_t^{(i)})\). Due to the intractability of the likelihood, we use mean-field variational inference with variational distributions \(q_t({\varvec{\theta }}), q_t({{\textbf{s}}})\):

$$\begin{aligned}&\sum _{i=1}^{N_t} \log \int _{\varvec{\theta }} \int _{{\textbf{s}}} p({\textbf{y}}_t^{(i)} \vert {\textbf{s}}, \varvec{\theta }, {\textbf{x}}_t^{(i)}) p({\textbf{s}}) p(\varvec{\theta }) d\varvec{\theta } d{\textbf{s}} \nonumber \\&\quad = \sum _{i=1}^{N^{(t)}} \log \int _{\varvec{\theta }} \int _{s} \frac{p(y_{i}^{(t)} \vert s, \varvec{\theta }, x_{i}^{(t)}) p(s) q_{t-1}(\varvec{\theta })}{q(s) q(\varvec{\theta })} q(\varvec{\theta }) q(s) d\varvec{\theta } ds \nonumber \\&\quad \ge \sum _{i=1}^{N_t} E_{q_t({\varvec{\theta }}), q_t({{\textbf{s}}})} \left[ \log p({\textbf{y}}_t^{(i)} \vert {\textbf{s}}, \varvec{\theta }, {\textbf{x}}_t^{(i)}) \right] \nonumber \\&\quad - KL(q_t({\textbf{s}}) \Vert p({\textbf{s}})) - KL(q_t(\varvec{\theta }) \Vert q_{t-1}(\varvec{\theta })) \end{aligned}$$
(A3)

Note that \(KL(q_t(\varvec{\theta }) \Vert q_{t-1}(\varvec{\theta }))\) is a different point between CVD for VCL and UCL. While CVD for VCL uses the KL term in Eq. (A1), the KL term in CVD for UCL is expressed in Eq. (A2). The remaining terms are built as in CVD for VCL.

Appendix B: Architectures and settings

Split MNIST and permuted MNIST

For Split MNIST, we use a fully-connected neural network (FCNN) with two hidden layers and multi-head output layer. Table 8 shows the detail of the network for Split MNIST dataset. For Permuted MNIST dataset, we also use FCNN but single-head output layer, and the architecture is shown in Table 9.

Table 8 Network architecture for Split MNIST
Table 9 Network architecture for Permuted MNIST

We tune the hyperparameters of both CVD and the combined methods. Moreover, we again use the strategy of parameter initialization as in UCL’s experiments (Ahn et al., 2019) in the beginning step of training process. All the hyperparameters of methods are listed as below:

  • UCL—\(\beta :\) {0.0001; 0.001; 0.01; 0.02; 0.03}, \(\alpha :\) {0.01; 0.3; 5}

  • EWC—\(\lambda :\) {40; 400; 4000; 10000; 40000}

  • VCL—not needed

  • GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}

  • CVD—KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}

Split CIFAR-10/100 and Split CIFAR-100:

The detail of architecture is used in CIFAR experiments is shown in Table 10 and all the hyperparameters are listed as below:

  • UCL—\(\beta :\) {0.0001; 0.0002; 0.001; 0.002}, \(\alpha :\) {0.01; 0.3; 5}, r :  {0.5; 0.125}, \(lr(\sigma ):\) {0.01; 0.02}

  • EWC—\(\lambda :\) {400; 1000; 4000; 10000; 25000; 40000}

  • VCL—not needed

  • GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}

  • AGS-CL—\(\lambda :\) {1.5; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}

  • CVD—KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}

Table 10 Network architecture for Split CIFAR-10/100

Split Omniglot:

The detail of architecture for Omniglot dataset is given in Table 11. Since, for each task, the number of classes is different, we denoted the classes of \(i^{th}\) task as \(C_i\). All the hyperparameters are listed as below:

  • UCL—\(\beta :\) {0.0001; 0.0002; 0.001; 0.002}, \(\alpha :\) {0.01; 0.3; 5}, r :  {0.5; 0.125}, \(lr(\sigma ):\) {0.01; 0.02}

  • EWC—\(\lambda :\) {4000; 10000; 25000; 40000; 100000}

  • VCL—not needed

  • GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}

  • AGS-CL—\(\lambda :\) {1.5; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}

  • CVD—KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}

Table 11 Network architecture for Split Omniglot

Split CUB-200:

We use AlexNet in this experiment, the detail of architecture is given in Table 12. All the hyperparameters are listed as below:

  • GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}

  • AGS-CL—\(\lambda :\) {1.5; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}

  • CVD-KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}

Table 12 Network architecture for Split CUB-200

Split ImangeNet-R:

We freeze the pretrained ViT backbone and add 4 dense layers to build model in this experiment, the detail of the architecture is given in Table 13. All the hyperparameters are listed as below:

  • GVCL—\(\beta :\) {0.001; 0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 50, 100; 1000}

  • AGS-CL—\(\lambda :\) {1.5; 50; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}

  • CVD-KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1; 1.5}

Table 13 Network architecture for Split ImageNet-R

Appendix C: Supplement visualizations

Figures 15, 16, 17 and 18 are the supplement illustrations for the analysis in Sect. 4.2. Similarly, the charts show the test accuracy of a task corresponding to the trained model of the task on the horizontal axis. As can be seen from this extent, CVD allows AGS-CL, VCL, UCL, and GVCL to be more steady across the tasks, whereas the original performs substantially worse. Based on these findings, it can be concluded that CVD not only enhances performance in most tasks but also effectively minimizes the forgetting phenomena.

Fig. 15
figure 15

The change of accuracy through tasks on VCL

Fig. 16
figure 16

The change of accuracy through tasks on UCL

Fig. 17
figure 17

The change of accuracy through tasks on AGS-CL

Fig. 18
figure 18

The change of accuracy through tasks on GVCL

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hai, N.L., Nguyen, T., Van, L.N. et al. Continual variational dropout: a view of auxiliary local variables in continual learning. Mach Learn 113, 281–323 (2024). https://doi.org/10.1007/s10994-023-06487-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06487-7

Keywords

Navigation