Continual variational dropout: a view of auxiliary local variables in continual learning

Hai, Nam Le; Nguyen, Trang; Van, Linh Ngo; Nguyen, Thien Huu; Than, Khoat

doi:10.1007/s10994-023-06487-7

Continual variational dropout: a view of auxiliary local variables in continual learning

Published: 07 December 2023

Volume 113, pages 281–323, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Nam Le Hai^1,3^na1,
Trang Nguyen¹^na1,
Linh Ngo Van ORCID: orcid.org/0000-0002-0011-5137¹,
Thien Huu Nguyen² &
…
Khoat Than¹

252 Accesses
1 Altmetric
Explore all metrics

Abstract

Regularization/prior-based approach appears to be one of the critical strategies in continual learning, considering its mechanism for preserving and preventing forgetting the learned knowledge. Without any retraining on previous data or extending the network architecture, the mechanism works by setting a constraint on the important weights of previous tasks when learning the current task. Regularization/prior approach, on the other hand, suffers the challenge of weights being moved intensely to the parameter region, in which the model achieves good performance for the latest task but poor ones for earlier tasks. To that end, we suggest a novel solution to this problem by continually applying variational dropout (CVD), thereby generating task-specific local variables that work as modifying factors for the global variables to fit the task. In particular, as we impose a variational distribution on the auxiliary local variables employed as multiplicative noise to the layers’ input, the model enables the global variables to be retained in a good region for all tasks and reduces the forgetting phenomenon. Furthermore, we obtained theoretical properties that are currently unavailable in existing methods: (1) uncorrelated likelihoods between different data instances reduce the high variance of stochastic gradient variational Bayes; (2) correlated pre-activation improves the representation ability for each task; and (3) data-dependent regularization assures the global variables to be preserved in a good region for all tasks. Throughout our extensive results, adding the local variables shows its significant advantage in enhancing the performance of regularization/prior-based methods by considerable magnitudes on numerous datasets. Specifically, it brings several standard baselines closer to state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Auxiliary Local Variables for Improving Regularization/Prior Approach in Continual Learning

Regularizing Meta-learning via Gradient Dropout

Dropout Fails to Regularize Nonparametric Learners

Article 20 January 2021

Availability of data and materials

Not applicable.

Notes

References

Ahn, H., Cha, S., Lee, D., & Moon, T. (2019). Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems (pp. 4392–4402).
Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., & Tuytelaars, T. (2018). Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 139–154).
Bach, T. X., Anh, N. D., Linh, N. V., & Than, K. (2023). Dynamic transformation of prior knowledge into Bayesian models for data streams. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3742–3750.
Article Google Scholar
Benzing, F. (2020). Understanding regularisation methods for continual learning. In Workshop of Advances in Neural Information Processing Systems.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural network. In International conference on machine learning (pp. 1613–1622). PMLR.
Boluki, S., Ardywibowo, R., Dadaneh, S. Z., Zhou, M., & Qian, X. (2020). Learnable Bernoulli dropout for bayesian deep learning. In The International Conference on Artificial Intelligence and Statistics, AISTATS (pp. 3905–3916).
Cha, S., Hsu, H., Hwang, T., Calmon, F. P., & Moon, T. (2021). CPR: Classifier-projection regularization for continual learning. In 9th International Conference on Learning Representations, ICLR.
Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6), 141–142.
Article Google Scholar
De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations
Farquhar, S., & Gal, Y. (2018). A unifying bayesian view of continual learning. In The Bayesian deep learning workshop at neural information processing systems
Gal, Y., Hron, J., & Kendall, A. (2017). Concrete dropout. In Advances in Neural Information Processing Systems (pp. 3581–3590).
Ghahramani, Z., & Attias, H. (2000). Online variational Bayesian learning. In Slides from talk presented at NIPS workshop on online learning.
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., & Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (pp. 2348–2356). Citeseer.
Ha, C., Tran, V.-D., Van, L. N., & Than, K. (2019). Eliminating overfitting of probabilistic topic models on short and noisy text: The role of dropout. International Journal of Approximate Reasoning, 112, 85–104.
Article Google Scholar
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., & Gilmer, J. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340–8349).
Henning, C., Cervera, M., D’Angelo, F., Von Oswald, J., Traber, R., Ehret, B., Kobayashi, S., Grewe, B. F., & Sacramento, J. (2021). Posterior meta-replay for continual learning. In Advances in neural information processing systems (Vol. 34).
Jung, S., Ahn, H., Cha, S., & Moon, T. (2020). Continual learning with node-importance based adaptive group sparse regularization. In Advances in neural information processing systems
Kingma, D. P., Salimans, T., & Welling, M. (2015). Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, 28, 2575–2583.
Google Scholar
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd international conference on learning representations, ICLR.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.
Article MathSciNet Google Scholar
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2935–2947.
Article Google Scholar
Van Linh, N., Bach, T. X., & Than, K. (2022). A graph convolutional topic model for short and noisy text streams. Neurocomputing, 468, 345–359.
Article Google Scholar
Liu, Y., Dong, W., Zhang, L., Gong, D., & Shi, Q. (2019). Variational bayesian dropout with a hierarchical prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7124–7133).
Loo, N., Swaroop, S., & Turner, R. E. (2021). Generalized variational continual learning. In International conference on learning representation
MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472.
Article Google Scholar
Mirzadeh, S., Farajtabar, M., Pascanu, R., & Ghasemzadeh, H. (2020). Understanding the role of training regimes in continual learning. In Advances in neural information processing systems
Mirzadeh, S. I., Farajtabar, M., & Ghasemzadeh, H. (2020). Dropout as an implicit gating mechanism for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 232–233).
Molchanov, D., Ashukha, A., & Vetrov, D. (2017). Variational dropout sparsifies deep neural networks. In International conference on machine learning (pp. 2498–2507).
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge: MIT Press.
Google Scholar
Neal, R. M. (1996). Bayesian learning for neural networks. Berlin: Springer.
Book Google Scholar
Nguyen, T., Mai, T., Nguyen, N., Van, L. N., & Than, K. (2022b). Balancing stability and plasticity when learning topic models from short and noisy text streams. Neurocomputing, 505, 30–43.
Article Google Scholar
Nguyen, S., Nguyen, D., Nguyen, K., Than, K., Bui, H., & Ho, N. (2021). Structured dropout variational inference for Bayesian neural networks. Advances in Neural Information Processing Systems, 34, 15188–15202.
Google Scholar
Nguyen, H., Pham, H., Nguyen, S., Van Linh, N., & Than, K. (2022a). Adaptive infinite dropout for noisy and sparse data streams. Machine Learning, 111(8), 3025–3060.
Article MathSciNet Google Scholar
Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2018). Variational continual learning. In International conference on learning representation.
Nguyen, V.-S., Nguyen, D.-T., Van, L.N., & Than, K. (2019). Infinite dropout for training bayesian models from data streams. In IEEE international conference on big data (Big Data) (pp. 125–134). IEEE
Oh, C., Adamczewski, K., & Park, M. (2020). Radial and directional posteriors for Bayesian deep learning. In The thirty-fourth conference on artificial intelligence, AAAI (pp. 5298–5305)
Paisley, J. W., Blei, D. M., & Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In Proceedings of the 29th international conference on machine learning, ICML
Phan, H., Tuan, A. P., Nguyen, S., Linh, N. V., & Than, K. (2022). Reducing catastrophic forgetting in neural networks via Gaussian mixture approximation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 106–117). Springer: Berlin
Sato, M.-A. (2001). Online model selection based on the variational bayes. Neural Computation, 13(7), 1649–1681.
Article Google Scholar
Shi, G., Chen, J., Zhang, W., Zhan, L.-M., & Wu, X.-M. (2021). Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. Advances in Neural Information Processing Systems, 34, 6747–6761.
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
MathSciNet Google Scholar
Swaroop, S., Nguyen, C.V., Bui, T. D., & Turner, R. E. (2018). Improving and understanding variational continual learning. In NeurIPS Continual Learning Workshop.
Swiatkowski, J., Roth, K., Veeling, B., Tran, L., Dillon, J., Snoek, J., Mandt, S., Salimans, T., Jenatton, R., & Nowozin, S. (2020). The k-tied normal distribution: A compact parameterization of Gaussian mean field posteriors in Bayesian neural networks. In International conference on machine learning (pp. 9289–9299). PMLR.
Van, L.N., Hai, N.L., Pham, H., & Than, K. (2022). Auxiliary local variables for improving regularization/prior approach in continual learning. In Pacific-Asia conference on knowledge discovery and data mining (pp. 16–28). Springer: Berlin
Van de Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. In NeurIPS—Continual learning workshop
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 dataset.
Wei, C., Kakade, S., & Ma, T. (2020). The implicit and explicit regularization effects of dropout. In International conference on machine learning (pp. 10181–10192). PMLR.
Yin, D., Farajtabar, M., & Li, A. (2020). Sola: Continual learning with second-order loss approximation. In Workshop of advances in neural information processing systems
Zenke, F., Poole, B., & Ganguli, S. (2017). Continual learning through synaptic intelligence. Proceedings of Machine Learning Research, 70, 3987.
Google Scholar

Download references

Funding

This research has been supported in part by the NSF grant CNS-1747798 to the IUCRC Center for Big Learning and the NSF grant # 2239570.

Author information

Nam Le Hai and Trang Nguyen have contributed equally to this work.

Authors and Affiliations

School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
Nam Le Hai, Trang Nguyen, Linh Ngo Van & Khoat Than
University of Oregon, Eugene, USA
Thien Huu Nguyen
FPT Software AI Center, Hanoi, Vietnam
Nam Le Hai

Authors

Nam Le Hai
View author publications
You can also search for this author in PubMed Google Scholar
Trang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Linh Ngo Van
View author publications
You can also search for this author in PubMed Google Scholar
Thien Huu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Khoat Than
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The contributions of each author are presented as follows: NLH: Methodology, Software, Validation, Formal analysis, Visualization, Writing—original draft, Investigation. TN: Methodology, Software, Validation, Formal analysis, Writing—original draft, Investigation. LNV: Conceptualization, Methodology, Validation, Formal analysis, Writing—original draft, Visualization, Investigation, Project administration. THN: Methodology, Validation, Formal analysis, Writing—review, Visualization, Supervision. KT: Methodology, Validation, Formal analysis, Writing—review, Supervision, Funding acquisition.

Corresponding author

Correspondence to Linh Ngo Van.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Code availability

The implementation for CVD can be found in https://github.com/nguyenvuthientrang/CVD.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Gustavo Batista.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A part of this work appears in Van et al. (2022).

Appendices

Appendix A: Auxiliary local variables for uncertainty regularized continual learning (UCL)

In this section, we review UCL (Ahn et al., 2019) which is one of the state-of-the-art methods for continual learning and how to apply CVD to this method. UCL uses the same the likelihood term of VCL, but reinterprets the KL term of VCL to improve this term. The KL term is rewritten as follows:

$$\begin{aligned} \frac{1}{2} \sum _{l=1}^L \left[ \left\| \frac{\varvec{\mu }_t^{(l)} - \varvec{\mu }_{t-1}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\right\| _2^2 + {\textbf{1}}^\intercal \left\{ \left( \frac{\varvec{\sigma }_{t}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}} \right) ^2 - \log \left( \frac{\varvec{\sigma }_{t}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}} \right) ^2 \right\} \right] \end{aligned}$$

(A1)

where l is the layer index of the neural network. UCL improves VCL by defining node importance and then adding two regularization terms. Based on the node importance, the first term limits the change of weights related to important nodes and the other makes weights more active to learn new tasks. In detail, UCL constrains the standard deviation of all weights connecting to the same node u in layer l to have the same value $\varvec{\sigma }_{u}^{(l)}$ and then uses this parameter to measure node importance. Moreover, it modifies this KL term to freeze the weights related to important nodes:

$$\begin{aligned} KL&= \sum _{l=1}^{L}\Big [\Big ( \frac{1}{2}\Big \Vert \mathbf {\Lambda }^{(l)}\odot (\varvec{\mu }_{t}^{(l)}-\varvec{\mu }_{t-1}^{(l)})\Big \Vert _2^2 \nonumber \\&+ (\varvec{\sigma }_{\text {init}}^{(l)})^2 \Big \Vert \Big (\frac{\varvec{\mu }_{t-1}^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\Big )^{2}\odot (\varvec{\mu }_{t}^{(l)}-\varvec{\mu }_{t-1}^{(l)}) \Big \Vert _1\Big )\nonumber \\&+ \frac{\beta }{2}{\textbf{1}}^\top \Big \{\Big (\frac{\varvec{\sigma }_t^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\Big )^2-\log \Big (\frac{\varvec{\sigma }_t^{(l)}}{\varvec{\sigma }_{t-1}^{(l)}}\Big )^2 + (\varvec{\sigma }_t^{(l)})^2-\log (\varvec{\sigma }_t^{(l)})^2\Big \}\Big ] \end{aligned}$$

(A2)

where $\varvec{\sigma }^{(l)}_{\text {init}}$ is the initial standard deviation hyperparameter for all weights on the l-th layer. The matrix $\mathbf {\Lambda }^{(l)}_{uv}\triangleq \max \Big \{\frac{\varvec{\sigma }_{\text {init}}^{(l)}}{\varvec{\sigma }_{t-1,u}^{(l)}}, \frac{\varvec{\sigma }_{\text {init}}^{(l-1)}}{\varvec{\sigma }_{t-1,v}^{(l-1)}}\Big \}$ defines the regularization strength for the weight $\varvec{\mu }_{t,uv}^{(l)}$.

CVD for UCL

We also add auxiliary variables to the original model and then maximize the log likelihood: $\log p({\textbf{Y}}_t \vert {\textbf{X}}_t) = \sum _{i=1}^{N_t} \log p({\textbf{y}}_t^{(i)} \vert {\textbf{x}}_t^{(i)})$. Due to the intractability of the likelihood, we use mean-field variational inference with variational distributions $q_t({\varvec{\theta }}), q_t({{\textbf{s}}})$:

$$\begin{aligned}&\sum _{i=1}^{N_t} \log \int _{\varvec{\theta }} \int _{{\textbf{s}}} p({\textbf{y}}_t^{(i)} \vert {\textbf{s}}, \varvec{\theta }, {\textbf{x}}_t^{(i)}) p({\textbf{s}}) p(\varvec{\theta }) d\varvec{\theta } d{\textbf{s}} \nonumber \\&\quad = \sum _{i=1}^{N^{(t)}} \log \int _{\varvec{\theta }} \int _{s} \frac{p(y_{i}^{(t)} \vert s, \varvec{\theta }, x_{i}^{(t)}) p(s) q_{t-1}(\varvec{\theta })}{q(s) q(\varvec{\theta })} q(\varvec{\theta }) q(s) d\varvec{\theta } ds \nonumber \\&\quad \ge \sum _{i=1}^{N_t} E_{q_t({\varvec{\theta }}), q_t({{\textbf{s}}})} \left[ \log p({\textbf{y}}_t^{(i)} \vert {\textbf{s}}, \varvec{\theta }, {\textbf{x}}_t^{(i)}) \right] \nonumber \\&\quad - KL(q_t({\textbf{s}}) \Vert p({\textbf{s}})) - KL(q_t(\varvec{\theta }) \Vert q_{t-1}(\varvec{\theta })) \end{aligned}$$

(A3)

Note that $KL(q_t(\varvec{\theta }) \Vert q_{t-1}(\varvec{\theta }))$ is a different point between CVD for VCL and UCL. While CVD for VCL uses the KL term in Eq. (A1), the KL term in CVD for UCL is expressed in Eq. (A2). The remaining terms are built as in CVD for VCL.

Appendix B: Architectures and settings

Split MNIST and permuted MNIST

For Split MNIST, we use a fully-connected neural network (FCNN) with two hidden layers and multi-head output layer. Table 8 shows the detail of the network for Split MNIST dataset. For Permuted MNIST dataset, we also use FCNN but single-head output layer, and the architecture is shown in Table 9.

Table 8 Network architecture for Split MNIST

Full size table

Table 9 Network architecture for Permuted MNIST

Full size table

We tune the hyperparameters of both CVD and the combined methods. Moreover, we again use the strategy of parameter initialization as in UCL’s experiments (Ahn et al., 2019) in the beginning step of training process. All the hyperparameters of methods are listed as below:

UCL—$\beta :$ {0.0001; 0.001; 0.01; 0.02; 0.03}, $\alpha :$ {0.01; 0.3; 5}
EWC—$\lambda :$ {40; 400; 4000; 10000; 40000}
VCL—not needed
GVCL—$\beta :$ {0.05; 0.1; 0.2}, $\lambda :$ {1; 10; 100; 1000}
CVD—KL_weight $\kappa$: {0.0001; 0.001; 0.01; 0.1; 1}

Split CIFAR-10/100 and Split CIFAR-100:

The detail of architecture is used in CIFAR experiments is shown in Table 10 and all the hyperparameters are listed as below:

UCL—$\beta :$ {0.0001; 0.0002; 0.001; 0.002}, $\alpha :$ {0.01; 0.3; 5}, r : {0.5; 0.125}, $lr(\sigma ):$ {0.01; 0.02}
EWC—$\lambda :$ {400; 1000; 4000; 10000; 25000; 40000}
VCL—not needed
GVCL—$\beta :$ {0.05; 0.1; 0.2}, $\lambda :$ {1; 10; 100; 1000}
AGS-CL—$\lambda :$ {1.5; 100; 400; 1000; 7000; 10000}, $\mu :$ {0.5; 10; 20}, $\rho :$ {0.1; 0.2; 0.3; 0.4; 0.5}
CVD—KL_weight $\kappa$: {0.0001; 0.001; 0.01; 0.1; 1}

Table 10 Network architecture for Split CIFAR-10/100

Full size table

Split Omniglot:

The detail of architecture for Omniglot dataset is given in Table 11. Since, for each task, the number of classes is different, we denoted the classes of $i^{th}$ task as $C_i$. All the hyperparameters are listed as below:

UCL—$\beta :$ {0.0001; 0.0002; 0.001; 0.002}, $\alpha :$ {0.01; 0.3; 5}, r : {0.5; 0.125}, $lr(\sigma ):$ {0.01; 0.02}
EWC—$\lambda :$ {4000; 10000; 25000; 40000; 100000}
VCL—not needed
GVCL—$\beta :$ {0.05; 0.1; 0.2}, $\lambda :$ {1; 10; 100; 1000}
AGS-CL—$\lambda :$ {1.5; 100; 400; 1000; 7000; 10000}, $\mu :$ {0.5; 10; 20}, $\rho :$ {0.1; 0.2; 0.3; 0.4; 0.5}
CVD—KL_weight $\kappa$: {0.0001; 0.001; 0.01; 0.1; 1}

Table 11 Network architecture for Split Omniglot

Full size table

Split CUB-200:

We use AlexNet in this experiment, the detail of architecture is given in Table 12. All the hyperparameters are listed as below:

GVCL—$\beta :$ {0.05; 0.1; 0.2}, $\lambda :$ {1; 10; 100; 1000}
AGS-CL—$\lambda :$ {1.5; 100; 400; 1000; 7000; 10000}, $\mu :$ {0.5; 10; 20}, $\rho :$ {0.1; 0.2; 0.3; 0.4; 0.5}
CVD-KL_weight $\kappa$: {0.0001; 0.001; 0.01; 0.1; 1}

Table 12 Network architecture for Split CUB-200

Full size table

Split ImangeNet-R:

We freeze the pretrained ViT backbone and add 4 dense layers to build model in this experiment, the detail of the architecture is given in Table 13. All the hyperparameters are listed as below:

GVCL—$\beta :$ {0.001; 0.05; 0.1; 0.2}, $\lambda :$ {1; 10; 50, 100; 1000}
AGS-CL—$\lambda :$ {1.5; 50; 100; 400; 1000; 7000; 10000}, $\mu :$ {0.5; 10; 20}, $\rho :$ {0.1; 0.2; 0.3; 0.4; 0.5}
CVD-KL_weight $\kappa$: {0.0001; 0.001; 0.01; 0.1; 1; 1.5}

Table 13 Network architecture for Split ImageNet-R

Full size table

Appendix C: Supplement visualizations

Figures 15, 16, 17 and 18 are the supplement illustrations for the analysis in Sect. 4.2. Similarly, the charts show the test accuracy of a task corresponding to the trained model of the task on the horizontal axis. As can be seen from this extent, CVD allows AGS-CL, VCL, UCL, and GVCL to be more steady across the tasks, whereas the original performs substantially worse. Based on these findings, it can be concluded that CVD not only enhances performance in most tasks but also effectively minimizes the forgetting phenomena.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hai, N.L., Nguyen, T., Van, L.N. et al. Continual variational dropout: a view of auxiliary local variables in continual learning. Mach Learn 113, 281–323 (2024). https://doi.org/10.1007/s10994-023-06487-7

Download citation

Received: 12 July 2022
Revised: 19 September 2023
Accepted: 04 November 2023
Published: 07 December 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10994-023-06487-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continual variational dropout: a view of auxiliary local variables in continual learning

Abstract

Access this article

Similar content being viewed by others

Auxiliary Local Variables for Improving Regularization/Prior Approach in Continual Learning

Regularizing Meta-learning via Gradient Dropout

Dropout Fails to Regularize Nonparametric Learners

Availability of data and materials

Notes

References

Funding