Abstract
Regularization/prior-based approach appears to be one of the critical strategies in continual learning, considering its mechanism for preserving and preventing forgetting the learned knowledge. Without any retraining on previous data or extending the network architecture, the mechanism works by setting a constraint on the important weights of previous tasks when learning the current task. Regularization/prior approach, on the other hand, suffers the challenge of weights being moved intensely to the parameter region, in which the model achieves good performance for the latest task but poor ones for earlier tasks. To that end, we suggest a novel solution to this problem by continually applying variational dropout (CVD), thereby generating task-specific local variables that work as modifying factors for the global variables to fit the task. In particular, as we impose a variational distribution on the auxiliary local variables employed as multiplicative noise to the layers’ input, the model enables the global variables to be retained in a good region for all tasks and reduces the forgetting phenomenon. Furthermore, we obtained theoretical properties that are currently unavailable in existing methods: (1) uncorrelated likelihoods between different data instances reduce the high variance of stochastic gradient variational Bayes; (2) correlated pre-activation improves the representation ability for each task; and (3) data-dependent regularization assures the global variables to be preserved in a good region for all tasks. Throughout our extensive results, adding the local variables shows its significant advantage in enhancing the performance of regularization/prior-based methods by considerable magnitudes on numerous datasets. Specifically, it brings several standard baselines closer to state-of-the-art results.
Similar content being viewed by others
Availability of data and materials
Not applicable.
References
Ahn, H., Cha, S., Lee, D., & Moon, T. (2019). Uncertainty-based continual learning with adaptive regularization. In Advances in Neural Information Processing Systems (pp. 4392–4402).
Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., & Tuytelaars, T. (2018). Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 139–154).
Bach, T. X., Anh, N. D., Linh, N. V., & Than, K. (2023). Dynamic transformation of prior knowledge into Bayesian models for data streams. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3742–3750.
Benzing, F. (2020). Understanding regularisation methods for continual learning. In Workshop of Advances in Neural Information Processing Systems.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural network. In International conference on machine learning (pp. 1613–1622). PMLR.
Boluki, S., Ardywibowo, R., Dadaneh, S. Z., Zhou, M., & Qian, X. (2020). Learnable Bernoulli dropout for bayesian deep learning. In The International Conference on Artificial Intelligence and Statistics, AISTATS (pp. 3905–3916).
Cha, S., Hsu, H., Hwang, T., Calmon, F. P., & Moon, T. (2021). CPR: Classifier-projection regularization for continual learning. In 9th International Conference on Learning Representations, ICLR.
Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6), 141–142.
De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations
Farquhar, S., & Gal, Y. (2018). A unifying bayesian view of continual learning. In The Bayesian deep learning workshop at neural information processing systems
Gal, Y., Hron, J., & Kendall, A. (2017). Concrete dropout. In Advances in Neural Information Processing Systems (pp. 3581–3590).
Ghahramani, Z., & Attias, H. (2000). Online variational Bayesian learning. In Slides from talk presented at NIPS workshop on online learning.
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., & Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (pp. 2348–2356). Citeseer.
Ha, C., Tran, V.-D., Van, L. N., & Than, K. (2019). Eliminating overfitting of probabilistic topic models on short and noisy text: The role of dropout. International Journal of Approximate Reasoning, 112, 85–104.
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., & Gilmer, J. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340–8349).
Henning, C., Cervera, M., D’Angelo, F., Von Oswald, J., Traber, R., Ehret, B., Kobayashi, S., Grewe, B. F., & Sacramento, J. (2021). Posterior meta-replay for continual learning. In Advances in neural information processing systems (Vol. 34).
Jung, S., Ahn, H., Cha, S., & Moon, T. (2020). Continual learning with node-importance based adaptive group sparse regularization. In Advances in neural information processing systems
Kingma, D. P., Salimans, T., & Welling, M. (2015). Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, 28, 2575–2583.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd international conference on learning representations, ICLR.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2935–2947.
Van Linh, N., Bach, T. X., & Than, K. (2022). A graph convolutional topic model for short and noisy text streams. Neurocomputing, 468, 345–359.
Liu, Y., Dong, W., Zhang, L., Gong, D., & Shi, Q. (2019). Variational bayesian dropout with a hierarchical prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7124–7133).
Loo, N., Swaroop, S., & Turner, R. E. (2021). Generalized variational continual learning. In International conference on learning representation
MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472.
Mirzadeh, S., Farajtabar, M., Pascanu, R., & Ghasemzadeh, H. (2020). Understanding the role of training regimes in continual learning. In Advances in neural information processing systems
Mirzadeh, S. I., Farajtabar, M., & Ghasemzadeh, H. (2020). Dropout as an implicit gating mechanism for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 232–233).
Molchanov, D., Ashukha, A., & Vetrov, D. (2017). Variational dropout sparsifies deep neural networks. In International conference on machine learning (pp. 2498–2507).
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge: MIT Press.
Neal, R. M. (1996). Bayesian learning for neural networks. Berlin: Springer.
Nguyen, T., Mai, T., Nguyen, N., Van, L. N., & Than, K. (2022b). Balancing stability and plasticity when learning topic models from short and noisy text streams. Neurocomputing, 505, 30–43.
Nguyen, S., Nguyen, D., Nguyen, K., Than, K., Bui, H., & Ho, N. (2021). Structured dropout variational inference for Bayesian neural networks. Advances in Neural Information Processing Systems, 34, 15188–15202.
Nguyen, H., Pham, H., Nguyen, S., Van Linh, N., & Than, K. (2022a). Adaptive infinite dropout for noisy and sparse data streams. Machine Learning, 111(8), 3025–3060.
Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2018). Variational continual learning. In International conference on learning representation.
Nguyen, V.-S., Nguyen, D.-T., Van, L.N., & Than, K. (2019). Infinite dropout for training bayesian models from data streams. In IEEE international conference on big data (Big Data) (pp. 125–134). IEEE
Oh, C., Adamczewski, K., & Park, M. (2020). Radial and directional posteriors for Bayesian deep learning. In The thirty-fourth conference on artificial intelligence, AAAI (pp. 5298–5305)
Paisley, J. W., Blei, D. M., & Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In Proceedings of the 29th international conference on machine learning, ICML
Phan, H., Tuan, A. P., Nguyen, S., Linh, N. V., & Than, K. (2022). Reducing catastrophic forgetting in neural networks via Gaussian mixture approximation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 106–117). Springer: Berlin
Sato, M.-A. (2001). Online model selection based on the variational bayes. Neural Computation, 13(7), 1649–1681.
Shi, G., Chen, J., Zhang, W., Zhan, L.-M., & Wu, X.-M. (2021). Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. Advances in Neural Information Processing Systems, 34, 6747–6761.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
Swaroop, S., Nguyen, C.V., Bui, T. D., & Turner, R. E. (2018). Improving and understanding variational continual learning. In NeurIPS Continual Learning Workshop.
Swiatkowski, J., Roth, K., Veeling, B., Tran, L., Dillon, J., Snoek, J., Mandt, S., Salimans, T., Jenatton, R., & Nowozin, S. (2020). The k-tied normal distribution: A compact parameterization of Gaussian mean field posteriors in Bayesian neural networks. In International conference on machine learning (pp. 9289–9299). PMLR.
Van, L.N., Hai, N.L., Pham, H., & Than, K. (2022). Auxiliary local variables for improving regularization/prior approach in continual learning. In Pacific-Asia conference on knowledge discovery and data mining (pp. 16–28). Springer: Berlin
Van de Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. In NeurIPS—Continual learning workshop
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 dataset.
Wei, C., Kakade, S., & Ma, T. (2020). The implicit and explicit regularization effects of dropout. In International conference on machine learning (pp. 10181–10192). PMLR.
Yin, D., Farajtabar, M., & Li, A. (2020). Sola: Continual learning with second-order loss approximation. In Workshop of advances in neural information processing systems
Zenke, F., Poole, B., & Ganguli, S. (2017). Continual learning through synaptic intelligence. Proceedings of Machine Learning Research, 70, 3987.
Funding
This research has been supported in part by the NSF grant CNS-1747798 to the IUCRC Center for Big Learning and the NSF grant # 2239570.
Author information
Authors and Affiliations
Contributions
The contributions of each author are presented as follows: NLH: Methodology, Software, Validation, Formal analysis, Visualization, Writing—original draft, Investigation. TN: Methodology, Software, Validation, Formal analysis, Writing—original draft, Investigation. LNV: Conceptualization, Methodology, Validation, Formal analysis, Writing—original draft, Visualization, Investigation, Project administration. THN: Methodology, Validation, Formal analysis, Writing—review, Visualization, Supervision. KT: Methodology, Validation, Formal analysis, Writing—review, Supervision, Funding acquisition.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no competing interests.
Code availability
The implementation for CVD can be found in https://github.com/nguyenvuthientrang/CVD.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Gustavo Batista.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A part of this work appears in Van et al. (2022).
Appendices
Appendix A: Auxiliary local variables for uncertainty regularized continual learning (UCL)
In this section, we review UCL (Ahn et al., 2019) which is one of the state-of-the-art methods for continual learning and how to apply CVD to this method. UCL uses the same the likelihood term of VCL, but reinterprets the KL term of VCL to improve this term. The KL term is rewritten as follows:
where l is the layer index of the neural network. UCL improves VCL by defining node importance and then adding two regularization terms. Based on the node importance, the first term limits the change of weights related to important nodes and the other makes weights more active to learn new tasks. In detail, UCL constrains the standard deviation of all weights connecting to the same node u in layer l to have the same value \(\varvec{\sigma }_{u}^{(l)}\) and then uses this parameter to measure node importance. Moreover, it modifies this KL term to freeze the weights related to important nodes:
where \(\varvec{\sigma }^{(l)}_{\text {init}}\) is the initial standard deviation hyperparameter for all weights on the l-th layer. The matrix \(\mathbf {\Lambda }^{(l)}_{uv}\triangleq \max \Big \{\frac{\varvec{\sigma }_{\text {init}}^{(l)}}{\varvec{\sigma }_{t-1,u}^{(l)}}, \frac{\varvec{\sigma }_{\text {init}}^{(l-1)}}{\varvec{\sigma }_{t-1,v}^{(l-1)}}\Big \}\) defines the regularization strength for the weight \(\varvec{\mu }_{t,uv}^{(l)}\).
CVD for UCL
We also add auxiliary variables to the original model and then maximize the log likelihood: \(\log p({\textbf{Y}}_t \vert {\textbf{X}}_t) = \sum _{i=1}^{N_t} \log p({\textbf{y}}_t^{(i)} \vert {\textbf{x}}_t^{(i)})\). Due to the intractability of the likelihood, we use mean-field variational inference with variational distributions \(q_t({\varvec{\theta }}), q_t({{\textbf{s}}})\):
Note that \(KL(q_t(\varvec{\theta }) \Vert q_{t-1}(\varvec{\theta }))\) is a different point between CVD for VCL and UCL. While CVD for VCL uses the KL term in Eq. (A1), the KL term in CVD for UCL is expressed in Eq. (A2). The remaining terms are built as in CVD for VCL.
Appendix B: Architectures and settings
Split MNIST and permuted MNIST
For Split MNIST, we use a fully-connected neural network (FCNN) with two hidden layers and multi-head output layer. Table 8 shows the detail of the network for Split MNIST dataset. For Permuted MNIST dataset, we also use FCNN but single-head output layer, and the architecture is shown in Table 9.
We tune the hyperparameters of both CVD and the combined methods. Moreover, we again use the strategy of parameter initialization as in UCL’s experiments (Ahn et al., 2019) in the beginning step of training process. All the hyperparameters of methods are listed as below:
-
UCL—\(\beta :\) {0.0001; 0.001; 0.01; 0.02; 0.03}, \(\alpha :\) {0.01; 0.3; 5}
-
EWC—\(\lambda :\) {40; 400; 4000; 10000; 40000}
-
VCL—not needed
-
GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}
-
CVD—KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}
Split CIFAR-10/100 and Split CIFAR-100:
The detail of architecture is used in CIFAR experiments is shown in Table 10 and all the hyperparameters are listed as below:
-
UCL—\(\beta :\) {0.0001; 0.0002; 0.001; 0.002}, \(\alpha :\) {0.01; 0.3; 5}, r : {0.5; 0.125}, \(lr(\sigma ):\) {0.01; 0.02}
-
EWC—\(\lambda :\) {400; 1000; 4000; 10000; 25000; 40000}
-
VCL—not needed
-
GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}
-
AGS-CL—\(\lambda :\) {1.5; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}
-
CVD—KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}
Split Omniglot:
The detail of architecture for Omniglot dataset is given in Table 11. Since, for each task, the number of classes is different, we denoted the classes of \(i^{th}\) task as \(C_i\). All the hyperparameters are listed as below:
-
UCL—\(\beta :\) {0.0001; 0.0002; 0.001; 0.002}, \(\alpha :\) {0.01; 0.3; 5}, r : {0.5; 0.125}, \(lr(\sigma ):\) {0.01; 0.02}
-
EWC—\(\lambda :\) {4000; 10000; 25000; 40000; 100000}
-
VCL—not needed
-
GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}
-
AGS-CL—\(\lambda :\) {1.5; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}
-
CVD—KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}
Split CUB-200:
We use AlexNet in this experiment, the detail of architecture is given in Table 12. All the hyperparameters are listed as below:
-
GVCL—\(\beta :\) {0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 100; 1000}
-
AGS-CL—\(\lambda :\) {1.5; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}
-
CVD-KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1}
Split ImangeNet-R:
We freeze the pretrained ViT backbone and add 4 dense layers to build model in this experiment, the detail of the architecture is given in Table 13. All the hyperparameters are listed as below:
-
GVCL—\(\beta :\) {0.001; 0.05; 0.1; 0.2}, \(\lambda :\) {1; 10; 50, 100; 1000}
-
AGS-CL—\(\lambda :\) {1.5; 50; 100; 400; 1000; 7000; 10000}, \(\mu :\) {0.5; 10; 20}, \(\rho :\) {0.1; 0.2; 0.3; 0.4; 0.5}
-
CVD-KL_weight \(\kappa\): {0.0001; 0.001; 0.01; 0.1; 1; 1.5}
Appendix C: Supplement visualizations
Figures 15, 16, 17 and 18 are the supplement illustrations for the analysis in Sect. 4.2. Similarly, the charts show the test accuracy of a task corresponding to the trained model of the task on the horizontal axis. As can be seen from this extent, CVD allows AGS-CL, VCL, UCL, and GVCL to be more steady across the tasks, whereas the original performs substantially worse. Based on these findings, it can be concluded that CVD not only enhances performance in most tasks but also effectively minimizes the forgetting phenomena.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hai, N.L., Nguyen, T., Van, L.N. et al. Continual variational dropout: a view of auxiliary local variables in continual learning. Mach Learn 113, 281–323 (2024). https://doi.org/10.1007/s10994-023-06487-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06487-7