Bayesian Distillation of Deep Learning Models

Grabovoy, A. V.; Strijov, V. V.

doi:10.1134/S0005117921110023

Bayesian Distillation of Deep Learning Models

Published: 22 December 2021

Volume 82, pages 1846–1856, (2021)
Cite this article

Automation and Remote Control Aims and scope Submit manuscript

A. V. Grabovoy¹ &
V. V. Strijov²

158 Accesses
1 Citation
Explore all metrics

Abstract

We study the problem of reducing the complexity of approximating models and consider methods based on distillation of deep learning models. The concepts of trainer and student are introduced. It is assumed that the student model has fewer parameters than the trainer model. A Bayesian approach to the student model selection is suggested. A method is proposed for assigning an a priori distribution of student parameters based on the a posteriori distribution of trainer model parameters. Since the trainer and student parameter spaces do not coincide, we propose a mechanism for the reduction of the trainer model parameter space to the student model parameter space by changing the trainer model structure. A theoretical analysis of the proposed reduction mechanism is carried out. A computational experiment was carried out on synthesized and real data. The FashionMNIST sample was used as real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Is My Neural Net Driven by the MDL Principle?

Computational Learning Theory

Generative Models and Unsupervised Learning

REFERENCES

Krizhevsky, A., Sutskever, I., and Hinton, G., ImageNet classification with deep convolutional neural networks, Proc. 25th Int. Conf. Neural Inf. Process. Syst. (2012), vol. 1, pp. 1097–1105.
Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition, Int. Conf. Learn. Representations (San Diego, 2015).
He, K., Ren, S., Sun, J., and Zhang, X., Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Las Vegas, 2016), pp. 770–778.
Devlin, J., Chang, M., Lee, K., and Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (Minnesota, 2019), vol. 1, pp. 4171–4186.
Vaswani, A., Gomez, A., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., and Uszkoreit, J., Attention is all you need, in Advances in Neural Information Processing Systems, 2017, vol. 5, pp. 6000–6010.
Al-Rfou, R., Barua, A., Constant, N., Kale, M., Raffel, C., Roberts, A., Siddhant, A., and Xue, L., mT5: a massively multilingual pre-trained text-to-text transformer, Proc. 2021 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (2021), pp. 483–498.
Brown, T. et al., GPT3: language models are few-shot learners, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901.
Zheng, T., Liu, X., Qin, Z., and Ren, K., Adversarial attacks and defenses in deep learning, Engineering, 2020, vol. 6, pp. 346–360.
Article Google Scholar
Hinton, G., Dean, J., and Vinyals, O., Distilling the knowledge in a neural network, NIPS Deep Learn. Representation Learn. Workshop (2015).
Vapnik, V. and Izmailov, R., Learning using privileged information: similarity control and knowledge transfer, J. Mach. Learn. Res., 2015, vol. 16, pp. 2023–2049.
MathSciNet MATH Google Scholar
Lopez-Paz, D., Bottou, L., Scholkopf, B., and Vapnik, V., Unifying distillation and privileged information, Int. Conf. Learn. Representations (Puerto Rico, 2016).
Burges, C., Cortes, C., and LeCun, Y., The MNIST Dataset of Handwritten Digits, 1998. http://yann.lecun.com/exdb/mnist/index.html.
Huang, Z. and Naiyan, W., Like What You Like: Knowledge Distill via Neuron Selectivity Transfer, 2019. .
Hinton, G., Krizhevsky, A., and Nair, V., CIFAR-10 (Canadian Institute for Advanced Research). http://www.cs.toronto.edu/~kriz/cifar.html.
Deng, J. et al., Imagenet: a large-scale hierarchical image database, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Miami, 2009), pp. 248–255.
LeCun, Y., Denker, J., and Solla, S., Optimal brain damage, Advances in Neural Information Processing Systems, 1989, vol. 2, pp. 598–605.
Graves, A., Practical variational inference for neural networks, Advances in Neural Information Processing Systems, 2011, vol. 24, pp. 2348–2356.
Grabovoy, A.V., Bakhteev, O.Y., and Strijov, V.V., Estimation of relevance for neural network parameters, Inf. Appl., 2019, vol. 13, no. 2, pp. 62–70.
Google Scholar
Rasul, K., Vollgraf, R., and Xiao, H., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, arXiv Preprint, 2017. .

Download references

Funding

This paper contains results of the project “Mathematical Methods for Big Data Mining” carried out as part of the implementation of the Competence Center Program of the National Technological Initiative “Big Data Storage and Analysis Center” supported by the Ministry of Science and Higher Education of the Russian Federation under the Agreement of Lomonosov Moscow State University with the Fund for Support of Projects of the National Technology Initiative of December 11, 2018, no. 13/1251/2018. This work was supported by the Russian Foundation for Basic Research, projects nos. 19-07-01155 and 19-07-00875.

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, Dolgoprudnyi, Moscow oblast, 141701, Russia
A. V. Grabovoy
Dorodnicyn Computing Centre, Russian Academy of Sciences, Moscow, 119333, Russia
V. V. Strijov

Authors

A. V. Grabovoy
View author publications
You can also search for this author in PubMed Google Scholar
V. V. Strijov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to A. V. Grabovoy or V. V. Strijov.

Additional information

Translated by V. Potapchouck

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grabovoy, A.V., Strijov, V.V. Bayesian Distillation of Deep Learning Models. Autom Remote Control 82, 1846–1856 (2021). https://doi.org/10.1134/S0005117921110023

Download citation

Received: 20 January 2021
Revised: 25 June 2021
Accepted: 30 June 2021
Published: 22 December 2021
Issue Date: November 2021
DOI: https://doi.org/10.1134/S0005117921110023

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions