Skip to main content
Log in

Bayesian Distillation of Deep Learning Models

  • Published:
Automation and Remote Control Aims and scope Submit manuscript

Abstract

We study the problem of reducing the complexity of approximating models and consider methods based on distillation of deep learning models. The concepts of trainer and student are introduced. It is assumed that the student model has fewer parameters than the trainer model. A Bayesian approach to the student model selection is suggested. A method is proposed for assigning an a priori distribution of student parameters based on the a posteriori distribution of trainer model parameters. Since the trainer and student parameter spaces do not coincide, we propose a mechanism for the reduction of the trainer model parameter space to the student model parameter space by changing the trainer model structure. A theoretical analysis of the proposed reduction mechanism is carried out. A computational experiment was carried out on synthesized and real data. The FashionMNIST sample was used as real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.

Similar content being viewed by others

REFERENCES

  1. Krizhevsky, A., Sutskever, I., and Hinton, G., ImageNet classification with deep convolutional neural networks, Proc. 25th Int. Conf. Neural Inf. Process. Syst. (2012), vol. 1, pp. 1097–1105.

  2. Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition, Int. Conf. Learn. Representations (San Diego, 2015).

  3. He, K., Ren, S., Sun, J., and Zhang, X., Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Las Vegas, 2016), pp. 770–778.

  4. Devlin, J., Chang, M., Lee, K., and Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (Minnesota, 2019), vol. 1, pp. 4171–4186.

  5. Vaswani, A., Gomez, A., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., and Uszkoreit, J., Attention is all you need, in Advances in Neural Information Processing Systems, 2017, vol. 5, pp. 6000–6010.

  6. Al-Rfou, R., Barua, A., Constant, N., Kale, M., Raffel, C., Roberts, A., Siddhant, A., and Xue, L., mT5: a massively multilingual pre-trained text-to-text transformer, Proc. 2021 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (2021), pp. 483–498.

  7. Brown, T. et al., GPT3: language models are few-shot learners, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901.

  8. Zheng, T., Liu, X., Qin, Z., and Ren, K., Adversarial attacks and defenses in deep learning, Engineering, 2020, vol. 6, pp. 346–360.

    Article  Google Scholar 

  9. Hinton, G., Dean, J., and Vinyals, O., Distilling the knowledge in a neural network, NIPS Deep Learn. Representation Learn. Workshop (2015).

  10. Vapnik, V. and Izmailov, R., Learning using privileged information: similarity control and knowledge transfer, J. Mach. Learn. Res., 2015, vol. 16, pp. 2023–2049.

    MathSciNet  MATH  Google Scholar 

  11. Lopez-Paz, D., Bottou, L., Scholkopf, B., and Vapnik, V., Unifying distillation and privileged information, Int. Conf. Learn. Representations (Puerto Rico, 2016).

  12. Burges, C., Cortes, C., and LeCun, Y., The MNIST Dataset of Handwritten Digits, 1998. http://yann.lecun.com/exdb/mnist/index.html.

  13. Huang, Z. and Naiyan, W., Like What You Like: Knowledge Distill via Neuron Selectivity Transfer, 2019. .

  14. Hinton, G., Krizhevsky, A., and Nair, V., CIFAR-10 (Canadian Institute for Advanced Research). http://www.cs.toronto.edu/~kriz/cifar.html.

  15. Deng, J. et al., Imagenet: a large-scale hierarchical image database, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Miami, 2009), pp. 248–255.

  16. LeCun, Y., Denker, J., and Solla, S., Optimal brain damage, Advances in Neural Information Processing Systems, 1989, vol. 2, pp. 598–605.

  17. Graves, A., Practical variational inference for neural networks, Advances in Neural Information Processing Systems, 2011, vol. 24, pp. 2348–2356.

  18. Grabovoy, A.V., Bakhteev, O.Y., and Strijov, V.V., Estimation of relevance for neural network parameters, Inf. Appl., 2019, vol. 13, no. 2, pp. 62–70.

    Google Scholar 

  19. Rasul, K., Vollgraf, R., and Xiao, H., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, arXiv Preprint, 2017. .

Download references

Funding

This paper contains results of the project “Mathematical Methods for Big Data Mining” carried out as part of the implementation of the Competence Center Program of the National Technological Initiative “Big Data Storage and Analysis Center” supported by the Ministry of Science and Higher Education of the Russian Federation under the Agreement of Lomonosov Moscow State University with the Fund for Support of Projects of the National Technology Initiative of December 11, 2018, no. 13/1251/2018. This work was supported by the Russian Foundation for Basic Research, projects nos. 19-07-01155 and 19-07-00875.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to A. V. Grabovoy or V. V. Strijov.

Additional information

Translated by V. Potapchouck

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grabovoy, A.V., Strijov, V.V. Bayesian Distillation of Deep Learning Models. Autom Remote Control 82, 1846–1856 (2021). https://doi.org/10.1134/S0005117921110023

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0005117921110023

Keywords

Navigation