Abstract
We study the problem of reducing the complexity of approximating models and consider methods based on distillation of deep learning models. The concepts of trainer and student are introduced. It is assumed that the student model has fewer parameters than the trainer model. A Bayesian approach to the student model selection is suggested. A method is proposed for assigning an a priori distribution of student parameters based on the a posteriori distribution of trainer model parameters. Since the trainer and student parameter spaces do not coincide, we propose a mechanism for the reduction of the trainer model parameter space to the student model parameter space by changing the trainer model structure. A theoretical analysis of the proposed reduction mechanism is carried out. A computational experiment was carried out on synthesized and real data. The FashionMNIST sample was used as real data.
Similar content being viewed by others
REFERENCES
Krizhevsky, A., Sutskever, I., and Hinton, G., ImageNet classification with deep convolutional neural networks, Proc. 25th Int. Conf. Neural Inf. Process. Syst. (2012), vol. 1, pp. 1097–1105.
Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition, Int. Conf. Learn. Representations (San Diego, 2015).
He, K., Ren, S., Sun, J., and Zhang, X., Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Las Vegas, 2016), pp. 770–778.
Devlin, J., Chang, M., Lee, K., and Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (Minnesota, 2019), vol. 1, pp. 4171–4186.
Vaswani, A., Gomez, A., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., and Uszkoreit, J., Attention is all you need, in Advances in Neural Information Processing Systems, 2017, vol. 5, pp. 6000–6010.
Al-Rfou, R., Barua, A., Constant, N., Kale, M., Raffel, C., Roberts, A., Siddhant, A., and Xue, L., mT5: a massively multilingual pre-trained text-to-text transformer, Proc. 2021 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (2021), pp. 483–498.
Brown, T. et al., GPT3: language models are few-shot learners, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901.
Zheng, T., Liu, X., Qin, Z., and Ren, K., Adversarial attacks and defenses in deep learning, Engineering, 2020, vol. 6, pp. 346–360.
Hinton, G., Dean, J., and Vinyals, O., Distilling the knowledge in a neural network, NIPS Deep Learn. Representation Learn. Workshop (2015).
Vapnik, V. and Izmailov, R., Learning using privileged information: similarity control and knowledge transfer, J. Mach. Learn. Res., 2015, vol. 16, pp. 2023–2049.
Lopez-Paz, D., Bottou, L., Scholkopf, B., and Vapnik, V., Unifying distillation and privileged information, Int. Conf. Learn. Representations (Puerto Rico, 2016).
Burges, C., Cortes, C., and LeCun, Y., The MNIST Dataset of Handwritten Digits, 1998. http://yann.lecun.com/exdb/mnist/index.html.
Huang, Z. and Naiyan, W., Like What You Like: Knowledge Distill via Neuron Selectivity Transfer, 2019. .
Hinton, G., Krizhevsky, A., and Nair, V., CIFAR-10 (Canadian Institute for Advanced Research). http://www.cs.toronto.edu/~kriz/cifar.html.
Deng, J. et al., Imagenet: a large-scale hierarchical image database, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Miami, 2009), pp. 248–255.
LeCun, Y., Denker, J., and Solla, S., Optimal brain damage, Advances in Neural Information Processing Systems, 1989, vol. 2, pp. 598–605.
Graves, A., Practical variational inference for neural networks, Advances in Neural Information Processing Systems, 2011, vol. 24, pp. 2348–2356.
Grabovoy, A.V., Bakhteev, O.Y., and Strijov, V.V., Estimation of relevance for neural network parameters, Inf. Appl., 2019, vol. 13, no. 2, pp. 62–70.
Rasul, K., Vollgraf, R., and Xiao, H., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, arXiv Preprint, 2017. .
Funding
This paper contains results of the project “Mathematical Methods for Big Data Mining” carried out as part of the implementation of the Competence Center Program of the National Technological Initiative “Big Data Storage and Analysis Center” supported by the Ministry of Science and Higher Education of the Russian Federation under the Agreement of Lomonosov Moscow State University with the Fund for Support of Projects of the National Technology Initiative of December 11, 2018, no. 13/1251/2018. This work was supported by the Russian Foundation for Basic Research, projects nos. 19-07-01155 and 19-07-00875.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by V. Potapchouck
Rights and permissions
About this article
Cite this article
Grabovoy, A.V., Strijov, V.V. Bayesian Distillation of Deep Learning Models. Autom Remote Control 82, 1846–1856 (2021). https://doi.org/10.1134/S0005117921110023
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0005117921110023