Abstract
The article deals with methods for reducing the complexity of approximating models. Probabilistic substantiation of distillation and privileged teaching methods is proposed. General conclusions are given for an arbitrary parametric function with a predetermined structure. A theoretical basis is demonstrated for the special cases of linear and logistic regression. The analysis of the considered models is carried out in a computational experiment on synthetic samples and real data. The FashionMNIST and Twitter Sentiment Analysis samples are considered as real data.
Similar content being viewed by others
REFERENCES
Vaswani, A., Gomez, A., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., and Uszkoreit, J., Attention is all you need, Adv. Neural Inf. Process. Syst., 2017, vol. 5, pp. 6000–6010.
Devlin, J., Chang, M., Lee, K., and Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. North Am. Ch. Assoc. Comput. Linguist.: Hum. Lang. Technol. (Minnesota, 2019), vol. 1, pp. 4171–4186.
He, K., Ren, S., Sun, J., and Zhang, X., Deep residual learning for image recognition, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Las Vegas, 2016), pp. 770–778.
Bakhteev, O.Yu. and Strijov, V.V., Deep learning model selection of suboptimal complexity, Autom. Remote Control, 2018, vol. 79, pp. 1474–1488.
Hinton, G., Dean, J., and Vinyals, O., Distilling the knowledge in a neural network, NIPS Deep Learn. Representation Learn. Workshop (2015).
Bucilu, C., Caruana, R., and Mizil, A., Model compression, Proc. ACM SIGKDD Conf. Knowl. Discovery Data Min. (Philadelphia, 2006), pp. 535–541.
Lopez-Paz, D., Bottou, L., Scholkopf, B., and Vapnik, V., Unifying distillation and privileged information, Int. Conf. Learn. Representations (Puerto Rico, 2016).
Tang, Z., Wang, D., and Zhang, Z., Recurrent neural network training with dark knowledge transfer, Proc. IEEE Conf. Acoust. Speech Signal Process. (Shanghai, 2016), vol. 2, pp. 5900–5904.
Darrell, T., Hoffman, J., Saenko, K., and Tzeng, E., Simultaneous deep transfer across domains and tasks, Proc. IEEE Conf. Comput. Vision (Santiago, 2015), vol. 2, pp. 4068–4076.
Ahn, S., Dai, Z., Damianou, A., Hu, S., and Lawrence, N., Variational information distillation for knowledge transfer, Proc. IEEE Conf. Comput. Vision Pattern Recognit. (Long Beach, 2019), pp. 9163–9171.
Burges, C., Cortes, C., and LeCun, Y., The MNIST dataset of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist/index.html .
Che, Z., Chen, Y., Guoping, H., Liu, W., Wang, T., and Ziqing, Y., TextBrewer: an open-source knowledge distillation toolkit for natural language processing, Proc. 58th Annu. Meet. Assoc. Comput. Linguist.: Syst. Demonstr. (Online, 2020).
Huang, Z. and Naiyan, W., Like what you like: knowledge distill via neuron selectivity transfer, 2017. .
Fu, T., Lei, Z., Liao, S., Mei, T., Wang, S., and Wang, X., Exclusivity-consistency regularized knowledge distillation for face recognition, Lect. Notes Comput. Sci., 2020, vol. 1, pp. 23–69.
Vapnik, V. and Izmailov, R., Learning using privileged information: similarity control and knowledge transfer, J. Mach. Learn. Res., 2015, vol. 16, pp. 2023–2049.
Ivakhnenko, A. and Madala, H., Inductive Learning Algorithms for Complex Systems Modeling, Boca Raton: CRC Press, 1994.
Rasul, K., Vollgraf, R., and Xiao, H., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017. .
Kozareva, Z., Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., and Wilson, T., SemEval-2013 task 2: sentiment analysis in Twitter, Proc. Seventh Int. Workshop Semantic Eval. (SemEval 2013) (Atlanta, 2013), pp. 312–320.
Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., and LeCun, Y., Backpropagation applied to handwritten zip code recognition, Neural Comput., 1989, vol. 1, no. 4, pp. 541–551.
Hochreiter, S. and Schmidhuber, J., Long short-term memory, Neural Comput., 1997, vol. 9, no. 8, pp. 1735–1780.
Ba, J. and Kingma D, Adam: a method for stochastic optimization, Int. Conf. Learn. Representations (San Diego, 2014).
Kod vychislitel’nogo eksperimenta (Computational Experiment Source Code). https://github.com/andriygav/PrivilegeLearning .
Funding
This article contains the results of the project “Mathematical Methods for Mining Big Data,” carried out as part of the implementation of the Competence Center Program of the National Technology Initiative “Big Data Storage and Analysis Center,” supported by the Ministry of Science and Higher Education of the Russian Federation under agreement no. 13/1251/2018 of December 11, 2018 between Lomonosov Moscow State University and the Foundation for Support of Projects of the National Technology Initiative. This work was supported by the Russian Foundation for Basic Research, projects nos. 19-07-01155, 19-07-00875, and 19-07-00885.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by V. Potapchouck
Rights and permissions
About this article
Cite this article
Grabovoy, A.V., Strijov, V.V. Probabilistic Interpretation of the Distillation Problem. Autom Remote Control 83, 123–137 (2022). https://doi.org/10.1134/S000511792201009X
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S000511792201009X