Abstract
Massive neural network models are often preferred over smaller models for their more favorable optimization landscapes during training. However, since the cost of evaluating a model grows with the size, it is desirable to obtain an equivalent compressed neural network model before deploying it for prediction. The best-studied tools for compressing neural networks obtain models with broadly similar architectures, including the depth of the model. No guarantees have been available for obtaining compressed models with substantially reduced depth. In this paper, we present fundamental obstacles to any algorithm achieving depth compression of neural networks. In particular, we show that depth compression is as hard as learning the input distribution, ruling out guarantees for most existing approaches. Furthermore, even when the input distribution is of a known, simple form, we show that there are no local algorithms for depth compression.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
In typical applications, f itself will be an approximation of some concept g known only through labeled examples, and the real goal is find an approximation of g in \(\mathcal {H}\). To simplify our discussion, we will not attempt to find a compression of f which approximates this concept g better than f does itself, and so g can be safely ignored.
- 2.
“Improper” in the sense of not belonging to the hypothesis class \(\mathcal {H}\).
References
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems (NIPS), pp. 2654–2662 (2014)
Blum, A., Frieze, A., Kannan, R., Vempala, S.: A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica 22(1–2), 35–52 (1998)
Blum, A., Kalai, A., Wasserman, H.: Noise-tolerant learning, the parity problem, and the statistical query model. J. ACM (JACM) 50(4), 506–519 (2003)
Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541 (2006)
Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: Advances in Neural Information Processing Systems, pp. 742–751 (2017)
Chu, C.T., et al.: Map-reduce for machine learning on multicore. In: Advances in Neural Information Processing Systems, pp. 281–288 (2007)
Daniely, A.: Depth separation for neural networks. In: Conference on Learning Theory, pp. 690–696 (2017)
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 2148–2156 (2013)
Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in Neural Information Processing Systems (NIPS), pp. 1269–1277 (2014)
Dunagan, J., Vempala, S.: A simple polynomial-time rescaling algorithm for solving linear programs. Math. Program. 114(1), 101–114 (2008)
Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. In: Conference on Learning Theory, pp. 907–940 (2016)
Feldman, V.: A general characterization of the statistical query complexity. In: Conference on Learning Theory, pp. 785–830 (2017)
Feldman, V., Gopalan, P., Khot, S., Ponnuswami, A.K.: On agnostic learning of parities, monomials, and halfspaces. SIAM J. Comput. 39(2), 606–645 (2009)
Feldman, V., Grigorescu, E., Reyzin, L., Vempala, S.S., Xiao, Y.: Statistical algorithms and a lower bound for detecting planted cliques. J. ACM (JACM) 64(2), 1–37 (2017)
Gunasekar, S., Lee, J.D., Soudry, D., Srebro, N.: Implicit bias of gradient descent on linear convolutional networks. In: Advances in Neural Information Processing Systems, pp. 9461–9471 (2018)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143 (2015)
Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general network pruning. In: Neural Networks, pp. 293–299 (1993)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv:1405.3866 (2014)
Kearns, M.J.: Efficient noise-tolerant learning from statistical queries. In: Proceedings of the 25th ACM Symposium on Theory of Computing (STOC), pp. 392–401 (1993)
Kiltz, E., Pietrzak, K., Venturi, D., Cash, D., Jain, A.: Efficient authentication from hard learning problems. J. Cryptol. 30(4), 1238–1275 (2017)
Klivans, A., Kothari, P.: Embedding hard learning problems into Gaussian space. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014) (2014)
LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems (NIPS), pp. 598–605 (1990)
Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 (2014)
Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597 (2016)
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Dickstein, J.S.: On the expressive power of deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2847–2854 (2017)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv:1412.6550 (2014)
Safran, I., Shamir, O.: Depth-width tradeoffs in approximating natural functions with neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2979–2987 (2017)
Song, L., Vempala, S., Wilmes, J., Xie, B.: On the complexity of learning neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 5520–5528 (2017)
Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N.: The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19(1), 2822–2878 (2018)
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv:1312.6199 (2013)
Tai, C., Xiao, T., Zhang, Y., Wang, X., Weinan, E.: Convolutional neural networks with low-rank regularization. arXiv:1511.06067 (2015)
Vempala, S., Wilmes, J.: Gradient descent for one-hidden-layer neural networks: polynomial convergence and SQ lower bounds. In: Conference on Learning Theory, pp. 3115–3117 (2019)
Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv:1710.01878 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Burstein, W., Wilmes, J. (2020). Obstacles to Depth Compression of Neural Networks. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12397. Springer, Cham. https://doi.org/10.1007/978-3-030-61616-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-61616-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61615-1
Online ISBN: 978-3-030-61616-8
eBook Packages: Computer ScienceComputer Science (R0)