Augmenting Legacy Networks for Flexible Inference

Clemons, Jason; Frosio, Iuri; Shen, Maying; Alvarez, Jose M.; Keckler, Stephen

doi:10.1007/978-3-031-25082-8_6

Jason Clemons ORCID: orcid.org/0000-0001-5533-417X¹⁰,
Iuri Frosio¹⁰,
Maying Shen¹⁰,
Jose M. Alvarez¹⁰ &
…
Stephen Keckler¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13807))

Included in the following conference series:

European Conference on Computer Vision

1499 Accesses

Abstract

Once deployed in the field, Deep Neural Networks (DNNs) run on devices with widely different compute capabilities and whose computational load varies over time. Dynamic network architectures are one of the existing techniques developed to handle the varying computational load in real-time deployments. Here we introduce LeAF (Legacy Augmentation for Flexible inference), a novel paradigm to augment the key-phases of a pre-trained DNN with alternative, trainable, shallow phases that can be executed in place of the original ones. At run time, LeAF allows changing the network architecture without any computational overhead, to effectively handle different loads. LeAF-ResNet50 has a storage overhead of less than 14% with respect to the legacy DNN; its accuracy varies from the original accuracy of 76.1% to 64.8% while requiring 4 to 0.68 GFLOPs, in line with state-of-the-art results obtained with non-legacy and less flexible methods. We examine how LeAF’s dynamic routing strategy impacts the accuracy and the use of the available computational resources as a function of the compute capability and load of the device, with particular attention to the case of an unpredictable batch size. We show that the optimal configurations for a given network can indeed vary based on the system metrics (such as latency or FLOPs), batch size and compute capability of the machine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We note that if the number of configurations becomes large, it may be more efficient to randomly sample the configuration for each mini-batch.
2.
The compute cost can be derived analytically in case of FLOPs, or experimentally in case of latency, energy or power consumption.
3.
When the compute cost is measured in FLOPS, the batch size will be normalized away. Other systems cost metrics (e.g., latency) may be a function of the batch size, as detailed in the Results Section.

References

Alvarez, J.M., Salzmann, M.: Learning the number of neurons in deep networks. In: NeurIPS (2016)
Google Scholar
Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once for all: train one network and specialize it for efficient deployment. In: ICLR (2020)
Google Scholar
Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: ZeroQ: a novel zero shot quantization framework. In: CVPR (2020)
Google Scholar
Dai, X., et al.: ChamNet: Towards efficient network design through platform-aware model adaptation. In: CVPR (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: RepVGG: Making VGG-style convnets great again. In: CVPR (2021)
Google Scholar
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE Trans, PAMI (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: ICCV (2017)
Google Scholar
Hu, H., Dey, D., Hebert, M., Bagnell, J.: Learning anytime predictions in neural networks via adaptive loss balancing. In: AAAI (2019)
Google Scholar
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Jastrzebski, S., Arpit, D., Ballas, N., Verma, V., Che, T., Bengio, Y.: Residual connections encourage iterative inference. In: ICLR (2018)
Google Scholar
Lee, N., Ajanthan, T., Torr, P.H.: Snip: Single-shot network pruning based on connection sensitivity. CoRR abs/1810.02340 (2018)
Google Scholar
Li, B., Wu, B., Su, J., Wang, G.: EagleEye: fast sub-net evaluation for efficient neural network pruning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 639–654. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_38
Chapter Google Scholar
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., Kautz, J.: Importance estimation for neural network pruning. In: CVPR (2019)
Google Scholar
NVIDIA: CUDA Toolkit Documentation. http://www.docs.nvidia.com/cuda/profiler-users-guide/index.html. Accessed 30 Oct 2021
Paszke, A., et al.: PyTorch: an imperative style, high-Performance deep learning library. In: NEURIPS (2019)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Google Scholar
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q.V., Hinton, G.E., Dean, J.: Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In: ICLR (2017)
Google Scholar
Shen, M., Yin, H., Molchanov, P., Mao, L., Liu, J., Alvarez, J.M.: HALP: hardware-aware latency pruning. CoRR abs/2110.10811 (2021)
Google Scholar
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR (2016)
Google Scholar
Teerapittayanon, S., McDanel, B., Kung, H.: BranchyNet: fast inference via early exiting from deep neural networks. In: ICPR (2016)
Google Scholar
Veit, A., Belongie, S.: Convolutional networks with adaptive inference graphs. Int. J. Comput. Vis. 128(3), 730–741 (2019). https://doi.org/10.1007/s11263-019-01190-4
Article Google Scholar
Wang, W., et al.: Accelerate CNNs from three dimensions: a comprehensive pruning framework. In: ICML (2021)
Google Scholar
Wang, X., Yu, F., Dou, Z.-Y., Darrell, T., Gonzalez, J.E.: SkipNet: learning dynamic routing in convolutional networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 420–436. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_25
Chapter Google Scholar
Wang, Y., et al.: Dual dynamic inference: enabling more efficient, adaptive, and controllable deep inference. IEEE J. Selected Top. Sig. Process. 14(4), 623–633 (2020)
Google Scholar
Yang, T.-J., et al.: NetAdapt: platform-aware neural network adaptation for mobile applications. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 289–304. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_18
Chapter Google Scholar
Yu, J., Huang, T.S.: Network slimming by slimmable networks: towards one-shot architecture search for channel numbers. CoRR abs/1903.11728 (2019). http://arxiv.org/1903.11728
Yu, J., Yang, L., Xu, N., Yang, J., Huang, T.: Slimmable neural networks. In: ICLR (2019)
Google Scholar
Zhou, W., Xu, C., Ge, T., McAuley, J.J., Xu, K., Wei, F.: Bert loses patience: fast and robust inference with early exit. In: NeurIPS (2020)
Google Scholar
Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. In: ICLR (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

NVIDIA Corporation, Santa Clara, CA, USA
Jason Clemons, Iuri Frosio, Maying Shen, Jose M. Alvarez & Stephen Keckler

Authors

Jason Clemons
View author publications
You can also search for this author in PubMed Google Scholar
Iuri Frosio
View author publications
You can also search for this author in PubMed Google Scholar
Maying Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jose M. Alvarez
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Keckler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jason Clemons .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Clemons, J., Frosio, I., Shen, M., Alvarez, J.M., Keckler, S. (2023). Augmenting Legacy Networks for Flexible Inference. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13807. Springer, Cham. https://doi.org/10.1007/978-3-031-25082-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-25082-8_6
Published: 12 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25081-1
Online ISBN: 978-3-031-25082-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Augmenting Legacy Networks for Flexible Inference