Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

Chen, Xiao-Bing; Qi, Hao; Peng, Shao-Hui; Zhuang, Yi-Min; Zhi, Tian; Chen, Yun-Ji

doi:10.1007/s11390-021-1213-3

Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

Regular Paper
Published: 30 November 2022

Volume 37, pages 1255–1270, (2022)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xiao-Bing Chen^1,2,
Hao Qi³,
Shao-Hui Peng^1,2,
Yi-Min Zhuang^1,2,
Tian Zhi¹ &
…
Yun-Ji Chen^1,2,4

419 Accesses
1 Altmetric
Explore all metrics

Abstract

Uniform memory multicore neural network accelerators (UNNAs) furnish huge computing power to emerging neural network applications. Meanwhile, with neural network architectures going deeper and wider, the limited memory capacity has become a constraint to deploy models on UNNA platforms. Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant. In this paper, we propose Tetris: a heuristic static memory management framework for UNNA platforms. Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor's liveness interval. Then the memory management problem is converted to a sequence per- mutation problem. Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints. We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods, and achieves an average memory reduction ratio of 91.9% and 87.9% for a quad-core and a 16-core Cambricon-X platform, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of NUMA Aware DNN Computing System

Deep Fusion: A Software Scheduling Method for Memory Access Optimization

Neural architecture search for in-memory computing-based deep learning accelerators

Article 20 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp:1026-1034. https://doi.org/10.1109/ICCV.2015.123.
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018. https://ar-xiv.org/abs/1810.04805, April 2021.
Silver D, Huang A, Maddison C J et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484-489. https://doi.org/10.1038/nature16961.
Article Google Scholar
Silver D, Schrittwieser J, Simonyan K et al. Mastering the game of Go without human knowledge. Nature, 2017, 550(7676): 354-259. https://doi.org/10.1038/nature24270.
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In Proc. the 25th International Conference on Neural Information Processing Systems, Dec. 2012, pp.1097-1105.
Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.1492-1500. DOI: https://doi.org/10.1109/CVPR.2017.634.
Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv:1701.06538, 2017. https://arxiv.org/abs/1701.06538, Jan. 2021.
Wang L, Ye J, Zhao Y, Wu W, Li A, Song S L, Xu Z, Kraska T. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proc. the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2018, pp.41-53. https://doi.org/10.1145/3178487.3178491.
Rhu M, Gimelshein N, Clemons J, Zulfiqar A, Keckler S W. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Proc. the 49th Annual IEEE/ACM International Symposium on Microar-chitecture, Oct. 2016, Article No. 18. https://doi.org/10.1109/MI-CRO.2016.7783721.
Pisarchyk Y, Lee J. Efficient memory management for deep neural net inference. arXiv:2001.03288, 2020. https://arxi-v.org/abs/2001.03288, Jan. 2021.
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: Efficient primitives for deep learning. arXiv:1410.0759, 2014. https://a-rxiv.org/abs/1410.0759, April 2021.
Barrachina S, Castillo M, Igual F D, Mayo R, Quintana-Orti E S. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In Proc. the 2008 IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008. https://doi.org/10.1109/IPDPS.2008.4536485.
Mahmoud M, Siu K, Moshovos A. Diffy: A Déjà vu-free differential deep neural network accelerator. In Proc. the 51st Annual IEEE/ACM International Symposium on Mi-croarchitecture, Oct. 2018, pp.134-147. https://doi.org/10.1109/MI-CRO.2018.00020.
Zhuang Y, Peng S, Chen X, Zhou S, Zhi T, Li W, Liu S. Deep fusion: A software scheduling method for memory access optimization. In Proc. the 16th IFIP WG 10.3 International Conference on Network and Parallel Computing, Aug. 2019, pp.277-288. https://doi.org/10.1007/978-3-030-30709-7_22.
Chen X, Peng S, Jin L, Zhuang Y, Song J, Du W, Liu S, Zhi T. Partition and scheduling algorithms for neural network accelerators. In Proc. the 13th International Symposium on Advanced Parallel Processing Technologies, Aug. 2019, pp.55-67. https://doi.org/10.1007/978-3-030-29611-7_5.
Zhang X, Zhi T. Machine learning inference framework on multicore processor. Journal of Computer Research and Development, 2019, 56(9): 1977-1987. https://doi.org/10.7544/issn1000-1239.2019.20180786. (in Chinese)
Article Google Scholar
Long G, Yang J, Zhu K, Lin W. Fusion stitching: Deep fusion and code generation for tensorow computations on GPUs. arXiv:1811.05213, 2018. https://arxiv.org/abs/1811.05213, April 2021.
Minakova S, Stefanov T. Buffer sizes reduction for memory-efficient CNN inference on mobile and embedded devices. In Proc. the 23rd Euromicro Conference on Digital System Design, Aug. 2020, pp.133-140. https://doi.org/10.1109/DSD51259.2020.00031.
Guan Y, Liang H, Xu N, Wang W, Shi S, Chen X, Sun G, Zhang W, Cong J. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proc. the 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, April 30-May 2, 2017, pp.152-159. https://doi.org/10.1109/FCCM.2017.25.
Wei X, Liang Y, Zhang P, Yu C H, Cong J. Over-coming data transfer bottlenecks in DNN accelerators via layer-conscious memory managment. In Proc. the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2019, pp.120-120. https://doi.org/10.1145/3289602.3293947.
Frazier P I. A tutorial on Bayesian optimization. arXiv:1-807.02811, 2018. https://arxiv.org/abs/1807.02811, April 2021.
Eriksson D, Pearce M, Gardner J R, Turner R, Poloczek M. Scalable global optimization via local Bayesian optimization. arXiv:1910.01739, 2019. https://arxiv.org/ab-s/1910.01739, April 2021.
Nayebi A, Munteanu A, Poloczek M. A framework for Bayesian optimization in embedded subspaces. In Proc. the 36th International Conference on Machine Learning, June 2019, pp.4752-4761.
Wang L, Fonseca R, Tian Y. Learning search space partition for black-box optimization using Monte Carlo tree search. arXiv:2007.00708, 2020. https://arxiv.org/abs/2007.00708, April 2021.
Varelas K, Auger A, Brockhoff D, Hansen N, ElHara O A, Semet Y, Kassab R, Barbaresco F. A comparative study of large-scale variants of CMA-ES. In Proc. the 15th International Conference on Parallel Problem Solving from Nature, Sept. 2018, pp.3-15. https://doi.org/10.1007/978-3-319-99253-2_1.
Abadi M, Barham P, Chen J et al. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.265-283.
Paszke A, Gross S, Massa F et al. PyTorch: An imperative style, high-performance deep learning library. arXiv:1912.01703, 2019. https://arxiv.org/abs/1912.01703, April 2021.
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In Proc. the 22nd ACM International Conference on Multimedia, Nov. 2014, pp.675-678. https://doi.org/10.1145/2647868.2654889.
Whitley D. A genetic algorithm tutorial. Statistics and Computing, 1994, 4(2). https://doi.org/10.1007/BF00175354.
Knuth D. The Art of Computer Programming, Volume I: Fundamental Algorithms. Addison-Wesley, 1968.
Zhang S, Du Z, Zhang L, Lan H, Liu S, Li L, Guo Q, Chen T, Chen Y. Cambricon-X: An accelerator for sparse neural networks. In Proc. the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2016. https://doi.org/10.1109/MICRO.2016.7783723.
Lan H Y, Wu L Y, Zhang X, Tao J H, Chen X Y, Wang B R, Wang Y Q, Guo Q, Chen Y J. DLPlib: A library for deep learning processor. Journal of Computer Science and Technology, 2017, 32(2): 286-96. https://doi.org/10.1007/s11390-017-1722-2.
Article Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xiao-Bing Chen, Shao-Hui Peng, Yi-Min Zhuang, Tian Zhi & Yun-Ji Chen
University of Chinese Academy of Sciences, Beijing, 100049, China
Xiao-Bing Chen, Shao-Hui Peng, Yi-Min Zhuang & Yun-Ji Chen
School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230026, China
Hao Qi
Chinese Academy of Sciences Center for Excellence in Brain Science and Intelligence Technology, Shanghai, 200031, China
Yun-Ji Chen

Authors

Xiao-Bing Chen
View author publications
You can also search for this author inPubMed Google Scholar
Hao Qi
View author publications
You can also search for this author inPubMed Google Scholar
Shao-Hui Peng
View author publications
You can also search for this author inPubMed Google Scholar
Yi-Min Zhuang
View author publications
You can also search for this author inPubMed Google Scholar
Tian Zhi
View author publications
You can also search for this author inPubMed Google Scholar
Yun-Ji Chen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tian Zhi.

Supplementary Information

ESM 1

(PDF 107 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, XB., Qi, H., Peng, SH. et al. Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators. J. Comput. Sci. Technol. 37, 1255–1270 (2022). https://doi.org/10.1007/s11390-021-1213-3

Download citation

Received: 10 December 2020
Accepted: 31 May 2021
Published: 30 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11390-021-1213-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimization of NUMA Aware DNN Computing System

Deep Fusion: A Software Scheduling Method for Memory Access Optimization

Neural architecture search for in-memory computing-based deep learning accelerators

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now