Abstract
Convolution neural networks (CNNs) are widely used algorithms in image processing, natural language processing and many other fields. The large amount of memory access of CNNs is one of the major concerns in CNN accelerator designs that influences the performance and energy-efficiency. With fast and low-cost memory access, Processing-In-Memory (PIM) system is a feasible solution to alleviate the memory concern of CNNs. However, the distributed manner of data storing in PIM systems is in conflict with the large amount of data reuse of CNN layers. Nodes of PIM systems may need to share their data with each other before processing a CNN layer, leading to extra communication overhead. In this article, we propose DDAM to map CNNs onto PIM systems with the communication overhead reduced. Firstly, A data transfer strategy is proposed to deal with the data sharing requirement among PIM nodes by formulating a Traveling-Salesman-Problem (TSP). To improve data locality, a dynamic programming algorithm is proposed to partition the CNN and allocate a number of nodes to each part. Finally, an integer linear programming (ILP)-based mapping algorithm is proposed to map the partitioned CNN onto the PIM system. Experimental results show that compared to the baselines, DDAM can get a higher throughput of 2.0× with the energy cost reduced by 37% on average.
- [1] 2018. Hybrid memory cube – HMC Gen2. (2018), 105. Retrieved from https://www.micron.com/-/media/client/global/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf. Accessed May 1, 2022.Google Scholar
- [2] . 2016. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE, Taipei, Taiwan, 1–12.
DOI: Google ScholarCross Ref - [3] . 2021. Revisiting ResNets: Improved training and scaling strategies. arXiv:2103.07579. Retrieved from https://arxiv.org/abs/2103.07579.Google Scholar
- [4] . 2020. Communication lower bound in convolution accelerators. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture.529–541.
DOI: ISSN: 2378-203X .Google ScholarCross Ref - [5] . 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Cambridge, United Kingdom, 609–622.
DOI: Google ScholarDigital Library - [6] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 367–379.
DOI: Google ScholarDigital Library - [7] . 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.27–39.
DOI: ISSN: 1063-6897 .Google ScholarDigital Library - [8] . 2021. CLU: A near-memory accelerator exploiting the parallelism in convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems 17, 2, 1–25.
DOI: Google ScholarDigital Library - [9] . 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88pJ/b logic-to-memory interface. In Proceedings of the 2020 IEEE International Electron Devices Meeting.IEEE, 6.6.1–6.6.4.
DOI: Google ScholarCross Ref - [10] . 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Xi’an China, 751–764.
DOI: Google ScholarDigital Library - [11] . 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 807–820.
DOI: Google ScholarDigital Library - [12] . 2018. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11(2018), 2601–2612.
DOI: Google ScholarCross Ref - [13] . 2022. Gurobi Optimizer Reference Manual. Retrieved from https://www.gurobi.com. Accessed May 1, 2022.Google Scholar
- [14] . 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/1510.00149.Google Scholar
- [15] . 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770–778.
DOI: Google ScholarCross Ref - [16] . 2000. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European Journal of Operational Research 126, 1(2000), 106–130.
DOI: Google ScholarCross Ref - [17] . 2014. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers.10–14.
DOI: Google ScholarCross Ref - [18] . 2021. CoSA: Scheduling by constrained optimization for spatial accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.IEEE, Valencia, Spain, 554–566.
DOI: Google ScholarDigital Library - [19] . 2017. On-Chip Networks: Second Edition. Morgan and Claypool.Google ScholarCross Ref
- [20] . 2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software.86–96.
DOI: Google ScholarCross Ref - [21] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, Toronto ON Canada, 1–12.
DOI: Google ScholarDigital Library - [22] . 2022. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. Retrieved from http://www.drampower.info.Google Scholar
- [23] . 2016a. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 380–392.
DOI: Google ScholarDigital Library - [24] . 2016b. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1(2016), 45–49.
DOI: Google ScholarDigital Library - [25] . 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6(2017), 84–90.
DOI: Google ScholarDigital Library - [26] . 2021. Placement for wafer-scale deep learning accelerator. In Proceedings of the 2021 26th Asia and South Pacific Design Automation Conference.665–670.Google ScholarDigital Library
- [27] . 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition.IEEE, Dresden, Germany, 343–348.
DOI: Google ScholarCross Ref - [28] . 2019. NeuralHMC: An efficient HMC-based accelerator for deep neural networks. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. ACM, Tokyo Japan, 394–399.
DOI: Google ScholarDigital Library - [29] . 2022. 184QPS/W 64Mb/mm23D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference.1–3.
DOI: ISSN: 2376–8606 .Google ScholarCross Ref - [30] . 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 304–315.
DOI: Google ScholarCross Ref - [31] . 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. Retrieved from https://arxiv.org/abs/1804.02767.Google Scholar
- [32] . 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 58–68.
DOI: Google ScholarCross Ref - [33] . 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 14–27.
DOI: Google ScholarDigital Library - [34] . 2021. A 96-MB 3D-stacked SRAM using inductive coupling with 0.4-V transmitter, termination scheme and 12:1 SerDes in 40-nm CMOS. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 2(2021), 692–703.
DOI: Google ScholarCross Ref - [35] . 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
- [36] . 2019. NAPEL: Near-memory computing application performance prediction via ensemble learning. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 1–6.
DOI: Google ScholarDigital Library - [37] . 1991. An analytical approach to floorplan design and optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10, 6(1991), 761–769.
DOI: Google ScholarDigital Library - [38] . 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 1–9.
DOI: Google ScholarCross Ref - [39] . 2019. QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS. IEEE Journal of Solid-State Circuits 54, 1(2019), 186–196.
DOI: Google ScholarCross Ref - [40] . 2018. Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 29, 6(2018), 1428–1441.
DOI: Google ScholarCross Ref - [41] . 2017. Exploiting parallelism for convolutional connections in processing-in-memory architecture. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 1–6.
DOI: Google ScholarDigital Library - [42] . 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 1–6.
DOI: Google ScholarDigital Library - [43] . 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design.IEEE, 1–8.
DOI: Google ScholarCross Ref - [44] . 2022. Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators. In Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture.475–489.
DOI: Google ScholarCross Ref - [45] . 2019. A configurable multi-precision CNN computing framework based on single bit RRAM. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference.1–6.Google ScholarDigital Library
Index Terms
- DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems
Recommendations
Towards memory-efficient processing-in-memory architecture for convolutional neural networks
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsConvolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. ...
Towards memory-efficient processing-in-memory architecture for convolutional neural networks
LCTES '17Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. ...
An Enhanced Memory Address Mapping Scheme for Improved Memory Access Performance of 2-D DWT Processing Systems
AbstractThe implementation of the memory for storing image and transform coefficients in 2-D DWT processing systems using the more cost-effective external memory module such as DDR DRAM is shown to suffer from effective memory bandwidth which is ...
Comments