skip to main content
research-article

DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems

Authors Info & Claims
Published:19 March 2023Publication History
Skip Abstract Section

Abstract

Convolution neural networks (CNNs) are widely used algorithms in image processing, natural language processing and many other fields. The large amount of memory access of CNNs is one of the major concerns in CNN accelerator designs that influences the performance and energy-efficiency. With fast and low-cost memory access, Processing-In-Memory (PIM) system is a feasible solution to alleviate the memory concern of CNNs. However, the distributed manner of data storing in PIM systems is in conflict with the large amount of data reuse of CNN layers. Nodes of PIM systems may need to share their data with each other before processing a CNN layer, leading to extra communication overhead. In this article, we propose DDAM to map CNNs onto PIM systems with the communication overhead reduced. Firstly, A data transfer strategy is proposed to deal with the data sharing requirement among PIM nodes by formulating a Traveling-Salesman-Problem (TSP). To improve data locality, a dynamic programming algorithm is proposed to partition the CNN and allocate a number of nodes to each part. Finally, an integer linear programming (ILP)-based mapping algorithm is proposed to map the partitioned CNN onto the PIM system. Experimental results show that compared to the baselines, DDAM can get a higher throughput of 2.0× with the energy cost reduced by 37% on average.

REFERENCES

  1. [1] 2018. Hybrid memory cube – HMC Gen2. (2018), 105. Retrieved from https://www.micron.com/-/media/client/global/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf. Accessed May 1, 2022.Google ScholarGoogle Scholar
  2. [2] Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE, Taipei, Taiwan, 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bello Irwan, Fedus William, Du Xianzhi, Cubuk Ekin D., Srinivas Aravind, Lin Tsung-Yi, Shlens Jonathon, and Zoph Barret. 2021. Revisiting ResNets: Improved training and scaling strategies. arXiv:2103.07579. Retrieved from https://arxiv.org/abs/2103.07579.Google ScholarGoogle Scholar
  4. [4] Chen Xiaoming, Han Yinhe, and Wang Yu. 2020. Communication lower bound in convolution accelerators. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture.529541. DOI:ISSN: 2378-203X.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Yunji, Luo Tao, Liu Shaoli, Zhang Shijin, He Liqiang, Wang Jia, Li Ling, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Cambridge, United Kingdom, 609622. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 367379. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chi Ping, Li Shuangchen, Xu Cong, Zhang Tao, Zhao Jishen, Liu Yongpan, Wang Yu, and Xie Yuan. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.2739. DOI:ISSN: 1063-6897.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Das Palash and Kapoor Hemangee K.. 2021. CLU: A near-memory accelerator exploiting the parallelism in convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems 17, 2, 125. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Fujun Bai, Xiping Jiang, Song Wang, Bing Yu, Jie Tan, Fengguo Zuo, Chunjuan Wang, Fan Wang, Xiaodong Long, Guoqing Yu, Ni Fu, Qiannan Li, Hua Li, Kexin Wang, Huifu Duan, Liang Bai, Xuerong Jia, Jin Li, Mei Li, Zhengwen Wang, Sheng Hu, Jun Zhou, Qiong Zhan, Peng Sun, Daohong Yang, Kau Cheichan, Yang David, Ho Ching-Sung, Hongbin Sun, Hangbing Lv, Ming Liu, Yi Kang, and Qiwei Ren. 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88pJ/b logic-to-memory interface. In Proceedings of the 2020 IEEE International Electron Devices Meeting.IEEE, 6.6.1–6.6.4. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Gao Mingyu, Pu Jing, Yang Xuan, Horowitz Mark, and Kozyrakis Christos. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Xi’an China, 751764. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Gao Mingyu, Yang Xuan, Pu Jing, Horowitz Mark, and Kozyrakis Christos. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 807820. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Gong Lei, Wang Chao, Li Xi, Chen Huaping, and Zhou Xuehai. 2018. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11(2018), 26012612. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] LLC Gurobi Optimization,. 2022. Gurobi Optimizer Reference Manual. Retrieved from https://www.gurobi.com. Accessed May 1, 2022.Google ScholarGoogle Scholar
  14. [14] Han Song, Mao Huizi, and Dally William J.. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR). Retrieved from https://arxiv.org/abs/1510.00149.Google ScholarGoogle Scholar
  15. [15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770778. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Helsgaun Keld. 2000. An effective implementation of the Lin–Kernighan traveling salesman heuristic. European Journal of Operational Research 126, 1(2000), 106130. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Horowitz Mark. 2014. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers.1014. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huang Qijing, Kang Minwoo, Dinh Grace, Norell Thomas, Kalaiah Aravind, Demmel James, Wawrzynek John, and Shao Yakun Sophia. 2021. CoSA: Scheduling by constrained optimization for spatial accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture.IEEE, Valencia, Spain, 554566. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Jerger Natalie Enright, Krishna Tushar, Peh Li-Shiuan, and Martonosi Margaret. 2017. On-Chip Networks: Second Edition. Morgan and Claypool.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Jiang Nan, Becker Daniel U., Michelogiannakis George, Balfour James, Towles Brian, Shaw D. E., Kim John, and Dally William J.. 2013. A detailed and flexible cycle-accurate Network-on-Chip simulator. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software.8696. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David, Agrawal Gaurav, Bajwa Raminder, Bates Sarah, Bhatia Suresh, Boden Nan, Borchers Al, Boyle Rick, Cantin Pierre-luc, Chao Clifford, Clark Chris, Coriell Jeremy, Daley Mike, Dau Matt, Dean Jeffrey, Gelb Ben, Ghaemmaghami Tara Vazir, Gottipati Rajendra, Gulland William, Hagmann Robert, Ho C. Richard, Hogberg Doug, Hu John, Hundt Robert, Hurt Dan, Ibarz Julian, Jaffey Aaron, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Hyun Yoon Doe. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, Toronto ON Canada, 112. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Karthik Chandrasekar, Christian Weis, Yonghui Li, Sven Goossens, Matthias Jung, Omar Naji, Benny Akesson, Norbert Wehn, and Goossens Kees. 2022. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. Retrieved from http://www.drampower.info.Google ScholarGoogle Scholar
  23. [23] Kim Duckhwan, Kung Jaeha, Chai Sek, Yalamanchili Sudhakar, and Mukhopadhyay Saibal. 2016a. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.IEEE, Seoul, South Korea, 380392. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Kim Yoongu, Yang Weikun, and Mutlu Onur. 2016b. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1(2016), 4549. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6(2017), 8490. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Li Benzheng, Du Qi, Liu Dingcheng, Zhang Jingchong, Chen Gengjie, and You Hailong. 2021. Placement for wafer-scale deep learning accelerator. In Proceedings of the 2021 26th Asia and South Pacific Design Automation Conference.665670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Li Jiajun, Yan Guihai, Lu Wenyan, Jiang Shuhao, Gong Shijun, Wu Jingya, and Li Xiaowei. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition.IEEE, Dresden, Germany, 343348. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Min Chuhan, Mao Jiachen, Li Hai, and Chen Yiran. 2019. NeuralHMC: An efficient HMC-based accelerator for deep neural networks. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. ACM, Tokyo Japan, 394399. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Niu Dimin, Li Shuangchen, Wang Yuhao, Han Wei, Zhang Zhe, Guan Yijin, Guan Tianchan, Sun Fei, Xue Fei, Duan Lide, Fang Yuanwei, Zheng Hongzhong, Jiang Xiping, Wang Song, Zuo Fengguo, Wang Yubing, Yu Bing, Ren Qiwei, and Xie Yuan. 2022. 184QPS/W 64Mb/mm23D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference.13. DOI:ISSN: 2376–8606.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Parashar Angshuman, Raina Priyanka, Shao Yakun Sophia, Chen Yu-Hsin, Ying Victor A., Mukkara Anurag, Venkatesan Rangharajan, Khailany Brucek, Keckler Stephen W., and Emer Joel. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 304315. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Redmon Joseph and Farhadi Ali. 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. Retrieved from https://arxiv.org/abs/1804.02767.Google ScholarGoogle Scholar
  32. [32] Samajdar Ananda, Joseph Jan Moritz, Zhu Yuhao, Whatmough Paul, Mattina Matthew, and Krishna Tushar. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software.IEEE, 5868. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Shao Yakun Sophia, Clemons Jason, Venkatesan Rangharajan, Zimmer Brian, Fojtik Matthew, Jiang Nan, Keller Ben, Klinefelter Alicia, Pinckney Nathaniel, Raina Priyanka, Tell Stephen G., Zhang Yanqing, Dally William J., Emer Joel, Gray C. Thomas, Khailany Brucek, and Keckler Stephen W.. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 1427. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Shiba Kota, Omori Tatsuo, Ueyoshi Kodai, Takamaeda-Yamazaki Shinya, Motomura Masato, Hamada Mototsugu, and Kuroda Tadahiro. 2021. A 96-MB 3D-stacked SRAM using inductive coupling with 0.4-V transmitter, termination scheme and 12:1 SerDes in 40-nm CMOS. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 2(2021), 692703. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  36. [36] Singh Gagandeep, Gómez-Luna Juan, Mariani Giovanni, Oliveira Geraldo F., Corda Stefano, Stuijk Sander, Mutlu Onur, and Corporaal Henk. 2019. NAPEL: Near-memory computing application performance prediction via ensemble learning. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Sutanthavibul S., Shragowitz E., and Rosen J. B.. 1991. An analytical approach to floorplan design and optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10, 6(1991), 761769. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 19. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Ueyoshi Kodai, Ando Kota, Hirose Kazutoshi, Takamaeda-Yamazaki Shinya, Hamada Mototsugu, Kuroda Tadahiro, and Motomura Masato. 2019. QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS. IEEE Journal of Solid-State Circuits 54, 1(2019), 186196. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wang Yi, Chen Weixuan, Yang Jing, and Li Tao. 2018. Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 29, 6(2018), 14281441. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Wang Yi, Zhang Mingxu, and Yang Jing. 2017. Exploiting parallelism for convolutional connections in processing-in-memory architecture. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wei Xuechao, Liang Yun, and Cong Jason. 2019. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wu Yannan Nellie, Emer Joel S., and Sze Vivienne. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design.IEEE, 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Zheng Shixuan, Zhang Xianjue, Liu Leibo, Wei Shaojun, and Yin Shouyi. 2022. Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators. In Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture.475489. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhu Zhenhua, Sun Hanbo, Lin Yujun, Dai Guohao, Xia Lixue, Han Song, Wang Yu, and Yang Huazhong. 2019. A configurable multi-precision CNN computing framework based on single bit RRAM. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference.16.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Design Automation of Electronic Systems
      ACM Transactions on Design Automation of Electronic Systems  Volume 28, Issue 3
      May 2023
      456 pages
      ISSN:1084-4309
      EISSN:1557-7309
      DOI:10.1145/3587887
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 March 2023
      • Online AM: 15 December 2022
      • Accepted: 30 November 2022
      • Revised: 13 September 2022
      • Received: 13 May 2022
      Published in todaes Volume 28, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format