skip to main content
research-article

Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks

Published:14 November 2023Publication History
Skip Abstract Section

Abstract

Video analytics has a wide range of applications and has attracted much interest over the years. While it can be both computationally and energy-intensive, video analytics can greatly benefit from in/near memory compute. The practice of moving compute closer to memory has continued to show improvements to performance and energy consumption and is seeing increasing adoption. Recent advancements in solid state drives (SSDs) have incorporated near memory Field Programmable Gate Arrays (FPGAs) with shared access to the drive’s storage cells. These near memory FPGAs are capable of running operations required by video analytic pipelines such as object detection and template matching. These operations are typically executed using Convolutional Neural Networks (CNNs). A CNN is composed of multiple individually processed layers that perform various image processing tasks. Due to lack of resources, a layer may be partitioned into more manageable sub-layers. These sub-layers are then processed sequentially, however, some sub-layers can be processed simultaneously. Moreover, the storage cells within FPGA equipped SSDs are capable of being augmented with in-storage compute to accelerate CNN workloads and exploit the intra-parallelism within a CNN layer. To this end, we present our work, which leverages heterogeneous architectures to create an in/near-storage acceleration solution for video analytics. We designed a NAND flash accelerator and an FPGA accelerator, then mapped and evaluated several CNN benchmarks. We show how to utilize FPGAs, local DRAMs, and in-memory SSD compute to accelerate CNN workloads. Our work also demonstrates how to remove unnecessary memory transfers to save latency and energy.

REFERENCES

  1. [1] Aga S., Jeloka S., Subramaniyan A., Narayanasamy S., Blaauw D., and Das R.. 2017. Compute caches. In IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 481492. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 112.Google ScholarGoogle Scholar
  3. [3] Baek SungHa, Jung Youngdon, Mohaisen Aziz, Lee Sungjin, and Nyang DaeHun. 2018. SSD-Insider: Internal defense of solid-state drive against ransomware with perfect data recovery. In IEEE 38th International Conference on Distributed Computing Systems (ICDCS’18). 875884. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-state Circ. 52, 1 (2016), 127138.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cheng Ming, Xia Lixue, Zhu Zhenhua, Cai Yi, Xie Yuan, Wang Yu, and Yang Huazhong. 2017. TIME: A training-in-memory architecture for memristor-based deep neural networks. In 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chi P., Li S., Xu C., Zhang T., Zhao J., Liu Y., Wang Y., and Xie Y.. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 2739.Google ScholarGoogle Scholar
  7. [7] Chi Ping, Li Shuangchen, Xu Cong, Zhang Tao, Zhao Jishen, Liu Yongpan, Wang Yu, and Xie Yuan. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 2739.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Corporation Intel. 2021. IEDM ferroelectric FET, hafnium oxide round up. Retrieved from https://www.eenewsanalog.com/news/iedm-ferroelectric-fet-hafnium-oxide-round/page/0/1.Google ScholarGoogle Scholar
  9. [9] Dai Jifeng, Li Yi, He Kaiming, and Sun Jian. 2016. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 29 (2016).Google ScholarGoogle Scholar
  10. [10] Eckert Charles, Wang Xiaowei, Wang Jingcheng, Subramaniyan Arun, Iyer Ravi, Sylvester Dennis, Blaaauw David, and Das Reetuparna. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 383396.Google ScholarGoogle Scholar
  11. [11] Fick Laura, Skrzyniarz Skylar, Parikh Malav, Henry Michael B., and Fick David. 2022. Analog matrix processor for edge AI real-time video analytics. In IEEE International Solid-State Circuits Conference (ISSCC’22). 260262. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Florent K., Pesic M., Subirats A., Banerjee K., Lavizzari S., Arreghini A., Piazza L. Di, Potoms G., Sebaai F., McMitchell S. R. C., Popovici M., Groeseneken G., and Houdt J. Van. 2018. Vertical ferroelectric HfO<inf>2</inf> FET based on 3-D NAND architecture: Towards dense low-power memory. In IEEE International Electron Devices Meeting (IEDM’18). 2.5.1–2.5.4. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Francisco Phil. 2014. IBM PureData System for analytics architecture. IBM Redbooks (2014), 116.Google ScholarGoogle Scholar
  14. [14] Fujiki Daichi, Mahlke Scott, and Das Reetuparna. 2018. In-memory data parallel processor. ACM SIGPLAN Not. 53, 2 (2018), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Fujiki Daichi, Mahlke Scott, and Das Reetuparna. 2019. Duality cache for data parallel acceleration. In 46th International Symposium on Computer Architecture. 397410.Google ScholarGoogle Scholar
  16. [16] Ghaffar Muhammad Mohsin, Sudarshan Chirag, Weis Christian, Jung Matthias, and Wehn Norbert. 2020. A low power In-DRAM architecture for quantized CNNs using fast Winograd convolutions. In International Symposium on Memory Systems (MEMSYS’20). Association for Computing Machinery, New York, NY, 158168. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Prabhat K. Gupta. 2015. Xeon+ fpga platform for the data center. In Fourth Workshop on the Intersections of Computer Architecture and Reconfigurable Logic, Vol. 119. 2.Google ScholarGoogle Scholar
  18. [18] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google ScholarGoogle Scholar
  19. [19] Imani Mohsen, Gupta Saransh, Kim Yeseong, and Rosing Tajana. 2019. FloatPIM: In-memory acceleration of deep neural network training with high precision. In ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA’19). IEEE, 802815.Google ScholarGoogle Scholar
  20. [20] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David, Agrawal Gaurav, Bajwa Raminder, Bates Sarah, Bhatia Suresh, Boden Nan, Borchers Al, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture. 112.Google ScholarGoogle Scholar
  21. [21] Minsu Kim, Muqing Liu, Luke R. Everson, and Chris H. Kim. 2022. An embedded nand flash-based compute-in-memory array demonstrated in a standard logic process. IEEE Journal of Solid-State Circuits 57, 2 (2022), 625–638. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Lee Jinho, Kim Heesu, Yoo Sungjoo, Choi Kiyoung, Hofstee H. Peter, Nam Gi-Joon, Nutter Mark R., and Jamsek Damir. 2017. ExtraV: Boosting graph processing near storage with a coherent accelerator. Proc. VLDB Endow. 10, 12 (2017), 17061717.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Shuangchen, Niu Dimin, Malladi Krishna T., Zheng Hongzhong, Brennan Bob, and Xie Yuan. 2017. DRISA: A DRAM-based reconfigurable in-situ accelerator. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). 288301.Google ScholarGoogle Scholar
  24. [24] Liu He, Han Jianhui, and Zhang Youhui. 2019. A unified framework for training, mapping and simulation of ReRAM-based convolutional neural network acceleration. IEEE Comput. Archit. Lett. 18, 1 (2019), 6366.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liu Wei, Anguelov Dragomir, Erhan Dumitru, Szegedy Christian, Reed Scott, Fu Cheng-Yang, and Berg Alexander C.. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 2137.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liu Xiao, Zhou Mingxuan, Rosing Tajana S., and Zhao Jishen. 2019. HR 3 AM: A heat resilient design for RRAM-based neuromorphic computing. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED’19). IEEE, 16.Google ScholarGoogle Scholar
  27. [27] Lo D., et al. 2017. Accelerating persistent neural networks at datacenter scale. In ML Systems Workshop at NIPS’17.Google ScholarGoogle Scholar
  28. [28] Luo Tao, Liu Shaoli, Li Ling, Wang Yuqing, Zhang Shijin, Chen Tianshi, Xu Zhiwei, Temam Olivier, and Chen Yunji. 2016. DaDianNao: A neural network supercomputer. IEEE Trans. Comput. 66, 1 (2016), 7388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Mailthody Vikram Sharma, Qureshi Zaid, Liang Weixin, Feng Ziyan, Gonzalo Simon Garcia de, Li Youjie, Franke Hubertus, Xiong Jinjun, Huang Jian, and Hwu Wen-mei. 2019. DeepStore: In-storage acceleration for intelligent queries. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19). Association for Computing Machinery, New York, NY, 224238. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Nguyen Duy Thanh, Nguyen Tuan Nghia, Kim Hyun, and Lee Hyuk-Jae. 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans. Very Large Scale Integ. Syst. 27, 8 (2019), 18611873. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Panda Preeti Ranjan. 2001. SystemC: A modeling platform supporting multiple design abstractions. In 14th International Symposium on Systems Synthesis. 7580.Google ScholarGoogle Scholar
  32. [32] Pouyan Peyman, Amat Esteve, Hamdioui Said, and Rubio Antonio. 2016. RRAM variability and its mitigation schemes. In 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS’16). IEEE, 141146.Google ScholarGoogle Scholar
  33. [33] Redmon Joseph and Farhadi Ali. 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google ScholarGoogle Scholar
  34. [34] Rose Adam, Swan Stuart, Pierce John, Fernandez Jean-Michel, et al. 2005. Transaction level modeling in SystemC. Open Syst.C Initiat. 1, 1.297 (2005).Google ScholarGoogle Scholar
  35. [35] Salamat Sahand, Aboutalebi Armin Haj, Khaleghi Behnam, Lee Joo Hwan, Ki Yang Seok, and Rosing Tajana. 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 262272. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Samsung. 2020. SamsungSSD. Retrieved from https://www.xilinx.com/applications/data-center/computational-storage/smartssd.html.Google ScholarGoogle Scholar
  37. [37] Seshadri Vivek, Hsieh Kevin, Boroum Amirali, Lee Donghyuk, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Archit. Lett. 14, 2 (2015), 127131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Seshadri Vivek, Lee Donghyuk, Mullins Thomas, Hassan Hasan, Boroumand Amirali, Kim Jeremie, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 273287. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Shafiee Ali, Nag Anirban, Muralimanohar Naveen, Balasubramonian Rajeev, Strachan John Paul, Hu Miao, Williams R. Stanley, and Srikumar Vivek. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, Piscataway, NJ, 1426. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Shafiee Ali, Nag Anirban, Muralimanohar Naveen, Balasubramonian Rajeev, Strachan John Paul, Hu Miao, Williams R. Stanley, and Srikumar Vivek. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 1426.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Sun Baohua, Yang Lin, Dong Patrick, Zhang Wenhan, Dong Jason, and Young Charles. 2018. Ultra power-efficient CNN domain specific accelerator with 9.3 tops/watt for mobile and embedded applications. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 16771685.Google ScholarGoogle Scholar
  42. [42] Vieira Joao, Giacomin Edouard, Qureshi Yasir, Zapater Marina, Tang Xifan, Kvatinsky Shahar, Atienza David, and Gaillardon Pierre-Emmanuel. 2019. A product engine for energy-efficient execution of binary neural networks using resistive memories. In IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC’19). IEEE, 160165.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wen Jiayu, Ma Yufei, and Wang Zhongfeng. 2020. An Efficient FPGA accelerator optimized for high throughput sparse CNN inference. In IEEE Asia Pacific Conference on Circuits and Systems (APCCAS’20). IEEE, 165168.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Yitbarek Salessawi Ferede, Yang Tao, Das Reetuparna, and Austin Todd. 2016. Exploring specialized near-memory processing for data intensive operations. In Design, Automation Test in Europe Conference Exhibition (DATE’16). 14491452.Google ScholarGoogle Scholar
  45. [45] Zhang J., Wang Z., and Verma N.. 2017. In-memory computation of a machine-learning classifier in a standard 6T SRAM array. IEEE J. Solid-state Circ. 52, 4 (Apr.2017), 915924. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Journal on Emerging Technologies in Computing Systems
        ACM Journal on Emerging Technologies in Computing Systems  Volume 20, Issue 1
        January 2024
        104 pages
        ISSN:1550-4832
        EISSN:1550-4840
        DOI:10.1145/3613494
        • Editor:
        • Ramesh Karri
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 November 2023
        • Online AM: 17 June 2023
        • Accepted: 31 March 2023
        • Revised: 3 November 2022
        • Received: 28 March 2022
        Published in jetc Volume 20, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)353
        • Downloads (Last 6 weeks)50

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text