Abstract
Video analytics has a wide range of applications and has attracted much interest over the years. While it can be both computationally and energy-intensive, video analytics can greatly benefit from in/near memory compute. The practice of moving compute closer to memory has continued to show improvements to performance and energy consumption and is seeing increasing adoption. Recent advancements in solid state drives (SSDs) have incorporated near memory Field Programmable Gate Arrays (FPGAs) with shared access to the drive’s storage cells. These near memory FPGAs are capable of running operations required by video analytic pipelines such as object detection and template matching. These operations are typically executed using Convolutional Neural Networks (CNNs). A CNN is composed of multiple individually processed layers that perform various image processing tasks. Due to lack of resources, a layer may be partitioned into more manageable sub-layers. These sub-layers are then processed sequentially, however, some sub-layers can be processed simultaneously. Moreover, the storage cells within FPGA equipped SSDs are capable of being augmented with in-storage compute to accelerate CNN workloads and exploit the intra-parallelism within a CNN layer. To this end, we present our work, which leverages heterogeneous architectures to create an in/near-storage acceleration solution for video analytics. We designed a NAND flash accelerator and an FPGA accelerator, then mapped and evaluated several CNN benchmarks. We show how to utilize FPGAs, local DRAMs, and in-memory SSD compute to accelerate CNN workloads. Our work also demonstrates how to remove unnecessary memory transfers to save latency and energy.
- [1] . 2017. Compute caches. In IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 481–492.
DOI: Google ScholarCross Ref - [2] . 2016. Fused-layer CNN accelerators. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.Google Scholar
- [3] . 2018. SSD-Insider: Internal defense of solid-state drive against ransomware with perfect data recovery. In IEEE 38th International Conference on Distributed Computing Systems (ICDCS’18). 875–884.
DOI: Google ScholarCross Ref - [4] . 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-state Circ. 52, 1 (2016), 127–138.Google ScholarCross Ref
- [5] . 2017. TIME: A training-in-memory architecture for memristor-based deep neural networks. In 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1–6.Google ScholarDigital Library
- [6] . 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 27–39.Google Scholar
- [7] . 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 27–39.Google ScholarDigital Library
- [8] . 2021. IEDM ferroelectric FET, hafnium oxide round up. Retrieved from https://www.eenewsanalog.com/news/iedm-ferroelectric-fet-hafnium-oxide-round/page/0/1.Google Scholar
- [9] . 2016. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 29 (2016).Google Scholar
- [10] . 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 383–396.Google Scholar
- [11] . 2022. Analog matrix processor for edge AI real-time video analytics. In IEEE International Solid-State Circuits Conference (ISSCC’22). 260–262.
DOI: Google ScholarCross Ref - [12] . 2018. Vertical ferroelectric HfO<inf>2</inf> FET based on 3-D NAND architecture: Towards dense low-power memory. In IEEE International Electron Devices Meeting (IEDM’18). 2.5.1–2.5.4.
DOI: Google ScholarCross Ref - [13] . 2014. IBM PureData System for analytics architecture. IBM Redbooks (2014), 1–16.Google Scholar
- [14] . 2018. In-memory data parallel processor. ACM SIGPLAN Not. 53, 2 (2018), 1–14.Google ScholarDigital Library
- [15] . 2019. Duality cache for data parallel acceleration. In 46th International Symposium on Computer Architecture. 397–410.Google Scholar
- [16] . 2020. A low power In-DRAM architecture for quantized CNNs using fast Winograd convolutions. In International Symposium on Memory Systems (MEMSYS’20). Association for Computing Machinery, New York, NY, 158–168.
DOI: Google ScholarDigital Library - [17] Prabhat K. Gupta. 2015. Xeon+ fpga platform for the data center. In Fourth Workshop on the Intersections of Computer Architecture and Reconfigurable Logic, Vol. 119. 2.Google Scholar
- [18] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- [19] . 2019. FloatPIM: In-memory acceleration of deep neural network training with high precision. In ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA’19). IEEE, 802–815.Google Scholar
- [20] . 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
- [21] Minsu Kim, Muqing Liu, Luke R. Everson, and Chris H. Kim. 2022. An embedded nand flash-based compute-in-memory array demonstrated in a standard logic process. IEEE Journal of Solid-State Circuits 57, 2 (2022), 625–638. Google ScholarCross Ref
- [22] . 2017. ExtraV: Boosting graph processing near storage with a coherent accelerator. Proc. VLDB Endow. 10, 12 (2017), 1706–1717.Google ScholarDigital Library
- [23] . 2017. DRISA: A DRAM-based reconfigurable in-situ accelerator. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). 288–301.Google Scholar
- [24] . 2019. A unified framework for training, mapping and simulation of ReRAM-based convolutional neural network acceleration. IEEE Comput. Archit. Lett. 18, 1 (2019), 63–66.Google ScholarCross Ref
- [25] . 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21–37.Google ScholarCross Ref
- [26] . 2019. HR 3 AM: A heat resilient design for RRAM-based neuromorphic computing. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED’19). IEEE, 1–6.Google Scholar
- [27] . 2017. Accelerating persistent neural networks at datacenter scale. In ML Systems Workshop at NIPS’17.Google Scholar
- [28] . 2016. DaDianNao: A neural network supercomputer. IEEE Trans. Comput. 66, 1 (2016), 73–88.Google ScholarDigital Library
- [29] . 2019. DeepStore: In-storage acceleration for intelligent queries. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19). Association for Computing Machinery, New York, NY, 224–238.
DOI: Google ScholarDigital Library - [30] . 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans. Very Large Scale Integ. Syst. 27, 8 (2019), 1861–1873.
DOI: Google ScholarDigital Library - [31] . 2001. SystemC: A modeling platform supporting multiple design abstractions. In 14th International Symposium on Systems Synthesis. 75–80.Google Scholar
- [32] . 2016. RRAM variability and its mitigation schemes. In 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS’16). IEEE, 141–146.Google Scholar
- [33] . 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google Scholar
- [34] . 2005. Transaction level modeling in SystemC. Open Syst.C Initiat. 1, 1.297 (2005).Google Scholar
- [35] . 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 262–272.
DOI: Google ScholarDigital Library - [36] . 2020. SamsungSSD. Retrieved from https://www.xilinx.com/applications/data-center/computational-storage/smartssd.html.Google Scholar
- [37] . 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Archit. Lett. 14, 2 (2015), 127–131.Google ScholarDigital Library
- [38] . 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 273–287.
DOI: Google ScholarDigital Library - [39] . 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, Piscataway, NJ, 14–26.
DOI: Google ScholarDigital Library - [40] . 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 14–26.Google ScholarDigital Library
- [41] . 2018. Ultra power-efficient CNN domain specific accelerator with 9.3 tops/watt for mobile and embedded applications. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1677–1685.Google Scholar
- [42] . 2019. A product engine for energy-efficient execution of binary neural networks using resistive memories. In IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC’19). IEEE, 160–165.Google ScholarCross Ref
- [43] . 2020. An Efficient FPGA accelerator optimized for high throughput sparse CNN inference. In IEEE Asia Pacific Conference on Circuits and Systems (APCCAS’20). IEEE, 165–168.Google ScholarCross Ref
- [44] . 2016. Exploring specialized near-memory processing for data intensive operations. In Design, Automation Test in Europe Conference Exhibition (DATE’16). 1449–1452.Google Scholar
- [45] . 2017. In-memory computation of a machine-learning classifier in a standard 6T SRAM array. IEEE J. Solid-state Circ. 52, 4 (
Apr. 2017), 915–924.DOI: Google ScholarCross Ref
Index Terms
- Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks
Recommendations
An FPGA implementation for neural networks with the FDFM processor core approach
This paper presents a field programmable gate array FPGA implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores ...
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
A Massively Parallel Coprocessor for Convolutional Neural Networks
ASAP '09: Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and ProcessorsWe present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable ...
Comments