research-article

Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks

Authors:
Ikenna Okafor

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

0000-0003-2133-8425
View Profile

,
Akshay Krishna Ramanathan

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

0000-0003-3789-7790
View Profile

,
Nagadastagiri Reddy Challapalle

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

0000-0003-3324-2009
View Profile

,
Zheyu Li

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

0009-0005-4117-807X
View Profile

,
Vijaykrishnan Narayanan

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

The Pennsylvania State University, Dept of Electrical Eng and Comp Sci, USA

0000-0001-6266-6068
View Profile

ACM Journal on Emerging Technologies in Computing Systems Volume 20 Issue 1Article No.: 1pp 1–22https://doi.org/10.1145/3597496

Published:14 November 2023Publication History

ACM Journal on Emerging Technologies in Computing Systems

Abstract

Video analytics has a wide range of applications and has attracted much interest over the years. While it can be both computationally and energy-intensive, video analytics can greatly benefit from in/near memory compute. The practice of moving compute closer to memory has continued to show improvements to performance and energy consumption and is seeing increasing adoption. Recent advancements in solid state drives (SSDs) have incorporated near memory Field Programmable Gate Arrays (FPGAs) with shared access to the drive’s storage cells. These near memory FPGAs are capable of running operations required by video analytic pipelines such as object detection and template matching. These operations are typically executed using Convolutional Neural Networks (CNNs). A CNN is composed of multiple individually processed layers that perform various image processing tasks. Due to lack of resources, a layer may be partitioned into more manageable sub-layers. These sub-layers are then processed sequentially, however, some sub-layers can be processed simultaneously. Moreover, the storage cells within FPGA equipped SSDs are capable of being augmented with in-storage compute to accelerate CNN workloads and exploit the intra-parallelism within a CNN layer. To this end, we present our work, which leverages heterogeneous architectures to create an in/near-storage acceleration solution for video analytics. We designed a NAND flash accelerator and an FPGA accelerator, then mapped and evaluated several CNN benchmarks. We show how to utilize FPGAs, local DRAMs, and in-memory SSD compute to accelerate CNN workloads. Our work also demonstrates how to remove unnecessary memory transfers to save latency and energy.

REFERENCES

[1] Aga S., Jeloka S., Subramaniyan A., Narayanasamy S., Blaauw D., and Das R.. 2017. Compute caches. In IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 481–492. DOI:Google ScholarCross Ref
[2] Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.Google Scholar
[3] Baek SungHa, Jung Youngdon, Mohaisen Aziz, Lee Sungjin, and Nyang DaeHun. 2018. SSD-Insider: Internal defense of solid-state drive against ransomware with perfect data recovery. In IEEE 38th International Conference on Distributed Computing Systems (ICDCS’18). 875–884. DOI:Google ScholarCross Ref
[4] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-state Circ. 52, 1 (2016), 127–138.Google ScholarCross Ref
[5] Cheng Ming, Xia Lixue, Zhu Zhenhua, Cai Yi, Xie Yuan, Wang Yu, and Yang Huazhong. 2017. TIME: A training-in-memory architecture for memristor-based deep neural networks. In 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1–6.Google ScholarDigital Library
[6] Chi P., Li S., Xu C., Zhang T., Zhao J., Liu Y., Wang Y., and Xie Y.. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 27–39.Google Scholar
[7] Chi Ping, Li Shuangchen, Xu Cong, Zhang Tao, Zhao Jishen, Liu Yongpan, Wang Yu, and Xie Yuan. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 27–39.Google ScholarDigital Library
[8] Corporation Intel. 2021. IEDM ferroelectric FET, hafnium oxide round up. Retrieved from https://www.eenewsanalog.com/news/iedm-ferroelectric-fet-hafnium-oxide-round/page/0/1.Google Scholar
[9] Dai Jifeng, Li Yi, He Kaiming, and Sun Jian. 2016. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 29 (2016).Google Scholar
[10] Eckert Charles, Wang Xiaowei, Wang Jingcheng, Subramaniyan Arun, Iyer Ravi, Sylvester Dennis, Blaaauw David, and Das Reetuparna. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 383–396.Google Scholar
[11] Fick Laura, Skrzyniarz Skylar, Parikh Malav, Henry Michael B., and Fick David. 2022. Analog matrix processor for edge AI real-time video analytics. In IEEE International Solid-State Circuits Conference (ISSCC’22). 260–262. DOI:Google ScholarCross Ref
[12] Florent K., Pesic M., Subirats A., Banerjee K., Lavizzari S., Arreghini A., Piazza L. Di, Potoms G., Sebaai F., McMitchell S. R. C., Popovici M., Groeseneken G., and Houdt J. Van. 2018. Vertical ferroelectric HfO<inf>2</inf> FET based on 3-D NAND architecture: Towards dense low-power memory. In IEEE International Electron Devices Meeting (IEDM’18). 2.5.1–2.5.4. DOI:Google ScholarCross Ref
[13] Francisco Phil. 2014. IBM PureData System for analytics architecture. IBM Redbooks (2014), 1–16.Google Scholar
[14] Fujiki Daichi, Mahlke Scott, and Das Reetuparna. 2018. In-memory data parallel processor. ACM SIGPLAN Not. 53, 2 (2018), 1–14.Google ScholarDigital Library
[15] Fujiki Daichi, Mahlke Scott, and Das Reetuparna. 2019. Duality cache for data parallel acceleration. In 46th International Symposium on Computer Architecture. 397–410.Google Scholar
[16] Ghaffar Muhammad Mohsin, Sudarshan Chirag, Weis Christian, Jung Matthias, and Wehn Norbert. 2020. A low power In-DRAM architecture for quantized CNNs using fast Winograd convolutions. In International Symposium on Memory Systems (MEMSYS’20). Association for Computing Machinery, New York, NY, 158–168. DOI:Google ScholarDigital Library
[17] Prabhat K. Gupta. 2015. Xeon+ fpga platform for the data center. In Fourth Workshop on the Intersections of Computer Architecture and Reconfigurable Logic, Vol. 119. 2.Google Scholar
[18] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
[19] Imani Mohsen, Gupta Saransh, Kim Yeseong, and Rosing Tajana. 2019. FloatPIM: In-memory acceleration of deep neural network training with high precision. In ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA’19). IEEE, 802–815.Google Scholar
[20] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David, Agrawal Gaurav, Bajwa Raminder, Bates Sarah, Bhatia Suresh, Boden Nan, Borchers Al, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
[21] Minsu Kim, Muqing Liu, Luke R. Everson, and Chris H. Kim. 2022. An embedded nand flash-based compute-in-memory array demonstrated in a standard logic process. IEEE Journal of Solid-State Circuits 57, 2 (2022), 625–638. Google ScholarCross Ref
[22] Lee Jinho, Kim Heesu, Yoo Sungjoo, Choi Kiyoung, Hofstee H. Peter, Nam Gi-Joon, Nutter Mark R., and Jamsek Damir. 2017. ExtraV: Boosting graph processing near storage with a coherent accelerator. Proc. VLDB Endow. 10, 12 (2017), 1706–1717.Google ScholarDigital Library
[23] Li Shuangchen, Niu Dimin, Malladi Krishna T., Zheng Hongzhong, Brennan Bob, and Xie Yuan. 2017. DRISA: A DRAM-based reconfigurable in-situ accelerator. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). 288–301.Google Scholar
[24] Liu He, Han Jianhui, and Zhang Youhui. 2019. A unified framework for training, mapping and simulation of ReRAM-based convolutional neural network acceleration. IEEE Comput. Archit. Lett. 18, 1 (2019), 63–66.Google ScholarCross Ref
[25] Liu Wei, Anguelov Dragomir, Erhan Dumitru, Szegedy Christian, Reed Scott, Fu Cheng-Yang, and Berg Alexander C.. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21–37.Google ScholarCross Ref
[26] Liu Xiao, Zhou Mingxuan, Rosing Tajana S., and Zhao Jishen. 2019. HR 3 AM: A heat resilient design for RRAM-based neuromorphic computing. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED’19). IEEE, 1–6.Google Scholar
[27] Lo D., et al. 2017. Accelerating persistent neural networks at datacenter scale. In ML Systems Workshop at NIPS’17.Google Scholar
[28] Luo Tao, Liu Shaoli, Li Ling, Wang Yuqing, Zhang Shijin, Chen Tianshi, Xu Zhiwei, Temam Olivier, and Chen Yunji. 2016. DaDianNao: A neural network supercomputer. IEEE Trans. Comput. 66, 1 (2016), 73–88.Google ScholarDigital Library
[29] Mailthody Vikram Sharma, Qureshi Zaid, Liang Weixin, Feng Ziyan, Gonzalo Simon Garcia de, Li Youjie, Franke Hubertus, Xiong Jinjun, Huang Jian, and Hwu Wen-mei. 2019. DeepStore: In-storage acceleration for intelligent queries. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19). Association for Computing Machinery, New York, NY, 224–238. DOI:Google ScholarDigital Library
[30] Nguyen Duy Thanh, Nguyen Tuan Nghia, Kim Hyun, and Lee Hyuk-Jae. 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans. Very Large Scale Integ. Syst. 27, 8 (2019), 1861–1873. DOI:Google ScholarDigital Library
[31] Panda Preeti Ranjan. 2001. SystemC: A modeling platform supporting multiple design abstractions. In 14th International Symposium on Systems Synthesis. 75–80.Google Scholar
[32] Pouyan Peyman, Amat Esteve, Hamdioui Said, and Rubio Antonio. 2016. RRAM variability and its mitigation schemes. In 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS’16). IEEE, 141–146.Google Scholar
[33] Redmon Joseph and Farhadi Ali. 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google Scholar
[34] Rose Adam, Swan Stuart, Pierce John, Fernandez Jean-Michel, et al. 2005. Transaction level modeling in SystemC. Open Syst.C Initiat. 1, 1.297 (2005).Google Scholar
[35] Salamat Sahand, Aboutalebi Armin Haj, Khaleghi Behnam, Lee Joo Hwan, Ki Yang Seok, and Rosing Tajana. 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 262–272. DOI:Google ScholarDigital Library
[36] Samsung. 2020. SamsungSSD. Retrieved from https://www.xilinx.com/applications/data-center/computational-storage/smartssd.html.Google Scholar
[37] Seshadri Vivek, Hsieh Kevin, Boroum Amirali, Lee Donghyuk, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Archit. Lett. 14, 2 (2015), 127–131.Google ScholarDigital Library
[38] Seshadri Vivek, Lee Donghyuk, Mullins Thomas, Hassan Hasan, Boroumand Amirali, Kim Jeremie, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, New York, NY, 273–287. DOI:Google ScholarDigital Library
[39] Shafiee Ali, Nag Anirban, Muralimanohar Naveen, Balasubramonian Rajeev, Strachan John Paul, Hu Miao, Williams R. Stanley, and Srikumar Vivek. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, Piscataway, NJ, 14–26. DOI:Google ScholarDigital Library
[40] Shafiee Ali, Nag Anirban, Muralimanohar Naveen, Balasubramonian Rajeev, Strachan John Paul, Hu Miao, Williams R. Stanley, and Srikumar Vivek. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 14–26.Google ScholarDigital Library
[41] Sun Baohua, Yang Lin, Dong Patrick, Zhang Wenhan, Dong Jason, and Young Charles. 2018. Ultra power-efficient CNN domain specific accelerator with 9.3 tops/watt for mobile and embedded applications. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1677–1685.Google Scholar
[42] Vieira Joao, Giacomin Edouard, Qureshi Yasir, Zapater Marina, Tang Xifan, Kvatinsky Shahar, Atienza David, and Gaillardon Pierre-Emmanuel. 2019. A product engine for energy-efficient execution of binary neural networks using resistive memories. In IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC’19). IEEE, 160–165.Google ScholarCross Ref
[43] Wen Jiayu, Ma Yufei, and Wang Zhongfeng. 2020. An Efficient FPGA accelerator optimized for high throughput sparse CNN inference. In IEEE Asia Pacific Conference on Circuits and Systems (APCCAS’20). IEEE, 165–168.Google ScholarCross Ref
[44] Yitbarek Salessawi Ferede, Yang Tao, Das Reetuparna, and Austin Todd. 2016. Exploring specialized near-memory processing for data intensive operations. In Design, Automation Test in Europe Conference Exhibition (DATE’16). 1449–1452.Google Scholar
[45] Zhang J., Wang Z., and Verma N.. 2017. In-memory computation of a machine-learning classifier in a standard 6T SRAM array. IEEE J. Solid-state Circ. 52, 4 (Apr.2017), 915–924. DOI:Google ScholarCross Ref

Index Terms

Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

An FPGA implementation for neural networks with the FDFM processor core approach

This paper presents a field programmable gate array FPGA implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores ...
Read More
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
Read More
A Massively Parallel Coprocessor for Convolutional Neural Networks
ASAP '09: Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors

We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Journal on Emerging Technologies in Computing Systems Volume 20, Issue 1
January 2024
104 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3613494
Editor:
Ramesh Karri
Polytechnic Institute of New York University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 14 November 2023
- Online AM: 17 June 2023
- Accepted: 31 March 2023
- Revised: 3 November 2022
- Received: 28 March 2022
Published in jetc Volume 20, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Object detection
Neural Networks
PIM
FPGA
near-memory compute
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 353
  Total Downloads
- Downloads (Last 12 months)353
- Downloads (Last 6 weeks)50
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks

ACM Journal on Emerging Technologies in Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

An FPGA implementation for neural networks with the FDFM processor core approach

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

A Massively Parallel Coprocessor for Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Fusing In-storage and Near-storage Acceleration of Convolutional Neural Networks

ACM Journal on Emerging Technologies in Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

An FPGA implementation for neural networks with the FDFM processor core approach

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

A Massively Parallel Coprocessor for Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media