research-article

Static Scheduling of Weight Programming for DNN Acceleration with Resource Constrained PIM

Authors:

Lei JuAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 6

Article No.: 89, Pages 1 - 22

https://doi.org/10.1145/3615657

Published: 11 September 2024 Publication History

Abstract

Most existing architectural studies on ReRAM-based processing-in-memory (PIM) DNN accelerators assume that all weights of the DNN can be mapped to the crossbar at once. However, these studies are over-idealized. ReRAM crossbar resources for calculation are limited because of technological limitations, so multiple weight mapping procedures are required during the inference process. In this article, we propose a static scheduling framework which generates the mapping between DNN weights and ReRAM cells with minimum runtime weight programming cost. We first build a ReRAM crossbar programming latency model by simultaneously considering the DNN weight patterns, ReRAM programming operations, and PIM architecture characteristics. Then, the model is used in the searching process to obtain an optimized weight-to-OU mapping table with minimum online programming latency. Finally, an OU scheduler is used to coordinate the activation sequences of OUs in the crossbars to perform the inference computation correctly. Evaluation results show the proposed framework significantly reduces the weight programming overhead and the overall inference latency for various DNN models with different input datasets.

References

[1]

Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei W. Hwu, John Paul Strachan, Kaushik Roy, et al. 2019. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 715–731.

Digital Library

[2]

Wei-Hao Chen, Kai-Xiang Li, Wei-Yu Lin, Kuo-Hsiang Hsu, Pin-Yi Li, Cheng-Han Yang, Cheng-Xin Xue, En-Yu Yang, Yen-Kai Chen, Yun-Sheng Chang, et al. 2018. A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors. In ISSCC’18.

[3]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ISCA’16 (2016).

[4]

Chaoqun Chu, Yanzhi Wang, Yilong Zhao, Xiaolong Ma, Shaokai Ye, Yunyan Hong, Xiaoyao Liang, Yinhe Han, and Li Jiang. 2020. PIM-Prune: Fine-grain DCNN pruning for crossbar-based process-in-memory architecture. In DAC’20.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Saransh Gupta, Mohsen Imani, Behnam Khaleghi, Venkatesh Kumar, and Tajana Rosing. 2019. RAPID: A ReRAM processing in-memory architecture for DNA sequence alignment. In 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 1–6.

[7]

Chung-Wei Hsu, I-Ting Wang, Chun-Li Lo, Ming-Chung Chiang, Wen-Yueh Jang, Chen-Hsi Lin, and Tuo-Hung Hou. 2013. Self-rectifying bipolar TaO x/TiO 2 RRAM with superior endurance over 10-12 cycles for 3D high-density storage-class memory. In 2013 Symposium on VLSI Technology. IEEE, T166–T167.

[8]

Yu Ji, Youyang Zhang, Xinfeng Xie, Shuangchen Li, Peiqi Wang, Xing Hu, Youhui Zhang, and Yuan Xie. 2019. FPSA: A full system stack solution for reconfigurable reram-based nn accelerator architecture. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 733–747.

Digital Library

[9]

Hai Jin, Cong Liu, Haikun Liu, Ruikun Luo, Jiahong Xu, Fubing Mao, and Xiaofei Liao. 2021. ReHy: A ReRAM-based digital/analog hybrid PIM architecture for accelerating CNN training. IEEE Transactions on Parallel and Distributed Systems (2021).

[10]

Wantong Li, Xiaoyu Sun, Shanshi Huang, Hongwu Jiang, and Shimeng Yu. 2022. A 40-nm MLC-RRAM compute-in-memory macro with sparsity control, on-chip write-verify, and temperature-independent ADC references. IEEE Journal of Solid-State Circuits (2022).

[11]

Weitao Li, Pengfei Xu, Yang Zhao, Haitong Li, Yuan Xie, and Yingyan Lin. 2020. Timely: Pushing data movements and interfaces in pim accelerators towards local and in time domain. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 832–845.

Digital Library

[12]

Meng-Yao Lin, Hsiang-Yun Cheng, Wei-Ting Lin, Tzu-Hsien Yang, I-Ching Tseng, Chia-Lin Yang, Han-Wen Hu, Hung-Sheng Chang, Hsiang-Pang Li, and Meng-Fan Chang. 2018. DL-RSIM: A simulation framework to enable reliable ReRAM-based accelerators for deep learning. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8.

Digital Library

[13]

Fangxin Liu, Wenbo Zhao, Zhezhi He, Zongwu Wang, Yilong Zhao, Yongbiao Chen, and Li Jiang. 2021. Bit-transformer: Transforming bit-level sparsity into higher preformance in ReRam-based accelerator. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9.

Digital Library

[14]

Qi Liu, Bin Gao, Peng Yao, Dong Wu, Junren Chen, Yachuan Pang, Wenqiang Zhang, Yan Liao, Cheng-Xin Xue, Wei-Hao Chen, et al. 2020. A fully integrated analog ReRAM based 78.4 TOPS/W compute-in-memory chip with fully parallel MAC computing. In ISSCC’20.

[15]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. ROBERTA: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[16]

Anni Lu, Xiaochen Peng, Yandong Luo, and et al.2021. A runtime reconfigurable design of compute-in-memory-based hardware accelerator for deep learning inference. ACM Trans. Design Autom. Electr. Syst. (2021).

Digital Library

[17]

Anni Lu, Xiaochen Peng, Yandong Luo, and Shimeng Yu. 2020. Benchmark of the compute-in-memory-based DNN accelerator with area constraint. IEEE Trans. on Very Large Scale Integration (VLSI) Systems 28, 9 (2020), 1945–1952.

Digital Library

[18]

Anni Lu, Xiaochen Peng, and Shimeng Yu. 2021. Compute-in-RRAM with limited on-chip resources. In AICAS’21.

[19]

Manqing Mao, Xiaochen Peng, Rui Liu, and et al.2019. MAX 2: An ReRAM-based neural network accelerator that maximizes data reuse and area utilization. J. Emerg. Sel. Topics Circuits Syst. (2019).

[20]

Mehdi Saberi, Reza Lotfi, Khalil Mafinezhad, and Wouter A. Serdijn. 2011. Analysis of power consumption and linearity in capacitive digital-to-analog converters used in successive approximation ADCs. IEEE Transactions on Circuits and Systems I: Regular Papers 58, 8 (2011), 1736–1748.

[21]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ISCA’16.

[22]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR’15.

[23]

Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. Pipelayer: A pipelined reram-based accelerator for deep learning. In HPCA’17.

[24]

Zhuoran Song, Dongyue Li, Zhezhi He, Xiaoyao Liang, and Li Jiang. 2021. ReRAM-sharing: Fine-grained weight sharing for ReRAM-based deep neural network accelerator. In ISCAS’21.

[25]

Chen-Yang Tsai, Chin-Fu Nien, Tz-Ching Yu, Hung-Yu Yeh, and Hsiang-Yun Cheng. 2021. RePIM: Joint exploitation of activation and weight repetitions for in-ReRAM DNN acceleration. In DAC’21.

Digital Library

[26]

Yen-Ting Tsou, Kuan-Hsun Chent, Chia-Lin Yang, Hsiang-Yun Cheng, Jian-Jia Chen, and Der-Yu Tsai. 2022. This is SPATEM! A spatial-temporal optimization framework for efficient inference on ReRAM-based CNN accelerator. In 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 702–707.

Digital Library

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

[28]

Chengning Wang, Dan Feng, Jingning Liu, Wei Tong, Bing Wu, and Yang Zhang. 2017. DAWS: Exploiting crossbar characteristics for improving write performance of high density resistive memory. In ICCD’17.

[29]

Chengning Wang, Dan Feng, Wei Tong, Jingning Liu, Bing Wu, Wei Zhao, Yang Zhang, and Yiran Chen. 2020. Improving write performance on cross-point RRAM arrays by leveraging multidimensional non-uniformity of cell effective voltage. TC’20 (2020).

[30]

Wen Wen, Youtao Zhang, and Jun Yang. 2018. Wear leveling for crossbar resistive memory. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1–6.

Digital Library

[31]

Wen Wen, Youtao Zhang, and Jun Yang. 2019. ReNEW: Enhancing lifetime for ReRAM crossbar based neural network accelerators. In 2019 IEEE 37th International Conference on Computer Design (ICCD). IEEE, 487–496.

[32]

Wen Wen, Youtao Zhang, and Jun Yang. 2020. Accelerating 3D vertical resistive memories with opportunistic write latency reduction. In 2020 IEEE/ACM International Conference Oon Computer Aided Design (ICCAD). IEEE, 1–8.

Digital Library

[33]

Lixue Xia, Boxun Li, Tianqi Tang, and et al.2017. MNSIM: Simulation platform for memristor-based neuromorphic computing system. TCAD 37, 5 (2017), 1009–1022.

[34]

Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In HPCA’15.

[35]

Cheng-Xin Xue, Je-Min Hung, Hui-Yao Kao, and et al.2021. A 22nm 4Mb 8b-precision ReRAM computing-in-memory macro with 11.91 to 195.7 TOPS/W for tiny AI edge devices. In ISSCC’21.

[36]

Tzu-Hsien Yang, Hsiang-Yun Cheng, Chia-Lin Yang, and et al.2019. Sparse ReRAM engine: Joint exploration of activation and weight sparsity in compressed neural networks. In ISCA’19.

Digital Library

[37]

Jong-Hyeok Yoon, Muya Chang, Win-San Khwa, Yu-Der Chih, Meng-Fan Chang, and Arijit Raychowdhury. 2021. 29.1 A 40nm 64Kb 56.67 TOPS/W read-disturb-tolerant compute-in-memory/digital RRAM macro with active-feedback-based read and in-situ write verification. In ISSCC’21.

[38]

Geng Yuan, Payman Behnam, Zhengang Li, and et al.2021. FORMS: Fine-grained polarized ReRAM-based in-situ computation for mixed-signal DNN accelerator. In ISCA’21.

Digital Library

[39]

Hang Zhang, Nong Xiao, Fang Liu, and Zhiguang Chen. 2016. LEADER: Accelerating ReRAM-based main memory by leveraging access latency discrepancy in crossbar arrays. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 756–761.

[40]

Yang Zhang, Dan Feng, Jingning Liu, Wei Tong, Bing Wu, and Caihua Fang. 2017. A novel ReRAM-based main memory structure for optimizing access latency and reliability. In 54th Annual Design Automation Conference 2017. 1–6.

Digital Library

[41]

Yang Zhang, Dan Feng, Wei Tong, Yu Hua, Jingning Liu, Zhipeng Tan, Chengning Wang, Bing Wu, Zheng Li, and Gaoxiang Xu. 2018. CACF: A novel circuit architecture co-optimization framework for improving performance, reliability and energy of ReRAM-based main memory system. ACM Transactions on Architecture and Code Optimization (TACO) 15, 2 (2018), 1–26.

Digital Library

[42]

Yuhao Zhang, Zhiping Jia, Yungang Pan, Hongchao Du, Zhaoyan Shen, Mengying Zhao, and Zili Shao. 2020. PATTPIM: A practical ReRAM-based DNN accelerator by reusing weight pattern repetitions. In DAC’20.

[43]

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019).

[44]

Yang-Lin Zheng, Wei-Yi Yang, Ya-Shu Chen, and Ding-Hung Han. 2022. An energy-efficient inference engine for a configurable ReRAM-based neural network accelerator. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022).

[45]

Farzaneh Zokaee and Lei Jiang. 2020. Mitigating voltage drop in resistive memories by dynamic reset voltage regulation and partition reset. In HPCA’20.

Index Terms

Static Scheduling of Weight Programming for DNN Acceleration with Resource Constrained PIM
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems
      1. Emerging architectures

Recommendations

A Novel Resistive Memory-based Process-in-memory Architecture for Efficient Logic and Add Operations

The coming era of big data revives the Processing-in-memory (PIM) architecture to relieve the memory wall problem that embarrasses the modern computing system. However, most existing PIM designs just put computing units closer to memory, rather than a ...
A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network Acceleration
Emerging resistive random-access memory (ReRAM) based processing-in-memory (PIM) accelerators have been increasingly explored in recent years because they can efficiently perform in-situ matrix-vector multiplication (MVM) operations involved in a wide ...
PIM-prune: fine-grain DCNN pruning for crossbar-based process-in-memory architecture
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference

Deep Convolution Neural network (DCNN) pruning is an efficient way to reduce the resource and power consumption in a DCNN accelerator. Exploiting the sparsity in the weight matrices of DCNNs, however, is nontrivial if we deploy these DCNNs in a crossbar-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 23, Issue 6

November 2024

505 pages

EISSN:1558-3465

DOI:10.1145/3613645

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 11 September 2024

Online AM: 11 August 2023

Accepted: 22 July 2023

Revised: 21 May 2023

Received: 26 January 2023

Published in TECS Volume 23, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China
Taishan Scholars Program, and Qilu Young Scholar Program of Shandong University

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
475
Total Downloads

Downloads (Last 12 months)291
Downloads (Last 6 weeks)42

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents