research-article

Public Access

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

Authors:

Yiran ChenAuthors Info & Claims

ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design

Article No.: 92, Pages 1 - 9

https://doi.org/10.1145/3400302.3415640

Published: 17 December 2020 Publication History

Abstract

Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings a performance bottleneck during inference. Transformer is also computationally and memory intensive and demands for a hardware acceleration solution. Although researchers have successfully applied ReRAM-based Processing-in-Memory (PIM) to accelerate convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the unique computation process of the scaled dot-product attention in Transformer makes it difficult to directly apply these designs. Besides, how to handle intermediate results in Matrix-matrix Multiplication (MatMul) and how to design a pipeline at a finer granularity of Transformer remain unsolved. In this work, we propose ReTransformer - a ReRAM-based PIM architecture for Transformer acceleration. ReTransformer can not only accelerate the scaled dot-product attention of Transformer using ReRAM-based PIM but also eliminate some data dependency by avoiding writing the intermediate results using the proposed matrix decomposition technique. Moreover, we propose a new sub-matrix pipeline design for multi-head self-attention. Experimental results show that compared to GPU and Pipelayer, ReTransformer improves computing efficiency by 23.21× and 3.25×, respectively. The corresponding overall power is reduced by 1086× and 2.82×, respectively.

References

[1]

Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. 2017. Weighted Transformer Network for Machine Translation. arXiv:1711.02132

[2]

Diogo Brito, Taimur Gibran Rabuske, Jorge R. Fernandes, Paulo F. Flores, and José C. Monteiro. 2015. Quaternary Logic Lookup Table in Standard CMOS. IEEE Trans. Very Large Scale Integr. Syst. 23, 2 (2015), 306--316.

Digital Library

[3]

Meng-Fan Chang, Pi-Feng Chiu, and Shyh-Shyuan Sheu. 2011. Circuit design challenges in embedded memory and resistive RAM (RRAM) for mobile SoC and 3D-IC. In Proceedings of the 16th Asia South Pacific Design Automation Conference, ASP-DAC 2011, Yokohama, Japan, January 25--27, 2011. IEEE, 197--203.

[4]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 27--39.

Digital Library

[5]

Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805

[7]

Gaoming Du, Chao Tian, Zhenmin Li, Duoli Zhang, Yong-Sheng Yin, and Yiming Ouyang. 2019. Efficient Softmax Hardware Architecture for Deep Neural Networks. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, GLSVLSI 2019, Tysons Corner, VA, USA, May 9--11, 2019. ACM, 75--80.

Digital Library

[8]

Rich Fackenthal, Makoto Kitagawa, Wataru Otsuka, Kirk Prall, Duane Mills, Keiichi Tsutsui, Johnny Javanifard, Kerry Tedrow, Tomohito Tsushima, Yoshiyuki Shibahara, and Glen Hush. 2014. 19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology. In 2014 IEEE International Conference on Solid-State Circuits Conference, ISSCC 2014, Digest of Technical Papers, San Francisco, CA, USA, February 9--13, 2014. IEEE, 338--339.

[9]

Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. 2017. Non-Autoregressive Neural Machine Translation. arXiv:1711.02281

[10]

Saransh Gupta, Mohsen Imani, and Tajana Rosing. 2018. FELIX: fast and energy-efficient logic in memory. In Proceedings of the International Conference on Computer-Aided Design, ICCAD 2018, San Diego, CA, USA, November 05--08, 2018. ACM, 55.

Digital Library

[11]

Runze Han, Peng Huang, Yudi Zhao, Zhe Chen, Lifeng Liu, Xiaoyan Liu, and Jinfeng Kang. 2017. Demonstration of logic operations in high-performance RRAM crossbar array fabricated by atomic layer deposition technique. Nanoscale research letters 12, 1 (2017), 1--6.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Miao Hu, Hai Li, Qing Wu, Garrett S. Rose, and Yiran Chen. 2012. Memristor crossbar based hardware realization of BSB recall function. In The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, June 10--15, 2012. IEEE, 1--7.

[15]

Shahar Kvatinsky, Dmitry Belousov, Slavik Liman, Guy Satat, Nimrod Wald, Eby G. Friedman, Avinoam Kolodny, and Uri C. Weiser. 2014. MAGIC - Memristor-Aided Logic. IEEE Trans. Circuits Syst. II Express Briefs 61-II, 11 (2014), 895--899.

[16]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:1909.11942

[17]

Bing Li, Bonan Yan, Chenchen Liu, and Hai (Helen) Li. 2019. Build reliable and efficient neuromorphic design with memristor technology. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC 2019, Tokyo, Japan, January 21--24, 2019. ACM, 224--229.

Digital Library

[18]

Shuo Li, Nong Xiao, Peng Wang, Guangyu Sun, Xiaoyang Wang, Yiran Chen, Hai Helen Li, Jason Cong, and Tao Zhang. 2019. RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses. IEEE Trans. Computers 68, 2 (2019), 239--254.

Digital Library

[19]

Mengyun Liu, Lixue Xia, Yu Wang, and Krishnendu Chakrabarty. 2019. Fault tolerance in neuromorphic computing systems. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC 2019, Tokyo, Japan, January 21--24, 2019. ACM, 216--223.

Digital Library

[20]

Yun Long, Taesik Na, and Saibal Mukhopadhyay. 2018. ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. IEEE Trans. Very Large Scale Integr. Syst. 26, 12 (2018), 2781--2794.

[21]

Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010. ISCA, 1045--1048.

[22]

Xiaochen Peng, Shanshi Huang, Yandong Luo, Xiaoyu Sun, and Shimeng Yu. 2019. DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies. In 2019 IEEE International Electron Devices Meeting (IEDM). IEEE, 32.5.1--32.5.4.

[23]

Xiaochen Peng, Rui Liu, and Shimeng Yu. 2019. Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on RRAM Based Processing-In-Memory Architecture. In IEEE International Symposium on Circuits and Systems, ISCAS 2019, Sapporo, Japan, May 26--29, 2019. IEEE, 1--5.

[24]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 14--26.

Digital Library

[25]

Shyh-Shyuan Sheu, Meng-Fan Chang, Ku-Feng Lin, Che-Wei Wu, Yu-Sheng Chen, Pi-Feng Chiu, Chia-Chen Kuo, Yih-Shan Yang, Pei-Chia Chiang, Wen-Pin Lin, Che-He Lin, Heng-Yuan Lee, Peiyi Gu, Sumin Wang, Frederick T. Chen, Keng-Li-Su, Chen-Hsin Lien, Kuo-Hsing Cheng, Hsin-Tun Wu, Tzu-Kun Ku, Ming-Jer Kao, and Ming-Jinn Tsai. 2011. A 4Mb embedded SLC resistive-RAM macro with 7.2ns read-write random-access time and 160ns MLC-access capability. In 2011 IEEE International Solid-State Circuits Conference, ISSCC 2011, Digest of Technical Papers, San Francisco, CA, USA, 20--24 February, 2011. IEEE, 200--202.

[26]

Saeideh Shirinzadeh, Mathias Soeken, Pierre-Emmanuel Gaillardon, Giovanni De Micheli, and Rolf Drechsler. 2017. Endurance management for resistive Logic-In-Memory computing architectures. In Design, Automation & Test in Europe Conference & Exhibition, DATE 2017, Lausanne, Switzerland, March 27--31, 2017. IEEE, 1092--1097.

[27]

Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017. IEEE Computer Society, 541--552.

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 5998--6008.

Digital Library

[29]

Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, and Huazhong Yang. 2016. Technological Exploration of RRAM Crossbar Array for Matrix-Vector Multiplication. J. Comput. Sci. Technol. 31, 1 (2016), 3--19.

[30]

Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing Attention Weights for Fast Transformer. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10--16, 2019. ijcai.org, 5292--5298.

[31]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8--14 December 2019, Vancouver, BC, Canada. 5754--5764.

[32]

Bo Yuan. 2016. Efficient hardware architecture of softmax layer in deep neural network. In 29th IEEE International System-on-Chip Conference, SOCC 2016, Seattle, WA, USA, September 6--9, 2016. IEEE, 323--326.

[33]

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. arXiv:1910.06188

[34]

Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating Neural Transformer via an Average Attention Network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, 1789--1798.

Cited By

Kalafatelis ANomikos NGiannopoulos AAlexandridis GKarditsa ATrakadas P(2025)Towards Predictive Maintenance in the Maritime Industry: A Component-Based OverviewJournal of Marine Science and Engineering10.3390/jmse1303042513:3(425)Online publication date: 25-Feb-2025
https://doi.org/10.3390/jmse13030425
Glint TMittal BSharma SRonak AGoud AKasture NMomin ZKrishna AMekie J(2025)AxLaM: energy-efficient accelerator design for language models for edge computingPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2023.0395383:2288Online publication date: 16-Jan-2025
https://doi.org/10.1098/rsta.2023.0395
Ko JPark SIm JKim JKoo RYang YLee J(2025)Vertical NAND flash memory-based matrix-matrix multiplication scheme for energy-efficient attention score computationNeurocomputing10.1016/j.neucom.2024.129210619(129210)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129210
Show More Cited By

Index Terms

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration
1. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems

Recommendations

A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal
Special Issue on NVM and Storage

Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data ...
HitM: high-throughput ReRAM-based PIM for multi-modal neural networks
ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design

With the rapid progress of artificial intelligence (AI) algorithms, multi-modal deep neural networks (DNNs) have been applied to some challenging tasks, e.g., image and video description to process multi-modal information from vision and language. ...
Emerging NVM: A Survey on Architectural Integration and Research Challenges

There has been a surge of interest in Non-Volatile Memory (NVM) in recent years. With many advantages, such as density and power consumption, NVM is carving out a place in the memory hierarchy and may eventually change our view of computer architecture. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design

November 2020

1396 pages

ISBN:9781450380263

DOI:10.1145/3400302

General Chair:
Yuan Xie
Univ. of California, Santa Barbara, CA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEEE CAS
IEEE CEDA
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 December 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICCAD '20

Sponsor:

SIGDA

ICCAD '20: IEEE/ACM International Conference on Computer-Aided Design

November 2 - 5, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
3,499
Total Downloads

Downloads (Last 12 months)1,284
Downloads (Last 6 weeks)151

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kalafatelis ANomikos NGiannopoulos AAlexandridis GKarditsa ATrakadas P(2025)Towards Predictive Maintenance in the Maritime Industry: A Component-Based OverviewJournal of Marine Science and Engineering10.3390/jmse1303042513:3(425)Online publication date: 25-Feb-2025
https://doi.org/10.3390/jmse13030425
Glint TMittal BSharma SRonak AGoud AKasture NMomin ZKrishna AMekie J(2025)AxLaM: energy-efficient accelerator design for language models for edge computingPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2023.0395383:2288Online publication date: 16-Jan-2025
https://doi.org/10.1098/rsta.2023.0395
Ko JPark SIm JKim JKoo RYang YLee J(2025)Vertical NAND flash memory-based matrix-matrix multiplication scheme for energy-efficient attention score computationNeurocomputing10.1016/j.neucom.2024.129210619(129210)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129210
Feng PChen YYu JYue HJiang ZXiao YXiao WLu HChen G(2024)InMemQK: A Product Quantization Based MatMul Module for Compute-in-Memory Attention MacroApplied Sciences10.3390/app14231119814:23(11198)Online publication date: 1-Dec-2024
https://doi.org/10.3390/app142311198
Chowdhury ZCilasun HResch SZabihi MLv YZink BWang JSapatnekar SKarpuzcu U(2024)On Gate Flip Errors in Computing-In-Memory2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546875(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546875
Afifi SSunny FNikdast MPasricha S(2024)Accelerating Neural Networks for Large Language Models and Graph Processing with Silicon Photonics2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546653(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546653
Xu JLiu HPeng XDuan ZLiao XJin H(2024)A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network AccelerationACM Transactions on Design Automation of Electronic Systems10.1145/370103430:1(1-23)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3701034
Kim BLi HChen Y(2024)Processing-in-Memory Designs Based on Emerging Technology for Efficient Machine Learning AccelerationProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3660367(614-619)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3660367
Luo YYu S(2024)H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge DevicesACM Transactions on Design Automation of Electronic Systems10.1145/364921929:3(1-19)Online publication date: 28-Feb-2024
https://dl.acm.org/doi/10.1145/3649219
Zhu YCai LYu LFan AYan LJing ZYan BTiw PLi YTao YYang Y(2024)MeMCISA: Memristor-Enabled Memory-Centric Instruction-Set Architecture for Database Workloads2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00122(1678-1692)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00122
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten