skip to main content
10.1145/3400302.3415640acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article
Public Access

ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

Published: 17 December 2020 Publication History

Abstract

Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings a performance bottleneck during inference. Transformer is also computationally and memory intensive and demands for a hardware acceleration solution. Although researchers have successfully applied ReRAM-based Processing-in-Memory (PIM) to accelerate convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the unique computation process of the scaled dot-product attention in Transformer makes it difficult to directly apply these designs. Besides, how to handle intermediate results in Matrix-matrix Multiplication (MatMul) and how to design a pipeline at a finer granularity of Transformer remain unsolved. In this work, we propose ReTransformer - a ReRAM-based PIM architecture for Transformer acceleration. ReTransformer can not only accelerate the scaled dot-product attention of Transformer using ReRAM-based PIM but also eliminate some data dependency by avoiding writing the intermediate results using the proposed matrix decomposition technique. Moreover, we propose a new sub-matrix pipeline design for multi-head self-attention. Experimental results show that compared to GPU and Pipelayer, ReTransformer improves computing efficiency by 23.21× and 3.25×, respectively. The corresponding overall power is reduced by 1086× and 2.82×, respectively.

References

[1]
Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. 2017. Weighted Transformer Network for Machine Translation. arXiv:1711.02132
[2]
Diogo Brito, Taimur Gibran Rabuske, Jorge R. Fernandes, Paulo F. Flores, and José C. Monteiro. 2015. Quaternary Logic Lookup Table in Standard CMOS. IEEE Trans. Very Large Scale Integr. Syst. 23, 2 (2015), 306--316.
[3]
Meng-Fan Chang, Pi-Feng Chiu, and Shyh-Shyuan Sheu. 2011. Circuit design challenges in embedded memory and resistive RAM (RRAM) for mobile SoC and 3D-IC. In Proceedings of the 16th Asia South Pacific Design Automation Conference, ASP-DAC 2011, Yokohama, Japan, January 25--27, 2011. IEEE, 197--203.
[4]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 27--39.
[5]
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
[7]
Gaoming Du, Chao Tian, Zhenmin Li, Duoli Zhang, Yong-Sheng Yin, and Yiming Ouyang. 2019. Efficient Softmax Hardware Architecture for Deep Neural Networks. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, GLSVLSI 2019, Tysons Corner, VA, USA, May 9--11, 2019. ACM, 75--80.
[8]
Rich Fackenthal, Makoto Kitagawa, Wataru Otsuka, Kirk Prall, Duane Mills, Keiichi Tsutsui, Johnny Javanifard, Kerry Tedrow, Tomohito Tsushima, Yoshiyuki Shibahara, and Glen Hush. 2014. 19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology. In 2014 IEEE International Conference on Solid-State Circuits Conference, ISSCC 2014, Digest of Technical Papers, San Francisco, CA, USA, February 9--13, 2014. IEEE, 338--339.
[9]
Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. 2017. Non-Autoregressive Neural Machine Translation. arXiv:1711.02281
[10]
Saransh Gupta, Mohsen Imani, and Tajana Rosing. 2018. FELIX: fast and energy-efficient logic in memory. In Proceedings of the International Conference on Computer-Aided Design, ICCAD 2018, San Diego, CA, USA, November 05--08, 2018. ACM, 55.
[11]
Runze Han, Peng Huang, Yudi Zhao, Zhe Chen, Lifeng Liu, Xiaoyan Liu, and Jinfeng Kang. 2017. Demonstration of logic operations in high-performance RRAM crossbar array fabricated by atomic layer deposition technique. Nanoscale research letters 12, 1 (2017), 1--6.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735--1780.
[14]
Miao Hu, Hai Li, Qing Wu, Garrett S. Rose, and Yiran Chen. 2012. Memristor crossbar based hardware realization of BSB recall function. In The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, June 10--15, 2012. IEEE, 1--7.
[15]
Shahar Kvatinsky, Dmitry Belousov, Slavik Liman, Guy Satat, Nimrod Wald, Eby G. Friedman, Avinoam Kolodny, and Uri C. Weiser. 2014. MAGIC - Memristor-Aided Logic. IEEE Trans. Circuits Syst. II Express Briefs 61-II, 11 (2014), 895--899.
[16]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:1909.11942
[17]
Bing Li, Bonan Yan, Chenchen Liu, and Hai (Helen) Li. 2019. Build reliable and efficient neuromorphic design with memristor technology. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC 2019, Tokyo, Japan, January 21--24, 2019. ACM, 224--229.
[18]
Shuo Li, Nong Xiao, Peng Wang, Guangyu Sun, Xiaoyang Wang, Yiran Chen, Hai Helen Li, Jason Cong, and Tao Zhang. 2019. RC-NVM: Dual-Addressing Non-Volatile Memory Architecture Supporting Both Row and Column Memory Accesses. IEEE Trans. Computers 68, 2 (2019), 239--254.
[19]
Mengyun Liu, Lixue Xia, Yu Wang, and Krishnendu Chakrabarty. 2019. Fault tolerance in neuromorphic computing systems. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, ASPDAC 2019, Tokyo, Japan, January 21--24, 2019. ACM, 216--223.
[20]
Yun Long, Taesik Na, and Saibal Mukhopadhyay. 2018. ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. IEEE Trans. Very Large Scale Integr. Syst. 26, 12 (2018), 2781--2794.
[21]
Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010. ISCA, 1045--1048.
[22]
Xiaochen Peng, Shanshi Huang, Yandong Luo, Xiaoyu Sun, and Shimeng Yu. 2019. DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies. In 2019 IEEE International Electron Devices Meeting (IEDM). IEEE, 32.5.1--32.5.4.
[23]
Xiaochen Peng, Rui Liu, and Shimeng Yu. 2019. Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on RRAM Based Processing-In-Memory Architecture. In IEEE International Symposium on Circuits and Systems, ISCAS 2019, Sapporo, Japan, May 26--29, 2019. IEEE, 1--5.
[24]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18--22, 2016. IEEE Computer Society, 14--26.
[25]
Shyh-Shyuan Sheu, Meng-Fan Chang, Ku-Feng Lin, Che-Wei Wu, Yu-Sheng Chen, Pi-Feng Chiu, Chia-Chen Kuo, Yih-Shan Yang, Pei-Chia Chiang, Wen-Pin Lin, Che-He Lin, Heng-Yuan Lee, Peiyi Gu, Sumin Wang, Frederick T. Chen, Keng-Li-Su, Chen-Hsin Lien, Kuo-Hsing Cheng, Hsin-Tun Wu, Tzu-Kun Ku, Ming-Jer Kao, and Ming-Jinn Tsai. 2011. A 4Mb embedded SLC resistive-RAM macro with 7.2ns read-write random-access time and 160ns MLC-access capability. In 2011 IEEE International Solid-State Circuits Conference, ISSCC 2011, Digest of Technical Papers, San Francisco, CA, USA, 20--24 February, 2011. IEEE, 200--202.
[26]
Saeideh Shirinzadeh, Mathias Soeken, Pierre-Emmanuel Gaillardon, Giovanni De Micheli, and Rolf Drechsler. 2017. Endurance management for resistive Logic-In-Memory computing architectures. In Design, Automation & Test in Europe Conference & Exhibition, DATE 2017, Lausanne, Switzerland, March 27--31, 2017. IEEE, 1092--1097.
[27]
Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017. IEEE Computer Society, 541--552.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 5998--6008.
[29]
Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, and Huazhong Yang. 2016. Technological Exploration of RRAM Crossbar Array for Matrix-Vector Multiplication. J. Comput. Sci. Technol. 31, 1 (2016), 3--19.
[30]
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing Attention Weights for Fast Transformer. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10--16, 2019. ijcai.org, 5292--5298.
[31]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8--14 December 2019, Vancouver, BC, Canada. 5754--5764.
[32]
Bo Yuan. 2016. Efficient hardware architecture of softmax layer in deep neural network. In 29th IEEE International System-on-Chip Conference, SOCC 2016, Seattle, WA, USA, September 6--9, 2016. IEEE, 323--326.
[33]
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. arXiv:1910.06188
[34]
Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating Neural Transformer via an Average Attention Network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, 1789--1798.

Cited By

View all
  • (2025)Towards Predictive Maintenance in the Maritime Industry: A Component-Based OverviewJournal of Marine Science and Engineering10.3390/jmse1303042513:3(425)Online publication date: 25-Feb-2025
  • (2025)AxLaM: energy-efficient accelerator design for language models for edge computingPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2023.0395383:2288Online publication date: 16-Jan-2025
  • (2025)Vertical NAND flash memory-based matrix-matrix multiplication scheme for energy-efficient attention score computationNeurocomputing10.1016/j.neucom.2024.129210619(129210)Online publication date: Feb-2025
  • Show More Cited By

Index Terms

  1. ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design
    November 2020
    1396 pages
    ISBN:9781450380263
    DOI:10.1145/3400302
    • General Chair:
    • Yuan Xie
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CAS
    • IEEE CEDA
    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 December 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ReRAM
    2. processing-in-memory
    3. transformer

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICCAD '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 457 of 1,762 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,284
    • Downloads (Last 6 weeks)151
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Towards Predictive Maintenance in the Maritime Industry: A Component-Based OverviewJournal of Marine Science and Engineering10.3390/jmse1303042513:3(425)Online publication date: 25-Feb-2025
    • (2025)AxLaM: energy-efficient accelerator design for language models for edge computingPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2023.0395383:2288Online publication date: 16-Jan-2025
    • (2025)Vertical NAND flash memory-based matrix-matrix multiplication scheme for energy-efficient attention score computationNeurocomputing10.1016/j.neucom.2024.129210619(129210)Online publication date: Feb-2025
    • (2024)InMemQK: A Product Quantization Based MatMul Module for Compute-in-Memory Attention MacroApplied Sciences10.3390/app14231119814:23(11198)Online publication date: 1-Dec-2024
    • (2024)On Gate Flip Errors in Computing-In-Memory2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546875(1-6)Online publication date: 25-Mar-2024
    • (2024)Accelerating Neural Networks for Large Language Models and Graph Processing with Silicon Photonics2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546653(1-6)Online publication date: 25-Mar-2024
    • (2024)A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network AccelerationACM Transactions on Design Automation of Electronic Systems10.1145/370103430:1(1-23)Online publication date: 18-Oct-2024
    • (2024)Processing-in-Memory Designs Based on Emerging Technology for Efficient Machine Learning AccelerationProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3660367(614-619)Online publication date: 12-Jun-2024
    • (2024)H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge DevicesACM Transactions on Design Automation of Electronic Systems10.1145/364921929:3(1-19)Online publication date: 28-Feb-2024
    • (2024)MeMCISA: Memristor-Enabled Memory-Centric Instruction-Set Architecture for Database Workloads2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00122(1678-1692)Online publication date: 2-Nov-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media