skip to main content
10.1145/3620665.3640422acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference

Published:27 April 2024Publication History

ABSTRACT

The Transformer-based generative model (TbGM), comprising summarization (Sum) and generation (Gen) stages, has demonstrated unprecedented generative performance across a wide range of applications. However, it also demands immense amounts of compute and memory resources. Especially, the Gen stages, consisting of the attention and fully-connected (FC) layers, dominate the overall execution time. Meanwhile, we reveal that the conventional system with GPUs used for TbGM inference cannot efficiently execute the attention layer, even with batching, due to various constraints. To address this inefficiency, we first propose AttAcc, a processing-in-memory (PIM) architecture for efficient execution of the attention layer. Subsequently, for the end-to-end acceleration of TbGM inference, we propose a novel heterogeneous system architecture and optimizations that strategically use xPU and PIM together. It leverages the high memory bandwidth of AttAcc for the attention layer and the powerful compute capability of the conventional system for the FC layer. Lastly, we demonstrate that our GPU-PIM system outperforms the conventional system with the same memory capacity, improving performance and energy efficiency of running a 175B TbGM by up to 2.81× and 2.67×, respectively.

References

  1. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.Google ScholarGoogle Scholar
  2. Mustafa F. Ali, Akhilesh Jaiswal, and Kaushik Roy. 2020. In-Memory Low-Cost Bit-Serial Addition Using Commodity DRAM Technology. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 1 (2020), 155--165. Google ScholarGoogle ScholarCross RefCross Ref
  3. Bahar Asgari, Ramyad Hadidi, Jiashen Cao, Da Eun Shim, Sung-Kyu Lim, and Hyesoon Kim. 2021. FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction. In HPCA. 908--920. Google ScholarGoogle ScholarCross RefCross Ref
  4. Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems. In MICRO. 1--13. Google ScholarGoogle ScholarCross RefCross Ref
  5. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS. Google ScholarGoogle ScholarCross RefCross Ref
  6. Jonathan Chang, Yen-Huei Chen, Wei-Min Chan, Sahil Preet Singh, Hank Cheng, Hidehiro Fujiwara, Jih-Yu Lin, Kao-Cheng Lin, John Hung, Robin Lee, Hung-Jen Liao, Jhon-Jhy Liaw, Quincy Li, Chih-Yung Lin, Mu-Chi Chiang, and Shien-Yang Wu. 2017. A 7nm 256Mb SRAM in High-K Metal-Gate FinFET Technology with Write-Assist Circuitry for Low-VMIN Applications. In IEEE International Solid-State Circuits Conference. 206--207. Google ScholarGoogle ScholarCross RefCross Ref
  7. Kevin K Chang, Prashant J Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K Qureshi, and Onur Mutlu. 2016. Low-Cost Inter-Linked Sub-arrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM. In HPCA. Google ScholarGoogle ScholarCross RefCross Ref
  8. Benjamin Y. Cho, Jeageun Jung, and Mattan Erez. 2021. Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators. In SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jaewan Choi, Jaehyun Park, Kwanhee Kyung, Nam Sung Kim, and Jung Ho Ahn. 2023. Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models. IEEE Computer Architecture Letters 22, 02 (2023), 113--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7-nm FinFET Predictive Process Design Kit. Microelectronics Journal 53 (2016), 105--115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Compute Express Link Consortium. 2022. Compute Express Link 3.0 White Paper. https://www.computeexpresslink.org/_files/ugd/0c1418_a8713008916044ae9604405d10a7773b.pdfGoogle ScholarGoogle Scholar
  12. Fabrice Devaux. 2019. The true Processing In Memory accelerator. In 2019 IEEE Hot Chips 31 Symposium (HCS). 1--24. Google ScholarGoogle ScholarCross RefCross Ref
  13. Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In HPCA. 283--295. Google ScholarGoogle ScholarCross RefCross Ref
  14. Fei Gao, Georgios Tziantzioulis, and David Wentzlaff. 2019. Compute-DRAM: In-Memory Compute Using Off-the-Shelf DRAMs. In MICRO. 100--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS. 751--764. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. SAFARI Research Group. 2023. Ramulator 2.0 --- GitHub Repository. https://github.com/CMU-SAFARI/ramulator2Google ScholarGoogle Scholar
  17. Peng Gu, Xinfeng Xie, Yufei Ding, Guoyang Chen, Weifeng Zhang, Dimin Niu, and Yuan Xie. 2020. iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture. In ISCA. 804--817. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2023. OliVe: Accelerating Large Language Models via Hardware-Friendly Outlier-Victim Pair Quantization. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, and Onur Mutlu. 2021. SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM. In ASPLOS. 329--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, and Deog-Kyoon Jeong. 2020. A3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In HPCA. 328--341. Google ScholarGoogle ScholarCross RefCross Ref
  21. Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W. Lee. 2021. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In ISCA. 692--705. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mingxuan He, Choungki Song, Ilkon Kim, Chunseok Jeong, Seho Kim, Il Park, Mithuna Thottethodi, and T. N. Vijaykumar. 2020. Newton: A DRAM-maker's Accelerator-in-Memory (AiM) Architecture for Machine Learning. In MICRO. 372--385. Google ScholarGoogle ScholarCross RefCross Ref
  23. Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. 2022. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. IEEE. 2018. International Roadmap for Devices and Systems: 2018. Technical Report. https://irds.ieee.org/editions/2018/Google ScholarGoogle Scholar
  25. Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision. In ISCA. 802--815. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. JEDEC. 2022. High Bandwidth Memory DRAM (HBM3).Google ScholarGoogle Scholar
  27. W.C. Jeong, S. Maeda, H.J. Lee, K.W. Lee, T.J. Lee, D.W. Park, B.S. Kim, J.H. Do, T. Fukai, D.J. Kwon, K.J. Nam, W.J. Rim, M.S. Jang, H.T. Kim, Y.W. Lee, J.S. Park, E.C. Lee, D.W. Ha, C.H. Park, H.J. Cho, S.M. Jung, and H.K. Kang. 2018. True 7nm Platform Technology featuring Smallest FinFET and Smallest SRAM cell by EUV, Special Constructs and 3rd Generation Single Diffusion Break. In IEEE Symposium on VLSI Technology. 59--60. Google ScholarGoogle ScholarCross RefCross Ref
  28. Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, and David A Patterson. 2023. TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter C. Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David A. Patterson. 2021. Ten Lessons From Three Generations Shaped Google's TPUv4i: Industrial Product. In ISCA. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. 2017. HBM (High Bandwidth Memory) DRAM Technology and Architecture. In 2017 IEEE International Memory Workshop (IMW). 1--4. Google ScholarGoogle ScholarCross RefCross Ref
  31. Byeongho Kim, Jongwook Chung, Eojin Lee, Wonkyung Jung, Sunjung Lee, Jaewan Choi, Jaehyun Park, Minbok Wi, Sukhan Lee, and Jung Ho Ahn. 2020. MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks. IEEE Trans. Comput. 69, 7 (2020), 955--967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In ISCA. 380--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. 2012. A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Computer Architecture Letters 15, 1 (2016), 45--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In MICRO. 740--753. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021. Hardware Architecture and Software Stack for PIM based on Commercial DRAM Technology: Industrial Product. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Seongju Lee, Kyuyoung Kim, Sanghoon Oh, Joonhong Park, Gimoon Hong, Dongyoon Ka, Kyudong Hwang, Jeongje Park, Kyeongpil Kang, Jungyeon Kim, Junyeol Jeon, Nahsung Kim, Yongkee Kwon, Kornijcuk Vladimir, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Jaewook Lee, Donguc Ko, Younggun Jun, Keewon Cho, Ilwoong Kim, Choungki Song, Chunseok Jeong, Daehan Kwon, Jieun Jang, Il Park, Junhyun Chun, and Joohwan Cho. 2022. A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 1--3.Google ScholarGoogle ScholarCross RefCross Ref
  38. Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. DRISA: A DRAM-Based Reconfigurable In-Situ Accelerator. In MICRO. 288--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. David Luebke. 2008. CUDA: Scalable Parallel Programming for High-Performance Scientific Computing. In 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro (ISBI). 836--838. Google ScholarGoogle ScholarCross RefCross Ref
  41. Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, and Onur Mutlu. 2023. Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator.Google ScholarGoogle Scholar
  42. NVIDIA. 2023. NVIDIA DGX A100. https://resources.nvidia.com/en-us-dgx-systems/dgx-aiGoogle ScholarGoogle Scholar
  43. Mike O'Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017. FineGrained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. OpenAI. 2023. Models. https://platform.openai.com/docs/models/.Google ScholarGoogle Scholar
  45. Jaehyun Park, Byeongho Kim, Sungmin Yun, Eojin Lee, Minsoo Rhu, and Jung Ho Ahn. 2021. TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory. In MICRO. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Myeong-Jae Park, Ho Sung Cho, Tae-Sik Yun, Sangjin Byeon, Young Jun Koo, Sangsic Yoon, Dong Uk Lee, Seokwoo Choi, Jihwan Park, Jinhyung Lee, Kyungjun Cho, Junil Moon, Byung-Kuk Yoon, Young-Jun Park, Sang-muk Oh, Chang Kwon Lee, Tae-Kyun Kim, Seong-Hee Lee, Hyun-Woo Kim, Yucheon Ju, Seung-Kyun Lim, Seung Geun Baek, Kyo Yun Lee, Sang Hun Lee, Woo Sung We, Seungchan Kim, Yongseok Choi, Seong-Hak Lee, Seung Min Yang, Gunho Lee, In-Keun Kim, Younghyun Jeon, Jae-Hyung Park, Jong Chan Yun, Chanhee Park, Sun-Yeol Kim, Sungjin Kim, Dong-Yeol Lee, Su-Hyun Oh, Taejin Hwang, Junghyun Shin, Yunho Lee, Hyunsik Kim, Jaeseung Lee, Youngdo Hur, Sangkwon Lee, Jieun Jang, Junhyun Chun, and Joohwan Cho. 2022. A 192-Gb 12-High 896-GB/s HBM3 DRAM with a TSV Auto-Calibration Scheme and Machine-Learning-Based Layout Optimization. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 444--446. Google ScholarGoogle ScholarCross RefCross Ref
  47. Myeong-Jae Park, Jinhyung Lee, Kyungjun Cho, Jihwan Park, Junil Moon, Sung-Hak Lee, Tae-Kyun Kim, Sanghoon Oh, Seokwoo Choi, Yongsuk Choi, Ho Sung Cho, Taesik Yun, Young Jun Koo, Jae-Seung Lee, Byung-Kuk Yoon, Young-Jun Park, Sangmuk Oh, Chang Kwon Lee, Seong-Hee Lee, Hyun-Woo Kim, Yucheon Ju, Seung-Kyun Lim, Kyo Yun Lee, Sang-Hoon Lee, Woo Sung We, Seungchan Kim, Seung Min Yang, Keonho Lee, In-Keun Kim, Younghyun Jeon, JaeHyung Park, Jong Chan Yun, Seonyeol Kim, Dong-Yeol Lee, Su-Hyun Oh, Jung-Hyun Shin, Yeonho Lee, Jieun Jang, and Joohwan Cho. 2023. A 192-Gb 12-High 896-GB/s HBM3 DRAM With a TSV Auto-Calibration Scheme and Machine-Learning-Based Layout Optimization. IEEE Journal of Solid-State Circuits 58, 1 (2023), 256--269. Google ScholarGoogle ScholarCross RefCross Ref
  48. Naebeom Park, Sungju Ryu, Jaeha Kung, and Jae-Joon Kim. 2021. High-throughput near-memory processing on CNNs with 3D HBM-like memory. ACM Transactions on Design Automation of Electronic Systems (TODAES) 26, 6 (2021), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini. 2023. Splitwise: Efficient generative LLM inference using phase splitting. arXiv:2311.18677 [cs.AR]Google ScholarGoogle Scholar
  50. Yubin Qin, Yang Wang, Dazheng Deng, Zhiren Zhao, Xiaolong Yang, Leibo Liu, Shaojun Wei, Yang Hu, and Shouyi Yin. 2023. FACT: FFN-Attention Co-Optimized Transformer Architecture with Eager Correlation Prediction. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving Language Understanding by Generative Pre-training. (2018).Google ScholarGoogle Scholar
  52. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  53. Yesin Ryu, Sung-Gi Ahn, Jae Hoon Lee, Jaewon Park, Yong Ki Kim, Hyochang Kim, Yeong Geol Song, Han-Won Cho, Sunghye Cho, Seung Ho Song, Haesuk Lee, Useung Shin, Jonghyun Ahn, Je-Min Ryu, Sukhan Lee, Kyoung-Hwan Lim, Jungyu Lee, Jeong Hoan Park, Jae-Seung Jeong, Sunghwan Joo, Dajung Cho, So Young Kim, Minsu Lee, Hyunho Kim, Minhwan Kim, Jae-San Kim, Jinah Kim, Hyun Gil Kang, Myung-Kyu Lee, Sung-Rae Kim, Young-Cheon Kwon, Young Yong Byun, Kijun Lee, Sangkil Park, Jaeyoun Youn, Myeong-O Kim, Kyomin Sohn, Sang-Joon Hwang, and Jooyoung Lee. 2023. A 16 GB 1024 GB/s HBM3 DRAM With Source-Synchronized Bus Design and On-Die Error Control Scheme for Enhanced RAS Features. IEEE Journal of Solid-State Circuits 58, 4 (2023), 1051--1061. Google ScholarGoogle ScholarCross RefCross Ref
  54. Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2017. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology. In MICRO. 273--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Alireza Shafaei, Yanzhi Wang, Xue Lin, and Massoud Pedram. 2014. Fin-CACTI: Architectural Analysis and Modeling of Caches with Deeply-Scaled FinFET Devices. In 2014 IEEE Computer Society Annual Symposium on VLSI. IEEE, 290--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Debendra Das Sharma. 2020. PCI Express® 6.0 Specification at 64.0 GT/s with PAM-4 signaling: a low latency, high bandwidth, high reliability and cost-effective interconnect. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 1--8. Google ScholarGoogle ScholarCross RefCross Ref
  57. Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.Google ScholarGoogle Scholar
  58. Hyunsung Shin, Dongyoung Kim, Eunhyeok Park, Sungho Park, Yongsik Park, and Sungjoo Yoo. 2018. McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2613--2622. Google ScholarGoogle ScholarCross RefCross Ref
  59. Taejoong Song, Jonghoon Jung, Woojin Rim, Hoonki Kim, Yongho Kim, Changnam Park, Jeongho Do, Sunghyun Park, Sungwee Cho, Hyuntaek Jung, Bongjae Kwon, Hyun-Su Choi, Jaeseung Choi, and Jong Shik Yoon. 2018. A 7nm FinFET SRAM Using EUV Lithography with Dual Write-Driver-Assist Circuitry for Low-Voltage Applications. In IEEE International Solid-State Circuits Conference. 198--200. Google ScholarGoogle ScholarCross RefCross Ref
  60. John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS. Google ScholarGoogle ScholarCross RefCross Ref
  62. Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In HPCA. 97--110. Google ScholarGoogle ScholarCross RefCross Ref
  63. Jingcheng Wang, Xiaowei Wang, Charles Eckert, Arun Subramaniyan, Reetuparna Das, David Blaauw, and Dennis Sylvester. 2020. A 28-nm Compute SRAM With Bit-Serial Logic/Arithmetic Operations for Programmable In-Memory Vector Computing. IEEE Journal of Solid-State Circuits 55, 1 (2020), 76--86. Google ScholarGoogle ScholarCross RefCross Ref
  64. Shien-Yang Wu, C.Y. Lin, M.C. Chiang, J.J. Liaw, J.Y. Cheng, S.H. Yang, C.H. Tsai, P.N. Chen, T. Miyashita, C.H. Chang, V.S. Chang, K.H. Pan, J.H. Chen, Y.S. Mor, K.T. Lai, C.S. Liang, H.F. Chen, S.Y. Chang, C.J. Lin, C.H. Hsieh, R.F. Tsui, C.H. Yao, C.C. Chen, R. Chen, C.H. Lee, H.J. Lin, C.W. Chang, K.W. Chen, M.H. Tsai, K.S. Chen, Y. Ku, and S.M. Jang. 2016. A 7nm CMOS Platform Technology Featuring 4th Generation FinFET Transistors with a 0.027um2 High Density 6-T SRAM cell for Mobile SoC Applications. In IEEE International Electron Devices Meeting. 2.6.1--2.6.4. Google ScholarGoogle ScholarCross RefCross Ref
  65. Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Vol. 202. 38087--38099. https://proceedings.mlr.press/v202/xiao23c.htmlGoogle ScholarGoogle Scholar
  66. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. ORCA: A Distributed Serving System for Transformer-Based Generative Models. In OSDI. https://www.usenix.org/conference/osdi22/presentation/yuGoogle ScholarGoogle Scholar
  67. Sungmin Yun, Byeongho Kim, Jaehyun Park, Hwayong Nam, Jung Ho Ahn, and Eojin Lee. 2022. GraNDe: Near-Data Processing Architecture With Adaptive Matrix Mapping for Graph Convolutional Networks. IEEE Computer Architecture Letters 21, 2 (2022), 45--48. Google ScholarGoogle ScholarCross RefCross Ref
  68. Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. 2020. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference. In MICRO. 811--824. Google ScholarGoogle ScholarCross RefCross Ref
  69. Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer. In HPCA. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
      April 2024
      1299 pages
      ISBN:9798400703850
      DOI:10.1145/3620665

      Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 April 2024

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate535of2,713submissions,20%
    • Article Metrics

      • Downloads (Last 12 months)650
      • Downloads (Last 6 weeks)650

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader