ABSTRACT
Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a critical kernel in domains like graph analytic and scientific computation. As a kind of classical special-purpose architecture, systolic arrays were first used for complex computing problems, e.g., matrix multiplication. However, classical systolic arrays are not efficient enough when handling sparse matrices due to the fact that the PEs containing zero-valued entries perform unnecessary operations that do not contribute to the result. Accordingly, in this paper, we propose Mentha, a framework that enables systolic arrays to accelerate sparse matrix computation by employing a sparse-packing algorithm suitable for various dataflow of systolic array. Firstly, Mentha supports both online and offline methods. By packing the rows or columns of the sparse matrix, the zero-valued items in the matrix are significantly reduced and the density of the matrix is improved. In addition, acceleration benefits can be obtained by the adaptation scheme even with limited resources. Moreover, we reconfigure PEs in systolic arrays at a low cost (1.28x in area, 1.21x in power) and find that our method outperforms TPU-like systolic arrays by 1.2~3.3x in terms of SpMM and 1.3~4.4x in terms of SpGEMM when dealing with moderately sparse matrices (sparsity < 0.9), while its performance is at least 9.7x better than cuSPARSE. Furthermore, experimental results show a FLOPs reduction of roughly 3.4x in the neural network.
- C Bagavathi and O Saraniya. 2019. Evolutionary Mapping Techniques for Systolic Computing System. Deep Learning and Parallel Computing Environment for Bioengineering Systems (2019).Google Scholar
- Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, De-chen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24-26, 2019, Kia Bazargan and Stephen Neuendorffer (Eds.). ACM, 63–72.Google ScholarDigital Library
- NVIDIA Corporation. 2012. CUDA CUSPARSE Library.Google Scholar
- Baiyun Cui, Yingming Li, Ming Chen, and Zhongfei Zhang. 2019. Fine-tune BERT with Sparse Self-Attention Mechanism. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 3546–3551.Google Scholar
- Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1–1:25.Google ScholarDigital Library
- Tim Dettmers and Luke Zettlemoyer. 2019. Sparse Networks from Scratch: Faster Training without Losing Performance. CoRR abs/1907.04840(2019).Google Scholar
- Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. Rigging the Lottery: Making All Tickets Winners. In ICML(Proceedings of Machine Learning Research, Vol. 119). PMLR, 2943–2952.Google Scholar
- Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 151–165.Google ScholarDigital Library
- Zhangxiaowen Gong, Houxiang Ji, Christopher Fletcher, Christopher Hughes, Sara Baghsorkhi, and Josep Torrellas. 2020. SAVE: Sparsity-Aware Vector Engine for Accelerating DNN Training and Inference on CPUs. 796–810. https://doi.org/10.1109/MICRO50266.2020.00070Google Scholar
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William (Bill) J. Dally. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In FPGA. ACM, 75–84.Google ScholarDigital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In ISCA. IEEE Computer Society, 243–254.Google ScholarDigital Library
- Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. 2020. Sparse-TPU: adapting systolic arrays for sparse matrices. In ICS ’20: 2020 International Conference on Supercomputing, Barcelona Spain, June, 2020, Eduard Ayguadé, Wen-mei W. Hwu, Rosa M. Badia, and H. Peter Hofstee (Eds.). ACM, 19:1–19:12.Google ScholarDigital Library
- Kartik Hegde, Hadi Asghari Moghaddam, Michael Pellauer, Neal Clayton Crago, Aamer Jaleel, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In MICRO. ACM, 319–333.Google Scholar
- Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In PPoPP. ACM, 300–314.Google Scholar
- Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. ACM.Google ScholarDigital Library
- H. T. Kung. 1982. Why Systolic Architectures?Computer 15, 1 (1982), 37–46.Google ScholarDigital Library
- H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. In ASPLOS. ACM, 821–834.Google Scholar
- Süreyya Emre Kurt, Aravind Sukumaran-Rajam, Fabrice Rastello, and P. Sadayappan. 2020. Efficient tiled sparse matrix multiplication through matrix signatures. In SC. IEEE/ACM, 87.Google Scholar
- Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. Snip: single-Shot Network Pruning based on Connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=B1VZqjAcYXGoogle Scholar
- Fanrong Li, Gang Li, Zitao Mo, Xiangyu He, and Jian Cheng. 2020. FSA: A Fine-Grained Systolic Accelerator for Sparse CNNs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 39, 11(2020), 3589–3600.Google ScholarCross Ref
- Zhi Gang Liu, Paul N. Whatmough, Yuhao Zhu, and Matthew Mattina. 2022. S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022, Seoul, South Korea, April 2-6, 2022. IEEE, 573–586.Google ScholarCross Ref
- Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In MICRO. ACM, 977–991.Google Scholar
- Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos. 2020. TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training. In MICRO. IEEE, 781–795.Google Scholar
- Decebal Mocanu, Elena Mocanu, Peter Stone, Phuong Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications 9 (06 2018).Google Scholar
- Hesham Mostafa and Xin Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4646–4655.Google Scholar
- Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David T. Blaauw, Trevor N. Mudge, and Ronald G. Dreslinski. 2018. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018. IEEE Computer Society, 724–736.Google Scholar
- Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel S. Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In ISCA. ACM, 27–40.Google ScholarDigital Library
- Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In HPCA.Google Scholar
- Mohammadreza Soltaniyeh, Richard Martin, and Santosh Nagarakatte. 2022. An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-Matrix Multiplication. ACM Transactions on Architecture and Code Optimization (04 2022).Google Scholar
- Nitish Kumar Srivastava, Hanchen Jin, Jie Liu, David H. Albonesi, and Zhiru Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 766–780.Google Scholar
- Vinay Vashishtha, Manoj Vangala, and Lawrence T. Clark. 2017. ASAP7 predictive design kit development and cell design technology co-optimization: Invited paper. In ICCAD. IEEE, 992–998.Google Scholar
- Dingqing Yang, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2020. Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 711–724.Google ScholarCross Ref
- Zhekai Zhang, Hanrui Wang, Song Han, and William J. Dally. 2020. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In HPCA. IEEE, 261–274.Google Scholar
- Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. 2019. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection. CoRR abs/1912.11637(2019).Google Scholar
- Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In MICRO. IEEE Computer Society, 15–28.Google Scholar
- Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs. In MICRO. ACM, 359–371.Google Scholar
Index Terms
- Mentha: Enabling Sparse-Packing Computation on Systolic Arrays
Recommendations
Fault-Tolerant Matrix Triangularizations on Systolic Arrays
Examines the checksum methods of Abraham et al. for LU decomposition on multiprocessor arrays. Their methods are efficient for detecting a transient error, but expensive for correcting it due to the need for a computation rollback. The authors show how ...
A class of fault-tolerant systolic arrays for matrix multiplication
This paper presents a proposal for a systematic approach for designing one class of fault-tolerant systolic arrays with orthogonal interconnects and unidirectional data flow (OUSA) for multiplication of rectangular matrices. The method employs space-...
Comments