ABSTRACT
Recently, sparse tensor algebra (SpTA) plays an increasingly important role in machine learning. However, due to the unstructured sparsity of SpTA, the general-purpose processors (e.g., GPU and CPU) are inefficient because of the underutilized hardware resources. Sparse kernel accelerators are optimized for specific tasks. However, their dedicated processing units and data paths cannot effectively support other SpTA tasks with different dataflow and various sparsity, resulting in performance degradation. This paper proposes FEASTA, a Flexible and Efficient Accelerator for Sparse Tensor Algebra. To process general SpTA tasks with various sparsity efficiently, we design FEASTA meticulously from three levels. At the dataflow abstraction level, we apply the Einstein Summation on the sparse fiber tree data structure to model the unified execution flow of general SpTA as joining and merging the fiber tree. At the instruction set architecture (ISA) level, a general SpTA ISA is proposed based on the execution flow. It includes different types of instructions for dense and sparse data, achieving flexibility and efficiency at the instruction level. At the architecture level, an instruction-driven architecture consisting of configurable and high-performance function units is designed, supporting the flexible and efficient ISA. Evaluations show that FEASTA has 5.40× geomean energy efficiency improvements compared to GPU among various workloads. FEASTA delivers 1.47× and 3.19× higher performance on sparse matrix multiplication kernels compared to state-of-the-art sparse matrix accelerator and CPU extension. Across diverse kernels, FEASTA achieves 1.69-12.70× energy efficiency over existing architectures.
- Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard, Ibrahim Hur, MidhunChandra Kodiyath, Samkit Jain, Daniel S. Klowden, Marek M. Landowski, Laurent Montigny, Ankit More, Przemyslaw Ossowski, Robert Pawlowski, Nick Pepperling, Fabrizio Petrini, Mariusz Sikora, Balasubramanian Seshasayee, Shaden Smith, Sebastian Szkoda, Sanjaya Tayal, Jesmin Jahan Tithi, Yves Vandriessche, and Izajasz P. Wrosz. Piuma: programmable integrated unified memory architecture. arXiv preprint arXiv:2010.06277, 2020.Google Scholar
- Krister Åhlander. Einstein summation for multi-dimensional arrays. 2000.Google Scholar
- Maximiliana Behnke and Kenneth Heafield. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2664--2674, 2020.Google ScholarCross Ref
- Vivek Bharadwaj, Aydın Buluç, and James Demmel. Distributed-memory sparse kernels for machine learning. 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 47--58, 2022.Google ScholarCross Ref
- Haodong Bian, Gangsheng Li, Linbing Liu, Dongqiang Huang, Runting Dong, and Jianqiang Huang. Research on accelerating the performance of spmv based on avx2 instruction set. Proceedings of the 2020 4th High Performance Computing and Cluster Technologies Conference & 2020 3rd International Conference on Big Data and Artificial Intelligence, 2020.Google ScholarDigital Library
- Siheng Chen, Baoan Liu, Chen Feng, Carlos Vallespi-Gonzalez, and Carl Wellington. 3d point cloud processing and learning for autonomous driving: Impacting map creation, localization, and perception. IEEE Signal Processing Magazine, 38(1):68--86, 2020.Google ScholarCross Ref
- Jack Choquette, Olivier Giroux, and Denis Foley. Volta: Performance and programmability. Ieee Micro, 38(2):42--52, 2018.Google ScholarCross Ref
- Christopher Bongsoo Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3070--3079, 2019.Google ScholarCross Ref
- William J Dally, Stephen W Keckler, and David B Kirk. Evolution of the graphics processing unit (gpu). IEEE Micro, 41(6):42--51, 2021.Google ScholarDigital Library
- Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights. Proceedings of the IEEE, 109:1706--1752, 2020.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.Google Scholar
- Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. High-performance sparse linear algebra on hbm-equipped fpgas using hls: A case study on spmv. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022.Google ScholarDigital Library
- Aosong Feng, Irene Li, Yuang Jiang, and Rex Ying. Diffuser: Efficient transformers with multi-hop attention diffusion for long sequences. arXiv preprint arXiv:2210.11794, 2022.Google Scholar
- Siying Feng, Jiawen Sun, Subhankar Pal, Xin He, Kuba Kaszyk, Dong hyeon Park, John Magnus Morton, Trevor N. Mudge, Murray Cole, Michael F. P. O'Boyle, Chaitali Chakrabarti, and Ronald G. Dreslinski. Cosparse: A software and hardware reconfigurable spmv framework for graph analytics. 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 949--954, 2021.Google ScholarDigital Library
- Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323--10337. PMLR, 2023.Google Scholar
- Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarCross Ref
- Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. Sparse-tpu: adapting systolic arrays for sparse matrices. Proceedings of the 34th ACM International Conference on Supercomputing, 2020.Google ScholarDigital Library
- Kartik Hegde, Hadi Asghari Moghaddam, Michael Pellauer, Neal Clayton Crago, Aamer Jaleel, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher. Extensor: An accelerator for sparse tensor algebra. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019.Google ScholarDigital Library
- Eric Hein, Tom Conte, Jeffrey Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, and Jason Riedy. An initial characterization of the emu chick. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 579--588. IEEE, 2018.Google ScholarCross Ref
- Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22:241:1--241:124, 2021.Google Scholar
- Reza Hojabr, Alireza Sedaghati, Amir Sharifian, Ahmad Khonsari, and Arrvindh Shriraman. Spaghetti: Streaming accelerators for highly sparse gemm on fpgas. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 84--96, 2021.Google ScholarCross Ref
- Olivia Hsu, Maxwell Strange, Jaeyeon Won, Ritvik Sharma, Kunle Olukotun, Joel S. Emer, Mark Horowitz, and Fredrik Kjolstad. The sparse abstract machine. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2022.Google Scholar
- Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. Gespmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--12, 2020.Google ScholarCross Ref
- Fredrik Kjolstad. Sparse Tensor Algebra Compilation. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA, Feb 2020.Google ScholarDigital Library
- Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1--77:29, October 2017.Google ScholarDigital Library
- David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 296--311, 2018.Google ScholarDigital Library
- Scott P. Kolodziej, Mohsen Mahmoudi Aznaveh, Matthew Bullock, Jarrett David, Timothy A. Davis, Matthew Henderson, Yifan Hu, and Read Sandström. The suitesparse matrix collection website interface. J. Open Source Softw., 4:1244, 2019.Google ScholarCross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60:84 -- 90, 2012.Google Scholar
- Ying Li, Lingfei Ma, Zilong Zhong, Fei Liu, Michael A Chapman, Dongpu Cao, and Jonathan Li. Deep learning for lidar point clouds in autonomous driving: A review. IEEE Transactions on Neural Networks and Learning Systems, 32(8):3412--3432, 2020.Google ScholarCross Ref
- Zhiyao Li, Jiaxiang Li, Taijie Chen, Dimin Niu, Hongzhong Zheng, Yuan Xie, and Mingyu Gao. Spada: Accelerating sparse matrix multiplication with adaptive dataflow. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023.Google ScholarDigital Library
- Liu Liu, Zheng Qu, Zhaodong Chen, Fengbin Tu, Yufei Ding, and Yuan Xie. Dynamic sparse attention for scalable transformer acceleration. IEEE Transactions on Computers, 71(12):3165--3178, 2022.Google Scholar
- Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.Google ScholarDigital Library
- Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023.Google Scholar
- Francisco Muñoz-Martínez, Raveesh Garg, Michael Pellauer, José L Abellán, Manuel E Acacio, and Tushar Krishna. Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 252--265, 2023.Google ScholarDigital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. Cacti 6 . 0 : A tool to understand large caches. 2007.Google Scholar
- Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.Google Scholar
- Maxim Naumov, L Chien, Philippe Vandermersch, and Ujval Kapasi. Cusparse library. In GPU Technology Conference, 2010.Google Scholar
- Subhankar Pal, Aporva Amarnath, Siying Feng, Michael F. P. O'Boyle, Ronald G. Dreslinski, and Christophe Dubach. Sparseadapt: Runtime control for sparse linear algebra on a reconfigurable accelerator. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.Google ScholarDigital Library
- Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. Outerspace: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 724--736. IEEE, 2018.Google ScholarCross Ref
- Eric Qin, A. Samajdar, Hyoukjun Kwon, V Ramana Pavan Nadella, Sudarshan M. Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 58--70, 2020.Google ScholarCross Ref
- Gengyu Rao, Jingji Chen, Jason Yik, and Xuehai Qian. Sparsecore: stream isa and processor specialization for sparse computation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 186--199, 2022.Google ScholarDigital Library
- Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. Combiner: Full attention transformer with sparse computation cost. In Neural Information Processing Systems, 2021.Google Scholar
- Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, and Kunle Olukotun. Capstan: A vector rda for sparsity. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1022--1035, 2021.Google ScholarDigital Library
- Fazle Sadi, Joseph Sweeney, Tze Meng Low, James C. Hoe, Lawrence T. Pileggi, and Franz Franchetti. Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019.Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
- Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young kyu Choi, Jason Lau, and Jason Cong. Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021.Google Scholar
- Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 766--780. IEEE, 2020.Google ScholarCross Ref
- Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 689--702. IEEE, 2020.Google ScholarCross Ref
- Vivienne Sze, Yu hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 2020.Google Scholar
- Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33:6377--6389, 2020.Google Scholar
- Erik H Thiede, Wenda Zhou, and Risi Kondor. Graph neural networks for biochemistry that incorporate substructure. Biophysical Journal, 121(3):531a, 2022.Google ScholarCross Ref
- Bangyan Wang, Lei Deng, Fei Sun, Guohao Dai, L. Liu, Yu Wang, and Yuan Xie. A one-for-all and o(v log(v))-cost solution for parallel merge style operations on sorted key-value arrays. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.Google ScholarDigital Library
- Cheng Wang, Ming Cheng, Ferdous Sohel, Bennamoun, and Jonathan Li. Normalnet: A voxel-based cnn for 3d object classification and retrieval. Neurocomputing, 323:139--147, 2019.Google ScholarCross Ref
- Hanrui Wang, Kuan Wang, Jiacheng Yang, Linxiao Shen, Nan Sun, Hae-Seung Lee, and Song Han. Gcn-rl circuit designer: Transferable transistor sizing with graph neural networks and reinforcement learning. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1--6. IEEE, 2020.Google ScholarCross Ref
- Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yujie Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Haotong Zhang, Haibin Lin, Junbo Jake Zhao, Jinyang Li, Alex Smola, and Zheng Zhang. Deep graph library: Towards efficient and scalable deep learning on graphs. ArXiv, abs/1909.01315, 2019.Google Scholar
- Xiang Wang, Tinglin Huang, Dingxian Wang, Yancheng Yuan, Zhenguang Liu, Xiangnan He, and Tat-Seng Chua. Learning intents behind interactions with knowledge graph for recommendation. In Proceedings of the Web Conference 2021, pages 878--887, 2021.Google ScholarDigital Library
- Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus. In USENIX Symposium on Operating Systems Design and Implementation, 2020.Google Scholar
- Felix Wu, Tianyi Zhang, Amauri H. de Souza, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019.Google Scholar
- Yannan Nellie Wu, Po-An Tsai, Angshuman Parashar, Vivienne Sze, and Joel S Emer. Sparseloop: An analytical approach to sparse tensor accelerator modeling. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1377--1395. IEEE, 2022.Google ScholarDigital Library
- Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. Sparsetir: Composable abstractions for sparse compilation in deep learning. arXiv preprint arXiv:2207.04606, 2022.Google Scholar
- Zhe Yuan, Yongpan Liu, Jinshan Yue, Yixiong Yang, Jingyu Wang, Xiaoyu Feng, Jian Zhao, Xueqing Li, and Huazhong Yang. Sticker: An energy-efficient multi-sparsity compatible accelerator for convolutional neural networks in 65-nm cmos. IEEE Journal of Solid-State Circuits, 55:465--477, 2020.Google ScholarCross Ref
- Guowei Zhang, Nithya Attaluri, Joel S Emer, and Daniel Sanchez. Gamma: Leveraging gustavson's algorithm to accelerate sparse matrix multiplication. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 687--701, 2021.Google ScholarDigital Library
- Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.Google Scholar
- Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. Sparch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 261--274. IEEE, 2020.Google ScholarCross Ref
- Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. {SparTA}:{Deep-Learning} model sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 213--232, 2022.Google Scholar
- Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57--81, 2020.Google ScholarCross Ref
- Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 359--371, 2019.Google ScholarDigital Library
- Neta Zmora, Guy Jacob, Lev Zlotnik, Bar Elharar, and Gal Novik. Neural network distiller: A python package for dnn compression research. ArXiv, abs/1910.12232, 2019.Google Scholar
Index Terms
- FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning
Recommendations
Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques—such as tiling—for ameliorating memory bottlenecks. Prior ...
ExTensor: An Accelerator for Sparse Tensor Algebra
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on MicroarchitectureGeneralized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging from 10-6% to 50%. This paper proposes a novel approach to ...
The tensor algebra compiler
Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save ...
Comments