FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

Authors:
Kai Zhong

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0002-8448-9530
View Profile

,
Zhenhua Zhu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0009-0007-9259-7180
View Profile

,
Guohao Dai

Shanghai Jiao Tong University, Shanghai, China

Infinigence-AI, Beijing, China

Shanghai Jiao Tong University, Shanghai, China

Infinigence-AI, Beijing, China

https://orcid.org/0000-0003-0849-3252
View Profile

,
Hongyi Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0009-0008-7095-7963
View Profile

,
Xinhao Yang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0009-0001-9739-2930
View Profile

,
Haoyu Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0009-0003-3889-8688
View Profile

,
Jin Si

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China

https://orcid.org/0009-0006-6829-9637
View Profile

,
Qiuli Mao

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0009-0004-8777-2579
View Profile

,
Shulin Zeng

Tsinghua University, Beijing, China

Infinigence-AI, Beijing, China

Tsinghua University, Beijing, China

Infinigence-AI, Beijing, China

https://orcid.org/0000-0002-1030-3748
View Profile

,
Ke Hong

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0002-5768-6037
View Profile

,
Genghan Zhang

Stanford University, Stanford, USA

Stanford University, Stanford, USA

https://orcid.org/0000-0002-3866-8167
View Profile

,
Huazhong Yang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0003-2421-353X
View Profile

,
Yu Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

https://orcid.org/0000-0001-6108-5157
View Profile

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3April 2024Pages 349–366https://doi.org/10.1145/3620666.3651336

Published:27 April 2024Publication History

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 349–366

ABSTRACT

Recently, sparse tensor algebra (SpTA) plays an increasingly important role in machine learning. However, due to the unstructured sparsity of SpTA, the general-purpose processors (e.g., GPU and CPU) are inefficient because of the underutilized hardware resources. Sparse kernel accelerators are optimized for specific tasks. However, their dedicated processing units and data paths cannot effectively support other SpTA tasks with different dataflow and various sparsity, resulting in performance degradation. This paper proposes FEASTA, a Flexible and Efficient Accelerator for Sparse Tensor Algebra. To process general SpTA tasks with various sparsity efficiently, we design FEASTA meticulously from three levels. At the dataflow abstraction level, we apply the Einstein Summation on the sparse fiber tree data structure to model the unified execution flow of general SpTA as joining and merging the fiber tree. At the instruction set architecture (ISA) level, a general SpTA ISA is proposed based on the execution flow. It includes different types of instructions for dense and sparse data, achieving flexibility and efficiency at the instruction level. At the architecture level, an instruction-driven architecture consisting of configurable and high-performance function units is designed, supporting the flexible and efficient ISA. Evaluations show that FEASTA has 5.40× geomean energy efficiency improvements compared to GPU among various workloads. FEASTA delivers 1.47× and 3.19× higher performance on sparse matrix multiplication kernels compared to state-of-the-art sparse matrix accelerator and CPU extension. Across diverse kernels, FEASTA achieves 1.69-12.70× energy efficiency over existing architectures.

References

Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cave, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard, Ibrahim Hur, MidhunChandra Kodiyath, Samkit Jain, Daniel S. Klowden, Marek M. Landowski, Laurent Montigny, Ankit More, Przemyslaw Ossowski, Robert Pawlowski, Nick Pepperling, Fabrizio Petrini, Mariusz Sikora, Balasubramanian Seshasayee, Shaden Smith, Sebastian Szkoda, Sanjaya Tayal, Jesmin Jahan Tithi, Yves Vandriessche, and Izajasz P. Wrosz. Piuma: programmable integrated unified memory architecture. arXiv preprint arXiv:2010.06277, 2020.Google Scholar
Krister Åhlander. Einstein summation for multi-dimensional arrays. 2000.Google Scholar
Maximiliana Behnke and Kenneth Heafield. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2664--2674, 2020.Google ScholarCross Ref
Vivek Bharadwaj, Aydın Buluç, and James Demmel. Distributed-memory sparse kernels for machine learning. 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 47--58, 2022.Google ScholarCross Ref
Haodong Bian, Gangsheng Li, Linbing Liu, Dongqiang Huang, Runting Dong, and Jianqiang Huang. Research on accelerating the performance of spmv based on avx2 instruction set. Proceedings of the 2020 4th High Performance Computing and Cluster Technologies Conference & 2020 3rd International Conference on Big Data and Artificial Intelligence, 2020.Google ScholarDigital Library
Siheng Chen, Baoan Liu, Chen Feng, Carlos Vallespi-Gonzalez, and Carl Wellington. 3d point cloud processing and learning for autonomous driving: Impacting map creation, localization, and perception. IEEE Signal Processing Magazine, 38(1):68--86, 2020.Google ScholarCross Ref
Jack Choquette, Olivier Giroux, and Denis Foley. Volta: Performance and programmability. Ieee Micro, 38(2):42--52, 2018.Google ScholarCross Ref
Christopher Bongsoo Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3070--3079, 2019.Google ScholarCross Ref
William J Dally, Stephen W Keckler, and David B Kirk. Evolution of the graphics processing unit (gpu). IEEE Micro, 41(6):42--51, 2021.Google ScholarDigital Library
Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights. Proceedings of the IEEE, 109:1706--1752, 2020.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.Google Scholar
Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. High-performance sparse linear algebra on hbm-equipped fpgas using hls: A case study on spmv. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022.Google ScholarDigital Library
Aosong Feng, Irene Li, Yuang Jiang, and Rex Ying. Diffuser: Efficient transformers with multi-hop attention diffusion for long sequences. arXiv preprint arXiv:2210.11794, 2022.Google Scholar
Siying Feng, Jiawen Sun, Subhankar Pal, Xin He, Kuba Kaszyk, Dong hyeon Park, John Magnus Morton, Trevor N. Mudge, Murray Cole, Michael F. P. O'Boyle, Chaitali Chakrabarti, and Ronald G. Dreslinski. Cosparse: A software and hardware reconfigurable spmv framework for graph analytics. 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 949--954, 2021.Google ScholarDigital Library
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323--10337. PMLR, 2023.Google Scholar
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarCross Ref
Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. Sparse-tpu: adapting systolic arrays for sparse matrices. Proceedings of the 34th ACM International Conference on Supercomputing, 2020.Google ScholarDigital Library
Kartik Hegde, Hadi Asghari Moghaddam, Michael Pellauer, Neal Clayton Crago, Aamer Jaleel, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher. Extensor: An accelerator for sparse tensor algebra. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019.Google ScholarDigital Library
Eric Hein, Tom Conte, Jeffrey Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, and Jason Riedy. An initial characterization of the emu chick. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 579--588. IEEE, 2018.Google ScholarCross Ref
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22:241:1--241:124, 2021.Google Scholar
Reza Hojabr, Alireza Sedaghati, Amir Sharifian, Ahmad Khonsari, and Arrvindh Shriraman. Spaghetti: Streaming accelerators for highly sparse gemm on fpgas. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 84--96, 2021.Google ScholarCross Ref
Olivia Hsu, Maxwell Strange, Jaeyeon Won, Ritvik Sharma, Kunle Olukotun, Joel S. Emer, Mark Horowitz, and Fredrik Kjolstad. The sparse abstract machine. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2022.Google Scholar
Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. Gespmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--12, 2020.Google ScholarCross Ref
Fredrik Kjolstad. Sparse Tensor Algebra Compilation. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA, Feb 2020.Google ScholarDigital Library
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1--77:29, October 2017.Google ScholarDigital Library
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 296--311, 2018.Google ScholarDigital Library
Scott P. Kolodziej, Mohsen Mahmoudi Aznaveh, Matthew Bullock, Jarrett David, Timothy A. Davis, Matthew Henderson, Yifan Hu, and Read Sandström. The suitesparse matrix collection website interface. J. Open Source Softw., 4:1244, 2019.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60:84 -- 90, 2012.Google Scholar
Ying Li, Lingfei Ma, Zilong Zhong, Fei Liu, Michael A Chapman, Dongpu Cao, and Jonathan Li. Deep learning for lidar point clouds in autonomous driving: A review. IEEE Transactions on Neural Networks and Learning Systems, 32(8):3412--3432, 2020.Google ScholarCross Ref
Zhiyao Li, Jiaxiang Li, Taijie Chen, Dimin Niu, Hongzhong Zheng, Yuan Xie, and Mingyu Gao. Spada: Accelerating sparse matrix multiplication with adaptive dataflow. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023.Google ScholarDigital Library
Liu Liu, Zheng Qu, Zhaodong Chen, Fengbin Tu, Yufei Ding, and Yuan Xie. Dynamic sparse attention for scalable transformer acceleration. IEEE Transactions on Computers, 71(12):3165--3178, 2022.Google Scholar
Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.Google ScholarDigital Library
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023.Google Scholar
Francisco Muñoz-Martínez, Raveesh Garg, Michael Pellauer, José L Abellán, Manuel E Acacio, and Tushar Krishna. Flexagon: A multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 252--265, 2023.Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. Cacti 6 . 0 : A tool to understand large caches. 2007.Google Scholar
Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.Google Scholar
Maxim Naumov, L Chien, Philippe Vandermersch, and Ujval Kapasi. Cusparse library. In GPU Technology Conference, 2010.Google Scholar
Subhankar Pal, Aporva Amarnath, Siying Feng, Michael F. P. O'Boyle, Ronald G. Dreslinski, and Christophe Dubach. Sparseadapt: Runtime control for sparse linear algebra on a reconfigurable accelerator. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.Google ScholarDigital Library
Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. Outerspace: An outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 724--736. IEEE, 2018.Google ScholarCross Ref
Eric Qin, A. Samajdar, Hyoukjun Kwon, V Ramana Pavan Nadella, Sudarshan M. Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 58--70, 2020.Google ScholarCross Ref
Gengyu Rao, Jingji Chen, Jason Yik, and Xuehai Qian. Sparsecore: stream isa and processor specialization for sparse computation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 186--199, 2022.Google ScholarDigital Library
Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. Combiner: Full attention transformer with sparse computation cost. In Neural Information Processing Systems, 2021.Google Scholar
Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, and Kunle Olukotun. Capstan: A vector rda for sparsity. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1022--1035, 2021.Google ScholarDigital Library
Fazle Sadi, Joseph Sweeney, Tze Meng Low, James C. Hoe, Lawrence T. Pileggi, and Franz Franchetti. Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019.Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young kyu Choi, Jason Lau, and Jason Cong. Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021.Google Scholar
Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, and Zhiru Zhang. Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 766--780. IEEE, 2020.Google ScholarCross Ref
Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 689--702. IEEE, 2020.Google ScholarCross Ref
Vivienne Sze, Yu hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture, 2020.Google Scholar
Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33:6377--6389, 2020.Google Scholar
Erik H Thiede, Wenda Zhou, and Risi Kondor. Graph neural networks for biochemistry that incorporate substructure. Biophysical Journal, 121(3):531a, 2022.Google ScholarCross Ref
Bangyan Wang, Lei Deng, Fei Sun, Guohao Dai, L. Liu, Yu Wang, and Yuan Xie. A one-for-all and o(v log(v))-cost solution for parallel merge style operations on sorted key-value arrays. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.Google ScholarDigital Library
Cheng Wang, Ming Cheng, Ferdous Sohel, Bennamoun, and Jonathan Li. Normalnet: A voxel-based cnn for 3d object classification and retrieval. Neurocomputing, 323:139--147, 2019.Google ScholarCross Ref
Hanrui Wang, Kuan Wang, Jiacheng Yang, Linxiao Shen, Nan Sun, Hae-Seung Lee, and Song Han. Gcn-rl circuit designer: Transferable transistor sizing with graph neural networks and reinforcement learning. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1--6. IEEE, 2020.Google ScholarCross Ref
Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yujie Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Haotong Zhang, Haibin Lin, Junbo Jake Zhao, Jinyang Li, Alex Smola, and Zheng Zhang. Deep graph library: Towards efficient and scalable deep learning on graphs. ArXiv, abs/1909.01315, 2019.Google Scholar
Xiang Wang, Tinglin Huang, Dingxian Wang, Yancheng Yuan, Zhenguang Liu, Xiangnan He, and Tat-Seng Chua. Learning intents behind interactions with knowledge graph for recommendation. In Proceedings of the Web Conference 2021, pages 878--887, 2021.Google ScholarDigital Library
Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus. In USENIX Symposium on Operating Systems Design and Implementation, 2020.Google Scholar
Felix Wu, Tianyi Zhang, Amauri H. de Souza, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019.Google Scholar
Yannan Nellie Wu, Po-An Tsai, Angshuman Parashar, Vivienne Sze, and Joel S Emer. Sparseloop: An analytical approach to sparse tensor accelerator modeling. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1377--1395. IEEE, 2022.Google ScholarDigital Library
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. Sparsetir: Composable abstractions for sparse compilation in deep learning. arXiv preprint arXiv:2207.04606, 2022.Google Scholar
Zhe Yuan, Yongpan Liu, Jinshan Yue, Yixiong Yang, Jingyu Wang, Xiaoyu Feng, Jian Zhao, Xueqing Li, and Huazhong Yang. Sticker: An energy-efficient multi-sparsity compatible accelerator for convolutional neural networks in 65-nm cmos. IEEE Journal of Solid-State Circuits, 55:465--477, 2020.Google ScholarCross Ref
Guowei Zhang, Nithya Attaluri, Joel S Emer, and Daniel Sanchez. Gamma: Leveraging gustavson's algorithm to accelerate sparse matrix multiplication. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 687--701, 2021.Google ScholarDigital Library
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.Google Scholar
Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. Sparch: Efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 261--274. IEEE, 2020.Google ScholarCross Ref
Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. {SparTA}:{Deep-Learning} model sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 213--232, 2022.Google Scholar
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57--81, 2020.Google ScholarCross Ref
Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 359--371, 2019.Google ScholarDigital Library
Neta Zmora, Guy Jacob, Lev Zlotnik, Bar Elharar, and Gal Novik. Neural network distiller: A python package for dnn compression research. ArXiv, abs/1910.12232, 2019.Google Scholar

Index Terms

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Special purpose systems
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific instruction set processors

Recommendations

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques—such as tiling—for ameliorating memory bottlenecks. Prior ...
Read More
ExTensor: An Accelerator for Sparse Tensor Algebra
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Generalized tensor algebra is a prime candidate for acceleration via customized ASICs. Modern tensors feature a wide range of data sparsity, with the density of non-zero elements ranging from 10-6% to 50%. This paper proposes a novel approach to ...
Read More
The tensor algebra compiler

Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
April 2024
1106 pages
ISBN:9798400703867
DOI:10.1145/3620666
General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2024
Check for updates
Author Tags
sparse computation
tensor algebra
instruction set architecture
hardware accelerator
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 210
  Total Downloads
- Downloads (Last 12 months)210
- Downloads (Last 6 weeks)210
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling

ExTensor: An Accelerator for Sparse Tensor Algebra

The tensor algebra compiler