skip to main content
10.1145/3524059.3532369acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniques

Published: 28 June 2022 Publication History

Abstract

Recent processors have been augmented with matrix-multiply units that operate on small matrices, creating a functional unit-rich environment. These units have been successfully employed on dense matrix operations such as those found in the Basic Linear Algebra Subprograms (BLAS). In this work, we exploit these new matrix-multiply facilities to speed up Sparse Matrix Dense Matrix Multiplications (SpMM) for highly sparse matrices.
SpMM is hard to optimize. The sparsity patterns lead to a highly irregular memory access behavior. Additionally, each sparse matrix has unique characteristics, making it hard to find a single SpMM strategy that works well for all sparse matrices. The addition of matrix-multiply units makes this even more challenging.
In this paper, we address these challenges. First, we design Dense Dynamic Blocks (DDB), a method to utilize the new matrix units. DDB has two specialized versions: DDB-MM and DDB-HYB. DDB-MM is a strategy that only utilizes the matrix-multiply facilities. DDB-HYB is a hybrid approach that maximizes the floating-point throughput by utilizing both vector and matrix units. Furthermore, we design a prediction mechanism for identifying the best SpMM strategy for a given sparse matrix and dense matrix pair: SpMM-OPT. SpMM-OPT selects among vector unit oriented, matrix unit oriented, and hybrid strategies for the highest floating-point throughput while taking cache optimizations into account.
We experiment with 440 matrices from the well-known SuiteSparse matrix collection on a POWER10 system with vector and matrix units. We show that DDB-MM and DDB-HYB can achieve a floating-point throughput of up to 1.1 and 2.5 TFLOPs/s on a POWER10 single-chip module for double- and single-precision SpMM, respectively. Our analysis also shows that SpMM-OPT effectively chooses the best SpMM strategy and can achieve an average speedup of up to 2X compared to an optimized CSR baseline.

References

[1]
2015. Intel Math Kernel Library Inspector-executor Sparse BLAS Routines. https://software.intel.com/en-us/articles/intel-math-kernel-library-inspector-executor-sparse-blas-routines
[2]
2017. NVIDIA Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. [Online; accessed 02-April-2021].
[3]
2019. Intel Xeon Platinum 8268 Processor. https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html. [Online; accessed 02-April-2021].
[4]
2020. Intel oneAPI Math Kernel Library. https://software.intel.com/content/www/us/en/develop/tools/oneapi/comp-onents/onemkl
[5]
2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf. [Online; accessed 02-April-2021].
[6]
2021. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html. [Online; accessed 02-April-2021].
[7]
Walid A. Abu-Sufah and Asma Abdel Karim. 2013. Auto-tuning of Sparse Matrix-Vector Multiplication on Graphics Processors. In Supercomputing - 28th International Supercomputing Conference, ISC 2013, Leipzig, Germany, June 16--20, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7905), Julian M. Kunkel, Thomas Ludwig, and Hans Werner Meuer (Eds.). Springer, 151--164.
[8]
Ramesh C. Agarwal, Fred G. Gustavson, and Mohammad Zubair. 1992. A High Performance Algorithm Using Pre-Processing for the Sparse Matrix-Vector Multiplication. In Proceedings Supercomputing '92, Minneapolis, MN, USA, November 16--20, 1992, Robert Werner (Ed.). IEEE Computer Society, 32--41.
[9]
Hasan Metin Aktulga, Aydin Buluç, Samuel Williams, and Chao Yang. 2014. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19--23, 2014. IEEE Computer Society, 1213--1222.
[10]
Hartwig Anzt, Stanimire Tomov, and Jack J. Dongarra. 2015. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In Proceedings of the Symposium on High Performance Computing, HPC 2015, part of the 2015 Spring Simulation Multiconference, SpringSim '15, Alexandria, VA, USA, April 12--15, 2015, Layne T. Watson, Josef Weinbub, Masha Sosonkina, and William I. Thacker (Eds.). SCS/ACM, 75--82. http://dl.acm.org/citation.cfm?id=2872609
[11]
Junya Arai, Hiroaki Shiokawa, Takeshi Yamamuro, Makoto Onizuka, and Sotetsu Iwamura. 2016. Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analysis. In 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23--27, 2016. IEEE Computer Society, 22--31.
[12]
Puneeth Bhat, Jose Moreira, and Satish Kumar Sadasivam. 2021. Matrix-Multiply Assist (MMA) Best Practices Guide. https://www.redbooks.ibm.com/redpieces/pdfs/redp5612.pdf. [Online; accessed 02-April-2021].
[13]
Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011, Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar (Eds.). ACM, 587--596.
[14]
Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (Calgary, AB, Canada) (SPAA '09). Association for Computing Machinery, New York, NY, USA, 233--244.
[15]
Alfredo Buttari, Victor Eijkhout, Julien Langou, and Salvatore Filippone. 2007. Performance Optimization and Modeling of Blocked Sparse Kernels. Int. J. High Perform. Comput. Appl. 21, 4 (Nov. 2007), 467--484.
[16]
Jee Whan Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven auto-tuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9--14, 2010, R. Govindarajan, David A. Padua, and Mary W. Hall (Eds.). ACM, 115--126.
[17]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages.
[18]
Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, Christopher J. Hughes, and Josep Torrellas. 2020. SparseTrain: Leveraging Dynamic Sparsity in Software for Training DNNs on General-Purpose SIMD Processors. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT '20). Association for Computing Machinery, New York, NY, USA, 279--292.
[19]
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. Çatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018, Tempe, AZ, USA, June 11--15, 2018, Ming Zhao, Abhishek Chandra, and Lavanya Ramakrishnan (Eds.). ACM, 66--79.
[20]
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix Multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia) (PPoPP'19). Association for Computing Machinery, New York, NY, USA, 300--314.
[21]
Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020. FeatGraph: a flexible and efficient backend for graph neural network systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event /Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 71.
[22]
Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. 2020. GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 72, 12 pages.
[23]
Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, and Xipeng Shen. 2021. Understanding and Bridging the Gaps in Current GNN Performance Optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). Association for Computing Machinery, New York, NY, USA, 119--132.
[24]
Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization Framework for Sparse Matrix Kernels. The International Journal of High Performance Computing Applications 18, 1 (2004), 135--158. arXiv:https://doi.org/10.1177/1094342004041296
[25]
Peng Jiang, Changwan Hong, and Gagan Agrawal. 2020. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, USA, February 22--26, 2020, Rajiv Gupta and Xipeng Shen (Eds.). ACM, 376--388.
[26]
V. Karakasis, G. Goumas, and N. Koziris. 2009. Perfomance Models for Blocked Sparse Matrix-Vector Multiplication Kernels. In ICPP 2009, International Conference on Parallel Processing, Vienna, Austria, 22--25 September 2009. IEEE Computer Society, 356--364.
[27]
Vasileios Karakasis, Georgios I. Goumas, and Nectarios Koziris. 2009. A Comparative Study of Blocking Storage Methods for Sparse Matrices on Multicore Architectures. In Proceedings of the 12th IEEE International Conference on Computational Science and Engineering, CSE 2009, Vancouver, BC, Canada, August 29--31, 2009. IEEE Computer Society, 247--256.
[28]
George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359--392.
[29]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2013. A unified sparse matrix data format for modern processors with wide SIMD units. CoRR abs/1307.6209 (2013). arXiv:1307.6209 http://arxiv.org/abs/1307.6209
[30]
Süreyya Emre Kurt, Aravind Sukumaran-Rajam, Fabrice Rastello, and P. Sadayappan. 2020. Efficient tiled sparse matrix multiplication through matrix signatures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 87.
[31]
Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-Tuner for Sparse Matrix-Vector Multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI '13). Association for Computing Machinery, New York, NY, USA, 117--126.
[32]
José E. Moreira, Kit Barton, Steven Battle, Peter Bergner, Ramon Bertran, Puneeth Bhat, Pedro Caldeira, David Edelsohn, Gordon Fossum, Brad Frey, Nemanja Ivanovic, Chip Kerchner, Vincent Lim, Shakti Kapoor, Tulio Machado Filho, Silvia Melitta Mueller, Brett Olsson, Satish Sadasivam, Baptiste Saleil, Bill Schmidt, Rajalakshmi Srinivasaraghavan, Shricharan Srivatsan, Brian W. Thompto, Andreas Wagner, and Nelson Wu. 2021. A matrix math facility for Power ISA(TM) processors. CoRR abs/2104.03142 (2021). arXiv:2104.03142 https://arxiv.org/abs/2104.03142
[33]
Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP'13). Association for Computing Machinery, New York, NY, USA, 456--471.
[34]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel S. Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24--28, 2017. ACM, 27--40.
[35]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[36]
Ali Pinar and Michael T. Heath. 1999. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the ACM/IEEE Conference on Super-computing, SC 1999, November 13--19, 1999, Portland, Oregon, USA. ACM, 30.
[37]
Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W. Keckler. 2018. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24--28, 2018. IEEE Computer Society, 78--91.
[38]
Youcef Saad. 2021. SPARSEKIT: A Basic Toolkit for Sparse Matrix Computations. https://www-users.cs.umn.edu/~saad/PDF/RIACS-90-20.pdf. [Online; accessed 02-April-2021].
[39]
Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS '15). Association for Computing Machinery, New York, NY, USA, 99--108.
[40]
Julian Shun and Guy E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Shenzhen, China) (PPoPP '13). Association for Computing Machinery, New York, NY, USA, 135--146.
[41]
William J. Starke, Brian W. Thompto, Jeffrey Stuecheli, and José E. Moreira. 2021. IBM's POWER10 Processor. IEEE Micro 41, 2 (2021), 7--14.
[42]
Nathan Thomas, Gabriel Tanase, Olga Tkachyshyn, Jack Perdue, Nancy M. Amato, and Lawrence Rauchwerger. 2005. A Framework for Adaptive Algorithm Selection in STAPL. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA) (PPoPP '05). Association for Computing Machinery, New York, NY, USA, 277--288.
[43]
Rich Vuduc, James Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin C. Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, November 16--22, 2002, CD-ROM, Roscoe C. Giles, Daniel A. Reed, and Kathryn Kelley (Eds.). IEEE Computer Society, 35:1--35:35.
[44]
Richard W. Vuduc and Hyun-Jin Moon. 2005. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In Proceedings of the First International Conference on High Performance Computing and Communications (Sorrento, Italy) (HPCC'05). Springer-Verlag, Berlin, Heidelberg, 807--816.
[45]
Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. CoRR abs/1909.01315 (2019). arXiv:1909.01315 http://arxiv.org/abs/1909.01315
[46]
Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. 2016. Speedup Graph Processing by Graph Ordering. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). ACM, New York, NY, USA, 1813--1828.
[47]
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IA-SpGEMM: An Input-Aware Auto-Tuning Framework for Parallel Sparse Matrix-Matrix Multiplication. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 94--105.
[48]
Takuma Yamaguchi and Federico Busato. 2021. Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores. https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/. [Online; accessed 02-April-2021].
[49]
Carl Yang, Aydin Buluç, and John D. Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU. In Euro-Par 2018: Parallel Processing, Marco Aldinucci, Luca Padovani, and Massimo Torquati (Eds.). Springer International Publishing, Cham, 672--687.
[50]
S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. 2020. Speeding Up SpMV for Power-Law Graph Analytics by Enhancing Locality & Vectorization. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, Los Alamitos, CA, USA, 1--15.
[51]
Buse Yilmaz, Bariş Aktemur, María J. Garzarán, Sam Kamin, and Furkan Kiraç. 2016. Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication. ACM Trans. Archit. Code Optim. 13, 1, Article 5 (March 2016), 26 pages.
[52]
Hao Yu and Lawrence Rauchwerger. 2000. Adaptive Reduction Parallelization Techniques. In ACM International Conference on Supercomputing 25th Anniversary Volume (Munich, Germany). Association for Computing Machinery, New York, NY, USA, 311--322.
[53]
H. Yu, D. Zhang, and L. Rauchwerger. 2004. An adaptive algorithm selection framework. In Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004. 278--289.
[54]
Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, and Joaquín Olivares. 2020. Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores. Computers & Electrical Engineering 88 (Dec 2020), 106848.

Cited By

View all
  • (2025)LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_7(78-97)Online publication date: 17-Feb-2025
  • (2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 13-Dec-2024
  • (2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
  • Show More Cited By

Index Terms

  1. Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniques

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
      June 2022
      514 pages
      ISBN:9781450392815
      DOI:10.1145/3524059
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 June 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. IBM POWER10
      2. SpMM
      3. matrix-multiply assist
      4. sparse matrix-matrix multiply

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ICS '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)190
      • Downloads (Last 6 weeks)32
      Reflects downloads up to 14 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1551-3_7(78-97)Online publication date: 17-Feb-2025
      • (2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 13-Dec-2024
      • (2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
      • (2024)HotTiles: Accelerating SpMM with Heterogeneous Accelerator Architectures2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00081(1012-1028)Online publication date: 2-Mar-2024
      • (2024)HA-SpMM: A Hybrid Feature-Based Adaptive SpMM Algorithm on GPU2024 China Automation Congress (CAC)10.1109/CAC63892.2024.10864726(1231-1236)Online publication date: 1-Nov-2024
      • (2023)SPADE: A Flexible and Scalable Accelerator for SpMM and SDDMMProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589054(1-15)Online publication date: 17-Jun-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media