research-article

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

Authors:
Yueting Li

Beihang University, China

Beihang University, China
View Profile

,
He Zhang

Beihang University, China

Beihang University, China
View Profile

,
Xueyan Wang

Beihang University, China

Beihang University, China
View Profile

,
Hao Cai

Southeast University, China

Southeast University, China
View Profile

,
Yundong Zhang

Vimicro Corporation, China

Vimicro Corporation, China
View Profile

,
Shuqin Lv

Truthmemory Corporation, China

Truthmemory Corporation, China
View Profile

,
Renguang Liu

Truthmemory Corporation, China

Truthmemory Corporation, China
View Profile

,
Weisheng Zhao

Beihang University, China

Beihang University, China
View Profile

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation ConferenceJanuary 2023Pages 222–227https://doi.org/10.1145/3566097.3567859

Published:31 January 2023Publication History

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Pages 222–227

ABSTRACT

Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.

References

Xilinx 2020. Xilinx Power Estimator. Retrieved June 8, 2021, from https://www.xilinx.com/products/technology/power/xpe.html.Google Scholar
Hao Cai, Juntong Chen, Yongliang Zhou, and Weisheng. Zhao. 2021. Toward Energy-Efficient STT-MRAM Design With Multi-Modes Reconfiguration. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 7, 2633--2639. Google ScholarCross Ref
Wenlong Cai, Mengxing Wang, Kaihua Cao, Huaiwen Yang, Shouzhong Peng, Huisong Li, and Weisheng. Zhao. 2021. Stateful implication logic based on perpendicular magnetic tunnel junctions. Science China Information Sciences 65, 2, 1869--1919. Google ScholarCross Ref
Yufei Chen, Haojie Pei, Xiao Dong, Zhou Jin, and Cheng Zhuo. 2022. Application of Deep Learning in Back-End Simulation: Challenges and Opportunities. 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 641--646. Google ScholarDigital Library
Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, and Yu Wang. 2022. DIMMining: Pruning-Efficient and Parallel Graph Mining on near-Memory-Computing. Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA), 130--145. Google ScholarDigital Library
Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, 1--25. Google ScholarDigital Library
B. Dieny, I. L. Prejbeanu, K. Garello, P. Gambardella, P. Freitas, and et al. 2020. Opportunities and challenges for spintronics in the microelectronics industry. Nature Electronics. 3, 8, 446--459. Google ScholarCross Ref
Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. HighPerformance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 54--64. Google ScholarDigital Library
Rafael Garibotti, Brandon Reagen, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2018. Assisting High-Level Synthesis Improve SpMV Benchmark Through Dynamic Dependence Analysis. IEEE Transactions on Circuits and Systems II: Express Briefs. 65, 10, 1440--1444. Google ScholarCross Ref
Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019. Demystifying Complex Workload-DRAM Interactions: An Experimental Study. Proc. ACM Meas. Anal. Comput. Syst. 3, 3, 65--77. Google ScholarDigital Library
Zongxia Guo, Jialiang Yin, Yue Bai, Daoqian Zhu, Kewen Shi, Gefei Wang, Kaihua Cao, and Weisheng Zhao. 2021. Spintronics for Energy- Efficient Computing: An Overview and Outlook. Proc. IEEE 109, 8, 1398--1417. Google ScholarCross Ref
Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 1--9. Google ScholarDigital Library
Shihua Huang, Luc Waeijen, and Henk Corporaal. 2022. How Flexible is Your Computing System? ACM Trans. Embed. Comput. Syst. Just Accepted, 1--41. Google ScholarDigital Library
Veronia Iskandar, Mohamed A. Abd El Ghany, and Diana Goehringer. 2022. Near-Memory Computing on FPGAs with 3D-Stacked Memories: Applications, Architectures, and Optimizations. ACM Trans. Reconfigurable Technol. Syst. Just Accepted, 6, 383--388. Google ScholarDigital Library
Hongyang Jia, Murat Ozatay, Yinqi Tang, Hossein Valavi, Rakshit Pathak, Jinseok Lee, and Naveen Verma. 2022. A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing. A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing. 64, 130--145. Google ScholarDigital Library
Hao Jiang, Kevin Yamada, Zizhe Ren, Thomas Kwok, Fu Luo, Qing Yang, Xiaorong Zhang, J. Joshua Yang, Qiangfei Xia, Yiran Chen, Hai Li, Qing Wu, and Mark. Barnell. 2018. Pulse-Width Modulation based Dot-Product Engine for Neuromorphic Computing System using Memristor Crossbar Array. 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 1--4. Google ScholarCross Ref
Inyup Kang. 2022. The Art of Scaling: Distributed and Connected to Sustain the Golden Age of Computation. 2022 IEEE International Solid-State Circuits Conference (ISSCC) 65, 25--31. Google ScholarCross Ref
Kyoung-Rog Lee, Jihoon Kim, Changhyeon Kim, Donghyeon Han, Juhyoung Lee, Jinsu Lee, Hongsik Jeong, and Hoi-Jun. Yoo. 2020. A 1.02-μW STT-MRAM-Based DNN ECG Arrhythmia Monitoring SoC With Leakage-Based Delay MAC Unit. IEEE Solid-State Circuits Letters 3, 390--393. Google ScholarCross Ref
Bing Li, Bonan Yan, and Hai Li. 2019. An Overview of In-Memory Processing with Emerging Non-Volatile Memory for Data-Intensive Applications. IEEE Transactions on Circuits and Systems I: Regular Papers, 381--386. Google ScholarDigital Library
Yueting Li, Wang Kang, Kunyu Zhou, Keni Qiu, and Weisheng. Zhao. 2022. Experimental Demonstration of STT-MRAM Based Nonvolatile Instantly On/Off System for IoT Applications: Case Studies. ACM Trans. Embed. Comput. Syst. Just Accepted, 1--23. Google ScholarDigital Library
Alberto Parravicini, Luca Giuseppe Cellamare, Marco Siracusa, and Marco D. Santambrogio. 2021. Scaling up HBM Efficiency of Top-K SpMV for Approximate Embedding Similarity on FPGAs. 2021 58th ACM/IEEE Design Automation Conference (DAC), 799--804. Google ScholarDigital Library
Kangyi Qiu, Yaojun Zhang, Bonan Yan, and Ru Huang. 2022. Heterogeneous Memory Architecture Accommodating Processing-in-Memory on SoC for AIoT Applications. 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 383--388. Google ScholarDigital Library
Davide Rossi, Francesco Conti, Manuel Eggiman, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, and Luca Benini. 2022. Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. IEEE Journal of Solid-State Circuits 57, 1, 127--139. Google ScholarCross Ref
Björn Sigurbergsson, Tom Hogervorst, Tong Dong Qiu, and Razvan Nane. 2019. Sparstition: A Partitioning Scheme for Large-Scale Sparse Matrix-Vector Multiplication on FPGA. 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2160-052X, 51--58. Google ScholarCross Ref
Soumendu Sinha, Nishad Sahu, Rishabh Bhardwaj, Aditya Mehta, Hitesh Ahuja, Satyam Srivastava, Anubhav Elhence, and Vinay. Chamola. 2021. Machine Learning on FPGA for Robust Si₃N₄-Gate ISFET pH Sensor in Industrial IoT Applications. IEEE Transactions on Industry Applications 57, 6, 1--41. Google ScholarCross Ref
Baohua Sun, Daniel Liu, Leo Yu, Jay Li, Helen Liu, Wenhan Zhang, and Terry. Torng. 2018. MRAM co-designed processing-in-memory CNN accelerator for mobile and IoT applications. arXiv preprint arXiv:1811.12179. Google ScholarCross Ref
Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, and Xin Fu. 2021. Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 885--897. Google ScholarDigital Library
Xueyan Wang, Jianlei Yang, Yinglin Zhao, Xiaotao Jia, Rong Yin, Xuhang Chen, Gang Qu, and Weisheng. Zhao. 2021. Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture. IEEE Trans. Comput., 1--1. Google ScholarDigital Library
He Zhang, Junzhan Liu, Jinyu Bai, Sai Li, Lichuan Luo, Shaoqian Wei, Jianxin Wu, Wang Kang, and Weisheng Zhao. 2022. HD-CIM: Hybrid-Device ComputingIn-Memory Structure Based on MRAM and SRAM to Reduce Weight Loading Energy of Neural Networks. IEEE Transactions on Circuits and Systems-I: Regular Paper 69, 11, 4465--4474. Google ScholarCross Ref
Yue Zhang, Jinkai Wang, Chenyu Lian, Yining Bai, Guanda Wang, Zhizhong Zhang, Zhenyi Zheng, Lei Chen, Kun Zhang, Georgios Sirakoulis, and Youguang Zhang. 2021. Time-Domain Computing in Memory Using Spintronics for Energy-Efficient Convolutional Neural Network. IEEE Transactions on Circuits and SystemsI: Regular Papers 68, 3, 1193--1205. Google ScholarCross Ref

Index Terms

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
Read More
Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core ...
Read More
Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA
NAS '13: Proceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage

Scaling up the sparse matrix-vector multiplication has been at the heart of numerous studies in both academia and industry. The massive parallelism of graphics processing units offers tremendous performance in many high-performance computing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference
January 2023
807 pages
ISBN:9781450397834
DOI:10.1145/3566097
General Chair:
Atsushi Takahashi
Tokyo Institute of Technology
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
STT-MRAM
SpMV
energy efficient
near memory processing
Qualifiers
- research-article
Conference

Acceptance Rates
ASPDAC '23 Paper Acceptance Rate102of328submissions,31%Overall Acceptance Rate466of1,454submissions,32%
More
Upcoming Conference
ASPDAC '25

Sponsor:

sigda

30th Asia and South Pacific Design Automation Conference

January 20 - 23, 2025

Tokyo , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)128
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media