ABSTRACT
Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.
- Xilinx 2020. Xilinx Power Estimator. Retrieved June 8, 2021, from https://www.xilinx.com/products/technology/power/xpe.html.Google Scholar
- Hao Cai, Juntong Chen, Yongliang Zhou, and Weisheng. Zhao. 2021. Toward Energy-Efficient STT-MRAM Design With Multi-Modes Reconfiguration. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 7, 2633--2639. Google ScholarCross Ref
- Wenlong Cai, Mengxing Wang, Kaihua Cao, Huaiwen Yang, Shouzhong Peng, Huisong Li, and Weisheng. Zhao. 2021. Stateful implication logic based on perpendicular magnetic tunnel junctions. Science China Information Sciences 65, 2, 1869--1919. Google ScholarCross Ref
- Yufei Chen, Haojie Pei, Xiao Dong, Zhou Jin, and Cheng Zhuo. 2022. Application of Deep Learning in Back-End Simulation: Challenges and Opportunities. 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 641--646. Google ScholarDigital Library
- Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, and Yu Wang. 2022. DIMMining: Pruning-Efficient and Parallel Graph Mining on near-Memory-Computing. Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA), 130--145. Google ScholarDigital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, 1--25. Google ScholarDigital Library
- B. Dieny, I. L. Prejbeanu, K. Garello, P. Gambardella, P. Freitas, and et al. 2020. Opportunities and challenges for spintronics in the microelectronics industry. Nature Electronics. 3, 8, 446--459. Google ScholarCross Ref
- Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. HighPerformance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 54--64. Google ScholarDigital Library
- Rafael Garibotti, Brandon Reagen, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2018. Assisting High-Level Synthesis Improve SpMV Benchmark Through Dynamic Dependence Analysis. IEEE Transactions on Circuits and Systems II: Express Briefs. 65, 10, 1440--1444. Google ScholarCross Ref
- Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019. Demystifying Complex Workload-DRAM Interactions: An Experimental Study. Proc. ACM Meas. Anal. Comput. Syst. 3, 3, 65--77. Google ScholarDigital Library
- Zongxia Guo, Jialiang Yin, Yue Bai, Daoqian Zhu, Kewen Shi, Gefei Wang, Kaihua Cao, and Weisheng Zhao. 2021. Spintronics for Energy- Efficient Computing: An Overview and Outlook. Proc. IEEE 109, 8, 1398--1417. Google ScholarCross Ref
- Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs. 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 1--9. Google ScholarDigital Library
- Shihua Huang, Luc Waeijen, and Henk Corporaal. 2022. How Flexible is Your Computing System? ACM Trans. Embed. Comput. Syst. Just Accepted, 1--41. Google ScholarDigital Library
- Veronia Iskandar, Mohamed A. Abd El Ghany, and Diana Goehringer. 2022. Near-Memory Computing on FPGAs with 3D-Stacked Memories: Applications, Architectures, and Optimizations. ACM Trans. Reconfigurable Technol. Syst. Just Accepted, 6, 383--388. Google ScholarDigital Library
- Hongyang Jia, Murat Ozatay, Yinqi Tang, Hossein Valavi, Rakshit Pathak, Jinseok Lee, and Naveen Verma. 2022. A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing. A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing. 64, 130--145. Google ScholarDigital Library
- Hao Jiang, Kevin Yamada, Zizhe Ren, Thomas Kwok, Fu Luo, Qing Yang, Xiaorong Zhang, J. Joshua Yang, Qiangfei Xia, Yiran Chen, Hai Li, Qing Wu, and Mark. Barnell. 2018. Pulse-Width Modulation based Dot-Product Engine for Neuromorphic Computing System using Memristor Crossbar Array. 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 1--4. Google ScholarCross Ref
- Inyup Kang. 2022. The Art of Scaling: Distributed and Connected to Sustain the Golden Age of Computation. 2022 IEEE International Solid-State Circuits Conference (ISSCC) 65, 25--31. Google ScholarCross Ref
- Kyoung-Rog Lee, Jihoon Kim, Changhyeon Kim, Donghyeon Han, Juhyoung Lee, Jinsu Lee, Hongsik Jeong, and Hoi-Jun. Yoo. 2020. A 1.02-μW STT-MRAM-Based DNN ECG Arrhythmia Monitoring SoC With Leakage-Based Delay MAC Unit. IEEE Solid-State Circuits Letters 3, 390--393. Google ScholarCross Ref
- Bing Li, Bonan Yan, and Hai Li. 2019. An Overview of In-Memory Processing with Emerging Non-Volatile Memory for Data-Intensive Applications. IEEE Transactions on Circuits and Systems I: Regular Papers, 381--386. Google ScholarDigital Library
- Yueting Li, Wang Kang, Kunyu Zhou, Keni Qiu, and Weisheng. Zhao. 2022. Experimental Demonstration of STT-MRAM Based Nonvolatile Instantly On/Off System for IoT Applications: Case Studies. ACM Trans. Embed. Comput. Syst. Just Accepted, 1--23. Google ScholarDigital Library
- Alberto Parravicini, Luca Giuseppe Cellamare, Marco Siracusa, and Marco D. Santambrogio. 2021. Scaling up HBM Efficiency of Top-K SpMV for Approximate Embedding Similarity on FPGAs. 2021 58th ACM/IEEE Design Automation Conference (DAC), 799--804. Google ScholarDigital Library
- Kangyi Qiu, Yaojun Zhang, Bonan Yan, and Ru Huang. 2022. Heterogeneous Memory Architecture Accommodating Processing-in-Memory on SoC for AIoT Applications. 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 383--388. Google ScholarDigital Library
- Davide Rossi, Francesco Conti, Manuel Eggiman, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, and Luca Benini. 2022. Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. IEEE Journal of Solid-State Circuits 57, 1, 127--139. Google ScholarCross Ref
- Björn Sigurbergsson, Tom Hogervorst, Tong Dong Qiu, and Razvan Nane. 2019. Sparstition: A Partitioning Scheme for Large-Scale Sparse Matrix-Vector Multiplication on FPGA. 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2160-052X, 51--58. Google ScholarCross Ref
- Soumendu Sinha, Nishad Sahu, Rishabh Bhardwaj, Aditya Mehta, Hitesh Ahuja, Satyam Srivastava, Anubhav Elhence, and Vinay. Chamola. 2021. Machine Learning on FPGA for Robust Si3N4-Gate ISFET pH Sensor in Industrial IoT Applications. IEEE Transactions on Industry Applications 57, 6, 1--41. Google ScholarCross Ref
- Baohua Sun, Daniel Liu, Leo Yu, Jay Li, Helen Liu, Wenhan Zhang, and Terry. Torng. 2018. MRAM co-designed processing-in-memory CNN accelerator for mobile and IoT applications. arXiv preprint arXiv:1811.12179. Google ScholarCross Ref
- Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, and Xin Fu. 2021. Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 885--897. Google ScholarDigital Library
- Xueyan Wang, Jianlei Yang, Yinglin Zhao, Xiaotao Jia, Rong Yin, Xuhang Chen, Gang Qu, and Weisheng. Zhao. 2021. Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture. IEEE Trans. Comput., 1--1. Google ScholarDigital Library
- He Zhang, Junzhan Liu, Jinyu Bai, Sai Li, Lichuan Luo, Shaoqian Wei, Jianxin Wu, Wang Kang, and Weisheng Zhao. 2022. HD-CIM: Hybrid-Device ComputingIn-Memory Structure Based on MRAM and SRAM to Reduce Weight Loading Energy of Neural Networks. IEEE Transactions on Circuits and Systems-I: Regular Paper 69, 11, 4465--4474. Google ScholarCross Ref
- Yue Zhang, Jinkai Wang, Chenyu Lian, Yining Bai, Guanda Wang, Zhizhong Zhang, Zhenyi Zheng, Lei Chen, Kun Zhang, Georgios Sirakoulis, and Youguang Zhang. 2021. Time-Domain Computing in Memory Using Spintronics for Energy-Efficient Convolutional Neural Network. IEEE Transactions on Circuits and SystemsI: Regular Papers 68, 3, 1193--1205. Google ScholarCross Ref
Index Terms
- Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture
Recommendations
Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingSparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU
Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core ...
Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA
NAS '13: Proceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and StorageScaling up the sparse matrix-vector multiplication has been at the heart of numerous studies in both academia and industry. The massive parallelism of graphics processing units offers tremendous performance in many high-performance computing ...
Comments