research-article

Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

Authors:

Weisheng ZhaoAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 3

Article No.: 37, Pages 1 - 24

https://doi.org/10.1145/3650729

Published: 25 April 2024 Publication History

Abstract

Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of embedded systems. To tackle these issues, we propose spin-transfer torque magnetic random-access memory (STT-MRAM)-based near memory computing (NMC) design for embedded systems. We optimize this design from three aspects: Fast-pipelined STT-MRAM readout scheme provides higher memory bandwidth for NMC design, enhancing real-time processing capability with a non-trivial area overhead. Direct index compression format in conjunction with digital sparse matrix-vector multiplication (SpMV) accelerator supports various matrices of practical applications that alleviate computing resource requirements. Custom NMC instructions and stream converter for NMC systems dynamically adjust available hardware resources for better utilization. Experimental results demonstrate that the memory bandwidth of STT-MRAM achieves 26.7 GB/s. Energy consumption and latency improvement of digital SpMV accelerator are up to 64× and 1,120× across sparsity matrices spanning from 10% to 99.8%. Single-precision and double-precision elements transmission increased up to 8× and 9.6×, respectively. Furthermore, our design achieves a throughput of up to 15.9× over state-of-the-art designs.

References

[1]

Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, and Tobi Delbruck. 2019. NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans. Neural Netw. Learn. Syst. 30, 3 (2019), 644–656. DOI:DOI:

[2]

Yehia Arafa, Abdel-Hameed Badawy, Gopinath Chennupati, Atanu Barai, Nandakishore Santhi, and Stephan Eidenbenz. 2020. Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles. In ACM International Conference on Supercomputing (ACM ICS’20). 1–12. DOI:DOI:

Digital Library

[3]

Kazi Asifuzzaman, Rommel Sánchez Verdejo, and Petar Radojković. 2022. Performance and power estimation of STT-MRAM main memory with reliable system-level simulation. ACM Trans. Embed. Comput. Syst. 21, 1 (2022), 1–25. DOI:DOI:

Digital Library

[4]

Samir Ben Dodo, Rajendra Bishnoi, Sarath Mohanachandran Nair, and Mehdi B. Tahoori. 2019. A spintronics memory PUF for resilience against cloning counterfeit. IEEE Trans. Very Large Scale Integ. Syst. 27, 11 (2019), 2511–2522. DOI:DOI:

Digital Library

[5]

João M. P. Cardoso, André DeHon, and Laura Pozzi. 2021. Guest editorial: IEEE TC special section on compiler optimizations for FPGA-based systems. IEEE Trans. Comput. 70, 12 (2021), 2013–2014. DOI:DOI:

Digital Library

[6]

Meng-Fan Chang, Albert Lee, Pin-Cheng Chen, Chrong Jung Lin, Ya-Chin King, Shyh-Shyuan Sheu, and Tzu-Kun Ku. 2015. Challenges and circuit techniques for energy-efficient on-chip nonvolatile memory using memristive devices. IEEE J. Emerg. Select. Topics Circ. Syst. 5, 2 (2015), 183–193. DOI:DOI:

[7]

Tung-Cheng Chang, Yen-Cheng Chiu, Chun-Ying Lee, Je-Min Hung, Kuang-Tang Chang, Cheng-Xin Xue, Ssu-Yen Wu, Hui-Yao Kao, Peng Chen, Hsiao-Yu Huang, Shih-Hsih Teng, and Meng-Fan Chang. 2020. 13.4 A 22nm 1Mb 1024b-read and near-memory-computing dual-mode STT-MRAM macro with 42.6GB/s read bandwidth for security-aware mobile devices. In IEEE International Solid-State Circuits Conference (IEEE ISSCC’20). 224–226. DOI:DOI:

[8]

Jianqi Chen, Monir Zaman, Yiorgos Makris, R. D. Shawn Blanton, Subhasish Mitra, and Benjamin Carrion Schafer. 2020. DECOY: DEflection-driven HLS-based computation partitioning for obfuscating intellectual property. In 57th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’20). 1–6. DOI:DOI:

[9]

Wei-Ming Chen, Tei-Wei Kuo, and Pi-Cheng Hsiu. 2021. Heterogeneity-aware multicore synchronization for intermittent systems. ACM Trans. Embed. Comput. Syst. 20, 9 (2021), 1–22. DOI:DOI:

Digital Library

[10]

Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-based graph processing framework on FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’21). 69–80. DOI:DOI:

Digital Library

[11]

Xuhang Chen, Xueyan Wang, Xiaotao Jia, Jianlei Yang, Gang Qu, and Weisheng Zhao. 2022. In Accelerating graph-connected component computation with emerging processing-in-memory architecture. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 41, 12 (2022), 5333–5342. DOI:DOI:

[12]

Elham Cheshmikhani, Hamed Farbeh, and Hossein Asadi. 2022. 3RSeT: Read disturbance rate reduction in STT-MRAM caches by selective tag comparison. IEEE Trans. Comput. 71, 6 (2022), 1305–1319. DOI:DOI:

Digital Library

[13]

Yuze Chi, Licheng Guo, and Jason Cong. 2022. Accelerating SSSP for power-law graphs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’22). 1–11. DOI:DOI:

Digital Library

[14]

Yen-Cheng Chiu, Win-San Khwa, Chung-Yuan Li, Fang-Ling Hsieh, Yu-An Chien, Guan-Yi Lin, Po-Jung Chen, Tsen-Hsiang Pan, De-Qi You, Fang-Yi Chen, Andrew Lee, Chung-Chuan Lo, Ren-Shuo Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, Yu-Der Chih, Tsung-Yung Chang, and Meng-Fan Chang. 2023. A 22nm 8Mb STT-MRAM near-memory-computing macro with 8b-precision and 46.4-160.1TOPS/W for Edge-AI devices. In IEEE International Solid-State Circuits Conference (IEEE ISSCC’23). 496–498. DOI:DOI:

[15]

Yen-Cheng Chiu, Chia-Sheng Yang, Shih-Hsin Teng, Hsiao-Yu Huang, Fu-Chun Chang, Yuan Wu, Yu-An Chien, Fang-Ling Hsieh, Chung-Yuan Li, Guan-Yi Lin, Po-Jung Chen, Tsen-Hsiang Pan, Chung-Chuan Lo, Win-San Khwa, Ren-Shuo Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, Chieh-Pu Lo, Yu-Der Chih, Tsung-Yung, Jonathan Chang, and Meng-Fan Chang. 2022. A 22nm 4Mb STT-MRAM data-encrypted near-memory computation macro with a 192GB/s read-and-decryption bandwidth and 25.1-55.1TOPS/W 8b MAC for AI operations. In IEEE International Solid-State Circuits Conference (IEEE ISSCC’22). Vol. 65. 178–180. DOI:DOI:

[16]

Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS today: Successes, challenges, and opportunities. ACM Trans. Reconfig. Technol. Syst. 15, 4 (2022), 1–42. DOI:DOI:

Digital Library

[17]

Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated accelerator generation and optimization with composable, parallel and pipeline architecture. In 55th ACM/ESDA/IEEE Design Automation Conference (ACM/IEEE DAC’18). 1–6. DOI:DOI:

Digital Library

[18]

Turck Clément Harabi Kamel-Eddine Querlioz Damien Dalgaty Thomas, Castellani Niccolo and Vianello Elisa. 2021. In situ learning using intrinsic memristor variability via Markov chain Monte Carlo sampling. Nat. Electron. 4, 2 (2021), 151–161. DOI:DOI:

[19]

F. Dartu, N. Menezes, and L. T. Pileggi. 1996. Performance computation for pre-characterized CMOS gates with RC loads. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 15, 5 (1996), 544–553. DOI:DOI:

Digital Library

[20]

Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-performance sparse linear algebra on HBM-equipped FPGAs using HLS: A case study on SpMV. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’22). 54–64. DOI:DOI:

Digital Library

[21]

Maria Angélica Dávila-Guzmán, Rubén Gran Tejero, María Villarroya-Gaudó, and Darío Suárez Gracia. 2021. Analytical model for memory-centric high-level synthesis-generated applications. IEEE Trans. Comput. 70, 12 (2021), 1–12. DOI:DOI:

Digital Library

[22]

Seungchul Jung, Hyungwoo Lee, Sungmeen Myung, Hyunsoo Kim, Seung Keun Yoon, Soon-Wan Kwon, Yongmin Ju, Minje Kim, Wooseok Yi, Shinhee Han, Baeseong Kwon, Boyoung Seo, Kilho Lee, Gwan-Hyeob Koh, Kangho Lee, Yoonjong Song, Changkyu Choi, Donhee Ham, and Sang Joon Kim. 2022. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 601, 10 (2022), 211–216. DOI:DOI:

[23]

Ivan Fernandez, Ricardo Quislant, Eladio Gutiérrez, Oscar Plata, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, and Onur Mutlu. 2020. NATSA: A near-data processing accelerator for time series analysis. In IEEE 38th International Conference on Computer Design (IEEE ICCD’20). 120–129. DOI:DOI:

[24]

Anteneh Gebregiorgis, Hoang Anh Du Nguyen, Jintao Yu, Rajendra Bishnoi, Mottaqiallah Taouil, Francky Catthoor, and Said Hamdioui. 2022. A survey on memory-centric computer architectures. J. Emerg. Technol. Comput. Syst. 18, 4 (2022), 1–50. DOI:DOI:

Digital Library

[25]

Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. ACM SIGMETRICS Performance Evaluation Review, Vol. 50. 33–34. DOI:DOI:

Digital Library

[26]

Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 6. 1–49. DOI:DOI:

Digital Library

[27]

Vinayak Gokhale, Aliasger Zaidy, Andre Xian Ming Chang, and Eugenio Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In IEEE International Symposium on Circuits and Systems (IEEE ISCAS’17). 1–4. DOI:DOI:

[28]

Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, and Jason Cong. 2020. Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency. In 57th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’20). 1–6. DOI:DOI:

[29]

Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jasona Cong. 2019. Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between FPGA and GPU. In IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (IEEE FCCM’19). 127–135. DOI:DOI:

[30]

Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jasona Cong. 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’21). 81–92. DOI:DOI:

Digital Library

[31]

Zongxia Guo, Jialiang Yin, Yue Bai, Daoqian Zhu, Kewen Shi, Gefei Wang, Kaihua Cao, and Weisheng Zhao. 2021. Spintronics for energy- efficient computing: An overview and outlook. Proc. IEEE 109, 8 (2021), 1398–1417. DOI:DOI:

[32]

Amir Mahdi Hosseini Monazzah, Amir M. Rahmani, Antonio Miele, and Nikil Dutt. 2020. CAST: Content-aware STT-MRAM cache write management for different levels of approximation. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 39, 12 (2020), 4385–4398. DOI:DOI:

[33]

Xianghong Hu, Hongmin Huang, Xueming Li, Xin Zheng, Qinyuan Ren, Jingyu He, and Xiaoming Xiong. 2022. High-performance reconfigurable DNN accelerator on a bandwidth-limited embedded system. ACM Trans. Embed. Comput. Syst. 4, 5 (2022), 1–23. DOI:DOI:

Digital Library

[34]

Xianghong Hu, Yuhang Zeng, Zicong Li, Xin Zheng, Shuting Cai, and Xiaoming Xiong. 2019. A resources-efficient configurable accelerator for deep convolutional neural networks. IEEE Access 7, 72113–72124. DOI:DOI:

[35]

Qijing Huang, Christopher Yarp, Sagar Karandikar, Nathan Pemberton, Benjamin Brock, Liang Ma, Guohao Dai, Robert Quitt, Krste Asanovic, and John Wawrzynek. 2019. Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration. In IEEE/ACM International Conference on Computer-Aided Design (IEEE ICCAD’19). 1–8. DOI:DOI:

[36]

Shihua Huang, Luc Waeijen, and Henk Corporaal. 2022. How flexible is your computing system? ACM Trans. Embed. Comput. Syst. 21, 4 (2022), 1–41. DOI:

Digital Library

[37]

Veronia Iskandar, Mohamed A. Abd El Ghany, and Diana Goehringer. 2022. Near-memory computing on FPGAs with 3D-stacked memories: Applications, architectures, and optimizations. ACM Trans. Reconfig. Technol. Syst.7 (2022), 1–31. DOI:DOI:

Digital Library

[38]

Shubham Jain, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2018. Computing in memory with spin-transfer torque magnetic RAM. IEEE Trans. Very Large Scale Integ. Syst. 26, 3, 470–483. DOI:DOI:

Digital Library

[39]

JEDEC JESD79-4C. 2020. DDR4 SDRAM Standard.Retrieved from https://www.jedec.org/standardsdocuments/docs/jesd79-4a

[40]

Hao Jiang, Kevin Yamada, Zizhe Ren, Thomas Kwok, Fu Luo, Qing Yang, Xiaorong Zhang, J. Joshua Yang, Qiangfei Xia, Yiran Chen, Hai Li, Qing Wu, and Mark. Barnell. 2018. In pulse-width modulation based Dot-product engine for neuromorphic computing system using memristor crossbar array. In IEEE International Symposium on Circuits and Systems (IEEE ISCAS’18), 1–4. DOI:DOI:

[41]

Taehwan Kim, Yunho Jang, Min-Gu Kang, Byong-Guk Park, Kyung-Jin Lee, and Jongsun Park. 2022. SOT-MRAM digital PIM architecture with extended parallelism in matrix multiplication. IEEE Trans. Comput. 71, 11 (2022), 2816–2828. DOI:DOI:

[42]

Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Hongjung Kim, Ju Young Kim, Young Jun Park, Jae Hwan Kim, Dae Suk Kim, Heat Bit Park, Jin Wook Shin, Jang Hwan Cho, Ki Hun Kwon, Min Jeong Kim, Jaejin Lee, Kun Woo Park, Byongtae Chung, and Sungjoo. Hong. 2014. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In IEEE International Solid-State Circuits Conference Digest of Technical Papers (IEEE ISSCC’14). 432–433. DOI:DOI:

[43]

Kyoung-Rog Lee, Jihoon Kim,Changhyeon Kim, Donghyeon Han, Juhyoung Lee, Jinsu Lee, Hongsik Jeong, and Hoi-Jun Yoo. 2020. A 1.02-\(\mu\)W STT-MRAM-Based DNN ECG arrhythmia monitoring SoC with leakage-based delay MAC unit. IEEE Solid-State Circuits Letters, 3 (2020), 390–393. DOI:

[44]

Youngmoon Lee. 2021. Thermal-aware design and management of embedded real-time systems. In Design, Automation & Test in Europe Conference & Exhibition (DATE’21). 1252–1255. DOI:DOI:

[45]

Yueting Li, Tianshuo Bai, Xinyi Xu, Yundong Zhang, Bi Wu, Hao Cai, Biao Pan, and Weisheng Zhao. 2022. A survey of MRAM-centric computing: from near memory to in memory. IEEE Transactions on Emerging Topics in Computing, 11, 2 (2023), 318–330, DOI:

[46]

Yueting Li, Bingluo Zhao, Xinyi Xu, Yundong Zhang, Jun Wang, and Weisheng Zhao. 2022. Work-in-progress: Toward energy-efficient near STT-MRAM processing architecture for neural networks. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’22). 13–14. DOI:DOI:

[47]

Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, Pei-Yu Chen, and Wei-Kuan Shih. 2019. Rethinking last-level-cache write-back strategy for MLC STT-RAM main memory with asymmetric write energy. In IEEE/ACM International Symposium on Low Power Electronics and Design (IEEE/ACM ISLPED’19). 1–6. DOI:DOI:

[48]

Zewei Liu, Chunqiang Hu, Baolin Wang, Jiajun Chen, Shaojiang Deng, and Jiguo Yu. 2022. A minimizing energy consumption scheme for real-time embedded system based on meta-heuristic optimization. IEEE Trans. Comput.-aid.Des. Integ. Circ. Syst. 42, 7 (2023), 2276–2289. DOI:

Digital Library

[49]

Yandong Luo and Shimeng Yu. 2022. AILC: Accelerate on-chip incremental learning with compute-in-memory technology. IEEE Trans. Comput. 70, 8 (2022), 1225–1238. DOI:DOI:

[50]

Srijeeta Maity, Anirban Ghose, Soumyajit Dey, and Swarnendu Biswas. 2021. Thermal-aware adaptive platform management for heterogeneous embedded systems. ACM Trans. Embed. Comput. Syst. 20, 5s (2021), 1–28. DOI:DOI:

Digital Library

[51]

Daoqian Zhu Zhaohao Wang-Jimmy Kan Zhengyang Zhao Kaihua Cao Zilu Wang Youguang Zhang Tianrui Zhang Chando Park Jian-Ping Wang Albert Fert Mengxing Wang, Wenlong Cai and Weisheng Zhao. 2018. Field-free switching of a perpendicular magnetic tunnel junction through the interplay of spin-orbit and spin-transfer torques. Nat. Electron. 1, 11 (2018), 582–588. DOI:DOI:

[52]

Chenlu Miao, Kai Bu, Mengming Li, Shaowu Mao, and Jianwei Jia. 2022. SwiftDir: Secure cache coherence without overprotection. In 55th IEEE/ACM International Symposium on Microarchitecture (IEEE/ACM MICRO’22). 662–677. DOI:DOI:

Digital Library

[53]

Joonas Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, and Pekka Jääskeläinen. 2022. Energy-efficient instruction delivery in embedded systems with domain wall memory. IEEE Trans. Comput. 71, 9 (2022), 2010–2021. DOI:DOI:

[54]

Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. 2021. Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families: Industrial product. In ACM/IEEE 48th Annual International Symposium on Computer Architecture (ACM/IEEE ISCA’21). 57–70. DOI:DOI:

Digital Library

[55]

Nicolai Oswald, Vijay Nagarajan, Daniel J. Sorin, Vasilis Gavrielatos, Theo Olausson, and Reece Carr. 2022. HeteroGen: Automatic synthesis of heterogeneous cache coherence protocols. In IEEE International Symposium on High-Performance Computer Architecture (IEEE HPCA’22), 756–771. DOI:DOI:

[56]

Scott P. Kolodziej, Mohsen Aznaveh, Matthew Bullock, Jarrett David, Timothy A. Davis, Matthew Henderson, Yifan Hu, and Read Sandstrom. 2019. The SuiteSparse matrix collection website interface. J. Open Source Softw. 4, 35 (2019), 1244. DOI:DOI:

[57]

Santiago Pagani, P. D. Sai Manoj, Axel Jantsch, and Jörg Henkel. 2020. Machine learning for power, energy, and thermal management on multicore processors: A survey. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 39, 1 (2020), 101–116. DOI:DOI:

Digital Library

[58]

Alberto Parravicini, Francesco Sgherzi, and Marco D. Santambrogio. 2021. A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA. In 26th Asia and South Pacific Design Automation Conference (ASP-DAC’21). 378–383.

[59]

Nuno Paulino, João Bispo, João C. Ferreira, and João M. P. Cardoso. 2021. A binary translation framework for automated hardware generation. IEEE Micro 41, 4 (2021), 15–23. DOI:DOI:

[60]

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’16). 26–35. DOI:DOI:

Digital Library

[61]

Meikang Qiu, Zhi Chen, Jianwei Niu, Ziliang Zong, Gang Quan, Xiao Qin, and Laurence T. Yang. 2015. Data allocation for hybrid memory with genetic algorithm. IEEE Trans. Emerg. Topics Comput. 3, 4 (2015), 544–555. DOI:DOI:

Digital Library

[62]

Ashish Ranjan, Swagath Venkataramani, Zoha Pajouhi, Rangharajan Venkatesan, Kaushik Roy, and Anand Raghunathan. 2017. STAxCache: An approximate, energy efficient STT-MRAM cache. In Design, Automation & Test in Europe Conference & Exhibition (DATE’17). 356–361. DOI:DOI:

[63]

Salonik Resch, S. Karen Khatamifard, Zamshed I. Chowdhury, Masoud Zabihi, Zhengyang Zhao, Husrev Cilasun, Jian-Ping Wang, Sachin S. Sapatnekar, and Ulya R. Karpuzcu. 2020. MOUSE: Inference in non-volatile memory for energy harvesting applications. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture (IEEE/ACM MICRO’20). 400–414. DOI:DOI:

[64]

Matteo Risso, Alessio Burrello, Daniele Jahier Pagliari, Francesco Conti, Lorenzo Lamberti, Enrico Macii, Luca Benini, and Massimo Poncino. 2021. Pruning in time (PIT): A lightweight network architecture optimizer for temporal convolutional networks. In 58th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’21). 1015–1020. DOI:DOI:

Digital Library

[65]

Davide Rossi, Francesco Conti, Manuel Eggiman, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, and Luca Benini. 2022. Vega: A ten-core SoC for IoT endnodes with DNN acceleration and cognitive wake-up from MRAM-based state-retentive sleep mode. IEEE J. Solid-State Circ. 57, 1 (2022), 127–139. DOI:DOI:

[66]

Arash Salahvarzi, Amir Mahdi Hosseini Monazzah, Mahdi Fazeli, and Kevin Skadron. 2021. NOSTalgy: Near-optimum run-time STT-MRAM quality-energy knob management for approximate computing applications. IEEE Trans. Comput. 70, 3 (2021), 414–427. DOI:DOI:

Digital Library

[67]

Sepahrad Salavati, Mohammad Hossein Moaiyeri, and Kian Jafari. 2021. Ultra-efficient nonvolatile approximate full-adder with spin-hall-assisted MTJ cells for in-memory computing applications. IEEE Trans. Magnet. 57, 5 (2021), 1–11. DOI:DOI:

[68]

Soheil Salehi, Navid Khoshavi, and Ronald F. DeMara. 2020. Mitigating process variability for non-volatile cache resilience and yield. IEEE Trans. Emerg. Topics Comput. 8, 3 (2020), 724–737. DOI:DOI:

[69]

Reza Salkhordeh, Onur Mutlu, and Hossein Asadi. 2019. An analytical model for performance and lifetime estimation of hybrid DRAM-NVM main memories. IEEE Trans. Comput. 68, 8 (2019), 1114–1130. DOI:DOI:

Digital Library

[70]

Eshan Singh, Florian Lonsing, Saranyu Chattopadhyay, Maxwell Strange, Peng Wei, Xiaofan Zhang, Yuan Zhou, Deming Chen, Jason Cong, Priyanka Raina, Zhiru Zhang, Clark Barrett, and Subhasish Mitra. 2020. A-QED verification of hardware accelerators. In 57th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’20). 1–6. DOI:DOI:

[71]

Shihao Song, Adarsha Balaji, Anup Das, and Nagarajan Kandasamy. 2022. Design-technology co-optimization for NVM-based neuromorphic processing elements. ACM Trans. Embed. Comput. Syst.3 (2022), 1–28. DOI:DOI:

Digital Library

[72]

Arun Subramaniyan, Jack Wadden, Kush Goliya, Nathan Ozog, Xiao Wu, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2021. Accelerated seeding for genome sequence alignment with enumerated radix trees. In ACM/IEEE 48th Annual International Symposium on Computer Architecture (ACM/IEEE ISCA’21). 1–6. DOI:DOI:

Digital Library

[73]

Mahdi Talebi, Arash Salahvarzi, Amir Mahdi Hosseini Monazzah, Kevin Skadron, and Mahdi Fazeli. 2021. ROCKY: A robust hybrid on-chip memory kit for the processors with STT-MRAM cache technology. IEEE Trans. Comput. 70, 12 (2021), 2198–2210. DOI:DOI:

Digital Library

[74]

Dharmesh Tarapore, Shahin Roozkhosh, Steven Brzozowski, and Renato Mancuso. 2022. Observing the invisible: Live cache inspection for high-performance embedded systems. IEEE Trans. Comput. 71, 3 (2022), 559–572. DOI:DOI:

Digital Library

[75]

Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, and Xin Fu. 2021. Shift-BNN: Highly-efficient probabilistic Bayesian neural network training via memory-friendly pattern retrieving. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (IEEE/ACM MICRO’21). 885–897. DOI:DOI:

Digital Library

[76]

Hongjie Wang, Yang Zhao, Chaojian Li, Yue Wang, and Yingyan Lin. 2020. A new MRAM-based process in-memory accelerator for efficient neural network training with floating point precision. In IEEE International Symposium on Circuits and Systems (IEEE ISCAS’20). 1–5. DOI:DOI:

[77]

Xueyan Wang, Jianlei Yang, Yinglin Zhao, Xiaotao Jia, Rong Yin, Xuhang Chen, Gang Qu, and Weisheng Zhao. 2022. Triangle counting accelerations: From algorithm to in-memory computing architecture. IEEE Trans. Comput. 71, 10 (2022), 2462–2472. DOI:DOI:

Digital Library

[78]

Yuntao Wei, Xueyan Wang, Shangtong Zhang, Jianlei Yang, Xiaotao Jia, Zhaohao Wang, Gang Qu, and Weisheng Zhao. 2023. IMGA: Efficient in-memory graph convolution network aggregation with data flow optimizations. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 42, 12 (2023), 4695–4705. DOI:

Digital Library

[79]

Bi Wu, Pengcheng Dai, Yuanqing Cheng, Ying Wang, Jianlei Yang, Zhaohao Wang, Dijun Liu, and Weisheng Zhao. 2020. A novel high performance and energy efficient NUCA architecture for STT-MRAM LLCs with thermal consideration. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 39, 4 (2020), 803–815. DOI:DOI:

[80]

Guoqing Xiao, Chuanghui Yin, Tao Zhou, Xueqi Li, Yuedan Chen, and Kenli Li. 2023. A survey of accelerating parallel sparse linear algebra. ACM Comput. Surv. 56, 1 (2024), 1–38. DOI:

Digital Library

[81]

Xilinx. 2019. Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance.Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp485-hbm.pdf

[82]

Xilinx. 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html

[83]

Xilinx. 2021. Vivado design suite user guide.Retrieved from https://www.xilinx.com/content/dam/xilinx/support/documentation/sw_manuals/xilinx2021_2/ug973-vivado-release-notesinstall-license.pdf

[84]

Xilinx. 2022. NGCodec Hardware HEVC Encoding (UG1408).Retrieved from https://www.xilinx.com/publications/user-guide/partner/ug1408-ngcodec-hevc.pdf

[85]

Hasan Erdem Yantır, Ahmed M. Eltawil, and Khaled N. Salama. 2022. A hardware/software co-design methodology for in-memory processors. J. Parallel Distrib. Comput. 161, 3 (2022), 63–71. DOI:DOI:

Digital Library

[86]

Masoud Zabihi, Zamshed Iqbal Chowdhury, Zhengyang Zhao, Ulya R. Karpuzcu, Jian-Ping Wang, and Sachin S. Sapatnekar. 2019. In-memory processing on the spintronic CRAM: From hardware design to application mapping. IEEE Trans. Comput. 68, 8 (2019), 1159–1173. DOI:DOI:

Digital Library

[87]

Jie Zhang, Myoungsoo Jung, and Mahmut Kandemir. 2019. FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. In IEEE International Symposium on High Performance Computer Architecture (IEEE HPCA’19). 426–439. DOI:DOI:

[88]

Lin Zhang, Pengyuan Lu, Fanxin Kong, Xin Chen, Oleg Sokolsky, and Insup Lee. 2021. Real-time attack-recovery for cyber-physical systems using linear-quadratic regulator. ACM Trans. Embed. Comput. Syst. 20, 5s (2021), 1–24. DOI:DOI:

Digital Library

Cited By

Zhang YLiang WXu WXu ZJia X(2024)Cost Minimization of Digital Twin Placements in Mobile Edge ComputingACM Transactions on Sensor Networks10.1145/365844920:3(1-26)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3658449

Index Terms

Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems
1. Computer systems organization
  1. Real-time systems
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
    2. Semiconductor memory

Recommendations

Toward Energy-Efficient Sparse Matrix-Vector Multiplication with near STT-MRAM Computing Architecture
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. ...
Device-architecture co-optimization of STT-RAM based memory for low power embedded systems
ICCAD '11: Proceedings of the International Conference on Computer-Aided Design

Spin-transfer torque random access memory (STT-RAM) is a fast, scalable, durable non-volatile memory which can be embedded into standard CMOS process. A wide range of write speeds from 1ns to 100ns have been reported for STT-RAM. The switching current ...
Progress and outlook for STT-MRAM
ICCAD '11: Proceedings of the International Conference on Computer-Aided Design

New product applications have an increasing demand for a non-volatile memory (NVM) exhibiting higher speeds, extended endurance and lower power consumption as existing solutions are not fully capable to deliver on all of these attributes. Of the group ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 23, Issue 3

May 2024

452 pages

EISSN:1558-3465

DOI:10.1145/3613579

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 25 April 2024

Online AM: 07 March 2024

Accepted: 28 February 2024

Revised: 19 January 2024

Received: 10 February 2023

Published in TECS Volume 23, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Tencent Foundation through the XPLORER PRIZE
National Key Research and Development Program of China
National Natural Science Foundation of China
Key Research and Development Program of Anhui Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
850
Total Downloads

Downloads (Last 12 months)850
Downloads (Last 6 weeks)92

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YLiang WXu WXu ZJia X(2024)Cost Minimization of Digital Twin Placements in Mobile Edge ComputingACM Transactions on Sensor Networks10.1145/365844920:3(1-26)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3658449

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents