skip to main content
research-article

Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

Published: 25 April 2024 Publication History

Abstract

Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of embedded systems. To tackle these issues, we propose spin-transfer torque magnetic random-access memory (STT-MRAM)-based near memory computing (NMC) design for embedded systems. We optimize this design from three aspects: Fast-pipelined STT-MRAM readout scheme provides higher memory bandwidth for NMC design, enhancing real-time processing capability with a non-trivial area overhead. Direct index compression format in conjunction with digital sparse matrix-vector multiplication (SpMV) accelerator supports various matrices of practical applications that alleviate computing resource requirements. Custom NMC instructions and stream converter for NMC systems dynamically adjust available hardware resources for better utilization. Experimental results demonstrate that the memory bandwidth of STT-MRAM achieves 26.7 GB/s. Energy consumption and latency improvement of digital SpMV accelerator are up to 64× and 1,120× across sparsity matrices spanning from 10% to 99.8%. Single-precision and double-precision elements transmission increased up to 8× and 9.6×, respectively. Furthermore, our design achieves a throughput of up to 15.9× over state-of-the-art designs.

References

[1]
Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, and Tobi Delbruck. 2019. NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans. Neural Netw. Learn. Syst. 30, 3 (2019), 644–656. DOI:DOI:
[2]
Yehia Arafa, Abdel-Hameed Badawy, Gopinath Chennupati, Atanu Barai, Nandakishore Santhi, and Stephan Eidenbenz. 2020. Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles. In ACM International Conference on Supercomputing (ACM ICS’20). 1–12. DOI:DOI:
[3]
Kazi Asifuzzaman, Rommel Sánchez Verdejo, and Petar Radojković. 2022. Performance and power estimation of STT-MRAM main memory with reliable system-level simulation. ACM Trans. Embed. Comput. Syst. 21, 1 (2022), 1–25. DOI:DOI:
[4]
Samir Ben Dodo, Rajendra Bishnoi, Sarath Mohanachandran Nair, and Mehdi B. Tahoori. 2019. A spintronics memory PUF for resilience against cloning counterfeit. IEEE Trans. Very Large Scale Integ. Syst. 27, 11 (2019), 2511–2522. DOI:DOI:
[5]
João M. P. Cardoso, André DeHon, and Laura Pozzi. 2021. Guest editorial: IEEE TC special section on compiler optimizations for FPGA-based systems. IEEE Trans. Comput. 70, 12 (2021), 2013–2014. DOI:DOI:
[6]
Meng-Fan Chang, Albert Lee, Pin-Cheng Chen, Chrong Jung Lin, Ya-Chin King, Shyh-Shyuan Sheu, and Tzu-Kun Ku. 2015. Challenges and circuit techniques for energy-efficient on-chip nonvolatile memory using memristive devices. IEEE J. Emerg. Select. Topics Circ. Syst. 5, 2 (2015), 183–193. DOI:DOI:
[7]
Tung-Cheng Chang, Yen-Cheng Chiu, Chun-Ying Lee, Je-Min Hung, Kuang-Tang Chang, Cheng-Xin Xue, Ssu-Yen Wu, Hui-Yao Kao, Peng Chen, Hsiao-Yu Huang, Shih-Hsih Teng, and Meng-Fan Chang. 2020. 13.4 A 22nm 1Mb 1024b-read and near-memory-computing dual-mode STT-MRAM macro with 42.6GB/s read bandwidth for security-aware mobile devices. In IEEE International Solid-State Circuits Conference (IEEE ISSCC’20). 224–226. DOI:DOI:
[8]
Jianqi Chen, Monir Zaman, Yiorgos Makris, R. D. Shawn Blanton, Subhasish Mitra, and Benjamin Carrion Schafer. 2020. DECOY: DEflection-driven HLS-based computation partitioning for obfuscating intellectual property. In 57th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’20). 1–6. DOI:DOI:
[9]
Wei-Ming Chen, Tei-Wei Kuo, and Pi-Cheng Hsiu. 2021. Heterogeneity-aware multicore synchronization for intermittent systems. ACM Trans. Embed. Comput. Syst. 20, 9 (2021), 1–22. DOI:DOI:
[10]
Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-based graph processing framework on FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’21). 69–80. DOI:DOI:
[11]
Xuhang Chen, Xueyan Wang, Xiaotao Jia, Jianlei Yang, Gang Qu, and Weisheng Zhao. 2022. In Accelerating graph-connected component computation with emerging processing-in-memory architecture. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 41, 12 (2022), 5333–5342. DOI:DOI:
[12]
Elham Cheshmikhani, Hamed Farbeh, and Hossein Asadi. 2022. 3RSeT: Read disturbance rate reduction in STT-MRAM caches by selective tag comparison. IEEE Trans. Comput. 71, 6 (2022), 1305–1319. DOI:DOI:
[13]
Yuze Chi, Licheng Guo, and Jason Cong. 2022. Accelerating SSSP for power-law graphs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’22). 1–11. DOI:DOI:
[14]
Yen-Cheng Chiu, Win-San Khwa, Chung-Yuan Li, Fang-Ling Hsieh, Yu-An Chien, Guan-Yi Lin, Po-Jung Chen, Tsen-Hsiang Pan, De-Qi You, Fang-Yi Chen, Andrew Lee, Chung-Chuan Lo, Ren-Shuo Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, Yu-Der Chih, Tsung-Yung Chang, and Meng-Fan Chang. 2023. A 22nm 8Mb STT-MRAM near-memory-computing macro with 8b-precision and 46.4-160.1TOPS/W for Edge-AI devices. In IEEE International Solid-State Circuits Conference (IEEE ISSCC’23). 496–498. DOI:DOI:
[15]
Yen-Cheng Chiu, Chia-Sheng Yang, Shih-Hsin Teng, Hsiao-Yu Huang, Fu-Chun Chang, Yuan Wu, Yu-An Chien, Fang-Ling Hsieh, Chung-Yuan Li, Guan-Yi Lin, Po-Jung Chen, Tsen-Hsiang Pan, Chung-Chuan Lo, Win-San Khwa, Ren-Shuo Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, Chieh-Pu Lo, Yu-Der Chih, Tsung-Yung, Jonathan Chang, and Meng-Fan Chang. 2022. A 22nm 4Mb STT-MRAM data-encrypted near-memory computation macro with a 192GB/s read-and-decryption bandwidth and 25.1-55.1TOPS/W 8b MAC for AI operations. In IEEE International Solid-State Circuits Conference (IEEE ISSCC’22). Vol. 65. 178–180. DOI:DOI:
[16]
Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS today: Successes, challenges, and opportunities. ACM Trans. Reconfig. Technol. Syst. 15, 4 (2022), 1–42. DOI:DOI:
[17]
Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated accelerator generation and optimization with composable, parallel and pipeline architecture. In 55th ACM/ESDA/IEEE Design Automation Conference (ACM/IEEE DAC’18). 1–6. DOI:DOI:
[18]
Turck Clément Harabi Kamel-Eddine Querlioz Damien Dalgaty Thomas, Castellani Niccolo and Vianello Elisa. 2021. In situ learning using intrinsic memristor variability via Markov chain Monte Carlo sampling. Nat. Electron. 4, 2 (2021), 151–161. DOI:DOI:
[19]
F. Dartu, N. Menezes, and L. T. Pileggi. 1996. Performance computation for pre-characterized CMOS gates with RC loads. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 15, 5 (1996), 544–553. DOI:DOI:
[20]
Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-performance sparse linear algebra on HBM-equipped FPGAs using HLS: A case study on SpMV. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’22). 54–64. DOI:DOI:
[21]
Maria Angélica Dávila-Guzmán, Rubén Gran Tejero, María Villarroya-Gaudó, and Darío Suárez Gracia. 2021. Analytical model for memory-centric high-level synthesis-generated applications. IEEE Trans. Comput. 70, 12 (2021), 1–12. DOI:DOI:
[22]
Seungchul Jung, Hyungwoo Lee, Sungmeen Myung, Hyunsoo Kim, Seung Keun Yoon, Soon-Wan Kwon, Yongmin Ju, Minje Kim, Wooseok Yi, Shinhee Han, Baeseong Kwon, Boyoung Seo, Kilho Lee, Gwan-Hyeob Koh, Kangho Lee, Yoonjong Song, Changkyu Choi, Donhee Ham, and Sang Joon Kim. 2022. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 601, 10 (2022), 211–216. DOI:DOI:
[23]
Ivan Fernandez, Ricardo Quislant, Eladio Gutiérrez, Oscar Plata, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, and Onur Mutlu. 2020. NATSA: A near-data processing accelerator for time series analysis. In IEEE 38th International Conference on Computer Design (IEEE ICCD’20). 120–129. DOI:DOI:
[24]
Anteneh Gebregiorgis, Hoang Anh Du Nguyen, Jintao Yu, Rajendra Bishnoi, Mottaqiallah Taouil, Francky Catthoor, and Said Hamdioui. 2022. A survey on memory-centric computer architectures. J. Emerg. Technol. Comput. Syst. 18, 4 (2022), 1–50. DOI:DOI:
[25]
Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. ACM SIGMETRICS Performance Evaluation Review, Vol. 50. 33–34. DOI:DOI:
[26]
Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 6. 1–49. DOI:DOI:
[27]
Vinayak Gokhale, Aliasger Zaidy, Andre Xian Ming Chang, and Eugenio Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In IEEE International Symposium on Circuits and Systems (IEEE ISCAS’17). 1–4. DOI:DOI:
[28]
Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, and Jason Cong. 2020. Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency. In 57th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’20). 1–6. DOI:DOI:
[29]
Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jasona Cong. 2019. Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between FPGA and GPU. In IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (IEEE FCCM’19). 127–135. DOI:DOI:
[30]
Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jasona Cong. 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’21). 81–92. DOI:DOI:
[31]
Zongxia Guo, Jialiang Yin, Yue Bai, Daoqian Zhu, Kewen Shi, Gefei Wang, Kaihua Cao, and Weisheng Zhao. 2021. Spintronics for energy- efficient computing: An overview and outlook. Proc. IEEE 109, 8 (2021), 1398–1417. DOI:DOI:
[32]
Amir Mahdi Hosseini Monazzah, Amir M. Rahmani, Antonio Miele, and Nikil Dutt. 2020. CAST: Content-aware STT-MRAM cache write management for different levels of approximation. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 39, 12 (2020), 4385–4398. DOI:DOI:
[33]
Xianghong Hu, Hongmin Huang, Xueming Li, Xin Zheng, Qinyuan Ren, Jingyu He, and Xiaoming Xiong. 2022. High-performance reconfigurable DNN accelerator on a bandwidth-limited embedded system. ACM Trans. Embed. Comput. Syst. 4, 5 (2022), 1–23. DOI:DOI:
[34]
Xianghong Hu, Yuhang Zeng, Zicong Li, Xin Zheng, Shuting Cai, and Xiaoming Xiong. 2019. A resources-efficient configurable accelerator for deep convolutional neural networks. IEEE Access 7, 72113–72124. DOI:DOI:
[35]
Qijing Huang, Christopher Yarp, Sagar Karandikar, Nathan Pemberton, Benjamin Brock, Liang Ma, Guohao Dai, Robert Quitt, Krste Asanovic, and John Wawrzynek. 2019. Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration. In IEEE/ACM International Conference on Computer-Aided Design (IEEE ICCAD’19). 1–8. DOI:DOI:
[36]
Shihua Huang, Luc Waeijen, and Henk Corporaal. 2022. How flexible is your computing system? ACM Trans. Embed. Comput. Syst. 21, 4 (2022), 1–41. DOI:
[37]
Veronia Iskandar, Mohamed A. Abd El Ghany, and Diana Goehringer. 2022. Near-memory computing on FPGAs with 3D-stacked memories: Applications, architectures, and optimizations. ACM Trans. Reconfig. Technol. Syst.7 (2022), 1–31. DOI:DOI:
[38]
Shubham Jain, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2018. Computing in memory with spin-transfer torque magnetic RAM. IEEE Trans. Very Large Scale Integ. Syst. 26, 3, 470–483. DOI:DOI:
[39]
JEDEC JESD79-4C. 2020. DDR4 SDRAM Standard.Retrieved from https://www.jedec.org/standardsdocuments/docs/jesd79-4a
[40]
Hao Jiang, Kevin Yamada, Zizhe Ren, Thomas Kwok, Fu Luo, Qing Yang, Xiaorong Zhang, J. Joshua Yang, Qiangfei Xia, Yiran Chen, Hai Li, Qing Wu, and Mark. Barnell. 2018. In pulse-width modulation based Dot-product engine for neuromorphic computing system using memristor crossbar array. In IEEE International Symposium on Circuits and Systems (IEEE ISCAS’18), 1–4. DOI:DOI:
[41]
Taehwan Kim, Yunho Jang, Min-Gu Kang, Byong-Guk Park, Kyung-Jin Lee, and Jongsun Park. 2022. SOT-MRAM digital PIM architecture with extended parallelism in matrix multiplication. IEEE Trans. Comput. 71, 11 (2022), 2816–2828. DOI:DOI:
[42]
Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Hongjung Kim, Ju Young Kim, Young Jun Park, Jae Hwan Kim, Dae Suk Kim, Heat Bit Park, Jin Wook Shin, Jang Hwan Cho, Ki Hun Kwon, Min Jeong Kim, Jaejin Lee, Kun Woo Park, Byongtae Chung, and Sungjoo. Hong. 2014. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In IEEE International Solid-State Circuits Conference Digest of Technical Papers (IEEE ISSCC’14). 432–433. DOI:DOI:
[43]
Kyoung-Rog Lee, Jihoon Kim,Changhyeon Kim, Donghyeon Han, Juhyoung Lee, Jinsu Lee, Hongsik Jeong, and Hoi-Jun Yoo. 2020. A 1.02-\(\mu\)W STT-MRAM-Based DNN ECG arrhythmia monitoring SoC with leakage-based delay MAC unit. IEEE Solid-State Circuits Letters, 3 (2020), 390–393. DOI:
[44]
Youngmoon Lee. 2021. Thermal-aware design and management of embedded real-time systems. In Design, Automation & Test in Europe Conference & Exhibition (DATE’21). 1252–1255. DOI:DOI:
[45]
Yueting Li, Tianshuo Bai, Xinyi Xu, Yundong Zhang, Bi Wu, Hao Cai, Biao Pan, and Weisheng Zhao. 2022. A survey of MRAM-centric computing: from near memory to in memory. IEEE Transactions on Emerging Topics in Computing, 11, 2 (2023), 318–330, DOI:
[46]
Yueting Li, Bingluo Zhao, Xinyi Xu, Yundong Zhang, Jun Wang, and Weisheng Zhao. 2022. Work-in-progress: Toward energy-efficient near STT-MRAM processing architecture for neural networks. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’22). 13–14. DOI:DOI:
[47]
Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, Pei-Yu Chen, and Wei-Kuan Shih. 2019. Rethinking last-level-cache write-back strategy for MLC STT-RAM main memory with asymmetric write energy. In IEEE/ACM International Symposium on Low Power Electronics and Design (IEEE/ACM ISLPED’19). 1–6. DOI:DOI:
[48]
Zewei Liu, Chunqiang Hu, Baolin Wang, Jiajun Chen, Shaojiang Deng, and Jiguo Yu. 2022. A minimizing energy consumption scheme for real-time embedded system based on meta-heuristic optimization. IEEE Trans. Comput.-aid.Des. Integ. Circ. Syst. 42, 7 (2023), 2276–2289. DOI:
[49]
Yandong Luo and Shimeng Yu. 2022. AILC: Accelerate on-chip incremental learning with compute-in-memory technology. IEEE Trans. Comput. 70, 8 (2022), 1225–1238. DOI:DOI:
[50]
Srijeeta Maity, Anirban Ghose, Soumyajit Dey, and Swarnendu Biswas. 2021. Thermal-aware adaptive platform management for heterogeneous embedded systems. ACM Trans. Embed. Comput. Syst. 20, 5s (2021), 1–28. DOI:DOI:
[51]
Daoqian Zhu Zhaohao Wang-Jimmy Kan Zhengyang Zhao Kaihua Cao Zilu Wang Youguang Zhang Tianrui Zhang Chando Park Jian-Ping Wang Albert Fert Mengxing Wang, Wenlong Cai and Weisheng Zhao. 2018. Field-free switching of a perpendicular magnetic tunnel junction through the interplay of spin-orbit and spin-transfer torques. Nat. Electron. 1, 11 (2018), 582–588. DOI:DOI:
[52]
Chenlu Miao, Kai Bu, Mengming Li, Shaowu Mao, and Jianwei Jia. 2022. SwiftDir: Secure cache coherence without overprotection. In 55th IEEE/ACM International Symposium on Microarchitecture (IEEE/ACM MICRO’22). 662–677. DOI:DOI:
[53]
Joonas Multanen, Kari Hepola, Asif Ali Khan, Jeronimo Castrillon, and Pekka Jääskeläinen. 2022. Energy-efficient instruction delivery in embedded systems with domain wall memory. IEEE Trans. Comput. 71, 9 (2022), 2010–2021. DOI:DOI:
[54]
Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. 2021. Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families: Industrial product. In ACM/IEEE 48th Annual International Symposium on Computer Architecture (ACM/IEEE ISCA’21). 57–70. DOI:DOI:
[55]
Nicolai Oswald, Vijay Nagarajan, Daniel J. Sorin, Vasilis Gavrielatos, Theo Olausson, and Reece Carr. 2022. HeteroGen: Automatic synthesis of heterogeneous cache coherence protocols. In IEEE International Symposium on High-Performance Computer Architecture (IEEE HPCA’22), 756–771. DOI:DOI:
[56]
Scott P. Kolodziej, Mohsen Aznaveh, Matthew Bullock, Jarrett David, Timothy A. Davis, Matthew Henderson, Yifan Hu, and Read Sandstrom. 2019. The SuiteSparse matrix collection website interface. J. Open Source Softw. 4, 35 (2019), 1244. DOI:DOI:
[57]
Santiago Pagani, P. D. Sai Manoj, Axel Jantsch, and Jörg Henkel. 2020. Machine learning for power, energy, and thermal management on multicore processors: A survey. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 39, 1 (2020), 101–116. DOI:DOI:
[58]
Alberto Parravicini, Francesco Sgherzi, and Marco D. Santambrogio. 2021. A reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA. In 26th Asia and South Pacific Design Automation Conference (ASP-DAC’21). 378–383.
[59]
Nuno Paulino, João Bispo, João C. Ferreira, and João M. P. Cardoso. 2021. A binary translation framework for automated hardware generation. IEEE Micro 41, 4 (2021), 15–23. DOI:DOI:
[60]
Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM/SIGDA FPGA’16). 26–35. DOI:DOI:
[61]
Meikang Qiu, Zhi Chen, Jianwei Niu, Ziliang Zong, Gang Quan, Xiao Qin, and Laurence T. Yang. 2015. Data allocation for hybrid memory with genetic algorithm. IEEE Trans. Emerg. Topics Comput. 3, 4 (2015), 544–555. DOI:DOI:
[62]
Ashish Ranjan, Swagath Venkataramani, Zoha Pajouhi, Rangharajan Venkatesan, Kaushik Roy, and Anand Raghunathan. 2017. STAxCache: An approximate, energy efficient STT-MRAM cache. In Design, Automation & Test in Europe Conference & Exhibition (DATE’17). 356–361. DOI:DOI:
[63]
Salonik Resch, S. Karen Khatamifard, Zamshed I. Chowdhury, Masoud Zabihi, Zhengyang Zhao, Husrev Cilasun, Jian-Ping Wang, Sachin S. Sapatnekar, and Ulya R. Karpuzcu. 2020. MOUSE: Inference in non-volatile memory for energy harvesting applications. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture (IEEE/ACM MICRO’20). 400–414. DOI:DOI:
[64]
Matteo Risso, Alessio Burrello, Daniele Jahier Pagliari, Francesco Conti, Lorenzo Lamberti, Enrico Macii, Luca Benini, and Massimo Poncino. 2021. Pruning in time (PIT): A lightweight network architecture optimizer for temporal convolutional networks. In 58th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’21). 1015–1020. DOI:DOI:
[65]
Davide Rossi, Francesco Conti, Manuel Eggiman, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, and Luca Benini. 2022. Vega: A ten-core SoC for IoT endnodes with DNN acceleration and cognitive wake-up from MRAM-based state-retentive sleep mode. IEEE J. Solid-State Circ. 57, 1 (2022), 127–139. DOI:DOI:
[66]
Arash Salahvarzi, Amir Mahdi Hosseini Monazzah, Mahdi Fazeli, and Kevin Skadron. 2021. NOSTalgy: Near-optimum run-time STT-MRAM quality-energy knob management for approximate computing applications. IEEE Trans. Comput. 70, 3 (2021), 414–427. DOI:DOI:
[67]
Sepahrad Salavati, Mohammad Hossein Moaiyeri, and Kian Jafari. 2021. Ultra-efficient nonvolatile approximate full-adder with spin-hall-assisted MTJ cells for in-memory computing applications. IEEE Trans. Magnet. 57, 5 (2021), 1–11. DOI:DOI:
[68]
Soheil Salehi, Navid Khoshavi, and Ronald F. DeMara. 2020. Mitigating process variability for non-volatile cache resilience and yield. IEEE Trans. Emerg. Topics Comput. 8, 3 (2020), 724–737. DOI:DOI:
[69]
Reza Salkhordeh, Onur Mutlu, and Hossein Asadi. 2019. An analytical model for performance and lifetime estimation of hybrid DRAM-NVM main memories. IEEE Trans. Comput. 68, 8 (2019), 1114–1130. DOI:DOI:
[70]
Eshan Singh, Florian Lonsing, Saranyu Chattopadhyay, Maxwell Strange, Peng Wei, Xiaofan Zhang, Yuan Zhou, Deming Chen, Jason Cong, Priyanka Raina, Zhiru Zhang, Clark Barrett, and Subhasish Mitra. 2020. A-QED verification of hardware accelerators. In 57th ACM/IEEE Design Automation Conference (ACM/IEEE DAC’20). 1–6. DOI:DOI:
[71]
Shihao Song, Adarsha Balaji, Anup Das, and Nagarajan Kandasamy. 2022. Design-technology co-optimization for NVM-based neuromorphic processing elements. ACM Trans. Embed. Comput. Syst.3 (2022), 1–28. DOI:DOI:
[72]
Arun Subramaniyan, Jack Wadden, Kush Goliya, Nathan Ozog, Xiao Wu, Satish Narayanasamy, David Blaauw, and Reetuparna Das. 2021. Accelerated seeding for genome sequence alignment with enumerated radix trees. In ACM/IEEE 48th Annual International Symposium on Computer Architecture (ACM/IEEE ISCA’21). 1–6. DOI:DOI:
[73]
Mahdi Talebi, Arash Salahvarzi, Amir Mahdi Hosseini Monazzah, Kevin Skadron, and Mahdi Fazeli. 2021. ROCKY: A robust hybrid on-chip memory kit for the processors with STT-MRAM cache technology. IEEE Trans. Comput. 70, 12 (2021), 2198–2210. DOI:DOI:
[74]
Dharmesh Tarapore, Shahin Roozkhosh, Steven Brzozowski, and Renato Mancuso. 2022. Observing the invisible: Live cache inspection for high-performance embedded systems. IEEE Trans. Comput. 71, 3 (2022), 559–572. DOI:DOI:
[75]
Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, and Xin Fu. 2021. Shift-BNN: Highly-efficient probabilistic Bayesian neural network training via memory-friendly pattern retrieving. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (IEEE/ACM MICRO’21). 885–897. DOI:DOI:
[76]
Hongjie Wang, Yang Zhao, Chaojian Li, Yue Wang, and Yingyan Lin. 2020. A new MRAM-based process in-memory accelerator for efficient neural network training with floating point precision. In IEEE International Symposium on Circuits and Systems (IEEE ISCAS’20). 1–5. DOI:DOI:
[77]
Xueyan Wang, Jianlei Yang, Yinglin Zhao, Xiaotao Jia, Rong Yin, Xuhang Chen, Gang Qu, and Weisheng Zhao. 2022. Triangle counting accelerations: From algorithm to in-memory computing architecture. IEEE Trans. Comput. 71, 10 (2022), 2462–2472. DOI:DOI:
[78]
Yuntao Wei, Xueyan Wang, Shangtong Zhang, Jianlei Yang, Xiaotao Jia, Zhaohao Wang, Gang Qu, and Weisheng Zhao. 2023. IMGA: Efficient in-memory graph convolution network aggregation with data flow optimizations. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 42, 12 (2023), 4695–4705. DOI:
[79]
Bi Wu, Pengcheng Dai, Yuanqing Cheng, Ying Wang, Jianlei Yang, Zhaohao Wang, Dijun Liu, and Weisheng Zhao. 2020. A novel high performance and energy efficient NUCA architecture for STT-MRAM LLCs with thermal consideration. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 39, 4 (2020), 803–815. DOI:DOI:
[80]
Guoqing Xiao, Chuanghui Yin, Tao Zhou, Xueqi Li, Yuedan Chen, and Kenli Li. 2023. A survey of accelerating parallel sparse linear algebra. ACM Comput. Surv. 56, 1 (2024), 1–38. DOI:
[81]
Xilinx. 2019. Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance.Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp485-hbm.pdf
[82]
Xilinx. 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html
[84]
Xilinx. 2022. NGCodec Hardware HEVC Encoding (UG1408).Retrieved from https://www.xilinx.com/publications/user-guide/partner/ug1408-ngcodec-hevc.pdf
[85]
Hasan Erdem Yantır, Ahmed M. Eltawil, and Khaled N. Salama. 2022. A hardware/software co-design methodology for in-memory processors. J. Parallel Distrib. Comput. 161, 3 (2022), 63–71. DOI:DOI:
[86]
Masoud Zabihi, Zamshed Iqbal Chowdhury, Zhengyang Zhao, Ulya R. Karpuzcu, Jian-Ping Wang, and Sachin S. Sapatnekar. 2019. In-memory processing on the spintronic CRAM: From hardware design to application mapping. IEEE Trans. Comput. 68, 8 (2019), 1159–1173. DOI:DOI:
[87]
Jie Zhang, Myoungsoo Jung, and Mahmut Kandemir. 2019. FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. In IEEE International Symposium on High Performance Computer Architecture (IEEE HPCA’19). 426–439. DOI:DOI:
[88]
Lin Zhang, Pengyuan Lu, Fanxin Kong, Xin Chen, Oleg Sokolsky, and Insup Lee. 2021. Real-time attack-recovery for cyber-physical systems using linear-quadratic regulator. ACM Trans. Embed. Comput. Syst. 20, 5s (2021), 1–24. DOI:DOI:

Cited By

View all
  • (2024)Cost Minimization of Digital Twin Placements in Mobile Edge ComputingACM Transactions on Sensor Networks10.1145/365844920:3(1-26)Online publication date: 6-May-2024

Index Terms

  1. Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 23, Issue 3
        May 2024
        452 pages
        EISSN:1558-3465
        DOI:10.1145/3613579
        • Editor:
        • Tulika Mitra
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Journal Family

        Publication History

        Published: 25 April 2024
        Online AM: 07 March 2024
        Accepted: 28 February 2024
        Revised: 19 January 2024
        Received: 10 February 2023
        Published in TECS Volume 23, Issue 3

        Check for updates

        Author Tags

        1. Real-time processing
        2. compressed format
        3. computing resources
        4. digital accelerator
        5. throughput

        Qualifiers

        • Research-article

        Funding Sources

        • Tencent Foundation through the XPLORER PRIZE
        • National Key Research and Development Program of China
        • National Natural Science Foundation of China
        • Key Research and Development Program of Anhui Province

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)850
        • Downloads (Last 6 weeks)92
        Reflects downloads up to 25 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Cost Minimization of Digital Twin Placements in Mobile Edge ComputingACM Transactions on Sensor Networks10.1145/365844920:3(1-26)Online publication date: 6-May-2024

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media