research-article

Public Access

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference

Authors:

Sai Rahul Chalamalasetti,

R. Stanley Williams,

Paolo Faraboschi,

John Paul Strachan,

Dejan S. MilojicicAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 715 - 731

https://doi.org/10.1145/3297858.3304049

Published: 04 April 2019 Publication History

Abstract

Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations of digital logic. They have been shown to be effective in special-purpose accelerators for a limited set of neural network applications. We present the Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor crossbars with general purpose execution units to enable the acceleration of a wide variety of Machine Learning (ML) inference workloads. PUMA's microarchitecture techniques exposed through a specialized Instruction Set Architecture (ISA) retain the efficiency of in-memory computing and analog circuitry, without compromising programmability. We also present the PUMA compiler which translates high-level code to PUMA ISA. The compiler partitions the computational graph and optimizes instruction scheduling and register allocation to generate code for large and complex workloads to run on thousands of spatial cores. We have developed a detailed architecture simulator that incorporates the functionality, timing, and power models of PUMA's components to evaluate performance and energy consumption. A PUMA accelerator running at 1 GHz can reach area and power efficiency of 577 GOPS/s/mm ² and 837~GOPS/s/W, respectively. Our evaluation of diverse ML applications from image recognition, machine translation, and language modelling (5M-800M synapses) shows that PUMA achieves up to 2,446× energy and 66× latency improvement for inference compared to state-of-the-art GPUs. Compared to an application-specific memristor-based accelerator, PUMA incurs small energy overheads at similar inference latency and added programmability.

References

[1]

Alan Agresti. 2002. Logistic regression .Wiley Online Library.

[2]

Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O'Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic Deep Neural Network Computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 382--394.

Digital Library

[3]

Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: ineffectual-neuron-free deep neural network computing. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 1--13.

Digital Library

[4]

Fabien Alibart, Elham Zamanidoost, and Dmitri B Strukov. 2013. Pattern classification by memristive crossbar circuits using ex situ and in situ training. Nature communications, Vol. 4 (2013).

[5]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[6]

Joao Ambrosi, Aayush Ankit, Rodrigo Antunes, Sai Rahul Chalamalasetti, Soumitra Chatterjee, Izzat El Hajj, Guilherme Fachini, Paolo Faraboschi, Martin Foltin, Sitao Huang, Wen mei Hwu, Gustavo Knuppe, Sunil Vishwanathpur Lakshminarasimha, Dejan Milojicic, Mohan Parthasarathy, Filipe Ribeiro, Lucas Rosa, Kaushik Roy, Plinio Silveira, and John Paul Strachan. 2018. Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning. In Rebooting Computing (ICRC), 2018 IEEE International Conference on. IEEE.

[7]

Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2017. YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2017).

[8]

Aayush Ankit, Abhronil Sengupta, Priyadarshini Panda, and Kaushik Roy. 2017b. RESPARC: A Reconfigurable and Energy-Efficient Architecture with Memristive Crossbars for Deep Spiking Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 27.

Digital Library

[9]

Aayush Ankit, Abhronil Sengupta, and Kaushik Roy. 2017a. Trannsformer: Neural network transformation for memristive crossbar based neuromorphic system design. In Proceedings of the 36th International Conference on Computer-Aided Design. IEEE Press, 533--540.

Digital Library

[10]

Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. 2016. Makalu: Fast Recoverable Allocation of Non-volatile Memory. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016). ACM, New York, NY, USA, 677--694.

Digital Library

[11]

Mahdi Nazm Bojnordi and Engin Ipek. 2016. Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 1--13.

[12]

G. W. Burr, R. M. Shelby, S. Sidler, C. di Nolfo, J. Jang, I. Boybat, R. S. Shenoy, P. Narayanan, K. Virwani, E. U. Giacometti, B. N. Kurdi, and H. Hwang. 2015. Experimental Demonstration and Tolerancing of a Large-Scale Neural Network (165 000 Synapses) Using Phase-Change Memory as the Synaptic Weight Element. IEEE Transactions on Electron Devices, Vol. 62, 11 (Nov 2015), 3498--3507.

[13]

Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. 2018. VIBNN: Hardware Acceleration of Bayesian Neural Networks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 476--488.

Digital Library

[14]

Lukas Cavigelli, David Gschwend, Christoph Mayer, Samuel Willi, Beat Muheim, and Luca Benini. 2015. Origami: A Convolutional Network Accelerator. In Proceedings of the 25th Edition on Great Lakes Symposium on VLSI (GLSVLSI '15). ACM, New York, NY, USA, 6.

Digital Library

[15]

Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. 2014. Atlas: Leveraging Locks for Non-volatile Memory Consistency. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA '14). ACM, New York, NY, USA, 433--452.

Digital Library

[16]

Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 247--257.

Digital Library

[17]

Guoyang Chen, Lei Zhang, Richa Budhiraja, Xipeng Shen, and Youfeng Wu. 2017. Efficient support of position independence on non-volatile memory. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 191--203.

Digital Library

[18]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014a. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, 269--284.

Digital Library

[19]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et almbox. 2014b. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609--622.

Digital Library

[20]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 367--379.

Digital Library

[21]

Ming Cheng, Lixue Xia, Zhenhua Zhu, Yi Cai, Yuan Xie, Yu Wang, and Huazhong Yang. 2017. TIME: A Training-in-memory Architecture for Memristor-based Deep Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 26.

Digital Library

[22]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 27--39.

Digital Library

[23]

Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et almbox. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro, Vol. 38, 2 (2018).

[24]

Jaeyong Chung and Taehwan Shin. 2016. Simplifying deep neural networks for neuromorphic architectures. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1--6.

Digital Library

[25]

Dan Claudiu Cirecs an, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. 2010. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, Vol. 22, 12 (2010), 3207--3220.

Digital Library

[26]

Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making Persistent Objects Fast and Safe with Next-generation, Non-volatile Memories. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 105--118.

Digital Library

[27]

Nachshon Cohen, David T. Aksun, and James R. Larus. 2018. Object-oriented Recovery for Non-volatile Memory. Proc. ACM Program. Lang., Vol. 2, OOPSLA, Article 153 (Oct. 2018), bibinfonumpages22 pages.

Digital Library

[28]

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop .

[29]

Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM.

Digital Library

[30]

Biplob Debnath, Alireza Haghdoost, Asim Kadav, Mohammed G Khatib, and Cristian Ungureanu. 2016. Revisiting hash table design for phase change memory. ACM SIGOPS Operating Systems Review, Vol. 49, 2 (2016), 18--26.

Digital Library

[31]

Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and Bo Yuan. 2017. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-circulant Weight Matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). ACM, New York, NY, USA, 395--408.

Digital Library

[32]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 92--104.

Digital Library

[33]

Izzat El Hajj, Thomas B. Jablin, Dejan Milojicic, and Wen-mei Hwu. 2017. SAVI Objects: Sharing and Virtuality Incorporated. Proc. ACM Program. Lang., Vol. 1, OOPSLA, Article 45 (2017), bibinfonumpages24 pages.

Digital Library

[34]

Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, and Karsten Schwan. 2016. SpaceJMP: Programming with Multiple Virtual Address Spaces. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 353--368.

Digital Library

[35]

C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems. 257--260.

[36]

Clément Farabet, Cyril Poulet, Jefferson Y Han, and Yann LeCun. 2009. Cnp: An fpga-based processor for convolutional networks. In Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on. IEEE, 32--37.

[37]

Ben Feinberg, Shibo Wang, and Engin Ipek. 2018a. Making memristive neural network accelerators reliable. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 52--65.

[38]

B. Feinberg, S. Wang, and E. Ipek. 2018b. Making Memristive Neural Network Accelerators Reliable. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) .

[39]

Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory Data Parallel Processor. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 1--14.

Digital Library

[40]

Terrence S Furey, Nello Cristianini, Nigel Duffy, David W Bednarski, Michel Schummer, and David Haussler. 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, Vol. 16, 10 (2000), 906--914.

[41]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 751--764.

Digital Library

[42]

Vinayak Gokhale, Jonghoon Jin, Aysegul Dundar, Berin Martini, and Eugenio Culurciello. 2014. A 240 g-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 682--687.

Digital Library

[43]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.

Digital Library

[44]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2017b. A Survey of FPGA Based Neural Network Accelerator. arXiv preprint arXiv:1712.08934 (2017).

[45]

Xinjie Guo, F Merrikh Bayat, M Bavandpour, M Klachko, MR Mahmoodi, M Prezioso, KK Likharev, and DB Strukov. 2017a. Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology. In Electron Devices Meeting (IEDM), 2017 IEEE International. IEEE, 6--5.

[46]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1737--1746.

Digital Library

[47]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243--254.

Digital Library

[48]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143.

Digital Library

[49]

Yukio Hayakawa, Atsushi Himeno, Ryutaro Yasuhara, W Boullart, E Vecchio, T Vandeweyer, T Witters, D Crotti, M Jurczak, S Fujii, et almbox. 2015. Highly reliable TaO x ReRAM with centralized filament for 28-nm embedded application. In VLSI Technology (VLSI Technology), 2015 Symposium on. IEEE, T14--T15.

[50]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[51]

Miao Hu, Catherine Graves, Can Li, Yunning Li, Ning Ge, Eric Montgomery, Noraica Davila, Hao Jiang, R. Stanley Williams, J. Joshua Yang, Qiangfei Xia, and John Paul Strachan. 2018. Memristor-based analog computation and neural network classification with a dot product engine. Advanced Materials (2018).

[52]

Miao Hu, Hai Li, Qing Wu, and Garrett S. Rose. 2012. Hardware Realization of BSB Recall Function Using Memristor Crossbar Arrays. In Proceedings of the 49th Annual Design Automation Conference (DAC '12). ACM, New York, NY, USA, 498--503.

Digital Library

[53]

Miao Hu, John Paul Strachan, Zhiyong Li, Emmanuelle M Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, Jianhua Joshua Yang, and R Stanley Williams. 2016. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1--6.

Digital Library

[54]

Joseph Izraelevitz, Hammurabi Mendes, and Michael L Scott. 2016. Linearizability of persistent memory objects under a full-system-crash failure model. In International Symposium on Distributed Computing. Springer, 313--327.

[55]

Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. 2018a. Bridge the Gap Between Neural Networks and Neuromorphic Hardware with a Neural Network Compiler. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 448--460.

Digital Library

[56]

Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. 2018b. Bridge the Gap between Neural Networks and Neuromorphic Hardware with a Neural Network Compiler. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 448--460.

Digital Library

[57]

Yu Ji, YouHui Zhang, ShuangChen Li, Ping Chi, CiHang Jiang, Peng Qu, Yuan Xie, and WenGuang Chen. 2016. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--13.

Digital Library

[58]

Nan Jiang, George Michelogiannakis, Daniel Becker, Brian Towles, and William J. Dally. 2013. BookSim 2.0 User's Guide .

[59]

Arpit Joshi, Vijay Nagarajan, Stratis Viglas, and Marcelo Cintra. 2017. ATOM: Atomic durability in non-volatile memory through hardware logging. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 361--372.

[60]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 1--12.

Digital Library

[61]

Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[62]

A. B. Kahng, B. Lin, and S. Nath. 2012. Comprehensive Modeling Methodologies for NoC Router Estimation. Technical Report. UCSD.

[63]

Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016b. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 380--392.

Digital Library

[64]

Kyounghoon Kim, Jungki Kim, Joonsang Yu, Jungwoo Seo, Jongeun Lee, and Kiyoung Choi. 2016a. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. In Proceedings of the 53rd Annual Design Automation Conference. ACM, 124.

Digital Library

[65]

Yongtae Kim, Yong Zhang, and Peng Li. 2015. A Reconfigurable Digital Neuromorphic Processor with Memristive Synaptic Crossbar for Cognitive Computing. J. Emerg. Technol. Comput. Syst., Vol. 11, 4, Article 38 (April 2015), bibinfonumpages25 pages.

Digital Library

[66]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

Digital Library

[67]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 461--475.

Digital Library

[68]

Dongsoo Lee and Kaushik Roy. 2013. Area efficient ROM-embedded SRAM cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 21, 9 (2013), 1583--1595.

Digital Library

[69]

Shuangchen Li, Dimin Niu, Krishna T Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. DRISA: A DRAM-based Reconfigurable In-Situ Accelerator. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 288--301.

Digital Library

[70]

Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 369--381.

Digital Library

[71]

Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 393--405.

Digital Library

[72]

Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, et almbox. 2015. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE. IEEE, 1--6.

Digital Library

[73]

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. Tabla: A unified template-based framework for accelerating statistical machine learning. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 14--26.

[74]

Tomávs Mikolov, Martin Karafiát, Lukávs Burget, Jan vC ernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association .

[75]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Understand Large Caches. Technical Report. HP Labs, HPL-2009--85.

[76]

Boris Murmann. 2011. ADC performance survey 1997--2011. http://www. stanford. edu/ murmann/adcsurvey. html (2011).

[77]

Boris Murmann. 2015. The race for the extra decibel: a brief review of current ADC performance trajectories. IEEE Solid-State Circuits Magazine, Vol. 7, 3 (2015), 58--66.

[78]

Sean Murray, William Floyd-Jones, Ying Qi, George Konidaris, and Daniel J Sorin. 2016. The microarchitecture of a real-time robot motion planning accelerator. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[79]

Sanketh Nalli, Swapnil Haria, Mark D Hill, Michael M Swift, Haris Volos, and Kimberly Keeton. 2017. An analysis of persistent memory use with WHISPER. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 135--148.

Digital Library

[80]

John Neter, Michael H Kutner, Christopher J Nachtsheim, and William Wasserman. 1996. Applied linear statistical models. Vol. 4. Irwin Chicago.

[81]

Matheus Almeida Ogleari, Ethan L Miller, and Jishen Zhao. 2018. Steal but no force: Efficient hardware undo

[82]

redo logging for persistent memory systems. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 336--349.

[83]

Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas Willhalm, and Grégoire Gomes. 2017. Memory management techniques for large-scale persistent-main-memory systems. Proceedings of the VLDB Endowment, Vol. 10, 11 (2017), 1166--1177.

Digital Library

[84]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 27--40.

Digital Library

[85]

Seongwook Park, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee, Sungpill Choi, and Hoi-Jun Yoo. 2015. 4.6 A1. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In Solid-State Circuits Conference-(ISSCC), 2015 IEEE International. IEEE, 1--3.

[86]

Maurice Peemen, Arnaud AA Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural networks. In Computer Design (ICCD), 2013 IEEE 31st International Conference on. IEEE, 13--19.

[87]

Mirko Prezioso, Farnood Merrikh-Bayat, BD Hoskins, GC Adam, Konstantin K Likharev, and Dmitri B Strukov. 2015. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature, Vol. 521, 7550 (2015), 61--64.

[88]

Shankar Ganesh Ramasubramanian, Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, and Anand Raghunathan. 2014. SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing. In Proceedings of the 2014 international symposium on Low power electronics and design. ACM, 15--20.

Digital Library

[89]

Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 267--278.

Digital Library

[90]

Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. 2017. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 405--418.

Digital Library

[91]

Paul Resnick and Hal R Varian. 1997. Recommender systems. Commun. ACM, Vol. 40, 3 (1997), 56--58.

Digital Library

[92]

Ron M Roth. 2017. Fault-Tolerant Dot-Product Engines. arXiv preprint arXiv:1708.06892 (2017).

[93]

Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on. IEEE, 53--60.

Digital Library

[94]

Abhronil Sengupta, Yong Shim, and Kaushik Roy. 2016. Proposal for an all-spin artificial neural network: Emulating neural and synaptic functionalities through domain wall motion in ferromagnets. IEEE transactions on biomedical circuits and systems, Vol. 10, 6 (2016), 1152--1160.

[95]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press.

Digital Library

[96]

Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[97]

Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 535--547.

Digital Library

[98]

Patrick M Sheridan, Fuxi Cai, Chao Du, Wen Ma, Zhengya Zhang, and Wei D Lu. 2017. Sparse coding with memristor networks. Nature nanotechnology (2017).

[99]

Seunghee Shin, Satish Kumar Tirukkovalluri, James Tuck, and Yan Solihin. 2017. Proteus: A flexible and fast software supported hardware logging approach for NVM. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM.

Digital Library

[100]

Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 541--552.

[101]

M. Song, J. Zhang, H. Chen, and T. Li. 2018. Towards Efficient Microarchitectural Design for Accelerating Unsupervised GAN-Based Deep Learning. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 66--77.

[102]

Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. 2009. The recurrent temporal restricted boltzmann machine. In Advances in Neural Information Processing Systems. 1601--1608.

Digital Library

[103]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039 (2017).

[104]

Toshiyuki Tanaka. 1998. Mean-field theory of Boltzmann machine learning. Physical Review E, Vol. 58, 2 (1998), 2302.

[105]

Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. 2017. ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 13--26.

Digital Library

[106]

Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Lightweight Persistent Memory. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, USA, 91--104.

Digital Library

[107]

Qian Wang, Yongtae Kim, and Peng Li. 2016. Neuromorphic processors with memristive synapses: Synaptic interface and architectural exploration. ACM Journal on Emerging Technologies in Computing Systems (JETC), Vol. 12, 4 (2016), 35.

Digital Library

[108]

Ying Wang, Huawei Li, and Xiaowei Li. 2017. Real-Time Meets Approximate Computing: An Elastic CNN Inference Accelerator with Adaptive Trade-off between QoS and QoR. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 33.

Digital Library

[109]

Yandan Wang, Wei Wen, Beiye Liu, Donald Chiarulli, and Hai Helen Li. 2017. Group Scissor: Scaling Neuromorphic Computing Design to Large Neural Networks. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 85.

Digital Library

[110]

Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1--6.

Digital Library

[111]

Zhuo Wang, Robert Schapire, and Naveen Verma. 2014. Error-adaptive classifier boosting (EACB): Exploiting data-driven training for highly fault-tolerant hardware. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 3884--3888.

[112]

Rainer Waser, Regina Dittmann, Georgi Staikov, and Kristof Szot. 2009. Redox-based resistive switching memories--nanoionic mechanisms, prospects, and challenges. Advanced materials, Vol. 21, 25--26 (2009), 2632--2663.

[113]

Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong, and Bingsheng He. 2015. NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems. In FAST, Vol. 15. 167--181.

Digital Library

[114]

Reza Yazdani, Albert Segura, Jose-Maria Arnau, and Antonio Gonzalez. 2016. An ultra low-power hardware accelerator for automatic speech recognition. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

[115]

Richard Yu. 2017. Panasonic and UMC Partner for 40nm ReRAM Process Platform. UMC Press Release (Feb 2017).

[116]

Shimeng Yu, Zhiwei Li, Pai-Yu Chen, Huaqiang Wu, Bin Gao, Deli Wang, Wei Wu, and He Qian. 2016. Binary neural network with 16 Mb RRAM macro chip for classification and online training. In Electron Devices Meeting (IEDM), 2016 IEEE International. IEEE, 16--2.

[117]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '15). ACM, New York, NY, USA, 161--170.

Digital Library

[118]

Jintao Zhang, Zhuo Wang, and Naveen Verma. 2016. A machine-learning classifier implemented in a standard 6T SRAM array. In VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium on. IEEE, 1--2.

[119]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.

Digital Library

Cited By

Kwon DWoo SHwang JKim HBae JShin WPark BLee J(2025)Efficient Hybrid Training Method for Neuromorphic Hardware Using Analog Nonvolatile MemoryIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332790636:1(807-819)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3327906
Wang XZhou MRosing T(2025)Fast-OverlaPIM: A Fast Overlap-Driven Mapping Framework for Processing In-Memory Neural Network AccelerationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342630844:1(130-143)Online publication date: Jan-2025
https://doi.org/10.1109/TCAD.2024.3426308
Yu RWang ZLiu QGao BHao ZGuo TDing SZhang JQin QWu DYao PZhang QTang JQian HWu H(2025)A full-stack memristor-based computation-in-memory system with software-hardware co-developmentNature Communications10.1038/s41467-025-57183-016:1Online publication date: 3-Mar-2025
https://doi.org/10.1038/s41467-025-57183-0
Show More Cited By

Index Terms

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems

Recommendations

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance Computing

This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, ...
Performance and toolchain of a combined GPU/FPGA desktop (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Low-power, high-performance computing nowadays relies on accelerator cards to speed up the calculations. Combining the power of GPUs with the flexibility of FPGAs enlarges the scope of problems that can be accelerated [2, 3]. We describe the performance ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy
Intelligence Advanced Research Projects Activity
Center for Brain-inspired Computing (C-BRIC), one of six centers in JUMP, a DARPA sponsored Semiconductor Research Corporation (SRC) program

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

292
Total Citations
View Citations
6,736
Total Downloads

Downloads (Last 12 months)2,363
Downloads (Last 6 weeks)304

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kwon DWoo SHwang JKim HBae JShin WPark BLee J(2025)Efficient Hybrid Training Method for Neuromorphic Hardware Using Analog Nonvolatile MemoryIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332790636:1(807-819)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3327906
Wang XZhou MRosing T(2025)Fast-OverlaPIM: A Fast Overlap-Driven Mapping Framework for Processing In-Memory Neural Network AccelerationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342630844:1(130-143)Online publication date: Jan-2025
https://doi.org/10.1109/TCAD.2024.3426308
Yu RWang ZLiu QGao BHao ZGuo TDing SZhang JQin QWu DYao PZhang QTang JQian HWu H(2025)A full-stack memristor-based computation-in-memory system with software-hardware co-developmentNature Communications10.1038/s41467-025-57183-016:1Online publication date: 3-Mar-2025
https://doi.org/10.1038/s41467-025-57183-0
Park JRhe JHwang CSo JKo J(2025)Input/mapping precision controllable digital CIM with adaptive adder tree architecture for flexible DNN inferenceJournal of Systems Architecture10.1016/j.sysarc.2024.103327159(103327)Online publication date: Feb-2025
https://doi.org/10.1016/j.sysarc.2024.103327
Nallathambi ABose CHaensch WRaghunathan A(2024)LRMP: Layer Replication with Mixed Precision for spatial in-memory DNN acceleratorsFrontiers in Artificial Intelligence10.3389/frai.2024.12683177Online publication date: 4-Oct-2024
https://doi.org/10.3389/frai.2024.1268317
Wang XSun XHan YChen X(2024)PIMSIM-NN: An ISA-based Simulation Framework for Processing-in-Memory Accelerators2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546788(1-2)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546788
Shahroodi TCardoso RWong SBosio AO'Connor IHamdioui S(2024)High-Performance Data Mapping for BNNs on PCM-Based Integrated Photonics2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546687(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546687
Pelke RCubero-Cascante JBosbach NStaudigl FLeupers RJoseph J(2024)CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546668(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546668
Xu JLiu HPeng XDuan ZLiao XJin H(2024)A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network AccelerationACM Transactions on Design Automation of Electronic Systems10.1145/370103430:1(1-23)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3701034
Chung JGu YJang IMeng LBansal NChowdhury MWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Reducing Energy Bloat in Large Model TrainingProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695970(144-159)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695970
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten