research-article

CLU: A Near-Memory Accelerator Exploiting the Parallelism in Convolutional Neural Networks

Authors:
Palash Das

Indian Institute of Technology Guwahati, Guwahati, Assam

Indian Institute of Technology Guwahati, Guwahati, Assam
View Profile

,
Hemangee K. Kapoor

Indian Institute of Technology Guwahati, Guwahati, Assam

Indian Institute of Technology Guwahati, Guwahati, Assam
View Profile

ACM Journal on Emerging Technologies in Computing Systems Volume 17 Issue 2Article No.: 22pp 1–25https://doi.org/10.1145/3427472

Published:15 April 2021Publication History

ACM Journal on Emerging Technologies in Computing Systems

Abstract

Convolutional/Deep Neural Networks (CNNs/DNNs) are rapidly growing workloads for the emerging AI-based systems. The gap between the processing speed and the memory-access latency in multi-core systems affects the performance and energy efficiency of the CNN/DNN tasks. This article aims to alleviate this gap by providing a simple and yet efficient near-memory accelerator-based system that expedites the CNN inference. Towards this goal, we first design an efficient parallel algorithm to accelerate CNN/DNN tasks. The data is partitioned across the multiple memory channels (vaults) to assist in the execution of the parallel algorithm. Second, we design a hardware unit, namely the convolutional logic unit (CLU), which implements the parallel algorithm. To optimize the inference, the CLU is designed, and it works in three phases for layer-wise processing of data. Last, to harness the benefits of near-memory processing (NMP), we integrate homogeneous CLUs on the logic layer of the 3D memory, specifically the Hybrid Memory Cube (HMC). The combined effect of these results in a high-performing and energy-efficient system for CNNs/DNNs. The proposed system achieves a substantial gain in the performance and energy reduction compared to multi-core CPU- and GPU-based systems with a minimal area overhead of 2.37%.

References

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In OSDI, 16, 265--283.Google ScholarDigital Library
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.Google ScholarDigital Library
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), IEEE, 336--348.Google ScholarDigital Library
Shaahin Angizi, Zhezhi He, Farhana Parveen, and Deliang Fan. 2018. IMCE: Energy-efficient bit-wise in-memory convolution engine for deep neural network. In Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 111--116.Google ScholarDigital Library
Shaahin Angizi, Zhezhi He, Adnan Siraj Rakin, and Deliang Fan. 2018. CMP-PIM: An energy-efficient comparator-based processing-in-memory neural network accelerator. In Proceedings of the 55th Annual Design Automation Conference. ACM, 105.Google ScholarDigital Library
Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2018. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. IEEE Transactions on Parallel & Distributed Systems1 (2018), 420--434.Google Scholar
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Computer Architecture News 38, 3 (2010), 247--257.Google ScholarDigital Library
Xue-Wen Chen and Xiaotong Lin. 2014. Big data deep learning: Challenges and perspectives. IEEE Access 2 (2014), 514--525.Google ScholarCross Ref
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609--622.Google ScholarDigital Library
Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127--138.Google ScholarCross Ref
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Andrew Ng. 2013. Deep learning with COTS HPC systems. In Proceedings of the International Conference on Machine Learning. 1337--1345.Google Scholar
Hybrid Memory Cube Consortium. 2013. Hybrid memory cube specification 1.0. Last Revision Jan (2013).Google Scholar
Francesco Conti and Luca Benini. 2015. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 683--688.Google ScholarCross Ref
George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8609--8613.Google Scholar
Palash Das and Hemangee K. Kapoor. 2018. Towards near-data processing of compare operations in 3D-stacked memory. In Proceedings of the 2018 Great Lakes Symposium on VLSI. ACM, 243--248.Google Scholar
P. Das and H. K. Kapoor. 2020. nZESPA: A near-3D-memory zero skipping parallel accelerator for CNNs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2020), 1--13.Google Scholar
Palash Das, Shivam Lakhotia, Prabodh Shetty, and Hemangee K. Kapoor. 2018. Towards near data processing of convolutional neural networks. In Proceedings of the 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID). IEEE, 380--385.Google Scholar
Li Du, Yuan Du, Yilei Li, Junjie Su, Yen-Cheng Kuan, Chun-Chen Liu, and Mau-Chung Frank Chang. 2018. A reconfigurable streaming deep convolutional neural network accelerator for internet of things. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 1 (2018), 198--208.Google ScholarCross Ref
Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 445--450.Google ScholarDigital Library
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 283--295.Google ScholarCross Ref
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 113--124.Google ScholarDigital Library
Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfigurable logic for near-data processing. In Proceedings of the 2016 IEEE 22nd International Symposium on High Performance Computer Architecture (HPCA). IEEE, 126--137.Google ScholarCross Ref
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. Tetris: Scalable and efficient neural network acceleration with 3D memory. ACM SIGOPS Operating Systems Review 51, 2 (2017), 751--764.Google ScholarCross Ref
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the International Conference on Machine Learning. 1737--1746.Google Scholar
Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, Marco Scoffier, Koray Kavukcuoglu, Urs Muller, and Yann LeCun. 2009. Learning long-range vision for autonomous off-road driving. Journal of Field Robotics 26, 2 (2009), 120--144.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2333--2338.Google ScholarDigital Library
Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT). IEEE, 87--88.Google ScholarCross Ref
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).Google Scholar
Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.Google Scholar
Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 2012. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the 2012 IEEE 30th International Conference on Computer Design (ICCD). IEEE, 5--14.Google ScholarDigital Library
Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 380--392.Google ScholarDigital Library
Duckhwan Kim, Taesik Na, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2018. Deeptrain: A programmable embedded platform for training deep neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2360--2370.Google ScholarCross Ref
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarDigital Library
Jinho Lee, Jongwook Chung, Jung Ho Ahn, and Kiyoung Choi. 2017. Excavating the hidden parallelism inside DRAM architectures with buffered compares. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 6 (2017), 1793--1806.Google ScholarDigital Library
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 469--480.Google Scholar
Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 393--405.Google ScholarDigital Library
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google ScholarCross Ref
N. Manohar, Y. H. Sharath Kumar, Radhika Rani, and G. Hemantha Kumar. 2019. Convolutional neural network with SVM for classification of animal images. In Emerging Research in Electronics, Computer Science and Technology. Springer, 527--537.Google Scholar
J. Murphy. 2017. Deep learning benchmarks of NVIDIA Tesla P100 PCIe Tesla K80 and Tesla M40 GPUs.Google Scholar
Andreas Nowatzyk, Fong Pong, and Ashley Saulsbury. 1996. Missing the memory wall: The case for processor/memory integration. In Proceedings of the 1996 23rd Annual International Symposium on Computer Architecture. IEEE, 90--90.Google Scholar
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 27--40.Google Scholar
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17).Google Scholar
J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Proceedings of the 2011 IEEE Hot Chips 23 Symposium (HCS). IEEE, 1--24.Google ScholarCross Ref
Maurice Peemen, Arnaud A. A. Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural Networks. In Proceedings of the ICCD, vol. 2013. 13--19.Google Scholar
Matthew Pickett. 2010. The Materials Science of Titanium Dioxide Memristors. Ph.D. Dissertation. UC Berkeley.Google Scholar
Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.Google ScholarCross Ref
Kiran Puttaswamy and Gabriel H. Loh. 2006. Thermal analysis of a 3D die-stacked high-performance microprocessor. In Proceedings of GLSVLSI. ACM, 19--24.Google Scholar
Rajat Raina, Anand Madhavan, and Andrew Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 873--880.Google Scholar
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234--241.Google ScholarCross Ref
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.Google ScholarDigital Library
Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. 2009. A massively parallel coprocessor for convolutional neural networks. In Proceedings of the 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009 (ASAP 2009).. IEEE, 53--60.Google ScholarDigital Library
Michael Schaffner, Frank K. Gürkaynak, Aljoscha Smolic, and Luca Benini. 2015. DRAM or no-DRAM?: Exploring linear solver architectures for image domain warping in 28 nm CMOS. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 707--712.Google ScholarCross Ref
Vivek Seshadri, Kevin Hsieh, Amirali Boroum, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Computer Architecture Letters 14, 2 (2015), 127--131.Google ScholarDigital Library
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Vinay Sriram, David Cox, Kuen Hung Tsoi, and Wayne Luk. 2010. Towards an embedded biologically-inspired machine vision processor. In Proceedings of the 2010 International Conference on Field-Programmable Technology (FPT). IEEE, 273--278.Google ScholarCross Ref
JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. JESD235 (2013).Google Scholar
Yichuan Tang. 2013. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013).Google Scholar
Sam Likun Xi, Oreoluwa Babarinsa, Manos Athanassoulis, and Stratos Idreos. 2015. Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware. ACM, 2.Google ScholarDigital Library
Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923 (2017).Google Scholar
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision. Springer, 818--833.Google Scholar
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.Google ScholarDigital Library
Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 1--7.Google ScholarCross Ref

Index Terms

CLU: A Near-Memory Accelerator Exploiting the Parallelism in Convolutional Neural Networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

Toward standardized near-data processing with unrestricted data placement for GPUs
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

3D-stacked memory devices with processing logic can help alleviate the memory bandwidth bottleneck in GPUs. However, in order for such Near-Data Processing (NDP) memory stacks to be used for different GPU architectures, it is desirable to standardize ...
Read More
Towards Near-Data Processing of Compare Operations in 3D-Stacked Memory
GLSVLSI '18: Proceedings of the 2018 on Great Lakes Symposium on VLSI

The gap between the processing speed and memory access speed of the modern multi-core systems has become a bottleneck for the emerging data-intensive workloads. In this scenario, it has become a smarter idea to move some amount of computation closer to ...
Read More
Memory-system requirements for convolutional neural networks
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Energy efficiency of the underlying memory systems is a huge issue in most neural network accelerator designs. It is imperative to understand the characteristics and behavior of data in the algorithm of neural networks to gain an insight into their ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Journal on Emerging Technologies in Computing Systems Volume 17, Issue 2
Hardware and Algorithms for Efficient Machine Learning
April 2021
360 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3446841
Editor:
Ramesh Karri
Polytechnic Institute of New York University, USA
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 15 April 2021
- Accepted: 1 September 2020
- Revised: 1 August 2020
- Received: 1 May 2020
Published in jetc Volume 17, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3D-stacked memory
Convolutional neural networks
near-data processing
near-memory accelerators
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 280
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

CLU: A Near-Memory Accelerator Exploiting the Parallelism in Convolutional Neural Networks

ACM Journal on Emerging Technologies in Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Toward standardized near-data processing with unrestricted data placement for GPUs

Towards Near-Data Processing of Compare Operations in 3D-Stacked Memory

Memory-system requirements for convolutional neural networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

CLU: A Near-Memory Accelerator Exploiting the Parallelism in Convolutional Neural Networks

ACM Journal on Emerging Technologies in Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Toward standardized near-data processing with unrestricted data placement for GPUs

Towards Near-Data Processing of Compare Operations in 3D-Stacked Memory

Memory-system requirements for convolutional neural networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media