skip to main content
10.1145/3352460.3358305acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions

Published: 12 October 2019 Publication History

Abstract

Deep Neural Networks (DNNs) are becoming the prevalent approach in computer vision, machine learning, natural language processing, and speech recognition applications. Although DNNs are perceived as compute-intensive tasks, they also apply intense pressure on the capacity and bandwidth of the memory hierarchy, primarily due to the large intermediate data communicated across network layers. Prior work on hardware DNN accelerators leverages the cross-layer data sparsity via fully-customized datapaths. However, dynamically compressing/expanding such data is a challenging task for general-purpose multi-processors with virtual memory and hardware-managed coherent cache hierarchies.
In this paper, we observe that the DNN intermediate data is either sequentially streamed or reshaped with a regular transformation between layers. Hence, accesses to this data can tolerate a sequential or block sequential compression/expansion without requiring random element retrieval. Based on this insight, we propose ZCOMP, a CPU vector ISA extension tailored for DNN cross-layer communication. ZCOMP compactly represents zero value compression/expansion and fully automates the metadata generation, storage and retrieval which eliminates the need for several extra instruction executions and register usage. ZCOMP can be targeted both for inference and training to dynamically compress/expand cross-layer data before being written to memory. Our evaluations for individual layers and end-to-end DNN networks demonstrate that ZCOMP offers substantial data traffic reduction, both on-chip across cache-hierarchy and off-chip to DRAM, and performance improvements over no compression and existing AVX512 compression approaches.

References

[1]
[n. d.]. CLOUD TPU. https://cloud.google.com/tpu/.
[2]
[n. d.]. DeepBench, Baidu. https://github.com/baidu-research/DeepBench.
[3]
[n. d.]. Intel Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/.
[4]
[n. d.]. Intel Math Kernel Library (MKL). https://software.intel.com/en-us/mkl.
[5]
2018. Intel Architecture Instruction Set Extensions and Future Features Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.
[6]
2018. NVIDIA TURING GPU ARCHITECTURE. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
[7]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. [n. d.]. Tensorflow: a system for large-scale machine learning.
[8]
Vahideh Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, Rajesh K. Gupta, and Hadi Esmaeilzadeh. 2018. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In ACM/IEEE Int. Symposium on Computer Architecture.
[9]
Alaa R Alameldeen and Rajat Agarwal. 2018. Opportunistic Compression for Direct-Mapped DRAM Caches. In Proceedings of the 4th Annual International Symposium on Memory Systems. 129--136.
[10]
Alaa R Alameldeen and David A Wood. 2004. Adaptive Cache Compression for High-Performance Processors. In Proceedings of the 31st ACM/IEEE Annual International Symposium on Computer Architecture, Munich, Germany. 212--223.
[11]
Alaa R Alameldeen and David A Wood. 2004. Frequent Pattern Compression" A Significance-Based Compression Scheme for L2 Caches. University of Wisconsin Department of Computer Sciences Technical Report 1500 (April 2004).
[12]
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 1--13.
[13]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 22.
[14]
A Arelakis and Per Stenstrom. 2014. SC2: A Statistical Compression Cache Scheme. In Proceedings of the 41st ACM/IEEE Annual International Symposium on Computer Architecture, Minneapolis, MN. 145--156.
[15]
Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. ACM Transactions on Architecture and Code Optimization (TACO), Article 5 (2014), 23 pages. https://doi.org/10.1145/2629677
[16]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
[17]
X Chen, Yang L, R.P Dick, L Shang, and H Lekstsas. 2010. C-Pack: A High-Performance Microprocessor Cache Compression Algorithm. IEEE Transactions on VLSI Systems 18, 8 (2010), 1196--1208.
[18]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609--622.
[19]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.
[20]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[21]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 27--39.
[22]
Julien Dusser, Thomas Piquet, and Andre Seznec. 2009. Zero-Content Augmented Caches. In Proceedings of 23rd International Conference on Supercomputing. 46--55.
[23]
Agner Fog. 2018. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. (September 2018).
[24]
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1--14.
[25]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review 51, 2 (2017), 751--764.
[26]
Jayesh Gaur, Alaa R Alameldeen, and Sreenivas Subramoney. 2016. Base-Victim Compression: An Opportunistic Cache Compression Architecture. In Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, Seoul, South Korea. 317--328. https://doi.org/10.1109/ISCA.2016.36
[27]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on SIMD architectures. arXiv preprint arXiv:1808.05567 (2018).
[28]
Erik G Hallnor and Steven K Reinhardt. 2005. A Unified Compressed Memory Hierarchy. In Proceedings of the 32nd ACM/IEEE Annual International Symposium on Computer Archtitecture, Madison, WI. 201--212.
[29]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 243--254.
[30]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[31]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 620--629.
[32]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[33]
Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai, Chang-Hong Hsu, Michael A Laurenzano, Scott Mahlke, Lingjia Tang, and Jason Mars. 2017. Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 786--799.
[34]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.
[35]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 1--12.
[36]
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--12.
[37]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[38]
Y. Kwon and M. Rhu. 2018. A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks. IEEE Computer Architecture Letters 17, 2 (July 2018), 134--138.
[39]
Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. 2018. Optimizing CNN Model Inference on CPUs. arXiv preprint arXiv:1809.02697 (2018).
[40]
Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, et al. 2017. Can fpgas beat gpus in accelerating next-generation deep neural networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 5--14.
[41]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 27--40.
[42]
Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. 2018. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications. arXiv preprint arXiv:1811.09886 (2018).
[43]
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B Kozuch, Michael A adn Gibbons, and Todd C Mowry. 2012. Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN. 51--63.
[44]
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 267--278.
[45]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 18.
[46]
Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 78--91.
[47]
Somayeh Sardashti, Andre Seznec, and David A Wood. 2014. Skewed Compressed Caches. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 331--342.
[48]
Somayeh Sardashti and David A Wood. 2013. Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching. In Proceedings of the 46th Annual International Symposium on Microarchitecture, Davis, CA.
[49]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.
[50]
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 17.
[51]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[52]
Linpeng Tang, Yida Wang, Theodore L Willke, and Kai Li. 2018. Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs. arXiv preprint arXiv:1807.09667 (2018).
[53]
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
[54]
Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 548--560.
[55]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 20.

Cited By

View all
  • (2024)MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-Core Processor2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546765(1-6)Online publication date: 25-Mar-2024
  • (2024)A Survey of Artificial Neural Network Computing SystemsCognitive Computation10.1007/s12559-024-10383-017:1Online publication date: 22-Nov-2024
  • (2023)A Framework for Behavioral Biometric Authentication Using Deep Metric Learning on Mobile DevicesIEEE Transactions on Mobile Computing10.1109/TMC.2021.307260822:1(19-36)Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
October 2019
1104 pages
ISBN:9781450369381
DOI:10.1145/3352460
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CPU
  2. Deep learning
  3. ISA
  4. compression
  5. memory system
  6. sparsity

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MICRO '52
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)6
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-Core Processor2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546765(1-6)Online publication date: 25-Mar-2024
  • (2024)A Survey of Artificial Neural Network Computing SystemsCognitive Computation10.1007/s12559-024-10383-017:1Online publication date: 22-Nov-2024
  • (2023)A Framework for Behavioral Biometric Authentication Using Deep Metric Learning on Mobile DevicesIEEE Transactions on Mobile Computing10.1109/TMC.2021.307260822:1(19-36)Online publication date: 1-Jan-2023
  • (2023)VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071058(259-272)Online publication date: Feb-2023
  • (2023)A Channel Pruning Optimization With Layer-Wise Sensitivity in a Single-Shot Manner Under Computational ConstraintsIEEE Access10.1109/ACCESS.2022.323256611(7043-7055)Online publication date: 2023
  • (2022)GraphiteProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527403(916-931)Online publication date: 18-Jun-2022
  • (2022)A Survey of Deep Learning on CPUs: Opportunities and Co-OptimizationsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.307176233:10(5095-5115)Online publication date: Oct-2022
  • (2022)On-the-Fly Lowering Engine: Offloading Data Layout Conversion for Convolutional Neural NetworksIEEE Access10.1109/ACCESS.2022.319261810(79730-79746)Online publication date: 2022
  • (2021)Effective exploitation of SIMD resources in cross-ISA virtualizationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454016(84-97)Online publication date: 7-Apr-2021
  • (2021)Cache Compression with Golomb-Rice Code and Quantization for Convolutional Neural Networks2021 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS51556.2021.9401655(1-5)Online publication date: May-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media