research-article

Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks

Authors:

Alberto Delmas Lascorz,

Dylan Malone Stuart,

Mostafa Mahmoud,

Andreas MoshovosAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 749 - 763

https://doi.org/10.1145/3297858.3304041

Published: 04 April 2019 Publication History

Abstract

Weight and activation sparsity can be leveraged in hardware to boost the performance and energy efficiency of Deep Neural Networks during inference. Fully capitalizing on sparsity requires re-scheduling and mapping the execution stream to deliver non-zero weight/activation pairs to multiplier units for maximal utilization and reuse. However, permitting arbitrary value re-scheduling in memory space and in time places a considerable burden on hardware to perform dynamic at-runtime routing and matching of values, and incurs significant energy inefficiencies. Bit-Tactical (TCL) is a neural network accelerator where the responsibility for exploiting weight sparsity is shared between a novel static scheduling middleware, and a co-designed hardware front-end with a lightweight sparse shuffling network comprising two (2- to 8-input) multiplexers per activation input. We empirically motivate two back-end designs chosen to target bit-sparsity in activations, rather than value-sparsity, with two benefits: a) we avoid handling the dynamically sparse whole-value activation stream, and b) we uncover more ineffectual work. TCL outperforms other state-of-the-art accelerators that target sparsity for weights and activations, the dynamic precision requirements of activations, or their bit-level sparsity for a variety of neural networks.

References

[1]

Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O'Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic Deep Neural Network Computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50 '17). 382--394.

Digital Library

[2]

Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 2016 IEEE/ACM International Conference on Computer Architecture (ISCA) .

Digital Library

[3]

Peter Brucker. 2001. Scheduling Algorithms 3rd ed.). Springer-Verlag, Berlin, Heidelberg.

Digital Library

[4]

T Chen, Z Du, N Sun, J Wang, C Wu, Y Chen, and O Temam. 2014a. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems .

Digital Library

[5]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and O. Temam. 2014b. DaDianNao: A Machine-Learning Supercomputer. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. 609--622.

Digital Library

[6]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). 367--379.

Digital Library

[7]

Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. 262--263 .

[8]

Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML '08). ACM, New York, NY, USA, 160--167.

Digital Library

[9]

Alberto Delmas, Patrick Judd, Sayeh Sharify, and Andreas Moshovos. 2017a. Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks. CoRR, Vol. abs/1706.00504 (2017). arxiv: 1706.00504 http://arxiv.org/abs/1706.00504

[10]

Alberto Delmas, Sayeh Sharify, Patrick Judd, and Andreas Moshovos. 2017b. Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability. CoRR, Vol. abs/1707.09068 (2017). arxiv: 1707.09068 http://arxiv.org/abs/1707.09068

[11]

A. Delmas, S. Sharify, P. Judd, K. Siu, M. Nikolic, and A. Moshovos. 2018. DPRed: Making Typical Activation Values Matter In Deep Learning Computing. ArXiv e-prints (Dec. 2018). arxiv: 1804.06732

[12]

J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248--255.

[13]

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. 2016. ESE: Efficient Speech Recognition Engine with Compressed LS™ on FPGA. CoRR, Vol. abs/1612.00694 (2016). arxiv: 1612.00694 http://arxiv.org/abs/1612.00694

Digital Library

[14]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 243--254.

Digital Library

[15]

Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. CoRR, Vol. abs/1510.00149 (2015). arxiv: 1510.00149 http://arxiv.org/abs/1510.00149

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput., Vol. 9, 8 (Nov. 1997), 1735--1780.

Digital Library

[17]

J.L. Holt and T.E. Baker. 1991. Back propagation simulations using limited precision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, Vol. ii. 121--126 vol.2.

[18]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). 1--12.

Digital Library

[19]

Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial Deep Neural Network Computing. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49).

Digital Library

[20]

D. Kim, J. Ahn, and S. Yoo. 2018. ZeNA: Zero-Aware Neural Network Accelerator. IEEE Design Test, Vol. 35, 1 (Feb 2018), 39--46.

[21]

Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Comput. Linguist., Vol. 19, 2 (June 1993), 313--330.

Digital Library

[22]

Micron. 2017. Calculating Memory Power for DDR4 SDRAM. Technical Note TN-40-07. https://www.micron.com/resource-details/868646c5--7ee2--4f6c-aaf4--7599bd5952df .

[23]

Naveen Muralimanohar and Rajeev Balasubramonian. {n. d.}. CACTI 6.0: A Tool to Understand Large Caches.

[24]

Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring Sparsity in Recurrent Neural Networks. CoRR, Vol. abs/1704.05119 (2017). arxiv: 1704.05119 http://arxiv.org/abs/1704.05119

[25]

Milos Nikolic, Mostafa Mahmoud, Yiren Zhao, Robert Mullins, and Andreas Moshovos. 2019. Characterizing Sources of Ineffectual Computations in Deep Learning Networks. In International Symposium on Performance Analysis of Systems and Software.

[26]

NVIDIA. {n. d.}. NVIDIA Deep Learning Accelerator. ({n. d.}). nvdla.org

[27]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 27--40.

Digital Library

[28]

Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2017. Faster CNNs with Direct Sparse Convolutions and Guided Pruning. https://github.com/IntelLabs/SkimCaffe. In 5th International Conference on Learning Representations (ICLR).

[29]

Michael L. Pinedo. 2008. Scheduling: Theory, Algorithms, and Systems 3rd ed.). Springer Publishing Company, Incorporated.

Digital Library

[30]

Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 267--278.

Digital Library

[31]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2014. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 {cs} (Sept. 2014). arXiv: 1409.0575.

[32]

Kevin Siu, Dylan Malone Stuart, Mostafa Mahmoud, and Andreas Moshovos. 2018. Memory Requirements for Convolutional Neural Network Hardware Accelerators. In IEEE International Symposium on Workload Characterization.

[33]

Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen Richardson, Shahar Kvatinsky, Jonathan Ragan-Kelley, Ardavan Pedram, and Mark Horowitz. 2016. A Systematic Approach to Blocking Convolutional Neural Networks. CoRR, Vol. abs/1606.04209 (2016).

[34]

Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne. 2017. Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15--19, 2016. 1--12.

Digital Library

[36]

X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 15--28.

[37]

M. Zhu and S. Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. ArXiv e-prints (Oct. 2017). arxiv: stat.ML/1710.01878

Cited By

Zhu ZZhou XWang CTian LHuang ZZhu Y(2025)Bit-Sparsity Aware Acceleration With Compact CSD Code on Generic Matrix MultiplicationIEEE Transactions on Computers10.1109/TC.2024.348363274:2(414-426)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3483632
Lu XFang JPeng LHuang CDu ZZhao YWang Z(2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3688612
Xiang XYue ZLi YLv LWei SHu YYin SDe V(2024)Dyn-Bitpool: A Two-sided Sparse CIM Accelerator Featuring a Balanced Workload Scheme and High CIM Macro UtilizationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655690(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655690
Show More Cited By

Index Terms

Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
      2. Special purpose systems

Recommendations

Fast image deconvolution using closed-form thresholding formulas of Lq(q=12,23) regularization

In this paper, we focus on the research of fast deconvolution algorithm based on the non-convex L"q(q=12,23) sparse regularization. Recently, we have deduced the closed-form thresholding formula for L"1"2 regularization model (Xu (2010) [1]). In this ...
Recovering sparse signals with a certain family of nonconvex penalties and DC programming

This paper considers the problem of recovering a sparse signal representation according to a signal dictionary. This problem could be formalized as a penalized least-squares problem in which sparsity is usually induced by a l₁-norm penalty on the ...
Image compressive sensing recovery using adaptively learned sparsifying basis via L0 minimization

From many fewer acquired measurements than suggested by the Nyquist sampling theory, compressive sensing (CS) theory demonstrates that, a signal can be reconstructed with high probability when it exhibits sparsity in some domain. Most of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

77
Total Citations
View Citations
1,621
Total Downloads

Downloads (Last 12 months)200
Downloads (Last 6 weeks)9

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu ZZhou XWang CTian LHuang ZZhu Y(2025)Bit-Sparsity Aware Acceleration With Compact CSD Code on Generic Matrix MultiplicationIEEE Transactions on Computers10.1109/TC.2024.348363274:2(414-426)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3483632
Lu XFang JPeng LHuang CDu ZZhao YWang Z(2024)Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise ProductACM Transactions on Architecture and Code Optimization10.1145/368861221:4(1-25)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3688612
Xiang XYue ZLi YLv LWei SHu YYin SDe V(2024)Dyn-Bitpool: A Two-sided Sparse CIM Accelerator Featuring a Balanced Workload Scheme and High CIM Macro UtilizationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655690(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655690
Nair HVellaisamy PLin TWang PBlanton SShen J(2024)Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference2024 IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC)10.1109/VLSI-SoC62099.2024.10767792(1-4)Online publication date: 6-Oct-2024
https://doi.org/10.1109/VLSI-SoC62099.2024.10767792
Chang LLu HLi CZhao XHu ZZhou JLi X(2024)General Purpose Deep Learning Accelerator Based on Bit InterleavingIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334272843:5(1470-1483)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3342728
Li WHu AXu NHe G(2024)A Precision-Scalable Deep Neural Network Accelerator With Activation Sparsity ExploitationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.331091643:1(263-276)Online publication date: Jan-2024
https://doi.org/10.1109/TCAD.2023.3310916
Zhang ZJiang JWang QMao ZJing N(2024) 3 A -ReRAM: Adaptive Activation Accumulation in ReRAM-Based CNN Accelerator IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.329796843:1(176-188)Online publication date: Jan-2024
https://doi.org/10.1109/TCAD.2023.3297968
Sun WZou ZLiu DSun WChen SKang Y(2024)Bit-Balance: Model-Hardware Codesign for Accelerating NNs by Exploiting Bit-Level SparsityIEEE Transactions on Computers10.1109/TC.2023.332447773:1(152-163)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TC.2023.3324477
Liu FZhao WWang ZChen YLiang XJiang L(2024)ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator With Fine-Grained Bit-Level SparsityIEEE Transactions on Computers10.1109/TC.2023.329086973:9(2320-2334)Online publication date: Sep-2024
https://doi.org/10.1109/TC.2023.3290869
Nie CTang CLin JHu HLv CCao TZhang WJiang LLiang XQian WSun YHe Z(2024)VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar OperationsIEEE Transactions on Computers10.1109/TC.2023.328509573:10(2378-2390)Online publication date: Oct-2024
https://doi.org/10.1109/TC.2023.3285095
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten