Abstract
Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm2 while dissipating 0.46W/TOPs in TSMC 28-nm technology.
- [1] . 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
DOI: Google ScholarCross Ref - [2] . 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 393–405.
DOI: Google ScholarDigital Library - [3] . 2020. Heterogeneous edge CNN hardware accelerator. In Proceedings of the 12th International Conference on Wireless Communications and Signal Processing (WCSP’20). Vol. 10, IEEE, 636–641. [Online]. Available: https://ieeexplore.ieee.org/document/9299736/.Google ScholarCross Ref
- [4] . 2020. A survey of accelerator architectures for deep neural networks. Engineering 6, 3 (2020), 264–274.
DOI: Google ScholarCross Ref - [5] . 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, New York, NY, 92–104.
DOI: Google ScholarDigital Library - [6] . 2018. Yoda NN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2018), 48–60.
DOI: Google ScholarCross Ref - [7] . 2020. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In Proceedings of the IEEE International Solid- State Circuits Conference - (ISSCC’20). Vol. 2, IEEE, 244–246. [Online]. Available: https://ieeexplore.ieee.org/document/9063078/.Google ScholarCross Ref
- [8] . RISC-V Vector Extension Intrinsic Document. Retrieved from https://github.com/riscv/rvv-intrinsic-doc.Google Scholar
- [9] . 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (2017), 26–39.
DOI: Google ScholarDigital Library - [10] . 2018. Vector processor configured to operate on variable length vectors using implicitly typed instructions. Patent No. 9,959,246.Google Scholar
- [11] . 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4820–4828.
DOI: Google ScholarCross Ref - [12] . 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (
10 2021), 370–403.DOI: Google ScholarDigital Library - [13] . 2020. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2247–2256.
DOI: Google ScholarCross Ref - [14] . 2020. Survey of machine learning accelerators. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–12.
DOI: Google ScholarCross Ref - [15] . 2020. VWA: Hardware efficient vectorwise accelerator for convolutional neural network. IEEE Trans. Circ. Syst. I: Regul. Pap. 67, 1 (2020), 145–154.
DOI: Google ScholarCross Ref - [16] . 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. VLSI Syst. 25, 8 (
8 2017), 2220–2233.DOI: Google ScholarDigital Library - [17] . 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (
1 2017), 127–138.DOI: Google ScholarCross Ref - [18] . 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (
7 2019), 292–308.DOI: Google ScholarCross Ref - [19] . 2016. HSA-enabled DSPs and accelerators. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’15). IEEE, 1407–1411.
DOI: Google ScholarCross Ref - [20] . 2022. A 65nm systolic neural CPU processor for combined deep learning and general-purpose computing with 95% PE utilization, high data locality and enhanced end-to-end performance. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC’22), Vol. 17. IEEE, 1–3.
DOI: Google ScholarCross Ref - [21] . 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (
10 2016), 105–112.DOI: Google ScholarDigital Library - [22] . 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM Press, New York, New York, 269–284.
DOI: Google ScholarDigital Library - [23] . 2015. DaDianNao: A machine-learning supercomputer. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’15). IEEE, 609–622.
DOI: Google ScholarDigital Library - [24] . 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGPLAN Not. 50, 4 (2015), 369–381.
DOI: Google ScholarDigital Library - [25] . 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (
9 2019), 8–16.DOI: Google ScholarCross Ref - [26] . 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 579–594. http://arxiv.org/abs/1802.04799.Google Scholar
- [27] . 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11, 1 (
1 1967), 25–33.DOI: Google ScholarDigital Library - [28] . 2018. Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 12 (2018), 4196–4207.
DOI: Google ScholarCross Ref - [29] . 2018. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 1 (2018), 198–208.
DOI: Google ScholarCross Ref - [30] . 2018. An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 4 (2018), 1349–1362.
DOI: Google ScholarCross Ref - [31] . 2018. DSIP: A scalable inference accelerator for convolutional neural networks. IEEE J. Solid-State Circ. 53, 2 (2018), 605–618.
DOI: Google ScholarCross Ref - [32] . 2017. FxpNet: Training a deep convolutional neural network in fixed-point representation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17), Vol. 2017-May, IEEE, 2494–2501.
DOI: Google ScholarCross Ref - [33] . 2015. Reverse rate matching for low-power LTE-advanced turbo decoders. IEEE Trans. Circ. Syst. I: Regul. Pap. 62, 12 (2015), 2920–2928.
DOI: Google ScholarCross Ref
Index Terms
- TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor
Recommendations
TCX: a programmable tensor processor
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in EuropeNeural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-...
A fuzzy RISC processor
We describe application-specific extensions for fuzzy processing to a general purpose processor. The application-specific instruction set extensions were defined and evaluated using hardware/software codesign techniques. Based on this approach, we have ...
A unified processor architecture for RISC & VLIW DSP
GLSVLSI '05: Proceedings of the 15th ACM Great Lakes symposium on VLSIThis paper presents a unified processor core with two operation modes. The processor core works as a compiler-friendly MIPS-like core in the RISC mode, and it is a 4-way VLIW in its DSP mode, which has distributed and ping-pong register organization ...
Comments