skip to main content
research-article

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Authors Info & Claims
Published:19 April 2023Publication History
Skip Abstract Section

Abstract

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm2 while dissipating 0.46W/TOPs in TSMC 28-nm technology.

REFERENCES

  1. [1] Sze Vivienne, Chen Yu-Hsin Hsin, Yang Tien-Ju Ju, and Emer Joel S.. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 22952329. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Liu Shaoli, Du Zidong, Tao Jinhua, Han Dong, Luo Tao, Xie Yuan, Chen Yunji, and Chen Tianshi. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 393405. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Moudgill Mayan, Glossner John, Huang Wei, Tian Chaoyang, Xu Chunxia, Yang Nianliang, Wang Lei, Liang Tailin, Shi Shaobo, Zhang Xiaodong, Iancu Daniel, Nacer Gary, and Li Kerry. 2020. Heterogeneous edge CNN hardware accelerator. In Proceedings of the 12th International Conference on Wireless Communications and Signal Processing (WCSP’20). Vol. 10, IEEE, 636641. [Online]. Available: https://ieeexplore.ieee.org/document/9299736/.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Yiran, Xie Yuan, Song Linghao, Chen Fan, and Tang Tianqi. 2020. A survey of accelerator architectures for deep neural networks. Engineering 6, 3 (2020), 264274. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Du Zidong, Fasthuber Robert, Chen Tianshi, Ienne Paolo, Li Ling, Luo Tao, Feng Xiaobing, Chen Yunji, and Temam Olivier. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, New York, NY, 92104. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Andri Renzo, Cavigelli Lukas, Rossi Davide, and Benini Luca. 2018. Yoda NN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2018), 4860. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Xue Cheng-xin, Huang Tsung-yuan, Liu Je-syu, Chang Ting-wei, Kao Hui-yao, Wang Jing-hong, Liu Ta-wei, Wei Shih-ying, Huang Sheng-po, Wei Wei-chen, Chen Yi-ren, Hsu Tzu-hsiang, Chen Yen-kai, Lo Yun-chen, Wen Tai-hsing, Lo Chung-chuan, Liu Ren-shuo, Hsieh Chih-cheng, Tang Kea-tiong, and Chang Meng-fan. 2020. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In Proceedings of the IEEE International Solid- State Circuits Conference - (ISSCC’20). Vol. 2, IEEE, 244246. [Online]. Available: https://ieeexplore.ieee.org/document/9063078/.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Hsiangkai Wang, Zakk Chen, Kito Cheng, Yi-Hsiu Hsu, Roger Ibanez, Nick Knight, and Mingjie Xing. RISC-V Vector Extension Intrinsic Document. Retrieved from https://github.com/riscv/rvv-intrinsic-doc.Google ScholarGoogle Scholar
  9. [9] Stephens Nigel, Biles Stuart, Boettcher Matthias, Eapen Jacob, Eyole Mbou, Gabrielli Giacomo, Horsnell Matt, Magklis Grigorios, Martinez Alejandro, Premillieu Nathanael, Reid Alastair, Rico Alejandro, and Walker Paul. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (2017), 2639. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Moudgill Mayan, Glossner C. John, Hoane Arthur Joseph, Hurtley Paul, and Kalashnikov Vitaly. 2018. Vector processor configured to operate on variable length vectors using implicitly typed instructions. Patent No. 9,959,246.Google ScholarGoogle Scholar
  11. [11] Wu Jiaxiang, Leng Cong, Wang Yuhang, Hu Qinghao, and Cheng Jian. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 48204828. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Liang Tailin, Glossner John, Wang Lei, Shi Shaobo, and Zhang Xiaotong. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (102021), 370403. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Qin Haotong, Gong Ruihao, Liu Xianglong, Shen Mingzhu, Wei Ziran, Yu Fengwei, and Song Jingkuan. 2020. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 22472256. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Reuther Albert, Michaleas Peter, Jones Michael, Gadepally Vijay, Samsi Siddharth, and Kepner Jeremy. 2020. Survey of machine learning accelerators. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chang Kuo Wei and Chang Tian Sheuan. 2020. VWA: Hardware efficient vectorwise accelerator for convolutional neural network. IEEE Trans. Circ. Syst. I: Regul. Pap. 67, 1 (2020), 145154. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Tu Fengbin, Yin Shouyi, Ouyang Peng, Tang Shibin, Liu Leibo, and Wei Shaojun. 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. VLSI Syst. 25, 8 (82017), 22202233. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (12017), 127138. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Chen Yu Hsin, Yang Tien Ju, Emer Joel S., and Sze Vivienne. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (72019), 292308. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Glossner John, Blinzer Paul, and Takala Jarmo. 2016. HSA-enabled DSPs and accelerators. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’15). IEEE, 14071411. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Ju Yuhao and Gu Jie. 2022. A 65nm systolic neural CPU processor for combined deep learning and general-purpose computing with 95% PE utilization, high data locality and enhanced end-to-end performance. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC’22), Vol. 17. IEEE, 13. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Chen Yunji, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (102016), 105112. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Chen Tianshi, Du Zidong, Sun Ninghui, Wang Jia, Wu Chengyong, Chen Yunji, and Temam Olivier. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM Press, New York, New York, 269284. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Chen Yunji, Luo Tao, Liu Shaoli, Zhang Shijin, He Liqiang, Wang Jia, Li Ling, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2015. DaDianNao: A machine-learning supercomputer. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’15). IEEE, 609622. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Liu Daofu, Chen Tianshi, Liu Shaoli, Zhou Jinhong, Zhou Shengyuan, Teman Olivier, Feng Xiaobing, Zhou Xuehai, and Chen Yunji. 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGPLAN Not. 50, 4 (2015), 369381. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Moreau Thierry, Chen Tianqi, Vega Luis, Roesch Jared, Yan Eddie, Zheng Lianmin, Fromm Josh, Jiang Ziheng, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (92019), 816. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Cowan Meghan, Shen Haichen, Wang Leyuan, Hu Yuwei, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 579594. http://arxiv.org/abs/1802.04799.Google ScholarGoogle Scholar
  27. [27] Tomasulo R. M.. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11, 1 (11967), 2533. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jo Jihyuck, Kim Suchang, and Park In Cheol. 2018. Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 12 (2018), 41964207. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Du Li, Du Yuan, Li Yilei, Su Junjie, Kuan Yen-Cheng, Liu Chun-Chen, and Chang Mau-chung Frank. 2018. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 1 (2018), 198208. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Ardakani Arash, Condo Carlo, Ahmadi Mehdi, and Gross Warren J.. 2018. An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 4 (2018), 13491362. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Jo Jihyuck, Cha Soyoung, Rho Dayoung, and Park In Cheol. 2018. DSIP: A scalable inference accelerator for convolutional neural networks. IEEE J. Solid-State Circ. 53, 2 (2018), 605618. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Chen Xi, Hu Xiaolin, Zhou Hucheng, and Xu Ningyi. 2017. FxpNet: Training a deep convolutional neural network in fixed-point representation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17), Vol. 2017-May, IEEE, 24942501. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Yoo Injae, Kim Bongjin, and Park In-Cheol. 2015. Reverse rate matching for low-power LTE-advanced turbo decoders. IEEE Trans. Circ. Syst. I: Regul. Pap. 62, 12 (2015), 29202928. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 22, Issue 3
            May 2023
            546 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3592782
            • Editor:
            • Tulika Mitra
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 19 April 2023
            • Online AM: 18 October 2022
            • Accepted: 10 October 2022
            • Revised: 31 August 2022
            • Received: 12 February 2022
            Published in tecs Volume 22, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format