research-article

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Authors:
Tailin Liang

University of Science and Technology Beijing, China and Hua Xia General Processor Technologies, Haidian Qu, Beijing, China

University of Science and Technology Beijing, China and Hua Xia General Processor Technologies, Haidian Qu, Beijing, China

0000-0002-7643-912X
View Profile

,
Lei Wang

University of Science and Technology Beijing, Beijing, China

University of Science and Technology Beijing, Beijing, China

0000-0003-1865-8153
View Profile

,
Shaobo Shi

University of Science and Technology Beijing, China and Hua Xia General Processor Technologies, Haidian Qu, Beijing, China

University of Science and Technology Beijing, China and Hua Xia General Processor Technologies, Haidian Qu, Beijing, China

0000-0003-4172-8906
View Profile

,
John Glossner

University of Science and Technology Beijing, China and General Processor Technologies, New York, USA

University of Science and Technology Beijing, China and General Processor Technologies, New York, USA

0000-0003-0713-2105
View Profile

,
Xiaotong Zhang

University of Science and Technology Beijing, Beijing, China

University of Science and Technology Beijing, Beijing, China

0000-0002-5122-3998
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 22 Issue 3Article No.: 47pp 1–27https://doi.org/10.1145/3568310

Published:19 April 2023Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm² while dissipating 0.46W/TOPs in TSMC 28-nm technology.

REFERENCES

[1] Sze Vivienne, Chen Yu-Hsin Hsin, Yang Tien-Ju Ju, and Emer Joel S.. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329. DOI:Google ScholarCross Ref
[2] Liu Shaoli, Du Zidong, Tao Jinhua, Han Dong, Luo Tao, Xie Yuan, Chen Yunji, and Chen Tianshi. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 393–405. DOI:Google ScholarDigital Library
[3] Moudgill Mayan, Glossner John, Huang Wei, Tian Chaoyang, Xu Chunxia, Yang Nianliang, Wang Lei, Liang Tailin, Shi Shaobo, Zhang Xiaodong, Iancu Daniel, Nacer Gary, and Li Kerry. 2020. Heterogeneous edge CNN hardware accelerator. In Proceedings of the 12th International Conference on Wireless Communications and Signal Processing (WCSP’20). Vol. 10, IEEE, 636–641. [Online]. Available: https://ieeexplore.ieee.org/document/9299736/.Google ScholarCross Ref
[4] Chen Yiran, Xie Yuan, Song Linghao, Chen Fan, and Tang Tianqi. 2020. A survey of accelerator architectures for deep neural networks. Engineering 6, 3 (2020), 264–274. DOI:Google ScholarCross Ref
[5] Du Zidong, Fasthuber Robert, Chen Tianshi, Ienne Paolo, Li Ling, Luo Tao, Feng Xiaobing, Chen Yunji, and Temam Olivier. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, New York, NY, 92–104. DOI:Google ScholarDigital Library
[6] Andri Renzo, Cavigelli Lukas, Rossi Davide, and Benini Luca. 2018. Yoda NN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2018), 48–60. DOI:Google ScholarCross Ref
[7] Xue Cheng-xin, Huang Tsung-yuan, Liu Je-syu, Chang Ting-wei, Kao Hui-yao, Wang Jing-hong, Liu Ta-wei, Wei Shih-ying, Huang Sheng-po, Wei Wei-chen, Chen Yi-ren, Hsu Tzu-hsiang, Chen Yen-kai, Lo Yun-chen, Wen Tai-hsing, Lo Chung-chuan, Liu Ren-shuo, Hsieh Chih-cheng, Tang Kea-tiong, and Chang Meng-fan. 2020. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In Proceedings of the IEEE International Solid- State Circuits Conference - (ISSCC’20). Vol. 2, IEEE, 244–246. [Online]. Available: https://ieeexplore.ieee.org/document/9063078/.Google ScholarCross Ref
[8] Hsiangkai Wang, Zakk Chen, Kito Cheng, Yi-Hsiu Hsu, Roger Ibanez, Nick Knight, and Mingjie Xing. RISC-V Vector Extension Intrinsic Document. Retrieved from https://github.com/riscv/rvv-intrinsic-doc.Google Scholar
[9] Stephens Nigel, Biles Stuart, Boettcher Matthias, Eapen Jacob, Eyole Mbou, Gabrielli Giacomo, Horsnell Matt, Magklis Grigorios, Martinez Alejandro, Premillieu Nathanael, Reid Alastair, Rico Alejandro, and Walker Paul. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (2017), 26–39. DOI:Google ScholarDigital Library
[10] Moudgill Mayan, Glossner C. John, Hoane Arthur Joseph, Hurtley Paul, and Kalashnikov Vitaly. 2018. Vector processor configured to operate on variable length vectors using implicitly typed instructions. Patent No. 9,959,246.Google Scholar
[11] Wu Jiaxiang, Leng Cong, Wang Yuhang, Hu Qinghao, and Cheng Jian. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4820–4828. DOI:Google ScholarCross Ref
[12] Liang Tailin, Glossner John, Wang Lei, Shi Shaobo, and Zhang Xiaotong. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (102021), 370–403. DOI:Google ScholarDigital Library
[13] Qin Haotong, Gong Ruihao, Liu Xianglong, Shen Mingzhu, Wei Ziran, Yu Fengwei, and Song Jingkuan. 2020. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2247–2256. DOI:Google ScholarCross Ref
[14] Reuther Albert, Michaleas Peter, Jones Michael, Gadepally Vijay, Samsi Siddharth, and Kepner Jeremy. 2020. Survey of machine learning accelerators. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–12. DOI:Google ScholarCross Ref
[15] Chang Kuo Wei and Chang Tian Sheuan. 2020. VWA: Hardware efficient vectorwise accelerator for convolutional neural network. IEEE Trans. Circ. Syst. I: Regul. Pap. 67, 1 (2020), 145–154. DOI:Google ScholarCross Ref
[16] Tu Fengbin, Yin Shouyi, Ouyang Peng, Tang Shibin, Liu Leibo, and Wei Shaojun. 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. VLSI Syst. 25, 8 (82017), 2220–2233. DOI:Google ScholarDigital Library
[17] Chen Yu-Hsin, Krishna Tushar, Emer Joel S., and Sze Vivienne. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (12017), 127–138. DOI:Google ScholarCross Ref
[18] Chen Yu Hsin, Yang Tien Ju, Emer Joel S., and Sze Vivienne. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (72019), 292–308. DOI:Google ScholarCross Ref
[19] Glossner John, Blinzer Paul, and Takala Jarmo. 2016. HSA-enabled DSPs and accelerators. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’15). IEEE, 1407–1411. DOI:Google ScholarCross Ref
[20] Ju Yuhao and Gu Jie. 2022. A 65nm systolic neural CPU processor for combined deep learning and general-purpose computing with 95% PE utilization, high data locality and enhanced end-to-end performance. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC’22), Vol. 17. IEEE, 1–3. DOI:Google ScholarCross Ref
[21] Chen Yunji, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (102016), 105–112. DOI:Google ScholarDigital Library
[22] Chen Tianshi, Du Zidong, Sun Ninghui, Wang Jia, Wu Chengyong, Chen Yunji, and Temam Olivier. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM Press, New York, New York, 269–284. DOI:Google ScholarDigital Library
[23] Chen Yunji, Luo Tao, Liu Shaoli, Zhang Shijin, He Liqiang, Wang Jia, Li Ling, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2015. DaDianNao: A machine-learning supercomputer. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’15). IEEE, 609–622. DOI:Google ScholarDigital Library
[24] Liu Daofu, Chen Tianshi, Liu Shaoli, Zhou Jinhong, Zhou Shengyuan, Teman Olivier, Feng Xiaobing, Zhou Xuehai, and Chen Yunji. 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGPLAN Not. 50, 4 (2015), 369–381. DOI:Google ScholarDigital Library
[25] Moreau Thierry, Chen Tianqi, Vega Luis, Roesch Jared, Yan Eddie, Zheng Lianmin, Fromm Josh, Jiang Ziheng, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (92019), 8–16. DOI:Google ScholarCross Ref
[26] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Cowan Meghan, Shen Haichen, Wang Leyuan, Hu Yuwei, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 579–594. http://arxiv.org/abs/1802.04799.Google Scholar
[27] Tomasulo R. M.. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11, 1 (11967), 25–33. DOI:Google ScholarDigital Library
[28] Jo Jihyuck, Kim Suchang, and Park In Cheol. 2018. Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 12 (2018), 4196–4207. DOI:Google ScholarCross Ref
[29] Du Li, Du Yuan, Li Yilei, Su Junjie, Kuan Yen-Cheng, Liu Chun-Chen, and Chang Mau-chung Frank. 2018. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 1 (2018), 198–208. DOI:Google ScholarCross Ref
[30] Ardakani Arash, Condo Carlo, Ahmadi Mehdi, and Gross Warren J.. 2018. An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 4 (2018), 1349–1362. DOI:Google ScholarCross Ref
[31] Jo Jihyuck, Cha Soyoung, Rho Dayoung, and Park In Cheol. 2018. DSIP: A scalable inference accelerator for convolutional neural networks. IEEE J. Solid-State Circ. 53, 2 (2018), 605–618. DOI:Google ScholarCross Ref
[32] Chen Xi, Hu Xiaolin, Zhou Hucheng, and Xu Ningyi. 2017. FxpNet: Training a deep convolutional neural network in fixed-point representation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17), Vol. 2017-May, IEEE, 2494–2501. DOI:Google ScholarCross Ref
[33] Yoo Injae, Kim Bongjin, and Park In-Cheol. 2015. Reverse rate matching for low-power LTE-advanced turbo decoders. IEEE Trans. Circ. Syst. I: Regul. Pap. 62, 12 (2015), 2920–2928. DOI:Google ScholarCross Ref

Index Terms

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Recommendations

TCX: a programmable tensor processor
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in Europe

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-...
Read More
A fuzzy RISC processor

We describe application-specific extensions for fuzzy processing to a general purpose processor. The application-specific instruction set extensions were defined and evaluated using hardware/software codesign techniques. Based on this approach, we have ...
Read More
A unified processor architecture for RISC & VLIW DSP
GLSVLSI '05: Proceedings of the 15th ACM Great Lakes symposium on VLSI

This paper presents a unified processor core with two operation modes. The processor core works as a compiler-friendly MIPS-like core in the RISC mode, and it is a 4-way VLIW in its DSP mode, which has distributed and ping-pong register organization ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 22, Issue 3
May 2023
546 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3592782
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 19 April 2023
- Online AM: 18 October 2022
- Accepted: 10 October 2022
- Revised: 31 August 2022
- Received: 12 February 2022
Published in tecs Volume 22, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Neural network accelerator
convolutional neural network
ASIC design
Tensor Processor
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 400
  Total Downloads
- Downloads (Last 12 months)237
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

ACM Transactions on Embedded Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

TCX: a programmable tensor processor

A fuzzy RISC processor

A unified processor architecture for RISC & VLIW DSP