Abstract:
This paper presents a deep learning processor that supports both inference and training for the entire convolutional neural network (CNN) with any size. The proposed desi...Show MoreMetadata
Abstract:
This paper presents a deep learning processor that supports both inference and training for the entire convolutional neural network (CNN) with any size. The proposed design enables on-chip training for applications that ask for high security and privacy. Techniques across design abstraction are applied to improve the energy efficiency. Re-arrangement of the weights in filters is leveraged to reduce the processing latency by 88%. Integration of fixed-point and floating-point arithmetics reduces the area of the multiplier by 56.8%, resulting in an unified processing element (PE) with 33% less area. In the low-precision mode, clock gating and data gating are employed to reduce the power of the PE cluster by 62%. Maxpooling and ReLU modules are co-designed to reduce the memory usage by 75%. A modified softmax function is utilized to reduce the area by 78%. Fabricated in 40nm CMOS, the chip consumes 18.7 mW and 64.5 mW for inference and training, respectively, at 82 MHz from a 0.6V supply. It achieves an energy efficiency of 2.25 TOPS/W, which is 2.67 times higher than the state-of-the-art learning processors. The chip also achieves a 2×105 times higher energy efficiency in training than a high-end CPU.
Published in: 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC)
Date of Conference: 04-06 November 2019
Date Added to IEEE Xplore: 06 April 2020
ISBN Information: