Abstract:
General convolution acceleration, such as Winograd and FFT, is a promising direction to address the computational complexity of current convolutional neural networks (CNN...Show MoreMetadata
Abstract:
General convolution acceleration, such as Winograd and FFT, is a promising direction to address the computational complexity of current convolutional neural networks (CNNs). However, the flexibility of these CNNs makes this kind of scheme always introduce massive redundant computations, damaging the acceleration effect. In this article, a two-stage splitting method for arbitrarily sized tensors and filters and a unified hardware architecture using layer-adaptive allocated Winograd units are proposed, achieving effective redundance elimination and unified architecture. First, a tensor adaptive presplitting method is proposed to divide the original tensors to match the rule of Winograd. Furthermore, a Winograd-based extended splitting scheme is designed to reduce the redundant calculations; therefore, a substantial reduction in multiplication operations in convolutional layers achieved 30.6%–75% savings. Finally, a unified hardware architecture with a layer-adaptive allocation method is proposed to evaluate and select the optimal Winograd F( {m} , {r} ) units and input/output parallelisms. This architecture is evaluated based on the Xilinx XCVU9P platform and achieves 1.97/1.23/1.60/1.25 GOPS/DSP for AlexNet, VGG16, modified VGG16, and ResNet18, respectively. It achieves up to 5.81\times improvements in DSP efficiency compared with previous FPGA-based designs.
Published in: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ( Volume: 43, Issue: 3, March 2024)