Abstract:
A high degree of sparsity in machine learning (ML) models has been highlighted as a significant opportunity to improve energy and delay efficiencies by skipping the compu...Show MoreMetadata
Abstract:
A high degree of sparsity in machine learning (ML) models has been highlighted as a significant opportunity to improve energy and delay efficiencies by skipping the computation of zero elements in operands. Despite the potential, its unstructured positions of zeros and a wide range of sparsity make it challenging to exploit this nature in hardware implementations that are often built on regular structures. To address these challenges, this article presents a low-power and high-performance AI accelerator, the so-called FreFlex, via sparsity-adaptive dynamic frequency modulation (SA-DFM) conjointly with the proposed processing element (PE) in a 2-D systolic array. The sparsity of each layer is determined by counting zero elements from the output while the layer is being computed. Then, the clock frequency is optimally modulated based on the sparsity level obtained from the previous layer’s output, which becomes an input of the next layer. The unused power slack due to the sparsity is exploited to boost performance while fully using the power budget. The proposed technique achieves up to 1.8 \times performance improvement by exploiting the sparsity while incurring less than 7% power overhead, even when there is no sparsity. The silicon prototype, fabricated in a 65-nm CMOS node, demonstrates 0.6–1.0-TOPS/W efficiency for convolution and attention computations, with a performance of 160 GOPS/s/mm2 with a maximum frequency of 1.1 GHz.
Published in: IEEE Journal of Solid-State Circuits ( Volume: 59, Issue: 3, March 2024)