Abstract:
In this article, we present a hardware architecture optimized for sparse and dense matrix processing in TensorFlow Lite and compatible with embedded-heterogeneous devices...Show MoreMetadata
Abstract:
In this article, we present a hardware architecture optimized for sparse and dense matrix processing in TensorFlow Lite and compatible with embedded-heterogeneous devices that integrate central processing unit and field-programmable gate array (FPGA) resources. The fused architecture for dense and sparse matrices design offers multiple configuration options that tradeoff parallelism and complexity, and uses a dataflow model to create four stages that read, compute, scale, and write results. All stages are designed to support TensorFlow Lite operations including asymmetric quantized activations, column-major matrix write, per-filter/per-axis bias values, and current scaling specifications. The configurable accelerator is integrated with the TensorFlow Lite inference engine running on the ARMv8 processor. We compare performance/power/energy with the state-of-the-art RUY software multiplication library showing up to 18× acceleration and 48× in dense and sparse modes, respectively. The sparse mode benefits from structural pruning to fully utilize the digital signal processing blocks present in the FPGA device.
Published in: IEEE Micro ( Volume: 42, Issue: 6, 01 Nov.-Dec. 2022)