Systematic realization of a fully connected deep and convolutional neural network architecture on a field programmable gate array

https://doi.org/10.1016/j.compeleceng.2021.107628Get rights and content

Highlights

  • A methodological approach for mapping DNN and CNN inference unit on an FPGA.

  • Serial processing for DNN and parallel systolic array processing for CNN.

  • Detailed illustration of the VLSI modules used to build the inference architecture.

  • Choice of distributed/block on-chip memory for the inference architectures and comparison with related works.

Abstract

A detailed methodology for implementing a fully connected (FC) deep neural network (DNN) and convolutional neural network (CNN) inference system on a field programming gate array (FPGA) is presented. Minimal computational units are used for the DNN. For the CNN, systolic array (SA) architecture endowed with parallel processing potential is utilized. Algorithmic analysis determines the optimum memory requirement for the fixed point trained parameters. The size of the trained parameters and the available memory on the target FPGA device govern the choice of on-chip memory to utilize. Experimental results indicate that the choice of block over distributed memory saves 62% look-up-tables (LUTs) for the DNN ([784-512-512-10]), and the choice of distributed over block memory saves 30% block random access memory (BRAM) for the LeNet-5 CNN unit. This study provides insights for developing FPGA-based digital systems for applications requiring DNN and CNN.

Introduction

At recent times, there has been a considerable amount of work requiring neural networks (NNs) for addressing a wide range of applications in the field of signal and image/video processing, big data, Internet of Things (IoT), and biomedical applications [1], [2], [3], [4], because, (1) NNs are excellent in solving various nonlinear problems, (2) considerable advancement has taken place in learning methods and better optimization algorithms [5], [6]. They are used to develop smart machines for predicting outcomes for both real/non-real-time applications. This rise of research works for developing better NN algorithms in the last decade is due to the advancement in Very Large Scale Integration (VLSI) technology, which has given rise to fast central processing units (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs) [7], [8]. The computation complexity in terms of the number of operations and speed depends on the NN architecture and the hardware platform used.

Broadly, two categories of research are carried out, viz., (1) developing better optimization algorithms on the state of the art problems targeted to NN framework, (2) developing appropriate VLSI architectures using high-level language (HLL) or hardware description language (HDL) for NN algorithms [8]. Several existing tools convert the algorithmic design to the architectural level [9], [10]. The high-level synthesis (HLS), is a common approach which has been widely adopted in VLSI industries for developing better optimization algorithms by using HLS tools to map their corresponding register-transfer level (RTL) architecture to the target hardware. However, one has to rely on those tools as a black box for mapping it to an architecture that may not be an optimized design depending on the targeted application. The other approach is to use HDL language which offers the flexibility to manually design the optimized architecture as per requirement. In this work, we have adopted the latter approach and used Verilog HDL for our analysis.

Convolutional Neural Networks (CNNs) are frequently used in various image based applications as the initial layers, prove to be good feature discriminators, creating distinguishable feature maps from the raw input data [11], [12]. The systolic array (SA) is made up of multiple pipelined processing elements (PEs), which support parallel processing with a rhythmic nature of dataflow [13]. They are used in applications requiring MAC (multiply and accumulate) operations and hence have been widely adopted in various digital signal processing (DSP) applications [14]. Recently, the use of SA structures for implementing CNN accelerators on an FPGA has gained significant attention [15], [16]. With such modular structures involving multiple PEs, the convolution (CONV) operations involved in the CNN are performed at a higher throughput [17].

In this work, we have elaborated on the steps to conceive and map a DNN and CNN inference unit from the algorithmic to the architectural level. Algorithmic level optimization is carried out for determining the optimum number of bits needed for representing the trained parameters. At the architectural level, a multiply and accumulate (MAC) unit and a processing element (PE) block are used as the basic computational unit for the DNN and CNN structures, respectively. The objective of this paper is to illustrate a step-by-step process for translating the algorithm of the design to its corresponding architecture using HDL. The major contributions of the work are as follows:

  • A systematic way to map an FC neural network inference unit using Verilog HDL, considering the basic computational units such as the MAC and PE blocks for the DNN and CNN, is presented. The migration from algorithm to architecture is described while considering relevant optimization.

  • Detailed illustration for building the VLSI architecture of the inference system and controlling scheme for scheduling the feedforward process is presented. For the DNN, a serial implementation using minimum resources is shown, whereas, for the CNN, parallel implementation using SA structure is exhibited. Similarly, other NN architectures can be built following the approach discussed in this work.

  • The resource utilization for the DNN and CNN architectures on the target FPGA device is computed. The choice of selecting distributed versus block on-chip memory for storing the trained parameters is justified based on the implementation results. The results are compared with similar works in the literature.

This analysis is done on the Modified National Institute of Standards and Technology (MNIST)1 database, which is well suited for ML experiments. Moreover, this analysis will also apply to larger image datasets because DNN and CNN perform better with more data samples and are powerful models that are noteworthy at predicting complex input data having non-linear relationships. CNN’s are a special type of DNN which are excellent at image processing tasks. In this study, the workflow for LeNet-5 CNN is shown. Depending on the specific image processing application, different CNN architectures need to be chosen. A larger dataset will suit a bigger CNN model having more parameters. In that case, the target FPGA should be chosen according to the available on-chip memory.

The basic flowchart illustrating the major steps involved in the process of mapping the neural network inference system from the algorithmic to architectural level is shown in Fig. 1. For DNN, a MAC unit, which is operated sequentially, and is suitable for application with limited area requirement and low speed is considered. For CNN, the processing elements (PE) are used in a SA architecture framework suitable for high-speed applications. Recently, Rectified Linear Unit (ReLU) has shown remarkable performance in DNN frameworks in terms of faster training as well as better performance compared to the standard sigmoid function [5]. The mathematical simplicity of the ReLU activation function further reduces the hardware requirement.

The rest of the article is organized as follows. Algorithmic level analysis such as training and optimization of trained parameters for the DNN and CNN are presented in Section 2. VLSI architectures involved in constructing the DNN and CNN are elaborated in Section 3. Evaluation of the estimated time of computation and the implementation results are discussed and compared with related works in Section 4. Finally, Section 5 concludes the article.

Section snippets

Description of DNN

A DNN is a neural network (NN) with two or more hidden layers, which addresses a wide range of problems related to classification and regression. A general structure of an FC DNN classifier is shown in Fig. 2(a), consisting of input, output, and hidden layers. The number of neurons in the input layer is equal to the dimension of the feature vector (FV). The number of classes to be predicted determines the number of neurons in the output layer. The trained parameters of the NN are known as

DNN

The system-level diagram of the DNN inference unit is shown in Fig. 4. The design implemented is a synchronous one where the inputs to the system are clk and state_reset and the output is out_class for representing the winner neuron. The multiply and accumulate (MAC) unit takes the FV and the trained parameters as the input. The bias term corresponding to the MAC neuron initializes the MAC output. The multiplication result of the connected weight with the previous layer neurons (FV for the case

Computations involved in DNN

The area consumed by the DNN architecture is dominated by the number of memory bits required for storing the trained parameters in the on-chip memory. The MAC unit will consist of a multiplier, an adder, intermediate registers, and multiplexer (MUX) which occupies a smaller portion of the entire design. The bit width of the output (Nmultout) of the multiplier is estimated by Eq. (6). Nmultout=NFV+NWb

Here, NFV and NWb are the number of bits for representing the FV and trained parameter values,

Conclusion

This paper has established a methodological approach for mapping an FC DNN and CNN inference unit on an FPGA using on-chip memory resources. The MNIST dataset is used for the analysis. The FC DNN has been implemented on an Artix-7 7a200tfbg676-2 FPGA using minimum resources consisting of a single MAC unit. LeNet-5 CNN inference unit is implemented in a SA fashion by exploiting its parallel processing feature on a Kintex Ultra xcku035sfva784-1LV FPGA. Optimizing the number of bits required for

CRediT authorship contribution statement

Anand Kumar Mukhopadhyay: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Sampurna Majumder: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization. Indrajit Chakrabarti: Resources, Writing – review & editing, Project

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107628.

Anand Kumar Mukhopadhyay received the B.Tech. in electronics and communication engineering from West Bengal University of Technology in 2012, M.Tech. in VLSI design and embedded systems from National Institute of Technology Rourkela in 2014, and Ph.D. from the Electronics and Electrical Communication Engineering department, Indian Institute of Technology Kharagpur in 2021. His research interests include artificial intelligence, digital VLSI architectures, and signal processing applications.

References (25)

  • SharmaH. et al.

    From high-level deep neural models to FPGAs

  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • Cited by (6)

    • StrokeViT with AutoML for brain stroke classification

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The time to treatment can further be reduced by portable, non-invasive stroke detection devices or embedded systems that can help to classify normal, hemorrhage, or ischemic stroke. In this regard, Mukhopadhyay et al. (2022) have demonstrated a systematic way to map a CNN from the algorithmic to the architectural level. At the architectural level, Field Programmable Gate Arrays (FPGA) are considered the most suitable platforms for CNN due to their reconfigurability, resource optimization, and low power consumption.

    • Design of an experimental setup for the implementation of CNNs in APSoCs

      2023, 1st IEEE Colombian Caribbean Conference, C3 2023

    Anand Kumar Mukhopadhyay received the B.Tech. in electronics and communication engineering from West Bengal University of Technology in 2012, M.Tech. in VLSI design and embedded systems from National Institute of Technology Rourkela in 2014, and Ph.D. from the Electronics and Electrical Communication Engineering department, Indian Institute of Technology Kharagpur in 2021. His research interests include artificial intelligence, digital VLSI architectures, and signal processing applications.

    Sampurna Majumder received the B.E. in electronics and telecommunication engineering from Indian Institute of Engineering Science and Technology Shibpur in 2017 and M.Tech. in microelectronics and VLSI from Indian Institute of Technology Kharagpur in 2020. Her research interests include VLSI design, machine learning, and hardware architecture.

    Indrajit Chakrabarti received the B.E. and M.E. degrees in electronics and telecommunication engineering from Jadavpur University, India, in 1987 and 1990, respectively, and the Ph.D. degree from the Indian Institute of Technology (IIT) Kharagpur in 1997. He is currently working as a Professor with the Department of Electronics and Communication Engineering, IIT Kharagpur. His research interests include VLSI architectures for image and video processing, digital signal processing, error control coding, and wireless communication.

    This paper was recommended for publication by associate editor Huimin Lu.

    View full text