# ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference

Jing Gong\*, Hassaan Saadat\*, Hasindu Gamaarachchi<sup>‡\*</sup>, Haris Javaid<sup>§</sup>,

Xiaobo Sharon Hu<sup>†</sup> and Sri Parameswaran\*

\*School of Computer Science and Engineering, UNSW Sydney, Kensington NSW 2052 Australia

<sup>‡</sup>Garvan Institute of Medical Research, Darlinghurst NSW 2010 Australia <sup>§</sup>AMD, Singapore

†Department of Computer Science and Engineering University of Notre Dame, Notre Dame, IN 46556 USA

Abstract—Edge training of Deep Neural Networks (DNNs) is a desirable goal for continuous learning; however, it is hindered by the enormous computational power required by training. Hardware approximate multipliers have shown their effectiveness for gaining resource-efficiency in DNN inference accelerators; however, training with approximate multipliers is largely unexplored. To build resource-efficient accelerators with approximate multipliers supporting DNN training, a thorough evaluation of training convergence and accuracy for different DNN architectures and different approximate multipliers is needed. This paper presents ApproxTrain1, an open-source framework that allows fast evaluation of DNN training and inference using simulated approximate multipliers. ApproxTrain is as user-friendly as TensorFlow (TF) and requires only a high-level description of a DNN architecture along with C/C++ functional models of the approximate multiplier. We improve the speed of the simulation at the multiplier level by using a novel LUT-based approximate floating-point (FP) multiplier simulator on GPU (AMSim). Additionally, a novel flow is presented to seamlessly convert C/C++ functional models of approximate FP multipliers into AMSim. ApproxTrain leverages CUDA and efficiently integrates AMSim into the TensorFlow library, in order to overcome the absence of native hardware approximate multiplier in commercial GPUs. We use ApproxTrain to evaluate the convergence and accuracy of DNN training with approximate multipliers for small and large datasets (including ImageNet) using LeNets and ResNets architectures. The evaluations demonstrate similar convergence behavior and negligible change in test accuracy compared to FP32 and bfloat16 multipliers. Compared to CPU-based approximate multiplier simulations in training and inference, the GPUaccelerated ApproxTrain is more than 2500x faster. Based on highly optimized closed-source cuDNN/cuBLAS libraries with native hardware multipliers, the original TensorFlow is, on average, only 8x faster than ApproxTrain.

# I. INTRODUCTION

THE *training* phase in deep learning is significantly more computationally demanding than the *inference* phase. Recent works have shown the importance of moving training to the edge to perform continuous learning [5], though such deployments are scarce due to the high training cost. Thus, training has so far been largely relegated to high-performance computers and the cloud. To support training at the edge, besides the accuracy of the learning model, energy/power and area efficiency are paramount.

An efficient DNN system could be realized through two distinct yet complementary approaches: (1) by exploring and

<sup>1</sup>https://github.com/AaronJing/ApproxTrain

finding efficient DNN hardware implementation schemes for both training and inference; and (2) by exploring and finding the most suitable DNN architecture for the given problem. This paper presents ApproxTrain, a framework that allows fast evaluation of DNN training when using a variety of simulated approximate multipliers. ApproxTrain facilitates thorough evaluations of training of many different DNNs with differing approximate multipliers, in a user-friendly manner with practically feasible runtimes to find a suitable approximate multipliers that can be integrated into edge devices for continuous learning.

Multipliers are one of the most compute-intensive hardware elements within a deep learning system [12]. Thus, using approximate multipliers becomes promising scheme for efficient implementation of DNN systems [35], and the efficacy of various approximate multiplier designs in DNN inference in terms of model accuracy and implementation efficiency has been demonstrated in recent years [33]. However, DNN training using approximate multipliers is largely unexplored. A major obstacle in the exploration and evaluation of approximate multipliers in DNN training is the absence of native hardware support for customized approximate floating-point (FP) multipliers on commercial processing units (CPUs, GPU, NPUs). Hence, software simulation of approximate multiplications is needed for handling DNN training with approximate multipliers.

Thorough and effective evaluation of approximate multiplier designs in DNN training has two practical requirements: (1) run-time of training simulation should be sufficiently fast and practically feasible; and (2) the ability to describe DNN architectures at a high level so that various DNN architectures can be quickly constructed and evaluated. However, resorting to software simulation makes it challenging to meet the above requirements. Firstly, since training is time-consuming, inefficient simulation may further slow down training (up to orders of magnitude), making its run-time practically infeasible.

Secondly, the standard frameworks, such as TensorFlow and Pytorch, that allow a high-level description of DNN architectures and take advantage of the computational power of commercial GPUs/TPUs/NPUs, are not equipped to utilize approximate multipliers, since they invoke the native multipliers (built in the hardware) of these commercial platforms. Moreover, these deep learning frameworks, although opensource themselves, are based on closed-source cuDNN and

cuBLAS libraries [2] at their backend. Hence, any modification in these high-level frameworks for incorporating fast approximate multiplier simulation while preserving their flexibility, requires specialized parallel programming skills, making it a daunting task for a regular DNN developer or approximate multiplier designer.

ApproxTrain overcomes the above challenges through three key contributions. First, to improve the speed of ApproxTrain at the multiplier-level, we use a mantissa lookup table-based approach for functional simulation of approximate multipliers. Our mantissa-lookup approach is based on the observation that only mantissa is approximated in most state-of-the-art approximate multipliers [30], [29]. We propose to compute the sign and exponent using the standard approach whereas compute the mantissa using the lookup table, thus, allowing smaller lookup tables, instead of constructing a lookup table for full bit-width of the operands. The alternate method of simulating approximate FP multipliers by bit-manipulation causes significant GPU instruction overhead and more importantly, results in variable performance for different multipliers. The proposed mantissa lookup table-based approach also makes the speed independent of the type of approximate multipliers.

Any training mechanism based on CPU can be painfully slower compared to the standard GPU implementations. Therefore, our second contribution, aiming to improve system-level performance, is the development of GPU-accelerated custom CUDA kernels as an alternative for non-modifiable closed-source cuDNN and cuBLAS libraries. Our GPU-accelerated custom CUDA kernels exploit various architectural features of the GPU ((such as fine-grained and coarse-grained parallelism, memory coalescing and on-chip shared memory). To reduce the number of kernel invocations (thus improving performance) and the necessary memory footprint during the weight gradient computation, we exploit the nature of the dilation step to efficiently integrate it into the image to column (IM2COL) kernel.

Third, to preserve the flexibility and ease-of-use of high-level framework (TensorFlow), we create custom approximate TF *ops* (explained in section III) and seamlessly integrate them into TensorFlow. These custom *TF* ops support different types of DNN layers with approximate multiplication. The GPU-accelerated custom CUDA kernels (described above) are used to implement our custom approximate TF *ops*. ApproxTrain requires only standard TensorFlow-based implementation of DNN architectures, along with C/C++-based functional models of the approximate multipliers.

Overall, ApproxTrain is a DNN framework that: (1) allows fast DNN training evaluations with approximate multipliers; (2) is similar and as user-friendly as the popular high-level frameworks such as TensorFlow; and (3) simplifies the daunting task of integrating approximate multiplier simulation in TensorFlow (or similar frameworks) to make it transparent to the DNN architecture/multiplier designer. To this end, the paper makes the following key **contributions**.

A novel flow is presented to seamlessly transform C/C++
functional models of any approximate FP multipliers
provided by designers into a lookup table (LUT)-based
simulator (AMSim). This LUT-based AMSim requires

- negligible GPU memory compared to a whole LUT-based method that stores all multiplication results into LUT.
- We present ApproxTrain, a TensorFlow-based framework, that allows fast DNN training evaluations with AMSim (simulated approximate FP multipliers). ApproxTrain is as user-friendly as the TensorFlow framework and requires only a high-level description of a DNN architecture along with C/C++ functional model of the approximate multiplier. ApproxTrain also supports inference using approximate multipliers.
- Since the closed-source cuDNN and cuBLAS libraries cannot be modified to integrate functional models of approximate multipliers into ApproxTrain, we developed GPU-accelerated custom CUDA kernels.
- To explore prospects of DNN training with approximate multipliers, we use ApproxTrain to evaluate DNN training with approximate multipliers in terms of training convergence and accuracy results. Our experiments with small and large datasets (including ImageNet) using popular neural network architectures demonstrate successful convergence with negligible change in test accuracy.
- The ApproxTrain framework, as well as the developed custom CUDA libraries, are released as an open-source repository [17].

There has been an attempt for such a framework. However, that framework only supports DNN inference with 8-bit approximate integer multipliers [33]. Thus, to our best knowledge, there is no general framework that supports fast and user friendly DNN training as well as inference with approximate FP multipliers, i.e., in both forward and backpropagation phases of training.

#### Paper Organization:

Section II presents the motivation of using approximate multipliers in training. The relevant background and related work for this paper are discussed in Section III and Section IV, respectively. Section V describes the novel LUT-based approximate FP multiplier simulation, and Section VI describes the integration of the novel simulation into the presented framework ApproxTrain. Experimental evaluation setup using ApproxTrain is elaborated in Section VII. The training convergence and accuracy results using approximate multipliers are presented in Section VIII, and the evaluation of the run-time performance of ApproxTrain is presented in Section IX. Finally, Section X concludes the paper.

# II. MOTIVATION: THE PROMISE OF APPROXIMATE MULTIPLIERS IN DNN TRAINING

The training phase of DNN requires large dynamic range to represent the intermediate data, such as gradients [22]. Therefore, the resource-hungry floating-point (which offers large dynamic range) is the widely used format for DNN training. Until a few years ago, the IEEE single-precision floating-point (FP32: having 8-bit exponent, 23-bit mantissa) format was used in training to achieve best possible accuracy results [24]. Recently, Brain Floating Point format (bfloat16) for multiplication was introduced, which has similar dynamic



Fig. 1. Comparison of resource-efficiency (higher is better) of IEEE FP32, IEEE FP16, bfloat16 and approximate multipliers (AFM32 and AFM16). All area and power values are normalized with area and power of FP32, respectively. The multipliers are single cycle designs, and logic synthesis is done using Cadence RC compiler for TSMC 45nm cell library at 1GHz.



Fig. 2. Forward propagation and backpropagation between DNN layers.

range (8-bit exponent), but lower precision (7-bit mantissa) compared to the FP32 [34]. The bfloat16 format is currently being supported by Google, Nvidia and Intel in their latest TPUs, GPUs and NPUs [3], [4], [34].

A comparison of resource-efficiency (higher is better) of IEEE FP32, FP16, and bfloat16 multipliers with the 32-bit and 16-bit versions of approximate FP multiplier from the literature (AFM32 and AFM16) [29] is shown in Figure 1. It can be observed that the approximate multipliers are more power-efficient and more area-efficient than the FP32, FP16, and the bfloat16 multipliers. Hence, we expect the approximate hardware multipliers to enable more resource-efficient training than when using FP32, FP16 or bfloat16 formats. However, before any custom neural accelerator (equipped with approximate hardware multipliers) can be built to exploit this promise, a thorough evaluation of training convergence and accuracy results for different DNN architectures is needed. This necessitates the fast evaluation framework for DNN training using approximate multipliers, which is the focus of this paper.

#### III. BACKGROUND

#### A. Deep Neural Networks

**DNN structure:** A DNN consists of an input layer, multiple hidden layers, and an output layer, as shown in Figure 2. Each layer is composed of multiple parallel neurons; for instance, the hidden layer in Figure 2 has four parallel neurons. Each parallel neuron contributes to a weighted output that acts as an input for the next layer. Inputs and weights are subjected to linear algebra operations in each layer, followed by nonlinear activation functions. Table I shows common types of



Fig. 3. DNN training algorithm with two stages: forward propagation and backpropagation.

TABLE I SOME POPULAR NEURAL NETWORK ARCHITECTURES AND TYPES OF CONSTITUENT LAYERS

| DNN architecture | Dense    | Convolution | Depthwise<br>Convolution | Pooling | MultiHead<br>Attention |  |
|------------------|----------|-------------|--------------------------|---------|------------------------|--|
| LeNet-300-100    | <b>√</b> | -           | -                        | -       | -                      |  |
| ResNet           | ✓        | ✓           | -                        | ✓       | -                      |  |
| MobileNets       | <b>√</b> | ✓           | ✓                        | ✓       | -                      |  |
| Transformer      | <b>√</b> | -           | -                        | -       | ✓                      |  |

layers used in some common DNN architectures. Dense layer is used by all architectures listed since it can be used as a classification layer at the output of most DNN architectures. It can be simply used in MLPs (multi-layer perceptrons) as feedforward layers. The core of dense layer is matrix-vector multiplication. Convolution and depthwise convolution are widely used in image classification tasks, such as ResNet and MobileNets in Table I. Convolution consists of multiplication-intensive operations. Pooling layers are responsible for down-sampling, reducing the feature maps size, and not involving multiplications. The MultiHeadAttention layer has shown extraordinary performance in language and image tasks, and the famous example is Transformer, as shown in Table I. The MultiHeadAttention layer involves matrix multiplication under the hood.

**DNN Training:** Training is an iterative process that finds optimal parameters (weights) to reduce the difference between the model prediction and the dataset-labels. Initially, the weights are randomly generated based on a particular distribution. Then, a series of forward propagation and backpropagation phases are executed until no further accuracy improvement occurs. Training is highly resource hungry and processes a large number of multiplications.

The training procedure is shown as a flow diagram in Figure 3. Activations from layer l-1 are multiplied with weights in the current layer l, and results are accumulated to become the activation for the next layer. After forward propagation, an error is calculated to reflect the difference between the predicted result and the label, known as propagation error. Propagation error is propagated backward to calculate gradients for weights in each layer so that the DNN model could achieve better prediction. As shown in Figure 3,  $Error^{l+1}$  is propagated backwards to layer l. To get  $WeightsGradient^l$ , multiply-accumulation is performed on  $Errors^{l+1}$  and  $Activations^{l-1}$ . Then,  $Weights^l$  could be updated with the calculated gradient. Similarly,  $Errors^l$  (preceding layer gradient) is computed as the multiply-accumulation operation between the  $Weights^l$  and

the  $Errors^{l+1}$  backpropagated from the succeeding layer l+1.

**Training Accuracy & Test Accuracy:** The dataset for a neural network is divided into two subsets: a training set and a test set. The training set is used to train the neural network. Training accuracy is the classification accuracy calculated using the training set and is mostly used for monitoring training convergence. The test accuracy is one metric to evaluate the performance of the trained neural network when classifying the 'unseen' test dataset.

**TensorFlow** (**TF**): TensorFlow, one of the most popular deep learning libraries, has been used to develop many DNNs across different application domains. TensorFlow provides highly abstracted Ops such as Conv2D (2D Convolution, also known as Conv2D layer) and Dense (also known as fully connected layer) that are commonly used across different architectures and applications, allowing the users to build models easily. In TensorFlow, DNN architectures are represented as graphs, and the Ops are nodes that take one or more tensors (multi-dimensional arrays) as inputs and perform computations on those tensors. Every *Op* has its Compute method that defines the mathematical operation to be performed on the tensors. Ops typically has a corresponding gradient Op which is used in the backpropagation in the training phase. The backend of TensorFlow utilizes cuBLAS and cuDNN libraries developed by NVIDIA for GPU acceleration. Both closedsource libraries are highly optimized low-level primitives for linear algebra and DNN.

## B. Approximate Computing and Approximate Multipliers

Many modern compute-intensive applications, including machine learning and DNN algorithms, are error-resilient: they can tolerate errors in underlying computations with negligible degradation in the final output quality [26]. Approximate computing aims to exploit this property by trading-off exactness in underlying computations for disproportionate gains in resource efficiency.

Hardware approximate multipliers (or approximate arithmetic units, in general) are resource-efficient computation units in which the hardware logic circuit is simplified, such that they become faster, smaller, and/or less power/energy-hungry while producing slightly erroneous outputs when compared to an exact multiplier. Depending on the data type, approximate multipliers can be classified into an approximate integer or approximate FP multipliers. Approximate integer multipliers have been demonstrated to be effective in the inference phase of DNN. However, DNN training using approximate multipliers is largely unexplored. More details on approximate multipliers can be found in [29].

# IV. RELATED WORK

Approximate hardware for DNNs typically refers to using approximate arithmetic units for DNNs inference. In the computation of DNNs, multiplications dominate operations and consume the most power and area. Thus, utilizing approximate multipliers could improve inference efficiency and has been extensively studied [29], [19], [18], [31]. For example, in [29], Saadat et al. replaced accurate multipliers with minimally



Fig. 4. DNN forward and backpropagation using approximate multipliers.

biased multipliers in AlexNet during the inference stage. Other works [19], [18], [31] all show significant energy savings when using approximate multipliers with minimal accuracy degradation. These works use only a limited variety of DNNs, support only inference, and do not report the run-time of simulations. TFapprox [33] enables fast and flexible simulation of approximate multipliers in DNN inference due to Tensor-Flow integration and GPU acceleration. However, TFapprox is limited to 8-bit integer multiplications and supports inference only.

Most of these efficiency gains were made for DNN inference. Few works focus on training DNNs with approximate multipliers. In [19], Kim et al. claimed approximate multipliers are not suitable for training because DaDianNao [8] project failed to train the DNN with the fixed-point data type. Fixedpoint 16-bit has a limited dynamic range that prevents some necessary small gradients from being represented [36]. In [14], Hammad et al. attempted to train a DNN with approximate multipliers first and further improved convergence with accurate multipliers. However, only VGGnet was evaluated; thus, its feasibility is not shown on a wide variety of neural networks. The work in [9] first attempted to train neural networks with logarithm-based approximate multipliers. However, the evaluated model is simple: only a fully connected layer was considered. A fully connected layer is well known to be redundant; for example, in [15], Han et al. achieved a compression rate of 40 times for LeNet-300-100 (MLP), with parameters reducing from 1070kb to 27kb. Therefore, the convergence could have been well caused by redundancy. Given these concerns, it is challenging to show the efficacy of the work in [9].

To overcome the above limitations, we first present a novel LUT-based approximate FP multiplier simulator on GPU (AMSim) that could efficiently simulate any type of approximate FP multipliers. Then, we integrate the AMSim into the ApproxTrain framework, supporting training and inference. Several neural network architectures with convolution and dense layers were evaluated, including training and testing on MNIST, CIFAR10, and ImageNet datasets.

# V. AM SIMULATION: EFFICIENT LOOKUP TABLE-BASED APPROXIMATE FP MULTIPLIER SIMULATION

As explained in Section III-A, a DNN consists of different types of layers, several of which involve multiplications. The computations involved in forward propagation and backpropagation of these layers were depicted graphically in Figure 3.



Fig. 5. Overview of creation of custom ops for ApproxTrain.

To enable simulation of approximate multipliers in DNN training, all multiplications in the forward and backpropagation should be replaced by approximate multiplications, computed by AMSim as depicted in Figure 4. However, there is no native hardware support for approximate FP multipliers on commercial processing units; hence, software simulation of approximate FP multipliers is needed.

Efficiently simulating approximate FP multipliers in software is particularly challenging. Firstly, approximate FP multipliers have differing computation procedures, so direct C/C++ simulation (bit manipulation) cannot guarantee consistent low overhead independent of the type of approximate FP multiplier. Secondly, multiplier designers may find it challenging to optimize multipliers in order to improve simulation speed. Furthermore, GPU-based simulation must be utilized to efficiently couple approximate FP multipliers with training and inference algorithms. Therefore, this section presents a novel flow for seamlessly converting the C/C++ simulation implementation into an optimal LUT-based approximate FP simulator on GPU, AMSim.

As depicted in the red dash box in Figure 5, LUT generation (see V-A) takes user-defined multiplier C/C++ code as input and generates mantissa products LUT. This generation step is required to be run once for a given approximate FP multipliers, and LUTs are written into binary files; thus, multipliers designers could load LUT binary files during run-time. Upon completing this generation step, users may load these LUTs into AMSim (see V-B) during run-time. This flow is designed based on the following key observations: (1) Mantissa multiplications contribute to 91.10% area and 92.70% power in the circuit of accurate FP multiplications [28]; thus, most AM designs [29], [30], [20] target optimizing the mantissa multiplications stage, and keep existing computation of exponent and sign unchanged; (2) different designs have differing approximate mantissa multiplication procedures, making it challenging to develop an efficient approximate FP multiplier simulators that will fit all designs. Mantissa multiplications are therefore simulated by LUTs (see Lookup Table Generation below), and the whole procedure for approximate FP multiplications involves three steps (see AMSim below): (1) Retrieve mantissa multiplication results from LUT; (2) Compute sign and exponent; (3) Concatenate sign, exponent, and mantissa multiplication to achieve the final approximate multiplication result.

## A. Lookup Table Generation

In [33], authors store all approximate integer multiplication results in LUT, with the LUT occupying only 128kB of GPU memory. However, this solution is not practical for approximate FP multipliers. The de-facto industrial standard for FP is 16-bits (bfloat16 and FP16), and storing the entire result of multiplication in a LUT would require 8.6 GB of memory, which is too costly for GPU. We propose to store only the mantissa multiplication results in LUTs based on our above-mentioned observations. In the case of bfloat16, there are 7 mantissa bits, resulting in  $2^7 \times 2^7 \times 4$  (stored as 4 bytes<sup>2</sup> in LUT) = 65.53 kB, which is negligible compared to 8 GB memory in GTX1080. Algorithm 1 takes the bit-width of mantissa M and approximate FP multiplication C/C++ code approx mul to generate mantissa multiplication LUTs. In lines 2-4 of Algorithm 1, two FP numbers A and B are initialized with arbitrary signs and exponents since the mantissa product is independent of signs and exponents. It should be noted that the exponent of A and B and the exponent of their product must not be special cases (0, Inf, and NaN). Otherwise, the carry from the mantissa multiplication cannot be detected; the detailed conditions are presented in line 4. In our AMSim, the carry from the mantissa multiplication is used to adjust the exponent. Lines 5-16 of Algorithm 1 captures the nested loop used to generate all possible mantissa combinations. The mantissa of A and B are populated by the nested loop indices in line 7. The populated A and B are then passed into the user-defined C/C++ function, approx\_mul. Then, approx\_mul generates an approximate FP product C. Lines 9-13 of Algorithm 1 describe how to detect carry without knowing any details about how hardware or simulations are implemented. The unnormalized exponent (un normalized exp) of C is

<sup>2</sup>Storing it as 4 bytes eliminates shift operation after retrieving from LUTs in AMSim, which further accelerates AMSim

# Algorithm 1 Approximate Mantissa Multiplications Lookup Table Generation

**input:** M,  $approx\_mul$   $ightharpoonup M \in [1,11]$  is the bit-width of mantissa.  $approx\_mul$  are approximate FP multiplication c code; it takes two FP32 numbers as inputs and outputs approximate FP multiplication as FP32. **output:**  $mntmult\_lut$   $ightharpoonup mntmult\_lut$  is the mantissa multiplications lookup table. The size of  $mntmult\_lut$  is  $2^{2M}$  and each entry is 4-byte.

```
1: function Approximate Mantissa Multiplications Lookup Ta-
    BLE GENERATION
2.
        A \leftarrow empty FP32; B \leftarrow empty FP32
3:
        Sign(A) \leftarrow 0 \ or \ 1; \ Sign(B) \leftarrow 0 \ or \ 1
        Exponent(A) \leftarrow N; Exponent(B) \leftarrow K;
4:
        \forall N,K\in[1,254];\ \forall (N+K-127)\in[1,254] for k=0 to 2^M do
5:
            for j=0 to 2^M do
6:
7:
                 Mantissa(A) \leftarrow k; Mantissa(B) \leftarrow j
8:
                 C \leftarrow approx \ mul(A, B)
9:
                 un\_normalized\_exp \leftarrow Exp(A) + Exp(B) - 127
10:
                 Carry \leftarrow 0
11:
                 if un\_normalized\_exp < Exponent(C) then
12:
                     Carry \leftarrow 1
                 end if
13:
                 mntmult\_lut[k \times 2^M + j] \leftarrow (Carry \ll 23) \mid Mantissa(C)
14:
15:
            end for
16:
         end for
17: end function
```



Fig. 6. GEMM performance comparison for direct C Simulation and AMSim for multipliers REALM16 [30], AFM16 [29] and MIT16 [25]. Note, FP32 is the time used by native hardware.

calculated in line 9 and is compared with the real exponent return by the user-defined C/C++ function  $(approx\_mul)$  in lines 11-13 to set carry. In AMSim, the carry bit needs to be retained in order to adjust the exponent if the real exponent of C is greater than the unnormalized exponent of C. Finally, carry bit and mantissa results are stored in the same entry of LUT  $(mntmult\_lut)$  in line 14. A script has been provided for multiplier designers to generate LUTs on the condition that a C/C++ approximate multiplication function is properly implemented by the designer.

#### B. AMSim

The AMSim is proposed to simulate approximate FP multipliers on GPU; it is composed of the three steps mentioned earlier in section V, and Algorithm 2 elaborates this mechanism in detail. Algorithm 2 takes two FP numbers a, b and mantissa product LUT; it output the approximate product of a and b. In line 7, the mantissa of A and B are extracted; then, on line 8, the index to fetch LUT is computed by concatenating the mantissa of A and B. In lines 9-10 of Algorithm 2, the mantissa multiplication results and carry are decoupled. In line 10, the sign of the approximate multiplication output C is computed as the XOR (exclusive-or) of the signs of A and

# **Algorithm 2** Approximate FP Multiplication Simulator (AM-Sim)

```
input: a, b, mntmult\_lut \triangleright a and b are FP inputs to the simulation. mntmult\_lut is the mantissa product lookup table
```

```
\triangleright c is the approximate product of a and b
 1: global variables
        M, Mantissa Bit-width.
3.
        M MASK, Mantissa Mask
        E_MASK, Exponent Mask.
5: end global variables
 6:
    function APPROXIMATE FP MULTIPLICATION SIMULATION
 7:
        Amnt \leftarrow M\_MASK \& a; Bmnt \leftarrow M\_MASK \& b
8:
        Mntmult \leftarrow mntmult\_lut[Amnt \gg (23 - M \times 2) +
                                Bmnt \gg (23 - M)I
 9.
        Carry \leftarrow Mntmult \& 0x00800000
10:
        Mntmult← Mntmult & 0xFF7FFFF
11:
        Sign \leftarrow (a \oplus b) \& S\_MASK
12.
        Exp \leftarrow ((a \& E\_MASK + b \& E\_MASK) \gg 23) - 127
        if Exp \le 0 or a & E\_MASK == 0 or b & E\_MASK == 0 then
13:
            c \leftarrow 0
14:
15:
        else if Exp \geq 255 then
16:
            c \leftarrow \mathit{INFINITY}
17:
        else
18:
            Exp \leftarrow Exp + Carry
19:
            c \leftarrow Sign \mid (Exp \ll 23) \mid Mntmult
20:
        end if
21: end function
```



Fig. 7. ApproxTrain: From end-user Perspective.

B. Exceptional cases (0 and Infinity) and normal cases are handled from lines 11 to 17 of Algorithm 2. If either the biased exponent of C is not greater than zero or one of the inputs (Aor B is zero) (line 12), then C should be zero. When the biased exponent of C exceeds or equals 255, the C is overflowed and results in Infinity. In lines 16-18 of Algorithm 2, the biased exponent is adjusted based on carry in the normal case. As a final step, sign, exponent, and mantissa are concatenated to form C. The AMSim is implemented as an inline device function and compiled into a part of the CUDA kernel. The LUT is retrieved from texture memory on GPU, a similar approach to that described in work [33]. Texture memory has its dedicated texture cache, connected to the L2 cache, and it would not affect the DNN workload in the L1 cache; thus, this approach reduces memory transaction overhead. Note that, despite giving 16-bit FP as an example, our approach enables generic (1, e, m) FP approximate multiplication simulation; bits of mantissa m could be selected from 1 (16 Bytes) to 11 (16.8MB, 1.6% of total GTX1080 memory), supporting a wide range of precisions. Additionally, the bits of the exponent e can be varied from 1 to 8 provided that a proper exponent casting function is given.

#### C. GEMM performance evaluation

As shown in Figure 6, we evaluate AMSim and direct C simulation with different approximate multipliers in GEMM (general matrix multiplication, see section VI-D). The two input matrices to GEMM are both 8000 by 8000, and the experiment is performed using GTX1080 GPU. Our approach, AMSim, is consistently 2x slower than native hardware FP32 for REALM16, AFM16, and MIT16, while other direct C simulations have performance overhead between 4.6x and 78.2x.

In the following section, we present the integration of the proposed AMSim into the framework, ApproxTrain, to enable DNN training/inference using approximate FP Multipliers.

# VI. APPROXTRAIN

ApproxTrain integrates AMSim into TensorFlow, so that different DNN architectures can be efficiently constructed and evaluated using high-level APIs. In ApproxTrain, we create custom TF *ops* (see Section III for explanation of *ops*) to support different types of DNN layers with approximate multiplication. To equip our custom *ops* with AMSim, we developed GPU-accelerated custom CUDA kernels for the implementation of our custom TF *ops*. These custom CUDA kernels

are needed because the standard *ops* available in the opensource TF library use closed-source cuDNN/cuBLAS libraries in the backend that cannot be modified. Thus, ApproxTrain enables fast evaluation of training/inference of different DNN architectures using different approximate multiplier designs.

In the following subsections, we first present an overview of ApproxTrain, followed by a detailed description of our approach to create two custom TF *ops* and the underlying CUDA kernels.

# A. ApproxTrain: Framework Overview

An overview of the ApproxTrain use-case is shown in Figure 7. In addition to the normal design flow of TF, a user simply needs to: (1) provide functional models of the approximate multipliers in C/C++; (2) and replace the standard layer *ops* with the approximate versions from ApproxTrain in the DNN architecture. Example code snippets for such a replacement is demonstrated in Listing1 and Listing2. After importing the compiled library of the custom *ops* from ApproxTrain, the DENSE *op* (fully connected) and CONV2D *op* (convolutional layer) are simply replaced with their approximate versions AMDENSE and AMCONV2D, respectively.

```
from tensorflow.keras import layers

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3))
model.add(layers.Dense(32, (3, 3))
```

Listing 1. DNN model using standard TensorFlow op for convolutional and dense layer.

```
from python.keras.layers.am_convolutional import AMCONV2D
from python.keras.layers.am_dense import AMDENSE

model = models.Sequential()
model.add(layers.AMCONV2D(32, (3, 3))
model.add(layers.AMDENSE(32, (3, 3))
```

Listing 2. DNN model using ApproxTrain for approximate convolutional and dense layers.

Figure 5 depicts an overview of the internals of creation and compilation of our custom ops in ApproxTrain. The main component is the approximate operator C++ class inside the blue dashed box which has multiple operations such as input validation, serializing tensors to linear arrays, memory allocation, and performing computations. The computational part of approximate operator C++ class includes functions to calculate feedforward propagation and back propagation by invoking our custom CUDA kernels or CPU kernels<sup>3</sup>. Custom CUDA kernels (explained in Section VI-D) are responsible for linear algebra operations and data rearrangement and are equipped to use AMSim. The AMSim is implemented as a device function for running on GPU. As stated before, custom CUDA kernels are written from scratch because the closedsource cuDNN and cuBLAS libraries cannot be modified to use approximate multipliers.

All CUDA kernels are compiled by NVCC, and the C++ operator class is compiled with g++. Then, the compiled

C++ object files are linked with the complied CUDA kernel objects to form the approximate operator shared library. This approximate operator run-time library is then enclosed in a python wrapper which is then registered into the standard TF library. Note that the compilation steps above only need to be done once. Instead of replacing the corresponding original operators in TensorFlow, the new approximate operators are kept alongside the original ones. Given user-defined approximate multiplier C codes, LUTs can be obtained by Lookup Table Generation (explained in V-A), as depicted in Figure 5. The obtained LUTs are loaded into the approximate operator runtime library during run-time to simulate different functional models of approximate multipliers. These python wrappers have the same parameters as original operators, and the users simply need to change the name of the original operators to the approximate ones to simulate the approximate multipliers, as demonstrated in the code listings above.

We have currently added two operators, AMDENSE (approximate Dense layers) and AMCONV2D (approximate Conv2D layers), to our framework to enable the support of two layers: Dense and the Conv2D layers. These two operators allow us to cover a large portion of DNN architectures.

# B. AMCONV2D: Conv. Layer Custom Op with Approximation

This subsection explains the approach used to realize the forward and backward propagation in the custom approximate operator AMCONV2D.

Forward propagation: Forward propagation takes activations  $A^{l-1}$  and  $QuantizedWeights^{l}$  to compute output activations  $A^l$  (Figure 3). We use the IM2COL+GEMM approach [7], [13] to perform forward propagation, because this approach exposes fine-grained parallelism suitable for GPU acceleration. Figure 8 (a) illustrates this approach. In Figure 8 (a)  $A^{l-1}$  is the input to IM2COL operation. The output of IM2COL and QuantizedWeights<sup>l</sup> are subjected to the GEMM operation. The GEMM kernel contains AMSim that can invoke native hardware multiplications (\* operator) or perform approximate multiplications using LUTs. Algorithm 3 describes our approach for forward propagation on GPU. First, the sizes of GPU global memory arrays are computed (line 2 in Algorithm 3) and allocated (line 3). Then, the IM2COL Kernel is invoked on the GPU (line 4 in Algorithm 3), followed by the GEMM kernel (line 5). Details of these GPU kernels will be discussed later in Section VI-D.

# **Algorithm 3** Approximate Forward propagation

**input:**  $A^{l-1}$ ,  $W^l$ , S, P,  $LUT \triangleright A^{l-1}$ : activation from layer l-1;  $W^l$ : weight from layer l; S: stride; P: Padding; LUT: mantissa product LUT **output:**  $A^l$   $\triangleright A^l$ : activation from layer l

- 1: function Approximate Forward Propagation
- PSize,ColSize ← calculate\_sizes(A<sup>l-1</sup>,W<sup>l</sup>,P,S) ▷ PSize: The size of padding ColSize: The size of Im2Col results
- 3:  $allocate\_GPU\_memory(PSize,ColSize,LUT)$   $ightharpoonup For A^{l-1}$ ,  $Columns(output of IM2COL), W^l, a^l and LUT$ .
- 4:  $Columns \leftarrow IM2COL\_kernel(A^{l-1}, PSize, ColSize)$ 5:  $A^l \leftarrow GEMM\_kernel(Columns, W^l, LUT...)$
- 5: A<sup>l</sup> ← GEMM\_kernel(Columns, W<sup>l</sup>, LUT ...) ▷ ... refer to m,n,k,lda,ldb and ldc that are ommitted for simplicity. A<sup>l</sup> is the output activation
- 6: end function

<sup>&</sup>lt;sup>3</sup>The CPU implementation was used for validating our GPU implementation and benchmark, but could also be used by a user who does not have GPU access at the cost of higher run-time.



Fig. 8. Forward propagation and back-propagation implementation overview.

Despite not being shown in Algorithm 3 for simplicity, a loop that invokes the kernels iteratively on tiles of the array  $A^{l-1}$  is implemented to enable our framework to train large architectures and datasets. This is because the CUDA grid (group of blocks of CUDA threads) dimension along the y-axis is limited to 65535 [1], and thus large input data cannot be fit into the GPU grid entirely.

**Backpropagation:** The backpropagation involves two gradient computations: weights gradient and gradient for the preceding layer (labeled as  $Errors^l$  in Figure 3). Algorithm 4 elaborates our backpropagation approach for efficient GPU implementation. Similar to Algorithm 3, here too, we first calculate the GPU array sizes and allocate them in lines 2-3 of Algorithm 4. Lines 4-5 are the invocation of the kernels for the weights gradient (explained below), and lines 6-8 are for the preceding layer gradient (explained below). Note that for backpropagation, we also implemented tiling (as we have explained for forward propagation) despite not being shown in Algorithm 4.

1) Weights Gradient Computation: We restructured the weights gradient computation to exploit the IM2COL+GEMM approach as illustrated in Figure 8 (b). We first subject  $Errors^{l+1}$  to dilation (inserting zeros between elements based on the stride parameter). Then, this  $DilatedErrors^{l+1}$  is fed to GEMM (Figure 8 (b)).

As opposed to forward propagation, mapping backpropagation to GEMM along with IM2COL to efficiently exploit GPU architecture is challenging. A naive method to implement the mapping of computation of weights gradient to GEMM would be to implement a separate GPU kernel to perform the dilation operation and invoke it before the GEMM kernel. However, this naive method would be inefficient due to two reasons.

# **Algorithm 4** Approximate Backpropagation

```
input: a^{l-1}, W^l, Error^{l+1}, Stride > a^{l-1} is the activation from layer l-1), W^l is the weight from layer l output: W^{l\prime}, Errors^l > W^{l\prime} is the gradient of W^l
```

- 1: function APPROXIMATE BACKPROPAGATION
- 2: PSize,  $ColSize \leftarrow calcualte\_sizes(a^{l-1}, W^l, Errors^{l+1}) \triangleright PSize$ : The size of padding ColSize: The size of Im2Col results
- allocate\_GPU\_memory(PSize, ColSize, LUT) 

  → For a<sup>l-1</sup>, Columns, DilatedError<sup>l+1</sup>, W<sup>l'</sup> and Errors<sup>l</sup>
- 4:  $Columns_{a^{l-1}} \leftarrow IM2COL\_Weight\_Kernel(a^{l-1}, ColSize, PSize)$
- 5:  $W^{l\prime} \leftarrow GEMM\_Kernel(Columns_{a^{l-1}}, Error^{l+1}, LUT...)$
- 6: Columns<sub>PDError</sub>l+1 ← IM2COL\_PLG\_Kernel(Error<sup>l+1</sup>, Col-Size, PSize) → IM2COL kernel for preceding layer gradient (PLG), PDError<sup>l+1</sup> is the PaddedDilatedError<sup>l+1</sup>
- 7:  $(W^l)_r^T \leftarrow Reverse\_Transpose\_kernel(W^l)$
- 8:  $Errors^l \leftarrow GEMM\_Kemel(Columns_{PDError^{l+1}}, (W^l)_r^T, LUT...)$
- 9: end function

First, invoking a kernel unnecessarily adds extra performance overhead. Second, a dilated array would require several times the memory as the original array (depending on *stride* value), consequently reducing the number of non-zero elements that can be stored in the GPU global memory, thus requiring more tiling (similar to tiling explained in forward propagation above). Instead of such a native approach, we implicitly perform this dilation inside the IM2COL\_Weight\_Kernel (a modified IM2COL kernel) by skipping elements in  $A^{l-1}$  that correspond to zero (line 4 of Algorithm 4) if the  $Error^{l+1}$  array was dilated.

2) Preceding Layer Gradient: We also restructure the computation of the preceding layer gradient to exploit the IM2COL+GEMM approach as shown in Figure 8 (c). For this, we first subject  $Errors^{l+1}$  to dilation (inserting zeros between elements as explained before), followed by padding (inserting zeros around the image along height and width dimensions). This  $PaddedDilatedErrors^{l+1}$  is the input to IM2COL as shown in Figure 8 (c). Then, we subject  $QuantizedWeights^l$  to transposition and reversal of elements. This  $TransposedReversedQuantizaedweights^l$  is fed as the input to the GEMM operation as shown in Figure 8 (c) along with the output of IM2COL.

Exploiting GEMM in preceding layer gradient ( $Errors^l$  in Figure 8 (c)) of AMCONV2D for efficient execution on the GPU is even more non-trivial since both transposition and reversal of elements in  $Weights^l$  are involved, in addition to padding and dilation of  $Errors^{l+1}$ .

Instead of having a separate kernel for dilating  $Errors^{l+1}$  which would cause additional kernel invocation overhead, we integrate the dilation operation into the IM2COL\_PLG\_Kernel (a modified IM2COL Kernel that performs padding and dilation) where each thread copies a zero into IM2COL results if the current pixel is at a dilated position. Unfortunately, unlike in backpropagation for weights gradient where the second operand to GEMM requires dilation, the dilation must be performed on the input to the IM2COL. Thus, we cannot simply skip elements as is done for weights gradient computation despite the need for more GPU memory.

Transposition and the reversal of  $QuantizedWeights^l$  can be implicitly done inside the GEMM kernel by manipulating the array index when accessing the second operand for GEMM. However, this would be highly inefficient because the global memory access pattern would not enable memory coalescing. Thus, here it is better to sacrifice some time to invoke a separate kernel that solely performs the reversal and transposition of  $QuantizedWeights^l$ , so that more time can be saved during the memory accesses of GEMM operation.

Since AMCONV2D is implemented by the GEMM ap-

proach, all multiplications are done in GEMM kernel; thus, we replace accurate multiplication in GEMM with AMSim device function to enable simulation.

# C. AMDENSE: Dense Layer Custom Op with Approximation

Unlike in the convolution layers, in the dense layer, each neuron receives input from all neurons in the preceding layer (see Figure 9). Like AMCONV2D described above, forward propagation and backpropagation of AMDENSE need to be implemented to realize training. Compared to AMCONV2D, AMDENSE occurs in a small proportion of the total computation and thus contributes to a tiny fraction of the total training time. Thus, CUDA optimization efforts are not as crucial as for AMCONV2D.

**Forward propagation:** Forward propagation can be mapped to a matrix-vector multiplication where weights in the dense layer are a 2-dimensional matrix, and the activations from the preceding layer are a 1-dimensional vector. This is shown using a simplified example in Figure 9 where the dense layer output is computed as:  $\binom{o_1}{o_2} = \binom{w_{11} \ w_{12} \ w_{23}}{w_{21} \ w_{22} \ w_{23}}\binom{x_1}{x_2} = \binom{w_{11}x_1 + w_{12}x_2 + w_{13}x_3}{w_{21}x_1 + w_{22}x_2 + w_{23}x_3}$ ; where x is the activations, w is the weights. We implemented a separate matrix-vector multiplication CUDA kernel for this rather than using the previously used GEMM kernel, because shared memory-based tiling is superfluous for a 1-D vector.

**Backpropagation:** Similar to AMCONV2D, backpropagation in AMDENSE also involves computations of weights gradient and preceding layer gradient.

- 1) Weights Gradient Computations: The gradient of weights in the AMDENSE layer l is computed as  $\delta_{out}a_{in}^T$  where  $a_{in}$  is the activation from preceding layer l-1 and  $\delta_{out}$  is the error backpropagated from succeeding layer l+1. The gradient of weight in Figure 9 is computed as  $\binom{w_{11}'}{w_{21}'} \binom{w_{12}'}{w_{23}'} = \binom{o_1'}{o_2'} \binom{x_1}{x_2} \binom{x_2}{x_3}$ . The same matrix-vector multiplication CUDA  $\binom{o_2'}{o_2'}$  were lis used here.
- 2) Preceding Layer Gradient Computations: The gradient of input in the AMDENSE layer l is calculated as  $(w)^T \delta_{out}$  where  $(w)^T$  is the transpose of the weights in the layer l. The gradient of input of given example in Figure 9 can be

computed as 
$$\binom{x_1^{'}}{x_2^{'}} = \binom{w_{11}}{w_{13}} \frac{w_{21}}{w_{23}} \binom{o_1^{'}}{o_2^{'}}$$
. For the computation of

the preceding layer gradient, we use the same matrix-vector kernel used for forward propagation. The transposition of the vector is implicitly handled because the elements are anyway stored linearly in memory.

We replace accurate multiplications in matrix-vector kernel with AMSim, considering that the matrix-vector kernel contains all multiplications of AMDENSE.

# D. Other Custom CUDA Kernels

The custom AM ops described above utilize several custom CUDA kernels. We developed these kernels to replace the kernels offered by the closed-source cuDNN and cuBLAS library. These custom CUDA kernels (which involve multiplication)



Fig. 9. AMDENSE implementation illustration.

are equipped to use AMSim to perform multiplication. In simple terms, these kernels may call the approximate multiplier functions with two operands as the arguments instead of using the '\*' operator to multiply the two operands. A brief description of these custom kernels is given below.

**GEMM kernel:** The GEMM kernel is a highly optimized kernel that uses a 2-D threading indexing model with 16x16 as the CUDA thread block size. 16x16 tiles of the input matrices are fetched to fast GPU shared memory (on-chip SRAM) from global memory to be used for repeated memory accesses.

IM2COL kernels: There are three separate IM2COL kernels, as we mentioned before: 1. IM2COL (line 4 of the Algorithm 3) for forward propagation 2. IM2COL\_Weight\_Kernel (line 4 of the Algorithm 4) for weights gradient and 3. IM2COL\_PLG\_Kernel (line 6 of the Algorithm 4) for preceding layer gradient. IM2COLs mentioned above are implemented by utilizing a 1-D threading indexing model with 256 as the CUDA thread block size.

IM2COL: Each thread in IM2COL first locates the element position of  $A^l$  (the output of forward propagation), then locates the patch's element position (a flattened window) corresponding to  $A^l$ . The above two steps are needed to copy input data into the correct output position. Finally, the element in the input is located and copied to the IM2COL output.

 $IM2COL\_Weight\_Kernel$ : The IM2COL\_Weight\\_Kernel first locates the element position of  $WeightsGradient^l$  rather than  $A^l$  in forward propagation since its output is the  $WeightsGradient^l$ . Then, the IM2COL\_Weight\_Kernel locates the element position in the patch related to  $WeightsGradient^l$ . Finally, the IM2COL\_Weight\_Kernel locates the element in  $A^{l-1}$  and copies it to the IM2COL output; note that skipping elements is performed here if stride is greater than 1.

*IM2COL\_PLG\_Kernel:* Similar to IM2COL IM2COL\_PLG\_Kernel, IM2COL\_PLG\_Kernel first locates the element position of the preceding layer gradient  $(Errors^{l-1}$  in Figure 8 (c) and then locates the element position in the patch. After the above two steps, the element position of input is located. However, this element position of input is computed based on the size of  $PaddedDilatedErrors^{l+1}$  rather than  $Errors^{l+1}$ (note the input data to IM2COL PLG Kernel is still  $Errors^{l+1}$ , but the size of input data is set to that of the  $PaddedDilatedErrors^{l+1}$ ); thus, an additional procedure is implemented for each thread to check if current position is dilated position or not. The native IM2COL could handle padding, but the computation for the size of padding is different from forward propagation and weights gradient, despite not being explained here.

**Transpose-And-Reverse Kernel:** The TransposeAndReverse Kernel is a custom CUDA kernel that swaps dimensions of data and reverses elements order. It uses a 2-D threading indexing model with 32x32 as the CUDA block sizes. It first gets the index of a pixel along the height and width dimension of input to reverse the elements. Then, it gets the index of the dimension that is to be swapped in the following procedure. Then, this kernel starts swapping and reversing elements by manipulating the index. This kernel improves spatial locality by rearranging data order; thus, when GEMM kernel loads data into shared memory, memory coalescing occurs.

# **Matrix-Vector Multiplication Kernel:**

Matrix-vector multiplication custom CUDA kernel is implemented by 1-D threading mode with 1024 threads in each block. Each thread will operate multiplication n times (n depends on the length of the vector).

#### VII. EXPERIMENT SETUP

We use the presented ApproxTrain to perform a series of *training* and *inference* experiments for image classification. The experiments uses various DNN architectures and datasets on different platforms. The purpose of these experimental evaluations is two-fold: (1) evaluate the efficacy of approximate multipliers in DNN training, i.e., the effect on training convergence and accuracy (Section VIII); and (2) evaluate the timing performance of ApproxTrain with different DNN architectures/datasets/platforms (Section IX). In the experiments, the framework inputs are neural network architecture, dataset, and multiplier-type. The outputs are the timing performance numbers and validation and test accuracy of the classification task. The details of different datasets, neural network architectures, and other settings used in our experiments are below.

**Datasets:** Three popular datasets from image-classification are used in our experiments: MNIST [11], CIFAR-10 [21] and ImageNet [10]. MNIST is hand-written digits consisting of 60,000 training and 10,000 test samples. Each sample is a 32×32 gray image. In the CIFAR10 dataset, there are ten classes with 6000 32×32 coloured images per class. The dataset is divided into 50,000 training samples and 10,000 test samples. ImageNet contains 1.2 million training images, spanning 1000 object classes. MNIST and CIFAR-10 are usually considered small datasets, while ImageNet is one of the largest datasets available for image classification.

Neural Network Architectures: Five neural network architectures: LeNet-300-100, LeNet-5 [23], ResNet-18/34/50 [16] are used in our experiments. LeNet-300-100 is a multilayer perceptron (MLP), while LeNet-5 is a convolutional neural network (CNN) having two convolution layers and three dense layers. ResNet is a deep convolutional network whose complexity can be adjusted by adding or removing building blocks [16]. ResNet-18/34/50 contains 18, 34 and 50 layers, respectively. The various combinations of datasets and architectures used in our experiments are listed in the first & second columns of Table III.

Datatype: Our experiments use floating-point format instead of integer/fixed-point because training typically requires

TABLE II
DATA-TYPES AND MULTIPLIERS USED IN EXPERIMENTS.

| Multiplier/<br>Datatype | Bit-width (s,e,m) | Description                             |
|-------------------------|-------------------|-----------------------------------------|
| FP32                    | (1,8,23)          | IEEE 754 standard format                |
| bfloat16                | (1,8,7)           | Brain Floating Point format [34]        |
| AFM32                   | (1,8,23)          | 32-bit version of approx. mult AFM [29] |
| AFM16                   | (1,8,7)           | 16-bit version of approx. mult AFM [29] |

a higher dynamic range. We keep the sign (s) as 1-bit and the exponent (e) as 8-bit (similar to FP32 and bfloat16 [34]). The number of mantissa bits (m) is varied in different experiments to achieve different bit-widths. The details of various data types/multipliers and their bit widths are listed in Table II. Since exponents are the same in all formats, type-conversion is simply a matter of bit-truncation or bit-extension. All accumulation operations are performed in FP32 to realize the industry de-facto standard of mixed-precision training when lower bit-widths are used for multiplication [6].

## **Experiment platforms:**

Three types of platforms are used for the experiments. System-I is equipped with a single NVIDIA V100 GPU and 12 core Intel Xeon Scalable (Cascade Lake) processor (24 CPUs per core). System-II is equipped with GTX1080 and i7 6600 CPU. System-I and System-II are used for run-time performance benchmarking (Section IX). In addition, to run the training convergence test for large datasets (Section VIII), we used another system which is a two-node cluster equipped with 8 V100 GPUs and two full Intel Xeon Scalable 'Cascade Lake' cores. To realize multi-GPU (distributed) training environment on the high-end cluster, TensorFlow wrapped by Horovod is used. The operating system used on all systems is Ubuntu 18.04.

**Implementation Details:** The presented framework is integrated into TensorFlow 2. The tested TensorFlow version is 2.3.0, which requires CUDA 10.1 and cuDNN 7.6.5. The custom CUDA kernels are compiled with NVCC provided by CUDA 10.1, whereas the supporting C/C++ files are complied with gcc-8/g++-8.

# VIII. RESULTS: TRAINING ACCURACY EVALUATION

In this section, we present the evaluation of training accuracy and convergence using the approximate multipliers [29] with ApproxTrain. For the following experiments, AFM32 and AFM16 [29] are used as representative approximate multipliers, whereas FP32 and bfloat16 formats are used as the baseline. Figure 1 depicted area and power efficiency of AFM16 and AFM32. In comparison with the FP32 multiplier, the AFM32 is 12x smaller and 24x more energy efficient, while the AFM16 is about 20x smaller and 50x more energy efficient. The different combinations of dataset/NN-architectures used in the experiments are listed as the title of each graph in Figure 10 (and also listed in the first two columns of Table III). For example, in Figure 10 (a), MNIST dataset is used with LeNet-300-100 architecture.



Fig. 10. Training curves for evaluated datasets and architectures using FP32, bfloat16 and approximate multipliers (AFM32 and AFM16). The convergence behaviour and convergence rate for AFM32 and AFM16 is similar to FP32 and bfloat16.

#### A. Training Convergence and Test Accuracy

The training accuracy and convergence are depicted in Figure 10, where the training accuracy (y-axis) is plotted against train-epochs (x-axis) for the four multipliers listed in Table II. The weights and parameters for the NN architectures are randomly initialized; however, for a given NN/dataset combination, the same random seed is used for all four multipliers (for fair comparison among different multipliers). The training is run for several epochs until the validation accuracy stabilizes. The training converges in 20 or fewer epochs for the two LeNets, whereas it stabilizes in around 100 epochs for CIFAR-10/ResNet combinations. From Figure 10, we observe that the training-accuracy plots for the AFM32 and AFM16 closely follow the plots for FP32 and bfloat16. The observation applies to both the small datasets as well as the large dataset (ImageNet) training. In other words, training converges with approximate multipliers (AFM32 and AFM16), and the convergence behavior and convergence rate are the same as for FP32 and bfloat16. Note that, as shown in Figure 1, AFM32 and AFM16 are much smaller and more powerefficient than FP32 and bfloat16 multipliers.

The final test accuracy results for the six dataset/architecture combinations are reported in Table III, for 32-bit and 16-bit formats. The third and sixth columns, presenting results for FP32 and bfloat16, are considered are baselines for 32-bit and 16-bit formats, respectively. The difference of test accuracy between AFMs compared to the corresponding baselines is listed in the fifth and eighth columns. From Table III, for both data formats, we observe that the test accuracy for all dataset/architecture combinations using approximate multipliers is very similar to the baseline (accuracy degradation

TABLE III
TEST ACCURACY RESULTS FOR TRAINING WITH DIFFERENT MULTIPLIERS.
ALL RESULTS ARE IN PERCENTAGE(%).

| D             | Neural        | 32-h  | it multip | liers | 16-bit multipliers |       |       |  |  |  |  |
|---------------|---------------|-------|-----------|-------|--------------------|-------|-------|--|--|--|--|
| Dataset       | Network       | FP32  | AFM32     | diff  | bfloat16           | AFM16 | diff  |  |  |  |  |
| Small Dataset |               |       |           |       |                    |       |       |  |  |  |  |
| MNIST         | LeNet-300-100 | 96.90 | 97.10     | 0.20  | 96.70              | 96.80 | 0.10  |  |  |  |  |
| MNIST         | LeNet-5       | 98.30 | 98.30     | 0.00  | 98.30              | 98.30 | 0.00  |  |  |  |  |
| CIFAR10       | ResNet18      | 93.22 | 93.23     | 0.01  | 93.48              | 93.40 | -0.08 |  |  |  |  |
| CIFAR10       | ResNet34      | 93.51 | 93.57     | 0.06  | 93.73              | 93.85 | 0.12  |  |  |  |  |
| CIFAR10       | ResNet50      | 93.54 | 93.48     | -0.06 | 93.45              | 93.62 | 0.17  |  |  |  |  |
| Large Dataset |               |       |           |       |                    |       |       |  |  |  |  |
| ImageNet      | ResNet50      | 73.10 | 73.00     | -0.10 | 73.10              | 73.10 | 0.00  |  |  |  |  |

is within 0.10%). Note that such accuracy differences also exist between the heavily adapted FP32 and bfloat16 formats (column 3 and 6–Table III). Therefore, we argue that such degradation is acceptable. In fact, in most cases, the accuracy for approximate multipliers is slightly better than the baselines (highlighted in blue in the table). A reason for this is that the error injected by erroneous approximate multiplications (AFMs) in training can be considered as stochastic noise, which is a type of regularization [27].

# B. Cross-format Test Accuracy

For the ImageNet dataset, we perform another experiment where we evaluate the test accuracy with a multiplier that is different from the one used for training. In other words, we train the neural network using one multiplier type and test it using another. The purpose of this experiment is to observe if any drastic over-fitting occurs w.r.t. the used multiplier type.

TABLE IV CROSS FORMAT TESTING FOR RESNET50-IMAGENET. ALL RESULTS ARE IN PERCENTAGE(%).

|                      |          | used for testing |       |          |       |  |  |  |  |  |
|----------------------|----------|------------------|-------|----------|-------|--|--|--|--|--|
| Multi                | pliers   | FP32             | AFM32 | bfloat16 | AFM16 |  |  |  |  |  |
|                      | FP32     | 73.10            | 73.10 | 73.10    | 73.00 |  |  |  |  |  |
|                      | AFM32    | 73.00            | 73.00 | 73.00    | 73.10 |  |  |  |  |  |
| used for<br>training | bfloat16 | 73.00            | 73.00 | 73.10    | 73.00 |  |  |  |  |  |
|                      | AFM16    |                  |       |          | 73.10 |  |  |  |  |  |



Fig. 11. MNIST CNN pruning results with different multipliers.

The results of the experiment are listed in Table IV. The multipliers used for training are listed along the second column, while the multipliers listed across the second row are used for testing. Essentially, the numbers in the diagonal (highlighted in bold) are test results when the same multiplier is used for training and test and are the baseline for each row. The rest of the entries in Table IV depict test-accuracy results when different multipliers are used for testing. We observe that the difference in accuracy is within 0.10%, which we deem acceptable, as discussed in the previous subsection. Therefore, this experiment demonstrates that we may safely train and deploy a neural network with different multiplier types (including approximate multipliers) as they do not drastically over-fit for the given data/multiplier type.

# C. Approximate Multiplier on top of Pruning

We also performed an experiment to couple pruning with use of approximate multipliers in training. Pruning is a mechanism for efficient inference, and involved repeated training effort. Thus, it is beneficial to improve the training efficiency and demonstrate that our framework enables hardware/algorithm co-design. The pruning code/algorithm is implemented following the official TensorFlow example without any custom modifications. The pruning schedule is polynomial decay. The initial sparsity is set to 70% and final sparsity is set to different levels to find the optimal sparsity. First, a CNN with 2 convolution layers and three dense layers was pretrained for 20 epochs. Then, the weights of the pre-trained CNN is loaded into a new model that to be pruned. After every pruning, the model is retrained for another two epochs to refine accuracy. In Figure 11, the red horizontal dash-dot line represents baseline for all the other experiments. The orange, blue and purple curves are pruned test accuracy against sparsity for FP32, bfloat16 and AFM16, respectively. Overall, these curves slowly declined from 70% to 80% sparsity and

dropped rapidly after 80% sparsity. We observed all three curves are above the baseline from 70% to 80%. This caused by sparse weights acting as dropout layer, providing extra regularization to help the model generalize. It can be observed that 83% sparsity level is optimal for pruned bfloat16 and pruned AFM16 as they are higher or equal to the baseline. However, FP32 dropped below the baseline with 83% sparsity. Additionally, curves of AFM16 and bfloat16 are consistently above the baseline, demonstrating that AFM16 could act as a drop-in replacement for native bfloat16 multiplier. In this experiment, we successfully coupled approximate multiplier designs with pruning algorithm, thus highlighting the flexibility of our framework.

# IX. RESULTS: RUNTIME PERFORMANCE EVALUATION

As discussed in Section I, the aim of ApproxTrain is to perform DNN training with approximate multiplier simulation in practically feasible run-times. In this section, we present results for a detailed evaluation of the timing performance of ApproxTrain for *training* as well as for *inference*.

The overall timing performance comparison is evaluated by recording the average time for DNNs to train/infer one batch. The results are listed in Table V and Table VI. For these evaluations, the training and inference experiments are run on two platforms: System-I and System-II (described in Section VII). For both training and inference, and for both platforms, Table V&VI present four types of run-time measurements. These are:

- 1) *TFnG* run-time for training/inference performed using standard TensorFlow with cuDNN/cuBLAS libraries on GPU with native hardware multiplier (FP32);
- ATnG run-time for training/inference performed using ApproxTrain with custom CUDA kernels (described in Section VI-D) on GPU with native hardware multiplier (FP32);
- 3) *ATxG* run-time for training/inference performed using ApproxTrain with custom CUDA kernels on GPU with AMSim (16-bit FP datatype (1, 8, 7) in Table II); and,
- 4) ATxC run-time for training/inference performed using ApproxTrain with custom CUDA kernels on CPU with direct C/C++ simulation of approx. multiplier.

The *TFnG* values, i.e., the run-times of standard TensorFlow with native hardware multiplication supported by GPU, are considered as the baseline in the following discussion. *Note that we did not perform run-time evaluation experiments with bfloat16 since the available hardware did not natively support bfloat16. Meanwhile, 16-bit FP datatype (1, 8, 7) as shown in Table II is used in AMSim, considering it equivalent to the industry de-facto standard for training/inference.* 

1) Custom CUDA kernels in ApproxTrain vs optimized cuDNN/cuBLAS in TensorFlow: For this comparison, we use ApproxTrain with the '\*' operator for multiplication, which invokes the native hardware multiplier on GPU instead of an approximate multiplier simulation model. This comparison demonstrates the performance of custom kernels on GPUs with native multiplier hardware. Therefore, in columns 4 & 11 of Table V&VI, ATnG refers to the time with our custom CUDA

|          |               |              | System - II (GTX1080 GPU) |        |                |                       |               |          |              |              |        |        |              |               |          |
|----------|---------------|--------------|---------------------------|--------|----------------|-----------------------|---------------|----------|--------------|--------------|--------|--------|--------------|---------------|----------|
|          |               | A            | Speed Ratio               |        |                | Actual Time per batch |               |          |              | Speed Ratio  |        |        |              |               |          |
|          |               | TF with      | AT with                   | AT v   | AT with<br>AFM |                       |               |          | TF with      | AT with      | AT ·   | with   |              |               |          |
| DataSet  | Neural        | native mult. | native mult.              | AF     |                |                       | ATnG/ ATxG/ A | ATxC/    | native mult. | native mult. | AFM    |        | ATnG/        | ATxG/         | ATxC/    |
|          | Network       | GPU          | GPU                       | GPU    | CPU            | TFnG                  | TFnG          | ATxG     | GPU          | GPU          | GPU    | CPU    | TFnG         | TFnG          | ATxG     |
|          |               | (TFnG)       | (ATnG)                    | (ATxG) | (ATxC)         | (slower)              | (slower)      | (faster) | (TFnG)       | (ATnG)       | (ATxG) | (ATxC) | (slower)     | (slower)      | (faster) |
| MNIST    | LeNet-300-100 | 2.0 ms       | 3 ms                      | 3 ms   | 3 s            | 1.3×                  | 1.6×          | 884×     | 0.9 ms       | 2 ms         | 3 ms   | 3 s    | 2.3×         | 3.2×          | 1078×    |
| MNIST    | LeNet-5       | 3 ms         | 7 ms                      | 13 ms  | 23 s           | 2.3×                  | 4.2×          | 1798×    | 2 ms         | 9 ms         | 16 ms  | 22 s   | $4.4 \times$ | 8.3×          | 1374×    |
| CIFAR10  | ResNet18      | 13 ms        | 49 ms                     | 178 ms | 736 s          | 3.7×                  | $13.5 \times$ | 4132×    | 27 ms        | 120 ms       | 357 ms | 672 s  | $4.4 \times$ | $13.1 \times$ | 1882×    |
| CIFAR10  | ResNet34      | 23 ms        | 90 ms                     | 338 ms | 1376 s         | 4.0×                  | $15.0 \times$ | 4072×    | 49 ms        | 221 ms       | 682 ms | 1280 s | $4.5 \times$ | $13.8 \times$ | 1877×    |
| CIFAR10  | ResNet50      | 44 ms        | 154 ms                    | 478 ms | 1632 s         | 3.5×                  | $10.8 \times$ | 3417×    | 93 ms        | 366 ms       | 960 ms | 1376 s | $3.9 \times$ | $10.3 \times$ | 1433×    |
| ImageNet | ResNet50      | 114 ms       | 460 ms                    | 1464ms | 4896 s         | 4.0×                  | 12.8×         | 3343×    | 267 ms       | 1186 ms      | 3091 s | 4864 s | 4.3×         | 11.6×         | 1574×    |

 $\begin{tabular}{ll} TABLE~VI\\ Inference~run-time~results~on~System-I~and~System-II.\\ \end{tabular}$ 

|          |               | System - I (V100 GPU) |              |        |                |                       |               |          |              | System - II (GTX1080 GPU) |        |        |              |               |          |  |  |
|----------|---------------|-----------------------|--------------|--------|----------------|-----------------------|---------------|----------|--------------|---------------------------|--------|--------|--------------|---------------|----------|--|--|
|          |               | A                     | Speed Ratio  |        |                | Actual Time per batch |               |          |              | Speed Ratio               |        |        |              |               |          |  |  |
|          |               | TF with               | AT with      | AT v   | AT with<br>AFM |                       |               |          | TF with      | AT with                   | AT     | with   |              |               |          |  |  |
| DataSet  | Neural        | native mult.          | native mult. | AF     |                |                       | TnG/ ATxG/    | ATxC/    | native mult. | native mult.              | AFM    |        | ATnG/        | ATxG/         | ATxC/    |  |  |
|          | Network       | GPU                   | GPU          | GPU    | CPU            | TFnG                  | TFnG          | ATxG     | GPU          | GPU                       | GPU    | CPU    | TFnG         | TFnG          | ATxG     |  |  |
|          |               | (TFnG)                | (ATnG)       | (ATxG) | (ATxC)         | (slower)              | (slower)      | (faster) | (TFnG)       | (ATnG)                    | (ATxG) | (ATxC) | (slower)     | (slower)      | (faster) |  |  |
| MNIST    | LeNet-300-100 | 1 ms                  | 1 ms         | 2 ms   | 1 s            | 1.2×                  | 1.5×          | 609×     | 0.869 ms     | 1 ms                      | 1 ms   | 857 ms | 1.3×         | 1.4×          | 697×     |  |  |
| MNIST    | LeNet-5       | 2 ms                  | 3 ms         | 4 ms   | 8 s            | 1.7×                  | <b>2.1</b> ×  | 1780×    | 1 ms         | 3 ms                      | 4 ms   | 7 s    | $2.5 \times$ | <b>3.6</b> ×  | 1815×    |  |  |
| CIFAR10  | ResNet18      | 5 ms                  | 15 ms        | 56 ms  | 320 s          | 3.0×                  | 11.3×         | 5743×    | 8 ms         | 39 ms                     | 113 ms | 352 s  | $4.8 \times$ | 13.7×         | 3102×    |  |  |
| CIFAR10  | ResNet34      | 9 ms                  | 25 ms        | 107 ms | 576 s          | 2.9×                  | $12.2 \times$ | 5405×    | 15 ms        | 68 ms                     | 217 ms | 640 s  | $4.5 \times$ | $14.3 \times$ | 2952×    |  |  |
| CIFAR10  | ResNet50      | 14 ms                 | 36 ms        | 131 ms | 544 s          | 2.5×                  | <b>9.2</b> ×  | 4154×    | 26 ms        | 93 ms                     | 265 ms | 512 s  | $3.6 \times$ | $10.1 \times$ | 1934×    |  |  |
| ImageNet | ResNet50      | 35 ms                 | 110 ms       | 398 ms | 1580 s         | 3.2×                  | 11.4×         | 3993×    | 74 ms        | 301 ms                    | 855 ms | 1568 s | 4×           | 11.5×         | 1833×    |  |  |



Fig. 12. Inference performance comparison for ApproxTrain and TFapprox.

kernels in ApproxTrain, as opposed to columns 3 & 10 which refers to the time taken by cuDNN/cuBLAS based TensorFlow.

The slow-down (speed-ratio) of ATnG compared with TFnG is highlighted in bold (black) in Table V&VI. In the training phase, ApproxTrain with native multiplication is  $1\times-5\times$  slower than standard TensorFlow for the various datasets. Note that the closed-source cuDNN and cuBLAS libraries have been optimized by teams of several hundred professionals within Nvidia for over a decade. Thus, we believe that less than  $5\times$  slow-down is reasonable. Nonetheless, since our framework is open-source, the research community may contribute with further optimizations.

2) ApproxTrain perf. with approx. multiplier simulation: We compare the run-times of ApproxTrain with approximate multiplier simulation on GPU (ATxG) against the standard TF with native multiplication on GPU (TFnG). The slow-down (speed-ratio) of this comparison is highlighted in bold (blue) in Table V&VI. Slow-down is around  $2\times$  for the smallest data-set/architecture, whereas the slow-down for ImageNet is about  $13\times$ . This comparison demonstrates the performance penalty of approximate multiplication simulation plus the use

of a custom CUDA library. Essentially, the difference in bold-black and bold-blue slow-downs is due to the overheads of the approximate multiplier simulation. The slow-down numbers for System-II are slightly lower than for System-I for training and inference in general since V100 GPU has Tensor Cores, which are faster than the architecture in the GTX1080.

Previous work TFapprox with inference-only framework [33] has shown 10× slowdown for small datasetarchitectures, despite only supporting 8-bit integer datatype. ApproxTrain supports floating-point training and inference with just  $7.32 \times$  slow-down  $(7.32 \times$  is the geometric mean of all experiments containing both AMDENSE and AMCONV2D operators). Despite 16-bit FP datatype (bfloat16) is used to benchmark ApproxTrain, we compare inference performance on approximate CONV2D operators of ApproxTrain and TFapprox. We reproduce the TFapprox project [32] and benchmark both ApproxTrain and TFapprox with identical measurement procedure in System-II (GTX1080 GPU). As shown in Figure 12, similar inference performance can be observed for both ApproxTrain and TFapprox across 4 different dataset-architectures, containing intensive approximate CONV2D operations. Note that, TFapprox only supports 8-bit integer inference for approximate CONV2D operator while ApproxTrain enables generic (1, e, m) FP training and inference at once for both AMCONV2D and AMDENSE operators.

3) ApproxTrain GPU performance vs CPU-based approximate multiplier simulations: We compare the performance of ApproxTrain with GPU (ATxG) with the runtimes of approximate multiplier training/inference on CPU (ATxC). The speed-ups of these comparisons are highlighted in bold (green) in Table V&VI. We observe that for training on System-I, the

geometric mean speed-up is more than 2500x! Similarly, for inference, the speed-up is more than 2869x. The speed-ups for System-II are slightly lower as the GPU is less powerful while CPUs in System-II have similar performance to System-I. Thus, the presented ApproxTrain offers a fast and easy solution for testing approximate multipliers and DNNs compared to naive simulations on CPU.

#### X. CONCLUSIONS

This paper proposed a framework (ApproxTrain) to perform training and inference with approximate FP multipliers through simulation. Firstly, a novel flow is proposed to effortlessly convert C/C++-based functional models of the approximate multipliers into optimal AMSim. Then, this AMSim is integrated into ApproxTrain (extension of Tensorflow), leveraging CUDA to speed up the simulation. ApproxTrain allows researchers to flexibly evaluate and explore their approximate multiplier designs in various DNNs. Our evaluations show that approximate multipliers (AFM) could converge DNNs as well as FP32 and bfloat16 multipliers. The GPU run-time shows significant speedup over CPU run-times, making it practically feasible. ApproxTrain is released as open-source [17] for further contributions from the research community.

#### XI. ACKNOWLEDGEMENT

This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. We acknowledge Yikai Wang for monitoring experiments.

#### REFERENCES

- [1] CUDA C++ Programming Guide. Accessed: 2022-04-03.
- [2] Deep learning frameworks. Accessed: 2022-04-07.
- [3] NVIDIA TENSOR CORES Unprecedented Acceleration for HPC and AI.
- [4] Alberto Villarreal Cueva. Intel® Deep Learning Boost New Deep Learning Instruction bfloat16, June 2020.
- [5] M. Aledhari, R. Razzak, R. M. Parizi, and F. Saeed. Federated Learning: A Survey on Enabling Technologies, Protocols, and Applications. <u>IEEE Access</u>, 8:140699–140725, 2020.
- [6] Amulya Vishwanath. Video Series: Mixed-Precision Training Techniques Using Tensor Cores for Deep Learning, January 2019.
- [7] K. Chellapilla, S. Puri, and P. Simard. High Performance Convolutional Neural Networks for Document Processing. In <u>Tenth International</u> Workshop on <u>Frontiers in Handwriting Recognition</u>, <u>Université de</u> Rennes, 2006.
- [8] Y. et al. Chen. DaDianNao: A Machine-Learning Supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE, December 2014.
- [9] T. Cheng, Y. Masuda, J. Chen, J. Yu, and M. Hashimoto. Logarithmapproximate floating-point multiplier is applicable to power-efficient neural network training. <u>Integration</u>, 74:19–31, September 2020.
- [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- [11] L. Deng. The mnist database of handwritten digit images for machine learning research. <u>IEEE Signal Processing Magazine</u>, 29(6):141–142, 2012
- [12] S. Jain et al. Compensated-dnn: Energy efficient low-precision deep neural networks by compensating quantization errors. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6, 2018.

- [13] Sharan C. et al. cudnn: Efficient primitives for deep learning. <u>CoRR</u>, abs/1410.0759, 2014.
- [14] I. Hammad, K. El-Sankary, and J. Gu. Deep Learning Training with Simulated Approximate Multipliers. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 47–51, Dali, China, December 2019.
- [15] S. Han, H.i Mao, and W. J Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. <u>International Conference on Learning Representations (ICLR)</u>, 2016
- [16] K. He, X. Zhang, Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
- [17] H. Saadat J. Gong, H. Gamaarachchi. Approxtrain. https://github.com/ AaronJing/ApproxTrain, 2022.
- [18] H. Kim. A low-cost compensated approximate multiplier for Bfloat16 data processing on convolutional neural network inference. <u>ETRI</u> Journal, 43(4):684–693, 2021.
- [19] M. S. Kim, A. A. D. Barrio, L. T. Oliveira, R. Hermida, and N. Bagherzadeh. Efficient Mitchell's Approximate Log Multipliers for Convolutional Neural Networks. <u>IEEE Transactions on Computers</u>, 68(5):660–675, May 2019.
- [20] Min Soo Kim, Alberto A. Del Barrio, HyunJin Kim, and Nader Bagherzadeh. The Effects of Approximate Multiplication on Convolutional Neural Networks. <u>IEEE Transactions on Emerging Topics in Computing</u>, pages 1–1, 2021. arXiv: 2007.10500.
- [21] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. page 60.
- [22] O. et al. Kuchaiev. Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq. <u>arXiv:1805.10387</u> [cs], November 2018.
- [23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. <u>Proceedings of the IEEE</u>, 86(11):2278– 2324, November 1998.
- [24] P. et al. Micikevicius. Mixed Precision Training. In <u>ICLR 2018</u>, October 2017
- [25] J. N. Mitchell. Computer Multiplication and Division Using Binary Logarithms. <u>IRE Transactions on Electronic Computers</u>, EC-11(4):512– 517, August 1962.
- [26] S. Mittal. A Survey of Techniques for Approximate Computing. <u>ACM</u> Computing Surveys, 48(4):1–33, March 2016.
- [27] H. Noh, T. You, J. Mun, and B. Han. Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization. arXiv:1710.05179 [cs], November 2017.
- [28] H. Saadat. Design and optimization of approximate multipliers and dividers for integer and floating-point arithmetic. PhD thesis, School of Computer Science and Engineering, University of New South Wales, 2021.
- [29] H. Saadat, H. Bokhari, and S. Parameswaran. Minimally Biased Multipliers for Approximate Integer and Floating-Point Multiplication. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):2623–2635, November 2018.
- [30] Hassaan Saadat, Haris Javaid, Aleksandar Ignjatovic, and Sri Parameswaran. REALM: Reduced-Error Approximate Log-based Integer Multiplier. In 2020 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1366–1371, March 2020. ISSN: 1558-1101.
- [31] K. Shirane, T. Yamamoto, and H. Tomiyama. A design methodology for approximate multipliers in convolutional neural networks: A case of MNIST. <u>International Journal of Reconfigurable and Embedded Systems</u> (IJRES), 10(1):1, March 2021.
- [32] Z. Vasicek V. Mrazek. Tfapprox. https://github.com/ehw-fit/ tf-approximate, 2020.
- [33] F. Vaverka, V. Mrazek, Z. Vasicek, and L. Sekanina. TFApprox: Towards a Fast Emulation of DNN Approximate Hardware Accelerators on GPU. 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 294–297, March 2020.
- [34] ShiBo Wang and Pankaj Kanwar. BFloat16: The secret to high performance on Cloud TPUs, August 2019. Library Catalog: cloud.google.com.
- [35] G. et al. Zervakis. Approximate computing for ml: State-of-the-art, challenges and visions. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, page 189–196, 2021.
- [36] X. et al. Zhang. Fixed-Point Back-Propagation Training. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2327–2335, 2020.