# TRIM: A Design Space Exploration Model for Deep Neural Networks Inference and Training Accelerators

Yangjie Qi, Student Member, IEEE, Shuo Zhang, Student Member, IEEE, and Tarek M. Taha, Member, IEEE

Abstract—There is increasing demand for specialized hardware for training deep neural networks, both in edge/IoT environments and in high performance computing systems. The design space of such hardware is very large due to the wide range of processing architectures, deep neural network configurations, and dataflow options. This makes developing deep neural network processors quite complex, especially for training. We present TRIM, an infrastructure to help hardware architects explore the design space of deep neural network accelerators for both inference and training in the early design stages. The model evaluates at the whole network level, considering both inter-layer and intra-layer activities. Given applications, essential hardware specifications, and a design goal, TRIM can quickly explore different hardware design options, select the optimal dataflow and guide new hardware architecture design. We validated TRIM with FPGAbased implementation of deep neural network accelerators and ASIC-based architectures. We also show how to use TRIM to explore the design space through several case studies. TRIM is a powerful tool to help architects evaluate different hardware choices to develop efficient inference and training architecture design. Experimental results show that TRIM is a powerful tool for rapidly exploring the design space of DNN architectures for training and inference.

*Index Terms*—DNN model, inference, training, accelerator, design space explores.

#### I. INTRODUCTION

**D** EEP Neural Networks (DNNs) [1] are being used in a wide variety of application domains, including computer vision, natural language processing, big data analysis, among others. The success of DNNs is also leading to intensive studies on DNN designs in different scenarios, from the data center to IoT. Most of these studies are focused on the DNN inference accelerator. Recent DNN application use cases are showing the need for various types of DNN training accelerators.

For very large deep learning models, which are generally processed in data centers, the training time and energy costs are becoming critical limiting factors. For example, GPT-3, which has 175 billion parameters, would take 355 years and \$4,600,00 for training using one Tesla V100 cloud instance [2]. Therefore further investigations into more optimized DNN training accelerators for cluster environments are needed. On the other hand, training DNNs on mobile devices is also

This paper was produced by the Parallel Cognitive Systems Laboratory at University of Dayton. They are in Dayton, OH.

Manuscript received April xx, xxxx; revised August xx, xxxx.

gaining attention through Federated Learning from Google [3]. In this case, the user does not share their private data with the cloud/server, and instead trains a network on their local system. A batch of federated learning applications have already been built and applied to cell phones [3]. Therefore, an energy-efficient DNN accelerator chip would be of great help for improving the user's experience. Moreover, training in real-time is becoming increasingly important since many edge devices are collecting new data on their own, such as IoT devices [4] and robotics [5]. The input data of these devices are usually sequential and change once in a while. Thus they require the processing hardware to be low-power with online learning capabilities.

Different scenarios require DNN hardware accelerators to offer significantly distinct requirements, including performance, power, latency, chip area, and flexibility. The design space for such DNN accelerators is complicated by the wide range of network architectures, numerous possible dataflows, and various hardware choices. Investigating this large hardware design space is currently an ad hoc and laborious process that requires significant time and expense to evaluate different design options. Many accelerators have been proposed for deep learning. These include systems for inference [6]-[9] and on-chip training [10], [11]. They share a similar architecture: a specialized memory hierarchy connected with an array of multiply-accumulate units. The differences between the designs come from their dataflows, which define the method of data partitioning and the order of computations. These accelerators are generally co-designed based on their dataflows to satisfy data cache and transformation bandwidth requirements.

Several groups have proposed analytical models to help explore the design space of the inference DNN accelerator. Timeloop [12] analyzes the data movements and memory access patterns to estimate the performance of the DNN inference accelerator. MAESTRO [13] utilizes data-centric directives and their co-designed analytical cost model to explore the design space of DNN inference accelerators. These two models give time and energy estimates based on the individual layers of a DNN. They only consider intra-layer (single layer level) workloads, and not inter-layer(cross-layer level) workloads. This means these models cannot achieve optimal results at the network level. AutoDNNChip [14] recognized the importance of cross-level optimization and utilized a graph-based representation to help predict the performance of accelerators. However, their scope is also limited to inference DNN accelerators. The comparisons between TRIM and all the above analytical models are shown in Table I. TRIM is the only analytical model that supports all three phases of training tasks. It is also the only analytic model that is considered both intra-layer workloads and inter-layer workloads.

|             |    | Ir<br>W | Inter-layer<br>Workloads |   |        |        |
|-------------|----|---------|--------------------------|---|--------|--------|
|             | FW | BW      | BW WG & Pooling Update   |   | $DP^1$ | $DD^2$ |
| TRIM        | Y  | Y       | Y                        | Y | Y      | Y      |
| Timeloop    | Y  | N       | N                        | N | N      | N      |
| MAESTRO     | Y  | N       | N                        | Y | N      | N      |
| AutoDNNChip | Y  | N       | N                        | Y | N      | N      |

TABLE I: TRIM compare with other analytical models

<sup>1</sup> DP: Data Preprocessing <sup>2</sup> DD: Data Dependency (Activation Caching)

Our review of the literature shows a lack of analytical models and design space exploration tools designed for DNN training accelerators. On the one hand, compared with DNN inference accelerators, training accelerators should be able to process more complicated intra-layer patterns. For example, the kernel height/width of the inference phase is mainly 3, 5, 7, and 11. In contrast, in the weight gradient phase, kernel height/width could be up to 220 (for AlexNet layer 1). On the other hand, intra-layer optimizations play more critical roles in the training phase, and the workloads are more complex than inference inter-layer workloads. Those training workloads required extra memory and significant data movement energy than inference tasks, which must be considered in the model to make sure the architectures have the capability to process training tasks and get network level optimal results (section 3.3).

In this paper, we present TRIM (TRaining archItecture Model for deep networks), an infrastructure to help hardware architects explore the design space of DNN accelerators for training and inference. It considers both intra-layer workloads and inter-layer workloads of DNNs. Given application and hardware specifications, TRIM quickly examines all possible dataflows and estimates time, energy, and area. TRIM also can be used to compare different hardware design options, optimize existing architectures, and guide new hardware architecture designs. The key contributions of this paper can be summarized as follows:

1) TRIM provides an analytical model to estimate the performance and energy of various DNN hardware architectures. TRIM utilizes a very flexible hardware template, which can model a wide range of architectures. TRIM explores the design space of data partition and reuse strategies for each hardware architecture and estimates the optimal performance and energy. This exploration guarantees fair comparisons between different architectures.

2) TRIM supports both inference and training of DNN accelerators. To the best of our knowledge, TRIM is the first infrastructure that can model and explore the design



Fig. 1: Overall view of TRIM model

space of DNN architectures for both training and inference. Furthermore, to accurately model training architectures, TRIM explores the design space of the DNN training accelerators at the network level, considering both intra-layer and inter-layer activities, and finds the optimal design choices.

3) We demonstrate how to utilize TRIM to explore the design space of FPGA-based architectures and ASIC-based architectures through two case studies. The results show the pros and cons of different hardware choices and lead to efficient training architecture designs.

## II. TRIM OVERVIEW

TRIM primarily provides a systematic framework to predict the time, energy, and area for various hardware architectures running different DNN applications for both inference and training. As shown in Fig. 1, TRIM takes four inputs: 1) task description, which consists of the DNN network model and corresponding parameters, such as batch size; 2) hardware parameters, which define the system hierarchy and specifications of each hardware component; 3) mapping constraints, which constrain data partitioning and order of computation in the system; 4) design goal, such as the fastest throughput or the lowest energy consumption.

TRIM consists of five components: Task analyst, TRIM designer, TRIM mapper, TRIM evaluator, and TRIM explorer. The task analyst utilizes the task description to generate intra-layer workloads and inter-layer workloads. The TRIM designer generates various hardware descriptions based on the hardware parameters. Each hardware description presents a specific hardware architecture. All hardware descriptions together configure the hardware architecture space. For a given hardware description, the TRIM mapper generates a mapspace for each intra-layer workload. A mapspace consists of multiple mappings, and each mapping defines a specific way in which data is partitioned, staged, and computed across the hardware architecture. The TRIM evaluator estimates the performance and energy of each mapping in the mapspace. Based on the estimates, the TRIM explorer finds the optimal mapping in each mapspace based on the design goal. By combining the optimal mappings and inter-layer workloads, the TRIM evaluator can estimate the performance and energy of each architecture in the architecture space. In the end, the TRIM explorer selects the optimal architecture based on the design goal.

| 1  | <pre>network_parameters = {</pre>              |
|----|------------------------------------------------|
| 2  | processing_type = 'Training'                   |
| 3  | input_shape = (224,224,3)                      |
| 4  | output_shape = 1000                            |
| 5  | <pre>batch_size = 64</pre>                     |
| 6  | }                                              |
| 7  |                                                |
| 8  | <pre>network_model = {</pre>                   |
| 9  | <pre>x = conv2d(in_shape=input_shape,</pre>    |
| 10 | <pre>out_channel=64,kernel_size=(11,11),</pre> |
| 11 | padding=(2,2),stride=(4,4),                    |
| 12 | <pre>activation='ReLU'))</pre>                 |
| 13 | <pre>x = pool2d(in_shape=x.shape,</pre>        |
| 14 | <pre>kernel_size=(3,3), stride=(2,2))</pre>    |
| 15 | <pre>x = fc(in_shape= x.shape,</pre>           |
| 16 | out_channel=output_shape,                      |
| 17 | <pre>activation='Sigmoid') }</pre>             |
|    |                                                |

Fig. 2: An example of the task description

In the remainder of the paper, we present the task analyst and its inputs and outputs in section III. Section IV introduces the TRIM designer. The mapping, mapspace, and our method to prune the mapspace are described in section V. The TRIM evaluator and explorer are described in section VI. We validate TRIM with FPGA designs and use TRIM to design an FPGAbased DNN accelerator in section VII. Section VIII is case studies using TRIM to explore ASIC based DNN accelerators design space, while section IX concludes the paper.

## III. TASK ANALYST

The task analyst goes through the task description and generates workloads. There are two types of workloads: intra-layer workloads and inter-layer workloads. The intralayer workloads describe each layer's primary computation operations, such as the 2D convolution in the convolutional (CONV) layer, the 2D pooling in the pooling (POOL) layer, and the matrix-to-matrix multiplication in the fully-connected (FC) layer. The inherent parallelism of the intra-layer workloads makes it possible to achieve performance and energy efficiency in parallel computing architectures. The inter-layer workloads consist of data preprocessing and intermediate activation caching. They do not have many parallel computing opportunities but significantly impact the tasks' performance and energy consumption.

#### A. Task Description

The task description used in our paper is a simplified TensorFlow-like description, which consists of the network parameters and network model, so it should be easy for Tensor-Flow users to develop their TRIM model from their code. The network parameters comprise the input shape, output shape, batch size, and processing type (inference or training). The network model is described as multiple layers connected in a specific order, where each layer is defined by a layer type with its corresponding parameters. The CONV layer is defined by in shape, out channels, kernel size, padding size, stride size, and activation type. The FC layer is described by in shape, out channels, and activation types. The parameters of the POOL layer are in shape, kernel size, and stride size. Fig. 2 shows the task description of a three layer network as an example.

The task analyst goes through the task description and generates workloads. In the case of inference, the task analyst generates one intra-layer workload for each layer. For training, however, the number of workloads generated depends on the layer type. For the CONV layer and FC layer, three workloads are generated, which correspond to the forward propagation (FW), backpropagation (BW), and weight gradient and update (WG) phases. The only exception is the first layer of the network, which does not have the BW phase. For the POOL layer, only two workloads are generated, as there is no WG phase needed. For AlexNet [15], which has five CONV layers, three fully connected layers, and three POOL layers, the task analyst generates 5+3+3=11 intra-layer workloads for the inference task, and  $(5+3) \times 3 + 3 \times 2 - 1 = 29$  intra-layer workloads for the training task.

#### B. Intra-layer Workloads

The intra-layer workload is presented as a nested loop. As shown in Fig. 3, the operation in the CONV layer is described by the height and width of filters (R, S), the height and width of outputs (E, F), input channel size (C), output channel size (M), and batch size (N). These seven parameters are used to construct a nested loop, as shown in Fig. 3. As the multiply-accumulate (MAC) operations occur only in the innermost loop (see Fig. 3), the loops can be nested in any order. The seven parameters, along with the stride sizes (U, V), define the computations and the shapes of the inputs, filters, and outputs. We can also represent fully connected layers, and recurrent layers in the same format as their main computations are matrix-matrix multiplications and matrixvector multiplications. Matrix-matrix multiplications can be defined similarly by setting R, S, E, and F equal to 1. Matrixvector multiplications can be represented by placing R, S, E, F, and N equal to 1. The POOL layer is supported and evaluated in the experiments. We consider it an intra-layer workload since it can also be described in a similar nested loop as the CONV layer. For the normalization layer, we assume it would be processed by the CPU or a separate coprocessor, which means it can be modeled as a constant delay in TRIM. Some special connections, such as the residual link, are split into an intra-layer workload and an inter-layer workload, which is considered and evaluated in the ResNet-IM experiments shown in Figs. 16 to 18.

#### C. Inter-layer Workloads

There are two types of inter-layer workloads: data preprocessing and intermediate activations caching.

**Data preprocessing** is executed before each intra-layer workload. For the inference tasks of DNN, as shown in Eq.1, only padding operations are needed and are usually ignored by inference exploration tools, such as [12]. However, those operations should be taken into consideration, as they consume both time and energy. Especially for the training task, the data preprocessing is executed in each iteration. Eq. 1, Eq. 2, and

| 1  | <pre>for n = 0:N # batch size</pre>         |
|----|---------------------------------------------|
| 2  | <pre>for m = 0:M # out channel</pre>        |
| 3  | <pre>for c = 0:C # in channel</pre>         |
| 4  | <pre>for r = 0:R # filter height</pre>      |
| 5  | <pre>for s = 0:S # filter width</pre>       |
| 6  | <pre>for e = 0:E # out height</pre>         |
| 7  | <pre>for f = 1:F # out width</pre>          |
| 8  | p = e * u + r # u and v are strides         |
| 9  | q = f * v + s                               |
| 10 | <pre>out[n,e,f,m] +=</pre>                  |
| 11 | <pre>input[n,p,q,c] * filter[r,s,c,m]</pre> |
| 12 | end                                         |
|    |                                             |

Fig. 3: Intra-layer workload of TRIM using loop nest format



Fig. 4: Inter-layer data dependency of DNN

Eq. 3 show the different data preprocessing operations for FW, BW, and WG of the CONV layer.

Furthermore, those preprocessing operations, such as padding and upsampling, would generate zeros. Those zeros are predictable and can be used to estimate the time and energy consumption of those architectures with an early zero detect mechanism. The data preprocessing workloads are presented as operation type, input, and output shape.

$$Y = padding(X) * W \tag{1}$$

$$dW = padding(X) * upsampling(dY)$$
(2)

$$dX = padding(upsampling(dY) * rot180(W^T))$$
(3)

Intermediate activations caching is only considered in the training task. As shown in Fig. 4, after we compute the first forward layer (FW1) of the network, the activations x1need to be cached and used as the inputs of WG3 later. Similarly, the activations  $x^2$  and  $x^3$  need to be cached and used as the inputs of WG2 and WG1, respectively. After all the FW computations are processed, we compute the first backward phase (BW1) and get the gradient errors (dy). The size of those intermediate activations and the timestamp they created and deleted configure the intermediate activations workloads. Data caching consumes both memory resources and energy. The TRIM mapper considers the memory size used by intermediate activations when validating the mapping, introduced in section V. The TRIM evaluator computes intermediate activations' energy consumption and adds it to the architecture's overall energy consumption.

TABLE II: Architecture Parameters of TRIM

| Туре        | Architecture Parameters                          |  |  |  |
|-------------|--------------------------------------------------|--|--|--|
| System      | # of levels; data precision                      |  |  |  |
| Computation | # of PEs                                         |  |  |  |
| Memory      | type; size; usage (inputs/filters/outputs/share) |  |  |  |
| Routing     | topology; routing size                           |  |  |  |
|             |                                                  |  |  |  |

## IV. DESIGNER

TRIM Designer utilized the architecture parameter to generate multiple hardware descriptions. Each hardware description presented a specific hardware organization. We utilized a very flexible hardware architecture template, which can be viewed as a tree graph. The main memory is the root, the global buffer (Gbuffer) is an intermediate node, the network-on-chip (NoC) are branches, and the register files (RFs) and processing engines (PEs) are the leaves. Between the main memory and global buffer, it may have multiple memory or routing levels. It is also possible to have multiple buffers in parallel as different intermediate nodes.

Table II shows the architecture parameters we can vary during the design space exploration. Each parameter can be set as multiple values. The designer generates various hardware descriptions by combining different values of each parameter. Table III shows one hardware description generated by the designer to model Eyeriss [9], a popular inference accelerator. As defined in number of levels, it has five hardware levels, which are named processing engine (PE), scratchpads (SP), network-on-chip (NoC), global buffer (Gbuffer), and off-chip memory. The first level is the computation level, which defines the total number of PEs in the system. The second, fourth, and fifth are memory level. They are different in the memory type and memory size. The value of their usage is shared, which means inputs, filters, and outputs share this memory. It is also possible to specify separate memory used by inputs, filters, and outputs. In that case, at the same hardware level, three separate memories need to be defined. The third level is the routing level, which defines the routing network topology and size.

Using the different values of parameters, the designer generates various hardware architectures. However, we cannot compare those architectures as their performance and energy consumption depend not only on their hardware architectures but also on the mappings. A mapping describes how a workload is executed in a hardware architecture. For a given hardware architecture, millions of possible mappings exist. Therefore, to compare different hardware architectures' performance and energy consumption, we need to ensure that the architectures are evaluated with their optimal mapping.

### V. MAPPER

The mapper consists of a mapping constructor, a mapping validator, and a mapspace pruner, as shown in Fig. 5. The mapping constructor takes intra-layer workloads and hardware organization to create possible mappings. Each mapping utilizes a nested loop format, which describes how data is

| Level | Туре        | Name Parameter |                | Value          |  |
|-------|-------------|----------------|----------------|----------------|--|
|       | System      | Arch-1         | # of levels    | 5              |  |
|       | System      | Alch-I         | data precision | fixed 16       |  |
| 1     | Computation | PE             | # of PEs       | 256            |  |
| 2     |             |                | memory type    | scratchpad     |  |
|       | Memory      | SP             | memory size    | 520 bytes      |  |
|       |             |                | usage          | shared         |  |
| 3     | Douting     | NoC            | topology       | 2-Level Bus    |  |
|       | Kouting     | NOC            | routing size   | $16 \times 16$ |  |
| 4     |             |                | memory type    | SRAM           |  |
|       | Memory      | Gbuf           | memory size    | 108 K          |  |
|       |             |                | usage          | shared         |  |
| 5     |             |                | memory type    | DRAM           |  |
|       | Memory      | Off-chip       | memory size    | N/A            |  |
|       |             |                | usage          | shared         |  |

TABLE III: An Example of Eyeriss Hardware Description



Fig. 5: Overview of TRIM mapper

moved, staged, and computed in given hardware. All possible mappings create the mapspace. The mapping validator computes the size of each hardware component used by the mapping, adjusts with the size used by intermediate activation caching workloads, and compares it with the size constraints listed in the hardware organization. The mapping that satisfies the size constraints of the hardware organization is called a valid mapping. All possible valid mappings together are the valid mapspace. The number of mappings in the valid mapspace varies from zero to several million. In the case of a mapspace having millions of mappings, exploring the mapspace exhaustively is too time-expensive. Dataflow constraints and utilization constraints are used to prune the valid mapspace. Finally, the pruned mapspace is carried out to the TRIM evaluator and explored by the TRIM explorer.

## A. Mapping

A mapping describes how the data is partitioned, moved, staged, and computed for a workload across a system hierarchy. More specifically, it shows how an intra-layer workload projects onto the hardware organization. To explain this concept, we will refer to the example in Fig. 6 which shows two possible mappings for a specific workload. Fig. 6a shows a vector-matrix multiplication workload, which is a particular case of the workload shown in Fig. 3 with M set as 32, C as 16, and all other parameters(N, R, S, E, and F) as 1. Fig. 6b and Fig. 6c present two mappings using the loop nest format for the workload in Fig 6a. Fig. 6d and Fig. 6e show how the two mappings are visualized in the architectures described in Table III.

Both mappings split the workload into five sub-mappings that correspond to the hardware levels shown in Table III. There are three types of sub-mappings: temporal, spatial, and computational. The temporal level is defined by the *for* loop and used for memory, while the *parallel for* loop defines the spatial level and is used for routing networks. The innermost level is the computational level, which is MAC operations. Temporal and spatial sub-mappings have loops belonging to the same dimensions in workload. As shown in Fig. 6b and Fig. 6c, the loops c4, c3, c2, c1 correspond to the loop c in workload (Fig. 6a), and the product of the loop bounds of loops c1, c2, c3, and c4 is equal to the loop bound of the loop c in the workload.

Each sub-mapping has specific loop bounds and order of loops. The loop bounds in a sub-mapping constrain how the inputs, filters, and outputs are partitioned at the current hardware level. Using the off-chip memory level of two mappings (Fig. 6b and Fig. 6c line 2 to 4) as example, the inputs and filters are stored in the off-chip memory in the beginning. Assume the global buffer's size is limited, and we cannot load all the data from off-chip memory to the global buffer at once. Thus we should partition the data and deliver them in order. In Fig. 6d, the inputs are not partitioned, while the filters are partitioned into four blocks through M dimensions. The partition method is decided by the loop bounds shown in Fig. 6b line 3 and 4, where the loop bound is 4 for the M dimension loop and 1 for the C dimension loop. In Fig. 6e, the inputs are partitioned into two blocks, and the filters are partitioned into four blocks, but with different shapes. The difference comes from the loop bounds shown in Fig. 6c lines 3 and 4, where the loop bounds are 2 for both the M and C dimension loops.

The order of loop dimensions decides the order of data transformation and whether the outputs, inputs, and weights need to be loaded back and sent to the inner hardware level. As shown in Fig. 6d and Fig. 6e, the data are partitioned using the same method in the global buffer level. The reason for that is the loop bounds of their Gbuf level are the same. However, as the order of their loop dimensions is different, as shown in Fig. 6b and Fig. 6c lines 6 and 7, the data movement and staging are different. In the first iteration, inputs block-1 and filters block-1 are sent to the NoC level for both mappings. However, for the mapping in Fig. 6b, the partial sum results do not need to be loaded back, as the partial sum results of the next iteration belong to the same data block. In contrast, for the mapping in Fig. 6c, the partial sum results should be loaded back at the end of the first iteration, as a different partial sum block is computed in the next iteration. In the second iteration, for mapping in Fig. 6b, input block-2 and filter block-2 (Fig. 6d) are sent to the NoC level. In contrast, for the mapping in Fig. 6c, only filter block-2 (Fig. 6e) should be delivered as the inputs used for this iteration are the same as the previous one. The order of loop dimensions only affects the temporal sub-mapping. For spatial sub-mappings, the order of loop dimensions is not important, as the data are concurrently



Fig. 6: An example workload and two possible mappings. (a) is the example workload, (b) and (c) are two example mappings which project the workload onto the hardware defined in Table III. (d) and (e) shows the dataflow based on the mapping in (b) and (c) respectively.

delivered to parallel components.

### B. Mapping Constructor and Pruner

The mapping constructor generates all possible mappings for each intra-layer workload by factoring the value of each loop bound of an intra-layer workload across the system levels. The advanced user can also provide custom factor lists, which forces TRIM to follow their factor method to construct mappings. The mapping constructor lists all possible seven loop orders of the loop inside each level. Further, inputs, weights, or outputs may bypass some levels without much reuse. For a given hardware description, the mapping constructor generates a fixed number of mappings for each intra-layer workload. For a specific intra-layer workload, the number of possible mappings is determined by the hardware description. Generally speaking, the mapping constructor generates more mappings for a system with more hierarchy levels, more parallel PEs, or larger memory size.

After a mapping is generated, it is used to compute the different hardware components' required sizes. The required memory size of inter-layer workloads would also be computed and combined with the mapping's memory size needed. The results are validated with the size constraints listed in the hardware organization. If the hardware resource utilization, such as number of PEs and memory utilization, needed by the mapping is less than the amount provided by the hardware organization, we consider the mapping as valid. All invalid mappings would be discarded, and all valid mappings together are the valid mapping space. The number of valid mappings for each layer can vary from zero to several million. A mapspace of this size could potentially be explored but requires significant time to search. We offer two optional methods to prune the valid mapspace and improve the speed of exploration. They are the dataflow constraint method and the utilization constraint method.

Dataflow constraint methods are primarily designed for

modeling architectures that have a co-designed dataflow. Most architects develop a dataflow strategy for their architecture to achieve efficiency - for example, Everiss was designed with a row-stationary dataflow in mind. By default, TRIM explores the entire design space with no constraints on dataflow (including the commonly used ones such as row stationary and weight stationary, along with any unique ones). To make TRIM look at a certain dataflow, we could pin some of the for loops in Fig. 3 to specific positions. The dataflow constraints (user inputs are shown in Fig. 1) are used to select the valid mappings, which comply with specific dataflows rules. At the same time, it can also be used to prune a valid mapspace. However, with more dataflow constraints added, it may exclude optimal mappings in the early stages and lead to inefficient design. We do not add any dataflow constraints to ensure TRIM can find the optimal solution in our architecture design space explorations.

Utilization constraint methods are developed based on our exploration experience using TRIM. We found that the mappings with a high PE utilization are generally faster, in which the PE utilization means the ratio of active PEs divided by the total number of PEs. The mappings with a high utilization rate of memory that is close to PEs are usually more energyefficient. Thus, we prune the mapspace by setting up the utilization constraints. For exploration where the design goal is high throughput, we set up the utilization constraints of PE level as 0.75. This means that mappings whose utilization rate of PEs is less than 0.75 would be discarded from the mapspace. When searching energy efficient mappings, we set up the utilization constraints of scratchpads/registers as 0.5. Thus, the mappings whose utilization rate of scratchpads/registers are less than 0.5 would be removed. These two utilization constraint numbers are selected based on our exploration experience. We explore architecture design spaces with constraints and without constraints and got the same optimal mappings.

#### VI. EVALUATOR AND EXPLORER

TRIM evaluator utilizes mappings and inter-layer workloads as inputs to estimate the performance, energy, and area, as shown in Fig. 1. It consists of an activity analyst, a performance model, an energy model, and an area model. The last TRIM component is the TRIM explorer, which tunes all TRIM components to explore the design space and select the optimal architecture.

#### A. Activity Analyst

The activity analyst generates the number of MACs, memory accesses, and NoC activities for mappings. The number of MACs is computed as the product of all loop bounds in a mapping. The memory accesses can be counted by using a simulator. However, a simulator would be extremely slow and cannot be used in large design space explorations. As the data transfer and computation in the workload executed are deterministic, we can use a mathematical method to compute the memory accesses quickly and accurately.

The memory accesses are computed using the loop dimensions and bounds belonging to the current hardware level. They are examined in order from inner to outer. Suppose the current loop examined has dimensions 'R,' 'S,' 'M,' 'C' for filters and 'N,' 'M,' 'E,' 'F' for outputs. In that case, the memory accesses can be computed as the product of the current loop bound, all unvisited loop bounds, and corresponding filter or output data package size. The memory accesses of inputs are more complex, as the conjunctive packages may overlap. In this case, only part of the data package needs to be transferred. We need to compute the overlap size of two conjunctive iterations in each loop first, then calculate the input memory access.

The NoC activities are generated based on the parallel for loops dimensions and bounds, which define how the inputs, filters, outputs data are transferred in the NoC. If the parallel for loop dimension corresponds to batch size (N), output height (E), and output width (F), each node in the NoC gets different input data and the same filter data. The computation results don't need to be accumulated. In this case, the filter data is counted as multicast activities, while inputs and outputs are counted as normal data transfer activities. If the parallel for loop dimension corresponds to filter height (R), filter width (S), and in channel (C), each node has different input data and filter data, while the computation results belong to the same outputs and need to be accumulated. In this case, the inputs and filters are counted as normal data transfer activities, while the outputs are counted as data transfer with accumulation activities. If the *parallel for* loop dimension corresponds to out channel (M), each node gets the same input data and different filter data, and the computation results don't need to be accumulated. Thus the input data and output data are counted as normal data activities, and the filters data are counted as multicast activities.

## B. Performance, Energy, and Area Estimations

The execution cycle of each intra-layer workload (mapping) is estimated by computing the execution cycles needed for each hardware level first. For PE level, the execution cycle is calculated as the number of MACs divided by the number of PEs and the number pipeline stages of the PEs. For memory and NoC level, the execution cycle is computed as the data transfer size divided by the interface bandwidth. We assume all hardware levels are operated as a pipeline. Thus, an intralayer workload's overall time is the maximum of the hardware level's execution cycles. The execution cycle of data preprocessing workload is estimated as the size of output data divided by the memory bandwidth. There is no extra time needed for intermediate activation caching workloads, as the time can be overlapped with the intra-layer workload. The overall performance is computed as the sum of the execution cycle of all intra-layer workloads and data preprocessing workloads.

TRIM has an embedded energy and area model based on 65 nm technology. We utilize microarchitecture parameters such as bit width and PE pipeline stages; memory type, memory size, number of memory ports, wire length for NoC, etc., to compute each hardware component's energy and area. For the memory components, the model is based on CACTI [16]. To improve the accuracy of our energy and area model, we also used the energy and area data from published papers [9], [17]. Both on-chip memory and off-chip memory are considered.

| Algorithm 1: Design Space Exploration using TRIM               |
|----------------------------------------------------------------|
| input : design goal, task description, hardware                |
| parameters, mapping constraints                                |
| output: optimal architecture                                   |
| intra-layer workloads, inter-layer workloads $\leftarrow$      |
| TaskAnalyst(task description);                                 |
| Architecture Space $\leftarrow$ Designer(hardware parameters); |
| for each hardware description $\in$ Architecture Space do      |
| for each intra-layer workload do                               |
| MapSpace $\leftarrow$ Mapper(hardware organization,            |
| intra-layer workloads, mapping constraints);                   |
| for each mapping $\in$ MapSpace do                             |
| Performance, Energy, Area $\leftarrow$                         |
| Evaluator(mapping);                                            |
| Evaluate design goal;                                          |
| Update optimalMappings[intra-layer];                           |
| end                                                            |
| end                                                            |
| Performance, Energy, Area← Evaluator(optimal                   |
| Mappings, inter-layer workloads);                              |
| Evaluate design goal;                                          |
| Update optimal Architecture;                                   |
| end                                                            |

The energy model can be replaced by other activity countbased energy models, such as Accelergy [18]. The dynamic energy consumption is computed as the activities count multiply with the energy per activity. Both inter-layer and intra-layer workloads are considered. We also consider the network's sparsity, but we only consider zeros generated by padding and upsampling. The zeros come from input data, and ReLU cannot be predicted in hardware design. The static energy mainly comes from caching the intermediate activations. We get the time for caching each intermediate activation based on the estimated performance and compute the energy with the energy model's memory static power. The overall energy consumption is the combination of dynamic and static energy. The area is computed by adding the area of each hardware component listed in the hardware organization.

## C. TRIM Explorer

The TRIM Explorer is the controller of the overall TRIM model. It utilizes different TRIM components to select the optimal architecture and corresponding optimal mappings based on the design goal, as shown in Algorithm 1. The task analyst generates the intra-layer workloads and inter-layer workloads. The designer creates the architecture space based on the hardware parameters. For each hardware description in the architecture space, we can find the global optimal mappings for each intra-layer workload with the exhaustive search method. This was used in all our case studies. Combining the optimal mappings and inter-layer workloads, we get the best performance and lowest energy consumption that the architecture can achieve. Using this data, the explorer can fairly compare different architectures and select the optimal one.



Fig. 7: Proposed FPGA Design

TABLE IV: FPGA Exploration Hardware Setup

|            | FPGA-1 | FPGA-2 | FPGA-3 | FPGA-4 | FPGA-5 |
|------------|--------|--------|--------|--------|--------|
| PE         | 8      | 16     | 32     | 64     | 128    |
| Cache (KB) | 20     | 24     | 32     | 48     | 80     |

## VII. FPGA VALIDATION AND CASE STUDY

## A. Experimental Setup

To validate TRIM, we utilized it to explore an FPGA-based training architecture's design space. We selected AlexNet (AlexNet-Cifar) [19], VGG-11 (VGG-Cifar) [20], and ResNet-20 (ResNet-Cifar) [21] and trained them with CIFAR-10 datasets [22] to evaluate our system. The PYNQ-Z1 board was used as the FPGA platform. The board has a dual-core Cortex-A9 processor, a high-performance FPGA chip (Xilinx ZYNQ XC7Z020), and 512 MB DDR3 memory. To estimate time, energy, and area accurately, we adjusted model parameters based on the FPGA frequency, energy and resources utilization. Those data were collected by implementing and characterizing several essential components, such as MAC units and DMA channels on the FPGA.

Fig. 7 shows the FPGA-based architecture we designed and implemented, which has both inference and training capabilities. The architecture has 32 parallel PEs and 32 KB cache. A DMA controller was used for high-performance data transfers between the accelerator cache and the DDR3 memory. The processor was used for dataflow control. The FPGA results achieved the same level of accuracy as the benchmarks trained on our PC with TensorFlow. To measure the real FPGA performance, we measured time using the Python time model. The energy was computed based on the FPGA power reported by Vivado and the time measured.

## B. Validate TRIM with Proposed FPGA Design

Fig. 8 shows the TRIM prediction divided by the real value for the time and energy of each phase in AlexNet-Cifar on the FPGA. The overall time modeling error from TRIM was less than 10% for each phase, while the energy modeling error was less than 20%. Fig. 9 shows the normalized time and energy of training our three benchmarks. The results show that TRIM accurately predicts training time and energy. As an early-stage model, TRIM offers good accuracy on both time and energy estimations.

## C. Explore Different FPGA Designs

We varied the hardware parameters and configured five different FPGA hardware design points as shown in Table IV.



Fig. 8: Time accuracy for TRIM across all phase in AlexNet-Cifar



Fig. 9: Normalized time (left) and energy (right) comparison between TRIM and proposed FPGA



Fig. 10: Normalized time for selected hardware architectures using their own highest throughput mapping



Fig. 11: Normalized energy for selected hardware architectures using their own highest throughput mapping

For each design in Table IV, we searched for the mapping that provided the highest throughput. Fig. 10 and Fig. 11 show the normalized time and energy for our five different FPGA configurations with highest throughput for each of the three networks examined. As we increased the hardware resources used, the time needed to compute each network decreased. The only exception is with FPGA-5 training ResNet-20, where the number of PEs doubled compared to FPGA-4, but the



Fig. 12: FPGA hardware utilization rate

TABLE V: FPGA Resources Results Generated by TRIM

|      | FPGA-1 | FPGA-2 | FPGA-3 | FPGA-4 | FPGA-5 |
|------|--------|--------|--------|--------|--------|
| LUT  | 9200   | 16400  | 30800  | 59600  | 117200 |
| FF   | 6300   | 11100  | 20700  | 39900  | 78300  |
| BRAM | 24     | 40     | 72     | 136    | 264    |
| DSP  | 40     | 80     | 160    | 320    | 640    |

throughput did not increase. This is because ResNet-20 has more layers, but each layer has far fewer parameters than the other two networks. This means that there is less scope for data parallelism.

For AlexNet, FPGA-5 and FPGA-4 consumed almost the same training energy, while FPGA-5 achieved a  $1.38 \times$ speedup. For VGG-11, FPGA-4 achieved the best energy efficiency, while FPGA-5 consumed  $1.02 \times$  more energy to achieve a  $1.31 \times$  speed up over FPGA-4. For ResNet-20, FPGA-3 achieved the best energy efficiency, while FPGA-4 consumed  $1.04 \times$  energy to achieve a  $1.32 \times$  speed up.

Fig. 12 shows the utilization rate of PEs and cache for each FPGA architecture training different networks. As FPGA-1 to FPGA-3 have a limited number of parallel PEs, the utilization rate of PEs is high for all three networks. FPGA-4 and FPGA-5 show higher PE utilization for AlexNet and VGG-11 than ResNet-20, which explains why FPGA-5 does not achieve higher training speedup for ResNet-20. The cache utilization for ResNet-20 is much less than the other two. The reason for that is ResNet-20 has much fewer parameters for each layer, which also requires much less memory space.

## D. Validate TRIM with Different FPGA Designs

Table V shows the resources needed by the different FPGA designs predicted by TRIM. FPGA-4 and FPGA-5 required 320 and 640 DSP units respectively, while 220 DPS units were available on our PYNQ-Z1 board. Thus we could not implement FPGA-4 and FPGA-5 on the PYNQ-Z1 board. We implemented FPGA-1 to FPGA-3 on the PYNQ-Z1 board to validate with TRIM.

Fig. 13a and Fig. 13b shows that TRIM accurately predicted the FPGA time and energy. As shown in Fig. 13c, TRIM predicted LUT and FF usages within 5% error while predicting BRAM and DSP usage accurately. Overall, TRIM accurately predicted the time, energy, and area.



Fig. 13: Validate selected FPGA architecture



Fig. 14: Overview of proposed spatial architecture

#### VIII. ASIC VALIDATION AND CASE STUDY

### A. Spatial architecture and TRIM validation

This section shows how TRIM is used to explore the design space of spatial architectures. Fig. 14 shows an overview of the spatial architectures we are examining. These are widely utilized as DNN accelerators, especially for inference. They typically consist of tens to hundreds of simple PEs that communicate with each other through a NoC. These architectures show both throughput and energy efficiency on parallel computing tasks, such as convolution and matrix multiplications. Several groups [6], [7] have designed unique spatial architectures to process the inference phase of DNNs efficiently.

Eyeriss [23] is a typical spatial architecture that is designed for the inference phase of DNNs. The original Eyeriss chip has a 108 KB on-chip global buffer (SRAM) and 168 PEs. Each PE has 512 bytes of registers. It utilizes a 16 bit data format, as this is enough for the inference phases of the deep networks considered.

We validated our model with the Eyeriss, as properties for the hardware have been published for this accelerator. We selected the five convolutional layers listed in their paper for inference (Eyeriss does not do training). Fig. 15a compares our TRIM predicted runtimes to the times listed in [9]. Although we overestimate the performance of Eyeriss, the most significant difference is for CONV1, with an error of about 17%. Fig. 15b shows that the power is underestimated by about 20%, which was also the case with our FPGA power estimations.

We designed and modeled a spatial architecture with DNN training capability. We started with the Eyeriss's hardware configuration and modified it to have training capability. We



Fig. 15: Validate with Eyeriss inference accelerator

increased the data format to 32 bits as fewer bits will cause the training accuracy to drop. We added extra functional units to each PE to enable training: a transpose unit and a derivative unit based on the Eyeriss activation unit. We chose 256 PEs, 1024 Byte registers per PE, and a 256 kB Global buffer as our baseline hardware configuration.

Eyeriss uses a row stationary dataflow, which is one of the most energy efficient dataflows for inference processes. There are no published studies on which type of dataflow is best for training. In this study, we did not limit any mapping(dataflow) constraints and gave the model full capability to explore all possible dataflows. Thus TRIM can go through millions of possible valid mappings (dataflows) to get the best mapping for least energy consumption.

Five network models were selected as benchmarks: 1) AlexNet training ImageNet (AlexNet-IM) [15], 2) AlexNet training Cifar-10 (AlexNet-Cifar) [19], 3) VGG-11 training ImageNet (VGG-IM) [20], 4) ResNet-18 training ImageNet (ResNet-IM) [21] and 5) MobileNet training ImageNet(MobileNet-IM) [24].

Below, we look at two case studies. In the first case, we use the baseline hardware configuration and study the effect on energy efficiency of sparsity circuits, different network models, datasets, and batch size. In the second case study, we use optimal options from case study I and vary the hardware configuration (size of register per PE and global buffer) to find optimal energy efficient hardware configurations for 256, 512, and 1024 PEs architectures.

## B. Case Study I: Energy Efficiency Analysis for a Training Architecture

1) Utilize the sparsity of DNNs: In DNN inference accelerator designs, data sparsity is utilized to achieve better



Fig. 16: Energy breakdown for different applications across each hardware level



Fig. 17: Energy breakdown for different applications across each phase



Fig. 18: Overall normalized energy for different networks

energy efficiency. The most commonly used method is to add circuits that skip operations with zero, including multiply and add by zero, which we named zero-skipping circuits. For the inference phase of DNNs, the zeros come from two aspects: padding zeros in convolution computations and zero values in data. For the training of DNNs, there exists one more source: upsampling zeros, which are generated by upsampling operations shown in Eqs. 2 and 3.

We estimated our baseline architecture's performance and energy with and without zero-skipping circuits. The zeroskipping circuits were implemented between the global buffer and register files. It is applied each time data is read from the global buffer to the register files. We only considered the padding zeros and upsampling zeros as they are determined by the network model and can be calculated without knowing the value of the input data. We want to note that as we didn't consider the zero values in data, the energy efficiency for the architecture with zero-skipping circuits is actually underestimated.

Figs. 16 and 17 show the normalized energy per operation for our modeled architectures with and without skipping zero circuits for training. The architecture with zero-skipping circuits showed better energy efficiency for all four benchmarks, with the highest being 1.4x for AlexNet. As shown in Fig. 16, the energy efficiency mainly comes from the ALU and register files. Fig. 17 indicates that the energy efficiency primarily comes from the weight gradient phase, which means the energy efficiency mainly comes from the upsampling of propagation error. Thus the architecture with zero-skipping circuits achieves the best energy efficiency in training AlexNet, as it has the most upsampling operations among the four benchmarks. As the zero-skipping circuits can achieve energy efficiency with little area overhead, we implemented it in our following baseline architectures.

2) Compare different applications: Figs. 16 and 17 show that the same architecture consumed different energy per operation while training various networks and datasets. Fig. 16 breaks down the energy per operation for different hardware components, while Fig. 17 breaks down the energy per operation for different phases of training. Fig. 18 shows the normalized network energy for training one batch of 64 images broken down by hardware component. All the benchmarks are explored using our baseline hardware and best energy efficiency mapping criteria.

The different energy efficiencies come from different reuse opportunities of the input feature maps, filters, and output feature maps offered by the training datasets, network model, and different types of convolution operations. AlexNet-IM and AlexNet-Cifar utilized the same neural network architecture training different datasets. AlexNet-IM consumes more overall energy because of the larger image sizes, but consumes  $1.7 \times$ less energy per operation because the larger image sizes provide more filter reuse opportunities. In addition to datasets, different network architecture options lead to different energy efficiencies per operation. Among the four selected networks, AlexNet-IM training shows  $1.41 \times$  and  $1.37 \times$  energy efficiency per operation over VGG-IM and ResNet-IM respectively. This is because AlexNet utilized larger filter sizes, which could lead to more input feature maps and output feature maps reuse. MobileNet shows the most significant energy per operation as it utilizes depthwise convolution and pointwise convolution operations. Compared with the 2D convolution used in the other networks, these operations have far fewer data reuse opportunities and thus show the highest energy per operation. However, these convolution methods significantly reduced the number of MAC operations, leading to higher overall energy efficiency than ResNet-IM and VGG-IM, as shown in Fig. 18.

3) Batch size variation: As shown in Figs. 19 and 20, training with a larger batch size achieved better energy efficiency for the same architecture. The differences come from the mappings utilized: the data with a larger batch size has more input feature maps and output feature maps reuse opportunities. TRIM can optimize the mapping to maximize the benefit of those reuse opportunities. Training with batch size 16 shows  $3.1 \times$  higher energy efficiency than training with batch size 1. The energy efficiency comes from reducing DRAM accesses and increasing accesses to global buffer and Network-on-Chip. However, training with batch size 128 shows negligible extra energy efficiency than training with batch size 64, even though



Fig. 19: Energy breakdown for different batch size training AlexNet-IM across each hardware



Fig. 20: Energy breakdown for different batch size training AlexNet-IM across each phase

it doubled the reuse opportunities of input and output feature maps. The bottleneck is the available hardware resources, which limits the reuse of each hardware level. Fig. 20 shows the energy efficiency evenly came from each phase, which means the batch processing can also be used for inferenceonly accelerators and achieve energy efficiency.

Overall, the network models, datasets, and batch size choices significantly affect the energy consumption for a given hardware architecture. In other words, if we want to compare the energy efficiency among different hardware architectures fairly, we must keep the network model, dataset, and batch size the same.

## C. Case Study II: Optimize hardware architecture through design space exploration

In this study, we build on the results from case study I and explore optimal architectures for the different networks. Based on the results of case study I, we implemented zero-skipping circuits and set the batch size to 64. We then varied the size of the register file per PE and global buffer to explore the design space for 256, 512, and 1024 PEs architectures. This was done for the AlexNet-IM, AlexNet-Cifar, and ResNet-IM benchmarks. The design goal was set to find the lowest energy-delay product options.

Fig. 21 shows the exploration results for AlexNet-Cifar with the design goal set as the lowest energy-delay product (EDP). As shown in Fig. 21a to Fig. 21c, the EDP decreased with increases in hardware resources. Additional on-chip memory resources allow better caching of network parameters and intermediate activations, thus reducing the overall energy for off-chip memory accesses. We found that with more PEs, processing was faster, but the overall energy did not increase. Thus additional PEs lowered the EDP. Thus a designer would have to trade off low EDP for additional chip area needed by PEs and memory.

Fig. 21d to Fig. 21f show that for architectures with the same number of PEs, the larger the on-chip memory size (global buffer and register files), the lower the energy consumption. However, for architecture with different PE numbers but the same memory size, the energy consumption change is not that obvious. This indicates that the memory configuration is critical for the energy efficiency of the hardware. Fig. 21g to Fig. 21i show the throughput for different architectures. For architectures with the same number of PEs, the processing time fluctuates with memory size increases. The reason for that is our design goal is the lowest EDP. TRIM traded off the processing time with energy to achieve the lowest EDP. Comparing the architectures with different numbers of PEs, we notice that the number of PEs is the key to achieving better performance. In the design space, the slowest architecture with 1024 PEs is 1.85x faster than the fastest architecture with 512 PEs.

It is important to emphasize that an architecture's performance and energy efficiency does not rely only on the hardware resources available, but also on the mapping utilized. In our experiments, TRIM explored the mapping space and picked mapping that gave the lowest EDP for a given architecture. Simply scaling up the number of PEs without selecting a proper mapping would not significantly increase performance and energy efficiency.

Fig. 22 shows the number of active PEs for architecture with 1024 PEs, 1MB global buffer, and different register file sizes per PE. These results are for training AlexNet-Cifar. A unique number of PEs are activated for various architectures and network layers, which indicates that different mappings are used in specific architecture and network layers. Even when the same number of PEs are active, the mapping methods may be different in terms of memory usage. TRIM automatically explores the mapping space and finds the optimal approach to process the benchmark for different architectures based on the design goal.

Fig. 21 shows the design space exploration results for the AlexNet-Cifar benchmark. We have done a similar set of explorations for two other benchmarks: AlexNet-IM and ResNet-IM. In our design space explorations, we notice that we can achieve a smaller energy-delay product with more hardware resources. For instance, increasing the number of PEs produced significant speedups, while increasing the onchip memory capacity allowed better energy efficiency. However, more hardware resources also mean a larger chip size. TRIM can quickly explore different hardware architectures and mappings to generate accurate performance and energy estimation. This allows an architect to make hardware tradeoff decisions in the early stages of design.

#### IX. CONCLUSION

We propose TRIM, an infrastructure for modeling and exploring the design space of DNN accelerators for both inference and training tasks. TRIM considers both intralayer and inter-layer workloads to generate the estimation



Fig. 21: Design space exploration results for AlexNet-Cifar with the design goal set as the lowest EDP.



Fig. 22: Active PE heatmap 1024 PE architecture for AlexNet-Cifar (1024 KB global buffer)

of performance, energy, and area. By considering inter-layer dependencies, TRIM can ensure the proposed hardware architectures have the capability to process all operations required for inference and training the DNN. In addition, the results are more reliable as they include the memory resources needed for caching the inter-layer data and the energy consumption of those inter-layer operations and data movements.We validate TRIM with the several FPGA designs and an ASIC-base inference accelerator (Eyeriss). After that, we demonstrate the usage of TRIM via case studies.

Future work includes extending TRIM to explore more state-of-the art deep networks, such as Transformer [25] and EfficientNet [26]. TRIM supports fixed precision quantization techniques. More quantization techniques such as mixedprecision quantization [27] that require co-designed hardware will also be added in the future. In recent years, neural architecture search (NAS) utilizes machine learning models to design the network architecture and to trade-off between accuracy and efficiency. We would explore the possibility to combine TRIM with NAS to enable deep network and hardware co-design.

Overall, TRIM is a powerful tool for exploring the pros and cons in the hardware design space of DNN training accelerators for both FPGA and ASIC design. To the best of our knowledge, TRIM is also the first model which can explore the design space of training and inference DNN architectures.

#### References

- Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," *Nature*, vol. 521, pp. 436–444, 5 2015. [Online]. Available: http://www.nature.com/ articles/nature14539
- [2] C. Li, "Openai's gpt-3 language model: A technical overview," 2020. [Online]. Available: https://lambdalabs.com/blog/demystifying-gpt-3/
- [3] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage, "Federated learning for mobile keyboard prediction," *arXiv*, 11 2018. [Online]. Available: http://arxiv.org/abs/1811.03604
- [4] N. Kukreja, A. Shilova, O. Beaumont, J. Huckelheim, N. Ferrier, P. Hovland, and G. Gorman, "Training on the edge: The why and the how." IEEE, 5 2019, pp. 899–903. [Online]. Available: https://ieeexplore.ieee.org/document/8778327/
- [5] H. Su, W. Qi, Y. Hu, H. R. Karimi, G. Ferrigno, and E. D. Momi, "An incremental learning framework for human-like redundancy optimization of anthropomorphic manipulators," *IEEE Transactions* on *Industrial Informatics*, pp. 1–1, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9252139/
- [6] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks." IEEE, 2 2017, pp. 553–564. [Online]. Available: http://ieeexplore.ieee.org/document/7920855/
- [7] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators." IEEE, 6 2016, pp. 267–278. [Online]. Available: https://ieeexplore.ieee.org/ document/7551399/
- [8] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: Efficient inference engine on compressed deep neural network," vol. 16. IEEE, 6 2016, pp. 243–254. [Online]. Available: http://ieeexplore.ieee.org/document/7551397/
- [9] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," *IEEE Journal of Solid-State Circuits*, vol. 52, pp. 127–138, 1 2017. [Online]. Available: http://ieeexplore.ieee.org/document/7738524/
- [10] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and Y. Chen, "Dadiannao: A neural network supercomputer," *IEEE Transactions on Computers*, vol. 66, pp. 73–88, 1 2017. [Online]. Available: http://ieeexplore.ieee.org/document/7480791/
- [11] F. Schuiki, M. Schaffner, F. K. Gurkaynak, and L. Benini, "A scalable near-memory architecture for training deep neural networks on large in-memory datasets," *IEEE Transactions on Computers*, vol. 68, pp. 484–497, 4 2019, ntx. [Online]. Available: https://ieeexplore.ieee.org/document/8502059/
- [12] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, "Timeloop: A systematic approach to dnn accelerator evaluation." IEEE, 3 2019, pp. 304–315. [Online]. Available: https://ieeexplore.ieee.org/document/8695666/
- [13] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar, "Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings," *IEEE Micro*, vol. 40, pp. 20–29, 5 2020.
- [14] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, D. Chen, and Y. Lin, "Autodnnchip: An automated dnn chip predictor and builder for both fpgas and asics," *FPGA* 2020 - 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 40–50, 2 2020. [Online]. Available: https://doi.org/10.1145/3373087.3375306
- [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," *Communications of the ACM*, vol. 60, pp. 84–90, 5 2017, alexnet. [Online]. Available: https://dl.acm.org/doi/10.1145/3065386
- [16] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," *HP laboratories*, vol. 27, p. 28, 2009.
- [17] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey *et al.*, "Scaledeep: A scalable compute architecture for learning and evaluating deep networks," in *Proceedings of the 44th Annual International Symposium on Computer Architecture*, 2017, pp. 13–26.
- [18] Y. N. Wu, J. S. Emer, and V. Sze, "Accelergy: An architecturelevel energy estimation methodology for accelerator designs," vol. 2019-Novem. IEEE, 11 2019, pp. 1–8. [Online]. Available: https: //ieeexplore.ieee.org/document/8942149/

- [19] Z. Luo, "Using alexnet train cifair-10," 2017. [Online]. Available: https://github.com/icpm/pytorch-cifar10/blob/master/models/AlexNet.py
- [20] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 9 2014, vgg. [Online]. Available: http://arxiv.org/abs/1409.1556
- [21] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," vol. 2016-Decem. IEEE, 6 2016, pp. 770–778. [Online]. Available: http://ieeexplore.ieee.org/document/7780459/
- [22] A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images. (2009)," 2009, cifar-10 and cifar-100 dataset.
- [23] Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks." IEEE, 6 2016, pp. 367–379. [Online]. Available: http://ieeexplore.ieee.org/ document/7551407/
- [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "Mobilenets: Efficient convolutional neural networks for mobile vision applications," *arXiv*, 4 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
- [25] A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, and I. Polosukhin, "Attention is all you need," *NIPS*, pp. 6000–6010, 2017.
- [26] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks." PMLR, 5 2019, pp. 6105–6114. [Online]. Available: https://proceedings.mlr.press/v97/tan19a.html
- [27] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, "Mixed precision training," 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, 10 2017. [Online]. Available: https://arxiv.org/abs/1710.03740v3



Yangjie Qi (S'14) received the B.S. (2012) from the Department of Electrical Engineering, Anhui University, Hefei, China, and the M.S. (2015) from the Department of Electrical and Computer Engineering from the University of Dayton, OH. He is currently pursuing his Ph.D. at the University of Dayton. His Ph.D. research work focuses on low-power, highperformance architectures for deep learning. He is a student member of the IEEE.



Shuo Zhang (S'21) received the B.E. degree (2014) from Nanjing University of Science and Technology, Nanjing, China, and the M.S. degree (2016) from University of Dayton, Dayton, OH. He is currently working on his Ph.D in electrical engineering at University of Dayton. His research interests include the design and analysis of hardware architectures for deep learning and applications on the hardware. He is a student member of the IEEE.



**Tarek M. Taha** (S'96–M'03) is a Professor in the Electrical and Computer Engineering Department at the University of Dayton. He received the BS degree from DePauw University, Greencastle, Indiana, in 1996 and the BSEE, MSEE, and PhD degrees in electrical engineering from the Georgia Institute of Technology, Atlanta, in 1996, 1998, and 2002, respectively. His research interests include cognitive computing architectures, high performance modeling. He received the NSF CAREER Award in 2007. His

research is supported by multiple agencies and companies including the National Science Foundation, the Air Force Research Laboratory, and the National Aeronautics and Space Administration. He is a member of the IEEE and the IEEE Computer Society.