3.1 Overview
A Tensor Slice is to Deep Learning, just like a DSP slice is to Digital Signal Processing. DSP Slices support the most common DSP operations like the MAC operation, along with additions and multiplications. Similarly, Tensor Slices support the most common machine learning operations like matrix-matrix multiplication and matrix-vector multiplication, along with element-wise matrix addition, subtraction and multiplication. The matrix-matrix and matrix-vector multiplication operations are pervasive in DL layers like fully-connected, convolution and recurrent layers. Element-wise (also referred to as Eltwise) matrix addition and subtraction is commonly found in layers like normalization, residual add and weight update. Eltwise matrix multiplication is used in layers like dropout. The Tensor Slice also has support for bias-preloading and tiling.
Figure
1 shows a logical block diagram of Tensor Slice. The slice interfaces with the FPGA interconnect through connection box (for inputs) and switch box (for outputs), similar to other blocks on modern FPGAs. The slice has a 50% sparsely populated local input crossbar, that makes the input pins of the slice swappable and hence increases the routability of the FPGA. The total number of inputs pins (including clock and reset) and output pins on the Tensor Slice are 310 and 298, respectively.
The core of the Tensor Slice is a 2D array of 16 processing elements (PEs) along with control logic. Each PE comprises of a multiplier and an adder which can act as an accumulator when a MAC operation is desired. The control logic comprises of input logic, output logic and muxing logic. The input logic sets the input data correctly (e.g. appropriately delay it for systolic computation) to be processed by the PEs. The output logic selects the output data appropriately from the PEs and shifts it out. The muxing logic selects between various modes of operation of the slice.
The Tensor Slice supports four precisions natively: 8-bit fixed-point (int8), 16-bit fixed-point (int16), IEEE half-precision (fp16), and Brain Floating Point (bf16) [
15]. These are the most commonly used precisions in DL inference and training. In int8 mode, all multiplications happen in int8 but accumulations are done in 32-bit fixed-point (int32). In int16 mode, all multiplications happen in int16 but accumulations are done in 48-bit fixed-point (int48). In the fp16 and bf16 modes, all multiplications happen in fp16 and bf16, respectively, but accumulations are done in IEEE single precision (fp32).
There are two primary modes of operation of the Tensor Slice: Tensor mode and Individual PE mode. In the Tensor mode, the slice operates on matrix inputs, whereas in Individual PE mode, it operates on scalar inputs. There are five sub-modes of the Tensor mode: Matrix-Matrix Multiplication, Matrix-Vector Multiplication, Eltwise Addition, Eltwise Subtraction, and Eltwise Multiplication. There are two sub-modes of the Individual PE mode: Multiplier and MAC. All the modes and sub-modes supported by the Tensor Slice are shown in Figure
2. The mode of operation of the slice is dynamically selectable. That is, the mode bits can be changed during run-time without requiring reconfiguration of the FPGA.
3.2 Processing Element
For this section, we refer to each processing element (PE) in the Tensor Slice as a physical PE. We refer to the logic/circuitry required to process 1 matrix element as a logical or functional PE. There are 16 physical PEs in the slice. In 16-bit precision modes, the slice needs to process 16 matrix elements. So, there is a one-to-one correspondence between a physical PE in the slice and a logical PE required in 16-bit precision modes. For example, logical PE00 is physical PE00 of the slice, logical PE01 is physical PE01 of the slice, and so on upto PE33. However, in 8-bit precision mode, the slice processes 64 matrix elements, so it needs 64 logical PEs. Because of hardware sharing, each physical PE in the slice acts as 4 logical 8-bit PEs. So, physical PE00 in the slice maps to logical 8-bit PE00, PE01, PE02, PE03. Physical PE01 in the slice maps to logical 8-bit PE04, PE05, PE06, PE07. And so on. Figure
3 shows the diagram of one physical processing element (PE) configured for 8-bit precision operation (int8) as 4 logical PEs and for 16-bit precision operation (int16/fp16/bf16) as 1 logical PE.
Each PE consists of registers for shifting input data and a MAC. The MAC is shown in Figure
4. The figure also shows the multiplexing in the MAC required for the individual PE mode. Logically, the MAC consists of a multiplier and an adder. But to enable hardware sharing between the integer and floating point modes, the MAC contains multiple small-sized adders and multipliers which are combined to form larger adders and multipliers, along with multiplexing logic, floating-point logic (aligning, normalizing, executing, rounding, etc.) and pipelining registers. There are four 8-bit multipliers and 16 8-bit adders in the MAC block.
When operating in the int8 mode (Figure
5(a)), four int8 multiplications and four int32 additions are required. The four 8-bit multipliers are directly used and 4 int32 additions are performed by combining the 8-bit adders. When operating in the int16 mode (Figure
5(b)), one int16 multiplication and one int48 addition is required. The multiplication uses four 8-bit multipliers along with 10 8-bit adders to add the partial sums. The int48 addition is performed by combining 6 8-bit adders.
In floating point modes, the floating point logic reuses the 8-bit multipliers and 8-bit adders as required. In fp16 mode (Figure
5(c)), one fp16 multiplication and 1 fp32 addition are required. The fp16 multiplication logic needs to do an 11-bit multiplication (for mantissas), for which it uses the four 8-bit multipliers and 8 8-bit adders (to add partial sums). It also needs a 5-bit addition (for exponents), for which it uses one 8-bit adder. In bf16 mode (Figure
5(d)), one bf16 multiplication and one fp32 addition are required. The bf16 multiplication logic needs to do an 8-bit multiplication (for mantissas), for which it uses one 8-bit multiplier. It also needs a 8-bit addition (for exponents), for which it uses one 8-bit adder. For the fp32 addition (required by both fp16 and bf16 modes) uses the same hardware. In our implementation of the fp32 adder, it needed one 24-bit addition and 3 8-bit additions during its various stages. For this, it uses 6 8-bit adders. Some 8-bit adders stay unused in floating point modes.
3.3 Tensor Mode
The
I/O (Input/Output) pins on the Tensor Slice in Tensor mode are shown in Table
1. The Tensor Mode is configured by setting the
mode input to 0. When configured to use int8 precision (
dtype = 00), the Tensor Slice acts on 8
\(\times\) 8 matrix operands and generates an 8
\(\times\) 8 matrix result. In int16 (
dtype = 01), fp16 (
dtype = 10) and bf16 (
dtype = 11) precisions, the Tensor Slice acts on 4
\(\times\) 4 matrix operands and generates a 4
\(\times\) 4 matrix result. We observed that by doing this, we can utilize the I/O pins of the Tensor Slice fully in each mode. Also, there are ample opportunities to share hardware between 4
\(\times\) 4 fp16/bf16/int16 and 8
\(\times\) 8 int8 array of processing elements.
The Tensor Slice performs a tensor operation over multiple clock cycles. The input start is asserted to start the operation. The input matrices/vector would typically be stored in RAM blocks and some control logic implemented in soft logic would read the RAM blocks to feed the inputs to the slice. Alternatively, inputs may be generated from some upstream logic (e.g. hardware for the previous layer of a neural network) and fed directly into the slice without being stored in a RAM block. As the input matrices/vector are fed into the slice, control logic inside the slice orchestrates the data and applies the right data elements at the right time to specific PEs. When the output data is available in the PEs, it is sent out on c_data and flags. If out_ctrl is 0, the output data is automatically shifted out cycle-by-cycle when it is ready, but the user can control when to shift it out by setting out_ctrl to 1. The output c_data_available is asserted when output data is valid on c_data and flags. flags contain the logical OR of the exception flags from the PEs in a column and are only valid for floating-point precisions. The output data can be stored in a RAM block, or directly fed to downstream logic (e.g. hardware for the next layer of a neural network) as it is generated by the slice. When the entire operation is done, the slice asserts the done signal.
Although the size of the matrix operations performed by the Tensor Slice are 4
\(\times\) 4 and 8
\(\times\) 8, the Tensor Slices can be chained to perform larger matrix operations. Section
3.3.4 provides details about this. Similarly, the Tensor Slice can support non-square inputs as well. For this purpose, there are validity masks for the inputs. This is done using
valid_mask_a_rows,
valid_mask_a_cols_b_rows and
valid_mask_b_cols pins on the slice. For example, when multiplying a 6
\(\times\) 4 matrix with a 4
\(\times\) 7 matrix in int8 mode, the values of these inputs can be 8’b0011_1111, 8’b0000_1111 and 8’b0111_1111, respectively.
In Tensor mode, bias and tiling support can be enabled. For bias (controlled using preload), the Tensor Slice supports pre-loading the PEs with an input matrix which is effectively added to the result of the subsequent matrix operation. For tiling (controlled using accumulate), the Tensor Slice supports the choice of not-resetting the results in the PEs before starting another operation. This can be used in performing tiled or blocked matrix multiplications, where the partial sums need to be accumulated across tiles or blocks.
3.3.1 Matrix-Matrix Multiplication Mode.
The matrix-matrix multiplication mode is enabled when op = 000. The matrix-matrix multiplication operation in the Tensor Slice is done systolically. Only the PEs along the left column and the top row of the 2D PE array receive external data. Other PEs receive data from neighboring PEs. The elements of first input matrix (matrix A) move from left to right and the elements of the second input matrix (matrix B) move from top to bottom. The result is calculated during the shifting process, and it stays in the respective PE until its computation is done. After that, the resulting matrix (matrix C) is shifted out left to right column-wise in a pipelined fashion. When the results are being shifted out, another tensor operation can be started on the Tensor Slice.
Elements of one operand matrix are applied column-wise on the input
a_data. Elements of the second operand matrix are applied row-wise on the input
b_data. In one cycle, 8 int8 elements or 4 int16/fp16/bf16 elements are applied on
a_data and the same number of elements are applied to
b_data. The output data is available on
c_data and
flags. In one cycle, results from one column of PEs are shifted out (See Section
3.3.5 for more details). Only 128 bits of c_data and 4 bits of flags are used in this mode.
Figure
6(a) shows the systolic setup of data from matrix A (left-to-right). See the path from
a_data to the PEs through the flip-flops and A-mux. The muxing required for chaining (A-mux) that selects between
a_data and
a_data_in is discussed later in Section
3.3.4. Figure
6(b) shows the same logic but for data from matrix B (top-to-bottom). Figure
6(c) shows the movement of matrix A elements (in red) and matrix B elements (in yellow) through the PEs. Figure
6(d) shows the shift out of the results (i.e. data for matrix C).
Matrix-matrix multiplication operation in the tensor mode is the most compute intensive operation done by the Tensor Slice. When using 16-bit precisions (int16, fp16, bf16), the slice performs 16 MAC operations in 1 cycle. So the math throughput of the slice is 16 MACs/clock. When using 8-bit precision (int8), the slice’s math throughput is 64 MACs/clock. To keep the slice fed with data, it reads 8 16-bit elements every clock cycle in 16-bit precision modes and 16 int8 elements every clock cycle in int8 precision mode. So, the on-chip memory bandwidth requirement of the Tensor Slice is 16 bytes/clock.
3.3.2 Matrix-Vector Multiplication (matvec) Mode.
The matrix-vector multiplication mode is enabled when op = 100. The matrix-vector multiplication operation in the Tensor Slice is also done systolically. The elements of the matrix move from left to right and the elements of the vector move from top to bottom. The result is calculated during the shifting process, and it stays in the respective PE until its computation is done. After that, the resulting vector is shifted out column-wise in 1 cycle.
Elements of the input matrix are applied column-wise on the input
a_data. Elements of the input vector are applied to
b_data. Note that only 8 bits of
b_data are used for int8 precision, and 16 bits of
b_data are used for int16/fp16/bf16 precisions. Since there is only one column in a vector, this implies only the PEs in one column of the 2D PE array are utilized. We identify an opportunity to improve the utilization of PEs by observing that there are many I/O pins on the Tensor Slice that are required only for the matrix-matrix mode, but are not required in matrix-vector mode (and eltwise modes as well). We add multiplexers in front of the PEs in the third column and expose them to already-existing unused I/O pins so that these PEs can also be loaded directly from the outside (instead of getting data from PEs to their left). Through this set of wires (called
second_a_data), we can now feed another matrix in the matrix-vector mode. This is shown in Figure
7. This is a slight deviation from a pure systolic design, in which only the PEs on the periphery read/write data from outside. However, the overhead of adding this feature is low, and the utilization of the slice doubles in the matrix-vector mode. More multiplexers can be added to PEs in other columns and rows to further increase the utilization of the Tensor Slice in matrix-vector mode. However, this will require new I/O pins to be added to the Tensor Slice. I/O pins on a block in the FPGA fabric are costly in terms of area (larger local crossbar) and routing (more congestion). Adding multiplexers also increases the combinatorial delays of timing paths going through them. So, the cost-benefit tradeoff needs to be carefully studied before adding multiplexers to more columns and rows.
The second vector can be fed from the bits of b_data that are unused in this mode. With this enhancement, two independent matrix-vector products can be calculated at the same time in the slice. Some other unused I/Os can be used for validity masks and for reading out the output results. Here is the mapping of I/O pins used for the reading a second matrix and a second vector, and for outputting a second result in the Matrix-Vector Multiplication mode:
•
second_a_data is mapped to a_data_in
•
second_validity_mask_a_rows is mapped to validity_mask_b_cols
•
second_validity_mask_a_cols_b_rows is mapped to b_data[23:16]
•
num_rows_matrix is mapped to final_op_size
•
num_cols_matrix is mapped to b_data[31:24]
•
second_b_data is mapped to b_data[47:32]
•
second_c_data is mapped to {c_data[159:128], b_data_out[63:48], b_data_out[31:16], a_data_out[63:0]}
•
second_flags is mapped to flags[7:4]
The num_rows_matrix is used to specify the number of rows of the matrix, whereas the number of columns in the matrix (and hence the number of elements in the vector) is specified using num_cols_matrix. These are used inside the Tensor Slice to calculate the number of cycles elapsed to start shifting out the results and to assert the done signal.
In matrix-vector multiplication mode, when using 16-bit precisions, the slice performs eight MAC operations in one cycle (4 in column 1 and 4 in column 3). So the math throughput of the slice is 8 MACs/clock. When using 8-bit precision, the slice’s math throughput is 16 MACs/clock. To keep the slice fed with data, it reads 10 16-bit elements every clock cycle in 16-bit precision modes, and hence the on-chip memory bandwidth requirement is 20 bytes/clock. Similarly, it reads 18 int8 elements every clock cycle in int8 precision mode, and hence the on-chip memory bandwidth requirement is 18 bytes/clock.
3.3.3 Eltwise Modes.
Element-wise matrix operations are supported by the slice as well, and can be performed by selecting the right settings of the op pins (op = 001 => eltwise multiplication; op = 010 => eltwise addition; op = 011 => eltwise subtraction). For the eltwise operations, the elements of first matrix move left to right and the elements of the second matrix move from top to bottom. The result calculation happens after all inputs have reached their respective locations in the PE array. We observe that this method of moving data through the PEs in eltwise mode increases the number of cycles required for an eltwise operation.
The enhancement used for matrix-vector mode to increase the utilization of PEs can be extended to reduce the cycles required in eltwise mode by 2
\(\times\). Instead of only feeding data into the 2D PE array from the left-column and top-row, additional PEs internal to the array can be fed, without adding any extra I/O cost. For matrix-vector multiplication mode, the third column was exposed on existing I/Os. In addition to exposing the third column, the third row is also exposed in eltwise mode. This enables loading of two columns of matrix A and two rows of matrix B at the same time, doubling the loading speed without adding any I/Os. The cost is a few multiplexers. I/Os unused in matrix-matrix multiplication mode are used for reading out the output results. This is also shown in Figure
7. The following list shows the mapping of I/O pins used for the reading a second matrix and a second vector, and for outputting a second result in Eltwise Modes:
•
second_a_data is mapped to a_data_in
•
second_b_data is mapped to b_data_in
•
second_c_data is mapped to a_data_out[63:0]
•
second_flags is mapped to flags[7:4]
3.3.4 Chaining.
Multiple Tensor Slices can be chained to perform operations on larger matrices. This is useful in matrix-matrix and matrix-vector multiplication operations. Figure
8 shows a logical view of four Tensor Slices chained in x and y directions to perform a larger matrix-matrix multiplication operation (e.g. an 8
\(\times\) 8 matrix multiplied with an 8
\(\times\) 8 matrix using four slices in fp16 mode). Signals
a_data_in and
a_data_out are used to chain the inputs from matrix A along the x direction. The signals
b_data_in and
b_data_out are used to chain the inputs from matrix B along the y direction. Only the Tensor Slices at the periphery are fed inputs. Inputs flow through the slices through the chains. The
c_data signal contains the output of the Tensor Slice. It can be chained with the output from neighboring Tensor Slices using soft logic or directly consumed from each Tensor Slice block, depending the requirements of the user’s design.
Note that the figure shows a logical connectivity of the slices in the x (horizontal) and y (vertical) directions. Physically, these slices can be anywhere on the FPGA. For example, four Tensor Slices in one grid column of the FPGA could be connected to perform a larger matrix operation. The inputs x_loc and y_loc are used to specify the logical location to the slices. Note that x_loc and y_loc do not determine or are related to the physical location of a slice in the FPGA grid. These signals are decoded internally to select the correct input port(s) whose data should feed the PEs. For example, the top-left slice in the logical grid of slices has {x_loc,y_loc}=00, implying that this slice should use the input data received at a_data and b_data ports. The bottom-right slice in the logical grid of slices has {x_loc, y_loc}=11, implying that this slice should use the inputs received at a_data_in and b_data_in from its logical neighbors to the left and top, respectively. Not only do different slices in a logical grid receive data from different ports, they get data at different times as well. For example, the inputs going into slice with {x_loc, y_loc}=11 are delayed with respect the inputs going in to the slice with {x_loc, y_loc}=00. x_loc and y_loc are also used in the control logic in the slice to sample the incoming data at the appropriate time.
The input
final_op_size is used to specify the overall size of the matrix operation being performed. In the case of the example shown in Figure
8, assuming int16 operation, the
final_op_size will be set to 8, because four slices are connected together and each slice performs a 4
\(\times\) 4 matrix operation. This signal is used in the control logic in the slice to determine when the computation is finished and when to start shifting out the result.
Consider a matrix-matrix multiplication where an MxK matrix is multiplied with a KxN matrix. For large values of M, the Tensor Slices are chained in the y (logically vertical) direction. For large values of N, the Tensor Slices are chained in the x (logically horizontal) direction. For large values of K, instead of chaining, typically, a longer number of cycles is used to accumulate the results. In other words, M and N dimensions are handled by using more hardware (more “space”), whereas K dimension is handled by using more cycles (more “time”). An advantage of mapping the K dimension on “time” is that the extended precision intermediate results do not need to move. The same concept applies to a matrix-vector multiplication, except that there is no requirement of chaining in the x (logically horizontal) direction. It only makes sense to chain Tensor Slices in the y (logically vertical) direction.
3.3.5 Rounding.
As mentioned above, accumulations in the Tensor Slice are done at a higher precision, compared to the multiplication. In other words, the results have higher precision compared to the operands. When
no_rounding is set to 1, the outputs are shifted out in the higher precision without being rounded to the input precision. But when
no_rounding is set to 0 by the user, the outputs are rounded to input precision before being shifted out. Convergent rounding or “round half to even” rounding [
31] is used to round the results. When rounded results are shifted out, it can take less number of cycles depending on the precision. For example, in the matrix-matrix multiplication int8 mode, if rounding is disabled, the output from 8 PEs (8 PEs = 1 column of PEs for int8 precision) is 8 * 32 = 256 bits. The
c_data signal is 128 bits. So, it takes 16 cycles to shift out the data of all 8 columns. However, if rounding is enabled, the output from 8 PEs is 8 * 8 = 64 bits. So, it takes 8 cycles to shift out data of all 8 columns. In matrix-matrix multiplication fp16 mode, if rounding is disabled, the output from 4 PEs (4 PEs = 1 column of PEs for fp16 precision) is 4 * 32 = 128 bits. So, it takes 4 cycles to shift out the data of all 4 columns. If rounding is enabled, the output from 4 PEs is 4 * 16 = 64 bits. So, in this case also, it takes 4 cycles to shift out data of all 4 columns.
3.4 Individual PE Mode
When the mode input pin is set to 1, the Tensor Slice changes to Individual PE mode. The main goal of providing this mode is to reduce the impact of adding Tensor Slices to an FPGA on non-DL applications and to improve utilization. In this mode, the slice is fractured such that inputs and outputs of individual PEs are exposed to the pins of the slice, enabling the PEs to be used like mini-DSP slices. Each PE can be separately and dynamically configured in two sub-modes: Multiplier or MAC. Furthermore, all the four precisions (int8, int16, fp16 and bf16) are available and can be dynamically selected. In int8 mode, each PE can be configured to be used as 2 8-bit multipliers or 1 8-bit MAC with 32-bit accumulation. In int16 mode, each PE can be configured to be used as 1 16-bit multiplier. In fp16 and bf16 modes, each PE can be configured to be used as 1 16-bit multiplier or 1 16-bit adder or 1 16-bit MAC with fp32 accumulation. Note that because of the large delay to access the PEs in the Tensor Slice (because of the local input crossbar), using the Individual PE mode will not be performant compared to, for example, a DSP slice based multiplication or MAC.
There is a limitation of this mode. The number of inputs and outputs on the slice (also called the I/O footprint of the slice) is governed by the Tensor mode (310 inputs, including clock and reset, and 298 outputs). Based on that, 8 PEs out of the 16 PEs can be exposed. We could add additional inputs and outputs to the slice to accommodate for exposing all 16 PEs in individual PE mode, but that would mean worsening the I/O footprint of an already large slice. Increasing the number of I/Os can lead to more routing congestion and higher channel width requirement and also a larger area of the Tensor Slice.
The inputs and outputs of an exposed PE are:
•
direct_mode (Multiply or MAC)
•
direct_dtype[1:0] (int8, int16, fp16 or bf16)
•
direct_flags[3:0] (exception flags for floating point mode)
Each exposed PE’s inputs and outputs are mapped onto the top-level inputs and outputs of the slice (shown in Table
1). The mapping of all inputs and outputs to various PEs is not significant for this paper, but here’s an example of the pin mapping for exposed PE #1:
•
direct_in_a[15:0] is mapped to {valid_mask_b_cols, final_op_size}
•
direct_in_b[15:0] is mapped to a_data[31:16]
•
direct_mode[1:0] is mapped to x_loc[3:2]
•
direct_dtype is mapped to accumulate
•
direct_out[31:0] is mapped to c_data[31:0]
•
direct_flags[3:0] is mapped to b_data_out[7:4]