# Accelerate Three-Dimensional Generative Adversarial Networks Using Fast Algorithm Ziqi Su<sup>1</sup>, Wendong Mao<sup>1</sup>, Zhongfeng Wang<sup>1</sup>, Jun Lin<sup>1</sup>, Wenqiang Wang<sup>2</sup>, and Haitao Sun<sup>2</sup> <sup>1</sup>School of Electronic Science and Engineering, Nanjing University, P. R. China <sup>2</sup> SenseTime Research Email: {zqsu,wdmao}@smail.nju.edu.cn, {zfwang,jlin}@nju.edu.cn, {wangwenqiang, sunhaitao}@powertensors.ai Abstract—Three-dimensional generative adversarial networks (3D-GAN) have attracted widespread attention in threedimension (3D) visual tasks. 3D deconvolution (DeConv), as an important computation of 3D-GAN, significantly increases computational complexity compared with 2D DeConv. 3D DeConv has become a bottleneck for the acceleration of 3D-GAN. Previous accelerators suffer from several problems, such as large memory requirements and resource underutilization. To handle the above issues, a fast algorithm for 3D DeConv (F3DC) is proposed in this paper. F3DC applies a fast algorithm to reduce the number of multiplications and achieves a significant algorithmic strength reduction. Besides, F3DC removes the extra memory requirement for overlapped partial sums and avoids computational imbalance to fully utilize resources. Moreover, we design an F3DC-based hardware architecture, which consists of four fast processing units (FPUs). Each FPU includes a pre-process module, a EWMM module and a post-process module for F3DC transformation. By implementing our design on the Xilinx VC709 platform for 3D-GAN, we achieve a throughput up to 1700 GOPS and 4 imes computational efficiency improvement compared with prior works. Index Terms—Three-Dimensional Generative Neural Networks, Deconvolution, Transposed Convolution, Hardware Architecture, Fast Algorithm. #### I. Introduction With the development of deep learning, three-dimension (3D) deep neural networks have been widely used in many visual fields. Among them, Three-Dimension Generative Adversarial Networks (3D-GAN) have been used for various visual tasks, such as 3D object recognition and reconstruction [1], [2], 3D model generation [1], medical image analysis [3]–[5], human action recognition [6], and so on. 3D-GAN usually contain operations like 3D convolution, 3D deconvolution (DeConv), which is also called transposed convolution. Many works [7], [8] investigated the acceleration of 3D convolution for 3D Convolutional Neural Networks (CNNs). For example, Winograd algorithm is adopted in [7] to propose a templatebased methodology for 2D and 3D CNNs and designed corresponding accelerators. [8] introduced a fast FFT-based algorithm and F3D-based hardware architecture for 3D CNNs. which significantly reduced the computational complexity of This work was supported in part by the National Natural Science Foundation of China under Grant 62174084, 62104097 and in part by the High-Level Personnel Project of Jiangsu Province under Grant JSSCBS20210034, the Key Research Plan of Jiangsu Province of China under Grant BE2019003-4. (Corresponding author: Zhongfeng Wang; Jun Lin.) 3D convolution operations. Even though the basic computation schemes of 3D convolution and 3D DeConv are similar, 3D DeConv needs to insert zeros into the original input feature maps before performing normal convolution. However, directly applying the methods of 3D convolution for DeConv leads to severe hardware underutilization. Besides, some works [9]–[11] proposed efficient solutions to accelerate 2D DeConv. However, 3D DeConv has higher computational complexity compared with 2D DeConv, which makes it hard to accelerate 3D DeConv by 2D methods. For instance, [9] presented a novel Wino-transCONV dataflow to accelerate 2D DeConv, but 3D DeConv is more complicated than 2D DeConv, which makes it hard to employ DeConv dataflow. [10] applied input oriented mapping (IOM) method to accelerate 3D DeConv. However, the generated overlapped partial sums will increase along with the size of kernels by using the IOM method, which leads to increased computing resources and complex dataflow. [11] accelerated 3D DeConv by taking advantage of the fixed sparsity pattern of intermediate data tiles, but resulted in unbalanced computations. To tackle these problems, we propose a fast algorithm named F3DC to accelerate 3D DeConv. Specifically, the contributions of this paper are concluded as follows. - By investigating the mathematical formation of 3D De-Conv, we propose an efficient computation method, namely F3DC, for 3D DeConv based on a fast transformation algorithm [12], which greatly reduces the computational complexity and simplifies the computation flow. - Based on the F3DC, we develop an efficient architecture to implement 3D DeConv layers. Fast processing unit and fast processing array are designed to implement F3DC transformation and improve parallelism, respectively. - The proposed design is implemented on the Xilinx ZC709 platform and achieves a computational throughput of 1700 GOPS and up to 4× improvement on computational efficiency compared with prior works. #### II. BACKGROUND Compared with 3D convolution, the 3D DeConv needs to insert zeros before performing normal convolution. Fig. 1 shows the process of 3D DeConv. DeConv is used to expand the input characteristics, which is different from convolution. For the convolution process, the relationship between input feature map and output feature map is represented by the following Eq. (1): $$o = \left| \frac{i + 2p - k}{s} \right| + 1,\tag{1}$$ where i and o represent the size of input and output feature map, respectively, and k means the kernel size of the convolution. p and s denote the padding and stride. Fig. 1. Illustration of 3D deconvolution. As illustrated in Fig. 1, multiple parameters are introduced to describe 3D DeConv. The relationship of parameters are shown by the following Eq. (2): $$i_{t} = o + (s - 1)(o - 1),$$ $$p_{t} = \frac{(k - 1) + (k - 2p - 1)}{2} = k - p - 1,$$ $$k_{t} = k,$$ $$s_{t} = 1,$$ $$O_{t} = i = \frac{i_{t} + 2p_{t} - k_{t}}{s_{t}} + 1,$$ (2) where $i_t$ represents the result size of inner zero inserting, $p_t$ , $k_t$ and $s_t$ means the padding, kernel and stride for DeConv, respectively. $O_t$ means the size of DeConv output. The inserted zeros occupy more than seventy-five percent of data in 3D DeConv, which results in a large amount of invalid operations. Avoiding invalid operations is necessary to reduce computational complexity. Some works [9], [10] focused on accelerating transposed convolution. For example, Wang *et al.* [10] introduced the IOM method to accelerate 3D DeConv. However, the overlapped partial sums of the IOM lead to increased computing resources and complex dataflow. Di *et al.* [9] presented the novel Wino-transCONV dataflow and corresponding hardware architecture design, but the data rearrangement process is more complicated for 3D DeConv. Hence, we propose the F3DC to tackle the inserted zeros and reduce computational complexity. #### III. F3DC ALGORITHM #### A. Computational Procedure Since 3D DeConv has higher computational complexity than 2D DeConv, it is necessary to exploit fast algorithms to reduce the algorithmic complexity. Hence, we design F3DC, a computationally efficient 3D DeConv algorithm based on fast transformation algorithm (FTA) [12]. FTA is an algorithm to transform 2D DeConv into matrix multiplications by the following Eq. (3): $$\mathbf{Y} = \mathbf{A}^{\mathrm{T}} \left[ \left( \mathbf{H} \cdot \mathbf{g} \cdot \mathbf{H}^{\mathbf{T}} \right) \odot \left( \mathbf{P}^{\mathrm{T}} \cdot \mathbf{d} \cdot \mathbf{P} \right) \right] \mathbf{A}, \tag{3}$$ where g is a $k \times k$ 2D DeConv kernel and d is an $I_r \times I_r$ 2D input tile. However, 3D DeConv has an extra dimension compared with 2D DeConv. The extra dimension leads to 2D methods can not directly exploit the acceleration potential of 3D DeConv. To tackle this problem, F3DC is developed to specifically accelerate 3D DeConv, which avoids zeros insertion and further reduces computational complexity. The computation process of F3DC is shown in Eq. (4). Multiply and accumulate (MAC) are transformed to element-wise multiplication (EWMM) by transformation matrix during the process. The process reduces the number of multiplications and results in lower computational complexity. $$\mathbf{Y} = \left\{ \mathbf{A}^{\mathbf{T}} \left\{ \left[ \left( \mathbf{H} \cdot \mathbf{g} \cdot \mathbf{H}^{\mathbf{T}} \right)^{\mathbf{R}} \cdot \mathbf{H}^{\mathbf{T}} \right] \right. \\ \left. \odot \left[ \left( \mathbf{P}^{\mathbf{T}} \cdot \mathbf{d} \cdot \mathbf{P} \right)^{R} \cdot \mathbf{P} \right] \right\} \mathbf{A} \right\}^{\mathbf{CR}} \mathbf{A}.$$ (4) In Eq. (4) $\odot$ represents EWMM. g is a $k \times k \times k$ 3D DeConv kernel and d is an $I_r \times I_r \times I_r$ 3D input tile. $\mathbf{H}$ denotes an $E_r \times k$ matrix to preprocess 3D kernels and $\mathbf{P^T}$ denotes an $E_r \times I_r$ matrix to preprocess input tiles. $\mathbf{A^T}$ represents an $O_r \times E_r$ post-processing matrix to obtain final output tile, where $I_r = \lceil (k+r \times s-1)/s \rceil$ , $E_r = k+(r-1) \times s$ , $O_r = s \times r$ . r means the order of transformation. R and CR denote the clockwise and counterclockwise rotation, respectively. Fig. 2 shows the whole process for F3DC. The computation of F3DC is presented in the following four procedures. Preprocess: the pre-process is shown in Fig. 2. The DeConv kernel and input tiles are sliced for matrix multiplication by H and P<sup>T</sup> firstly, and then a 90 degrees clockwise rotation with vertical axis is performed to involve depth dimension before repeating slice and matrix multiplication. By above process, the size of DeConv kernel and input tiles are transformed from $k \times k \times k$ and $I_r \times I_r \times I_r$ to $E_r \times E_r \times E_r$ , respectively. For the rotation, because the matrix could not directly conduct multiplication with 3D cubes, the cubes need to be sliced for matrix multiplication, which distinguishes 2D and 3D DeConv computation methods. Besides, 3D DeConv needs to exploit extra dimension for acceleration compared with 2D computation, and as a result, rotation is a necessary procedure for 3D DeConv. **EWMM**: the two transformed cubes perform EWMM to get the resulting cube for post-processing. EWMM performs $E_r \times E_r \times E_r$ multiplications for each transformation. EWMM is presented in Fig. 2. Fig. 2. An example of F3DC computational procedure, where the DeConv kernel size is 4 and the stride is 2, r represents the order of F3DC. **Post-process**: the results of EWMM are sliced to perform multiplication with post-processing matrix $\mathbf{A^T}$ , then a 90 degrees counterclockwise rotation with vertical axis is conducted before repeating slice and matrix multiplication. The opposite direction of rotation is due to the 3D DeConv data arrangement. The result size of post-process is $O_r \times O_r \times O_r$ . Post-process is presented in Fig. 2. **Accumulation and Splicing**: the result tiles are accumulated channel-wise and spliced for the output feature maps. The transformation matrix of different kernels could refer to [12]. Here, we use $T_r(o_t^3, k_t^3)$ to denote an r-order F3DC, where $o_t^3$ and $k_t^3$ represent the size of the output and kernel, respectively. Considering $4\times 4\times 4$ is a common size for 3D DeConv kernel, a usual example of 3-order 3D DeConv $T_3(6^3,4^3)$ transformation matrix is presented in Eq. (5) for clarity. ## B. Complexity analysis Table I presents a comparison between F3DC, zero-inserting method (ZIM), and winograd-based method [11] on algorithmic reduction towards 3D DeConv. A winograd-based algorithm is adopted in [11] to compute 3D DeConv on zero-inserted feature maps. The method only removes zeros in the edges leading to insufficiently exploits acceleration potential of fast algorithm, and can not reach the optimal speedup. The arithmetic complexity of F3DC for r-order is denoted as $\mu(T_T[o_1^3, k_3^4])$ , it is computed as: $$\mu \left[ T_r \left( O_r^3, k^3 \right) \right] = \frac{[k + (r - 1) \times s]^3}{(r \times s)^3}.$$ (6) As shown in Table I, F3DC achieves $27 \times$ reduction on multiplications per output compared with ZIM in k=4, s=2, which significantly improves computational efficiency for 3D DeCony. #### IV. THE PROPOSED ARCHITECTURE AND DATAFLOW #### A. Architecture Overview As shown in Fig. 3, the proposed architecture consists of three data buffers (input buffer, kernel buffer, output buffer), fast processing array (FPA), fast processing unit (FPU) and two TABLE I ALGORITHMIC COMPLEXITY COMPARSION\* | | k = 3 | k=4 | k=5 | k = 9 | |------------------------------------|-------|------|--------|--------| | | s=2 | s=2 | s=2 | s=2 | | ZIM | 27 | 64 | 125 | 729 | | Winograd-based [11] | 3.375 | 8 | 15.625 | 91.125 | | $T_3\left(O_3^3,k^3\right)$ (Ours) | 1.59 | 2.37 | 3.375 | 10.17 | \* The algorithmic complexity indicates the average number of multiplications required to obtain one result. accumulators. The FPA consists of four FPUs to form $2 \times 2$ array. Each FPU includes a pre-process module, a EWMM module and a post-process module. The pre-process module executes the transformation of 3D DeConv kernel and input tiles by the matrix of Eq. (5). Each EWMM module includes $512~(8\times8\times8)$ multipliers, and four EWMM modules consume 2048 DSP resources. The post-process module also uses the matrix in Eq. (5) to obtain output tiles before channel-wise summation. The accumulator focuses on the accumulation of output tiles from different channels. Fig. 3. Overview of the proposed hardware architecture. #### B. FPU Module As shown in Fig. 4, an FPU is designed for matrix transformation and EWMM in F3DC by utilizing the simple coefficients in the matrix and the computation for EWMM. The preprocess module implements matrix multiplication responsible for the transformation of kernel and input tiles. Since the number in the transformation matrix mainly consists of 1, Fig. 4. Overall architecture of FPU and the detailed transformation circuit of each module. (a) Input transformation circuit in pre-process module. (b) Weight transformation circuit in pre-process module. (c)EWMM circuit for multiplication. (d) Output transformation circuit in post-process module. -1, 1/2 and -1/2. The multiplication of the number can be easily implemented by invert or shift operation, which greatly simplifies the hardware design. The post-process module is similar to the pre-process module in transformation matrix. Besides, the computation procedure for F3DC is suitable for a variety of convolution kernels with little adjustment for the transformation circuits and the number of DSPs of EWMM module. The specified circuit for $T_3(6^3, 4^3)$ is shown in Fig. 4. ## C. FPA Module In order to improve throughput of the architecture, we design the FPA which consists of four FPUs and two accumulators. Four FPUs form a $2\times 2$ array to increase parallelism by computing two input channels and two output channels at the same time. For each row of FPUs, they share the same tile from the same input channel, and each column of FPUs computes the same output channel by an accumulator. FPA could execute more computations for one address access and improve data utilization compared with non-parallel design. ## D. F3DC Dataflow Given that it may not be feasible to load data on chip for all, especially for 3D neural networks. We apply weight stationary [13] with a loop order of output channel, depth tile, height tile, width tile and input channel. For each input channel loop, the procedure of Fig. 2 is computed. Besides, we unroll the loop of the output channel and input channel to increase parallelism. #### V. EXPERIMENTAL RESULTS # A. Experimental Setup We choose 3D-GAN [1] model as the benchmark to evaluate our design. The inputs and weights are quantified into 16 bits and 8 bits, respectively. We implement all modules of the architecture with Verilog HDL, and evaluate our architecture on the Xilinx VC709 board with a frequency of 150 MHz. Implementation results are reported by Xilinx Vivado 2020.1. ### B. Performance Analysis As illustrated in Table II, we implement a computationally efficient F3DC-based accelerator, which not only executes zero-free operations, but also further reduces the computational complexity. The proposed design achieves 0.83 performance density compared with [9]. Even though [9] uses a fast algorithm to accelerate DeConv computation, the rearranged filters result in computational imbalance and reduce computation efficiency. [10] applies IOM method on 3D DeConv, which means the intermediate results will increase storage overhead. Our method simplifies the dataflow and avoids the storage of intermediate results. [11] uses Winograd algorithm on the zero-inserted input feature maps to accelerate 3D DeCony, and uses the sparsity of intermediate results to avoid some redundant computations. However, [11] has complex pre-processing parameters, which increase hardware overhead. The processing parameters of our method are simplified to save hardware resources. In brief, our design can significantly improve computational efficiency and meanwhile reduce hardware complexity. TABLE II COMPARISON WITH OTHER WORKS | Works | [8] | [9] | [10] | Ours | |-----------------------------------|--------|--------|------------|--------| | Platform | Xilinx | Xilinx | Xilinx | Xilinx | | | VC709 | ZCU102 | VC709 | VC709 | | Model | C3D | 3D-GAN | 3D-GAN | 3D-GAN | | Clock(Mhz) | 200 | 200 | 200 | 150 | | BRAMs Used | 1071 | - | 712 | 1470 | | Flip-Flops Used | 265750 | - | 566182 | 212195 | | LUTs Used | 257210 | - | 292292 | 192342 | | DSP Used | 1536 | 2520 | 2304 | 2048 | | Performance<br>(GOPS) | 864.1 | 482.4 | 450*(3600) | 1700 | | Performance Density<br>(GOPS/DSP) | 0.56 | 0.19 | 0.20 | 0.83 | <sup>\*</sup> Performance is normalized by removing zero-related computations. #### VI. CONCLUSION In this paper, we first introduce F3DC, a fast algorithm for 3D DeConv, capable of reducing the computational complexity and eliminating invalid operations related to inserted zeros. Furthermore, an efficient hardware architecture is proposed to implement the F3DC-based acceleration of 3D-GAN. Finally, we evaluate our architecture by implementing 3D-GAN model on the Xilinx VC709 platform. The experimental results demonstrate that the proposed architecture can achieve a throughput of 1700 GOPS, which surpasses prior works significantly. #### REFERENCES - [1] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum, "Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling," *CoRR*, vol. abs/1610.07584, 2016. [Online]. Available: http://arxiv.org/abs/1610.07584 - [2] B. Yang, H. Wen, S. Wang, R. Clark, A. Markham, and N. Trigoni, "3d object reconstruction from a single depth view with adversarial learning," in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), 2017, pp. 679–688. - [3] M. D. Cirillo, D. Abramian, and A. Eklund, "Vox2vox: 3d-gan for brain tumour segmentation," *CoRR*, vol. abs/2003.13653, 2020. [Online]. Available: https://arxiv.org/abs/2003.13653 - [4] . iek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, "3d u-net: Learning dense volumetric segmentation from sparse annotation," in *Medical Image Computing and Computer-Assisted Intervention MICCAI 2016*, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds. Springer International Publishing, pp. 424–432. - [5] X. Huang, J. Shan, and V. Vaidya, "Lung nodule detection in ct using 3d convolutional neural networks," in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), 2017, pp. 379–383. - [6] S. Ji, W. Xu, M. Yang, and K. Yu, "3d convolutional neural networks for human action recognition," *IEEE Transactions on Pattern Analysis* and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013. - [7] J. Shen, Y. Huang, Z. Wang, Y. Qiao, M. Wen, and C. Zhang, "Towards a uniform template-based architecture for accelerating 2d and 3d cnns on fpga," in *Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, ser. FPGA '18. New York, NY, USA: Association for Computing Machinery, 2018, p. 97106. [Online]. Available: https://doi.org/10.1145/3174243.3174257 - [8] C. Fang, L. He, H. Wang, J. Wei, and Z. Wang, "Accelerating 3d convolutional neural networks using 3d fast fourier transform," in 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5. - [9] X. Di, H. Yang, Z. Huang, N. Mao, Y. Jia, and Y. Zheng, "Exploring resource-efficient acceleration algorithm for transposed convolution of gans on fpga," in 2019 International Conference on Field-Programmable Technology (ICFPT), 2019, pp. 19–27. - [10] D. Wang, J. Shen, M. Wen, and C. Zhang, "Towards a uniform architecture for the efficient implementation of 2d and 3d deconvolutional neural networks on fpgas," in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), 2019, pp. 1–5. - [11] J. Shen, D. Wang, Y. Huang, M. Wen, and C. Zhang, "Scale-out acceleration for 3d cnn-based lung nodule segmentation on a multi-fpga system," in *Proceedings of the 56th Annual Design Automation Conference 2019*, ser. DAC '19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3316781.3317906 - [12] W. Mao, P. Yang, and Z. Wang, "Fta-gan: A computation-efficient accelerator for gans with fast transformation algorithm," *IEEE Transactions on Neural Networks and Learning Systems*, pp. 1–15, 2021. - [13] R. Xu, S. Ma, Y. Wang, and Y. Guo, "Cmsa: Configurable multi-directional systolic array for convolutional neural networks," in 2020 IEEE 38th International Conference on Computer Design (ICCD), 2020, pp. 494–497.