# arXiv:2210.07803v1 [eess.SP] 14 Oct 2022

# An Efficient FPGA Accelerator for Point Cloud

Zilun Wang, Wendong Mao, Peixiang Yang, Zhongfeng Wang, and Jun Lin

School of Electronic Science and Engineering, Nanjing University, P. R. China Email: {zlwang,wdmao,pxyang}@smail.nju.edu.cn, {zfwang,jlin}@nju.edu.cn

Abstract-Deep learning-based point cloud processing plays an important role in various vision tasks, such as autonomous driving, virtual reality (VR), and augmented reality (AR). The submanifold sparse convolutional network (SSCN) has been widely used for the point cloud due to its unique advantages in terms of visual results. However, existing convolutional neural network accelerators suffer from non-trivial performance degradation when employed to accelerate SSCN because of the extreme and unstructured sparsity, and the complex computational dependency between the sparsity of the central activation and the neighborhood ones. In this paper, we propose a high performance FPGA-based accelerator for SSCN. Firstly, we develop a zero removing strategy to remove the coarse-grained redundant regions, thus significantly improving computational efficiency. Secondly, we propose a concise encoding scheme to obtain the matching information for efficient point-wise multiplications. Thirdly, we develop a sparse data matching unit and a computing core based on the proposed encoding scheme, which can convert the irregular sparse operations into regular multiplyaccumulate operations. Finally, an efficient hardware architecture for the submanifold sparse convolutional layer is developed and implemented on the Xilinx ZCU102 field-programmable gate array board, where the 3D submanifold sparse U-Net is taken as the benchmark. The experimental results demonstrate that our design drastically improves computational efficiency, and can dramatically improve the power efficiency by 51 times compared to GPU.

*Index Terms*—Point cloud, submanifold sparse convolution, hardware architecture

### I. INTRODUCTION

Three dimensions (3D) point cloud is the inherently sparse data acquired from 3D sensors and can provide rich geometric, shape, and scale information [1]. Compared with two dimensions (2D) RGB images, 3D point cloud preserves a better understanding of the original geometric information in 3D space for deep learning-based vision tasks. While the biggest challenge of computing on the 3D point cloud comes from its extremely sparse nature. What's more, the sparsity of point cloud is fundamentally different from that in traditional convolutional neural networks (CNNs). For CNNs, the sparsity is usually caused by the activation functions. But for point cloud, its sparsity reflects the 3D composition of the real world. How to reduce the redundant computation caused by high sparsity becomes the key to the processing of point cloud. Prior works have proposed deep learning-based



Fig. 1. An example of the point cloud application [2].

methods. For instance, [3]–[5] projected the 3D point cloud into 2D to compress the data dimensions and reduce computational complexity, then applied 2D CNNs on the 2D point cloud. [6]-[8] directly leveraged the Multi-Layer Perceptrons (MLPs) operation for the original points to extract semantic features from the sparse point cloud, without voxelizing point cloud into 3D grids. [9]-[12] converted the point cloud into sparse discrete representation, then applied modified 3D CNNs for different tasks. Furthermore, authors in [12] proposed submanifold sparse convolution (Sub-Conv) to reduce memory and computational costs of computing on the point cloud by restricting the computation of convolution to be related to nonzero activations. The submanifold sparse convolutional network (SSCN) [12] achieves remarkable results compared to other deep learning-based methods [13]. Consequently, SSCN plays an important role in point cloud-based deep learning applications, motivating its deployment on resource constraint edge devices and corresponding dedicated accelerators.

Nowadays, to accelerate CNNs, some specifically designed hardware accelerators [14]–[16] are proposed. Eyeriss [14] presented a general dataflow to minimize data movement. GoSPA [16] proposed an intersection method to optimize the dataflow when the activations and weights had sparsity. However, when these accelerators for CNNs are directly used for SSCN, they suffer from severe performance degradation because they can not perform the matching operation of explicitly determining each nonzero activation and searching its nonzero neighbors, which is the core operation of the Sub-Conv layer. Therefore, a dedicated accelerator for SSCN is highly desired to promote its deployment.

Currently, several works presented solutions for point cloudbased networks. [17] and [18] introduced ASIC-based accelerators for PonitNet++ and proposed optimization schemes to the neighbor point search. [19] designed a low-power FPGA-based accelerator, which optimized the nonlinear implementations in PointNet. The above works are based on the PointNet and PointNet++ networks, and thus cannot be directly applied

This work was supported in part by the National Natural Science Foundation of China under Grant 62174084, 62104097 and in part by the High-Level Personnel Project of Jiangsu Province under Grant JSSCBS20210034, the Key Research Plan of Jiangsu Province of China under Grant BE2019003-4. (Corresponding author: Zhongfeng Wang; Jun Lin.)

to the acceleration of SSCN. PointAcc [20] proposed an ASIC-based accelerator that unified diverse mapping operations into a multiply-accumulate operation through coordinate transformation to be compatible with different point cloud networks. Other hardware solutions such as GPUs can be deployed to accelerate the point cloud networks. However, GPUs are not suitable for resource constraint edge devices because of their high power consumption, and the matching operation also limits their performance. Concentrating on the SSCN, we propose an FPGA-based efficient SSCN accelerator, ESCA, to support the matching operation and corresponding computations. This work makes the following contributions:

- A tile-based zero removing strategy is proposed to improve computational efficiency. The strategy reduces the processing time of the sparse information significantly, which also alleviates the computational load imbalance.
- An encoding scheme is introduced to efficiently support the matching operation. Based on the above scheme, a matching method is proposed to execute the matching operation for each nonzero activation, which solves the problems of explicit representation in the matching operation.
- A dedicated SSCN accelerator is proposed to support the matching operation and corresponding computations. The proposed design is implemented in the Xilinx ZCU102 platform and achieves significant improvement in terms of GOPS and power efficiency compared with GPU.

### II. BACKGROUND

The computation rules of Sub-Conv are fundamentally different from that of traditional convolution. Fig. 2(a) shows the results of traditional convolution for sparse features, and Fig. 2(b) shows the matching process of Sub-Conv. In traditional convolution, the input feature map is traversed by a kernel, and multiply-accumulate operations are performed in order. Even if the feature map has sparsity, as long as the convolution parameters, such as stride, kernel size, etc., are determined, the computation rules and correspondences in the convolution are explicitly determined. As a result, the sparse data in the output feature map dilates [12], so it is not suitable for point cloud-based computation.

For Sub-Conv [12], the fields of the feature map involved in the convolution operations are strictly limited to the neighbors of the nonzero activations, and the output feature map maintains the same sparsity as the input feature map. As shown in Fig. 2(b), five nonzero activations mean that this feature map only needs to perform five convolution operations with the corresponding kernel, and the positions are strictly limited to the fields where the central activation is nonzero. Because the Sub-Conv layer can keep the same pattern of sparsity between the input feature map and the output feature map, it shows satisfying visual results when is applied to the point cloud with high sparsity.

However, because the restricted computation pattern of the Sub-Conv layer leads to irregular sparse matching operations, traditional convolution accelerators suffer from performance



Fig. 2. Illustration of traditional convolution and Sub-Conv. (a) Traditional convolution: the feature map is traversed by the kernel, and the sparsity in the output feature map dilates. (b) Sub-Conv: The kernel only calculates with the fields where the center activation is non-zero.

degradation when they are directly applied to it [20]. Therefore, efficient accelerators for SSCN are urgently needed, and the bottleneck lies in the extreme and unstructured sparsity, and the complex computational dependency between the sparsity of the central activation and the neighborhood ones.

# III. EFFICIENT DESIGN FOR SUBMANIFOLD SPARSE CONVOLUTIONAL NETWORK

# A. Tile-based Zero Removing Strategy

Voxelized point cloud has huge sparsity. Directly processing on the original feature map results in large memory overhead and computation cost, and dramatically reduce computational efficiency. Take the ShapeNet dataset [21] as an example, it has nearly 99.9% sparsity, resulting in many regions without nonzero activations. Since the computation depends on the sparsity of the central activation, removing the all-zero regions has no effect on the result. To tackle this problem, we propose an effective tile-based zero removing strategy to remove the coarse-grained redundant sparse regions. As illustrated in Fig.



Fig. 3. The process of zero removing strategy. (a) The original input feature map is first divided into tiles of fixed size. (b) The fully sparse tiles of the input are removed, keeping only tiles containing nonzero activations. (c) Due to the nature of Sub-Conv, the removal of fully sparse tiles does not affect the output.

3(a), the original 3D feature map is divided into tiles of size  $N \times M \times L$ , where N, M and L are configurable parameters, and the sparsity in each tile is detected. If all the activations are zero in the tile, the tile is fully sparse and will be removed from the original feature map as shown in Fig. 3(b). Because the fully sparse tile is irrelevant to the computation of the Sub-Conv, the output feature map, as depicted in Fig. 3(c), still maintains the same sparsity. Then the processed feature map is only composed of active tiles, which contain at least one nonzero activation, and will be sequentially matched and computed. With this zero removing strategy, the time overhead when processing sparse information is significantly reduced, and the problem of computational imbalance is also alleviated.

### B. Matching Operation and Encoding Scheme

Matching operation is the procedure to locate each nonzero activation and search its nonzero neighbors, which is crucial for the computation of SSCN, and the position information recording the geometric distribution of nonzero activations is required to support the matching operation. Thus, an encoding scheme is proposed, which encodes the feature map into two types of data: index mask and valid data.

**Index Mask.** The index mask is used to explicitly represent the sparsity distributions of the feature map and is dynamically traversed during computation. The relationship between features, masks, and nonzero activations is shown in Fig. 4. Mask is a one-bit signal with only two states of 0 and 1, which represents that the activation is zero or not, respectively, and it is stored in the mask buffer. It also has a strong correlation with the sparse distribution of the point cloud, so the computation relationship between input feature maps and matching operation can be established explicitly.

Valid Data. Valid data are the nonzero activations and the corresponding weights, as shown in Fig. 4. As valid data, the activations and weights are stored in the corresponding buffers, and can be read from the buffers under the guidance of the index mask. Thus, the matching operation can be performed through the process of interaction between the index mask and the valid data.



Fig. 4. Composition of the index mask and the vaild data.

### C. Sparse Data Matching Unit

The matching operation and the composition of match group are elaborated in Fig. 5. A **match group** contains the nonzero



Fig. 5. Illustration of the matching operation and match group.



Fig. 6. Description of the SDMU. The Acc in the state index generator corresponds to the accumulation operation.

activations and corresponding weights for each convolution calculation based on the central nonzero activation. Also a set of elements in a match group is called a match. Thus, after determining all the match groups for each nonzero activation, the matching operation is completed for one feature map. Meanwhile, the computation of the Sub-Conv layer is decomposed into point-wise multiply-accumulate operations for each match group.

To support the matching operation and search all match groups efficiently for the Sub-Conv layer, we propose the sparse data matching unit (SDMU), which is shown in Fig. 6. The mask judger and the decoder perform the matching operation and generate the match groups from the buffers. For the convolution with the kernel size of  $K \times K \times K$ , the index masks of each column are read sequentially. So the parallelism of the decoder in SDMU is  $K^2$ , which corresponds to the number of columns. Then the FIFO group stores the match groups in column order. Finally, the multiplexer (MUX) selects matches from the FIFO group and sends them to the computing core for point-wise multiply-accumulation.

To coordinate the computation rules, activations and the ones that are in their neighbor field need to be explicitly acquired at the same time. Therefore,  $K^computearray3$  masks are required for determination. This area is called the sparse receptive field (SRF). For each nonzero activation, the matching operation and the acquisition of the match group are limited to the SRF.

The process of matching operation is described in Fig. 7(a). In this case, it is presented in 2D and can be smoothly extended to 3D. The kernel size is  $3^2$ , so the parallelism is 3. The following steps of the matching method, read masks, judge state, generate state index, and fetch activations are presented to conduct the matching operation.



Fig. 7. Examples of the matching steps in the SDMU. (a) The process of obtaining match groups through masks. (b) Pipeline representation when executing the matching operation.

**Read masks:** The index mask is read from the mask buffer for each SRF and sent to the mask judger.

**Judge state:** The mask is judged whether to perform the convolution for the SRF by the mask judger. If the center mask corresponds to a nonzero activation, then this SRF is active, and the match group is fetched from buffers according to the generate state index step and fetch activations step. Otherwise, it is non-active and the fetch activations step will be skipped.

Generate state index: In this step, the relative position of nonzero activations is generated for each SRF and is called the state index. It can be regarded as an array (A, B). The index A records the nonzero activations accumulated in each column and it is cumulated for each SRF. The index B represents the number of activations in each column for each SRF if the state is active, otherwise, index B equals 0. Thus, the index A marks the highest address of the activation in the activation buffer for each match group. And the index B corresponds to the address length of the activation involved in the computation in each column.

**Fetch activations:** The address fragment for nonzero activations of each column can be represented by (A, A-B). It is generated in the address generator and contains addresses for all activations in each match group. Then the corresponding activations are read from the activation buffer. If the mask of the central site is zero, which indicates the matching operation will not be implemented, the fetch activations step for this SRF will be skipped accordingly.

These matching steps are executed in a pipeline, as shown in Fig. 7(b). Since weights and activations have a positional correspondence in each match group, the weights that need to participate in the computation can also be obtained by state index synchronously, and the corresponding activations and weights are concatenated when read from buffers. In summary, the state index obtained by traversing the index mask can establish a matching relationship with valid data, through which the match group can be collected.

In the matching steps, parallel processing is performed according to the column dimension in every SFR to maintain the synchronization of explicit representations of each match group. Therefore, after obtaining the match group from  $K^2$ columns, which is decided by the kernel size, a FIFO group is applied to store them. The FIFO group consists of  $K^2$  identical FIFOs, and each FIFO stores the matches belonging to one column. In each cycle, the controller in the decoder selects a match from a FIFO based on the calculation order, and MUX sends it to the computing core.

### D. Computing Core

Since the sparse data are already transformed into match groups in the SDMU, the computing core (CC) is designed to implement dense point-wise multiply-accumulate operations. The CC contains a computing array and an accumulator. In each cycle, the input to the computing array is a match belonging to a match group. In order to improve throughput, the computing array is divided into m + 1 computing units (CUs), each of which performs the computation of n+1 input channels (ICs), and the output of each CU is the partial sum of the corresponding output channel (OC), so the total parallelism of the computing array is (m + 1)(n + 1).

Fig. 8(b) illustrates the inputs and outputs of the computing array. The activations of the n + 1 ICs are broadcast to all CUs.  $A_{[n]}$  represents activations belonging to IC n.  $W_{[n][m]}$  represents weights belonging to IC n, OC m. For example, the result of CU m is equal to the partial sum of the n ICs on the  $m^{th}$  OC.

The detailed structure of the computing unit is shown in Fig. 8(c). The partial sum of nonzero activations for different OCs can be obtained through the computing array, then the partial sum is sent to the accumulator and the output of each SRF is obtained.



Fig. 8. Illustration of loop unrolling and the composition of computing array. (a) The process of loop unrolling. (b) The description of the computing array. (c) The structure of the computing unit in the computing array.

The details of the loops are shown in Fig. 8(a). Each active tile is traversed in turn. The obtained data are fed to the CC in



Fig. 9. Description of overall hardware architecture.

the order of matched nonzero activations and weights, and the IC and OC dimensions are completed sequentially according to the parallelism of the proposed computing array. Finally, the partial sum of each match group is accumulated to obtain the outputs corresponding to nonzero activations. The SDMU and CC are executed in pipeline to increase resource utilization and the system throughput.

### E. Overall Hardware Architecture

The overall hardware architecture is shown in Fig. 9, mainly containing a main controller, an SDMU, a CC, and corresponding buffers on the on-chip logic.

**Main Controller**. The main controller is responsible for ensuring that the SDMU and the CC are executed in the right order.

**SDMU**. In SDMU, the mask judger and the decoder perform the matching operation. The obtained match groups are stored in the corresponding FIFOs, so as to read them under the control of the FIFO group and MUX, and the matched data are sent to the computing array in order.

**CC.** In the computing array of CC, computation is performed in the IC and OC dimensions, and the partial sum is generated in the OC dimension. Then the partial sum is accumulated in the accumulator and finally sent to the output buffer. In our structure, the parallelism is set to 16 both in the OC and IC dimensions.

There are four **buffers** to store data, whose basic unit is block RAM. The mask buffer stores the mask, while the activation buffer and weight buffer store activations and weights respectively. The output buffer stores the outputs and sends them to the off-chip DRAM.

### **IV. EXPERIMENTAL RESULTS**

# A. Experimental Setup

We adopt the 3D submanifold sparse U-Net (SS U-Net) [12] to evaluate our ESCA. SS U-Net can perform the semantic segmentation task of the point cloud with satisfactory visual results. The pre-trained network parameters are 8bit quantized, and the activations are 16bit quantized. The kernel size of the Sub-Conv in the SS U-Net is  $3 \times 3 \times 3$ , so the parallelism of SDMU and the number of FIFOs in the FIFO group are set

to  $3^2$ . The whole system is implemented with Vivado Design Suite. The performance of the GPU baseline is measured by NVIDIA System Management Interface.

### B. Analysis of Zero Removing Strategy

We comprehensively evaluate the zero removing strategy on two representative point cloud datasets, ShapeNet dataset [21] and NYU Depth dataset (v2) [22]. The feature maps are normalized to the size of  $192 \times 192 \times 192$  after voxelization. We test the effect of different tiling sizes on the sparsity and the number of remaining active tiles. The experimental results are shown in Table I. With different tiling sizes, this strategy achieves up to 99.82% zero reduction in the ShapeNet [21], and up to 99.85% in the NYU [22]. A more fine-grained tile size increases the removing ratio of zeros, it also increases the computational complexity. We use the tile size of  $8 \times 8 \times 8$ .

TABLE I Analysis of Zero Removing Strategy

|                    | Tile Size                | Active | All    | Removing |
|--------------------|--------------------------|--------|--------|----------|
| ShapeNet<br>[21]   | The Size                 | Tiles  | Tiles  | Ratio    |
|                    | $4 \times 4 \times 4$    | 198    | 110592 | 99.82%   |
|                    | $8 \times 8 \times 8$    | 42     | 13824  | 99.69%   |
|                    | $12 \times 12 \times 12$ | 23     | 4096   | 99.43%   |
|                    | $16 \times 16 \times 16$ | 14     | 1728   | 99.18%   |
| <b>NYU</b><br>[22] | Tile Size                | Active | All    | Removing |
|                    |                          | Tiles  | Tiles  | Ratio    |
|                    | $4 \times 4 \times 4$    | 161    | 110592 | 99.85%   |
|                    | $8 \times 8 \times 8$    | 33     | 13824  | 99.76%   |
|                    | $12 \times 12 \times 12$ | 19     | 4096   | 99.53%   |
|                    | $16 \times 16 \times 16$ | 9      | 1728   | 99.48%   |

# C. Results Comparison

The proposed ESCA architecture is implemented on the Zynq UltraScale+ ZCU102 FPGA at 270MHz. The hardware resource utilization is reported in Table II.

TABLE II FPGA FREQUENCY AND RESOURCE UTILIZATION

| Frequency (MHz) | LUT     | FF      | BRAM     | DSP      |
|-----------------|---------|---------|----------|----------|
| 270             | 17614   | 12142   | 365.5    | 256      |
| 270             | (6.43%) | (2.22%) | (40.08%) | (10.16%) |

ESCA is compared with Tesla P100 GPU and Intel Xeon Gold 6148 CPU, which are existing hardware acceleration solutions for SSCN. As shown in Fig. 10, our ESCA outperforms the CPU and GPU implementation by around 8.41 times and 1.89 times in terms of speedup. Since the computation of SSCN depends on the sparsity of the center activation and its neighborhood ones, the GPU and CPU cannot recognize this correspondence, resulting in a large number of redundant computations. While in ESCA, the matching operation is executed efficiently. The detailed comparisons between GPU and our design are summarized in Table III. Our design achieves 17.73 GOPS and 5.14 GOPS/W in terms of performance and power efficiency, which outperforms GPU by around 1.88 times and 51 times. Note that the GOPS is effective performance



Fig. 10. Comparison with CPU and GPU in terms of time consumption when processing a Sub-Conv layer.

 TABLE III

 COMPARISON WITH OTHER IMPLEMENTATIONS FOR POINT CLOUD

|                              | GPU        | [19]        | ours       |
|------------------------------|------------|-------------|------------|
| Device                       | Tesla P100 | ZynqXC7z045 | ZynpZCU102 |
| Frequency (MHz)              | -          | 100         | 270        |
| Model                        | SS U-Net   | O-Pointnet  | SS U-Net   |
| Precision                    | FP32       | INT16       | INT8/INT16 |
| Power (W)                    | 90.56      | 2.15        | 3.45       |
| Performance<br>(GOPS)        | 9.40       | 1.21        | 17.73      |
| Power Efficiency<br>(GOPS/W) | 0.10       | 0.56        | 5.14       |

containing only non-zero multiply-accumulate operations for a fair and clear comparison with other implementations.

To further evaluate the performance of ESCA, it is also compared with an FPGA-based accelerator [19], which targets the optimized PointNet (O-Pointnet) and leverages the MLP operations for point clouds. Compared with [19], our accelerator has a significant improvement in both performance and power efficiency as shown in Table III.

To sum up, the higher performance of ESCA comes from two aspects. On one hand, the zero removing strategy and encoding scheme optimize the data structure to facilitate the match operation. On the other hand, the on-chip logic efficiently performs matching operation and multiply-add computations by the SDMU and CC.

### V. CONCLUSION

In this paper, we present ESCA, an efficient FPGA-based accelerator that supports SSCN. A zero removing strategy is introduced to remove the coarse-grained redundant regions and an encoding scheme is proposed to simplify the matching operation. Based on the encoding scheme, the sparse data matching unit (SDMU) and the computation core (CC) are developed. The 3D submanifold sparse U-Net is considered for the experiment. The proposed design is implemented on Xilinx ZCU102. The experimental results show that our work outperforms the GPU by around 1.88 times and 51 times in terms of performance and power efficiency.

# REFERENCES

- Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, "Deep learning for 3d point clouds: A survey," *IEEE transactions on pattern* analysis and machine intelligence, vol. 43, no. 12, pp. 4338–4364, 2020.
- [2] S. Shi, X. Wang, and H. Li, "Pointrcnn: 3d object proposal generation and detection from point cloud," in *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, 2019, pp. 770– 779.

- [3] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg, "Deep projective 3d semantic segmentation," in *International Conference on Computer Analysis of Images and Patterns*. Springer, 2017, pp. 95–107.
- [4] A. Boulch, B. Le Saux, and N. Audebert, "Unstructured point cloud semantic labeling using deep segmentation networks." *3DOR*@ *Euro-graphics*, vol. 3, 2017.
- [5] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, "Multi-view convolutional neural networks for 3d shape recognition," in *Proceedings* of the IEEE international conference on computer vision, 2015, pp. 945– 953.
- [6] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, "Pointnet: Deep learning on point sets for 3d classification and segmentation," in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2017, pp. 652–660.
- [7] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, "Pointnet++: Deep hierarchical feature learning on point sets in a metric space," *Advances in neural information processing systems*, vol. 30, 2017.
- [8] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, "Kpconv: Flexible and deformable convolution for point clouds," in *Proceedings of the IEEE/CVF international conference on computer vision*, 2019, pp. 6411–6420.
- [9] J. Huang and S. You, "Point cloud labeling using 3d convolutional neural network," in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 2670–2675.
- [10] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, "Sparsity invariant cnns," in 2017 international conference on 3D Vision (3DV). IEEE, 2017, pp. 11–20.
- [11] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese, "Segcloud: Semantic segmentation of 3d point clouds," in 2017 international conference on 3D vision (3DV). IEEE, 2017, pp. 537–547.
- [12] B. Graham, M. Engelcke, and L. Van Der Maaten, "3d semantic segmentation with submanifold sparse convolutional networks," in *Proceedings* of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9224–9232.
- [13] M. Najibi, G. Lai, A. Kundu, Z. Lu, V. Rathod, T. Funkhouser, C. Pantofaru, D. Ross, L. S. Davis, and A. Fathi, "Dops: Learning to detect 3d objects and predict their 3d shapes," in *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, 2020, pp. 11913–11922.
- [14] Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," ACM SIGARCH computer architecture news, vol. 44, no. 3, pp. 367–379, 2016.
- [15] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, "An efficient hardware accelerator for structured sparse convolutional neural networks on fpgas," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 28, no. 9, pp. 1953–1965, 2020.
- [16] C. Deng, Y. Sui, S. Liao, X. Qian, and B. Yuan, "Gospa: an energyefficient high-performance globally optimized sparse convolutional neural network accelerator," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 1110– 1123.
- [17] B. Liu, X. Chen, Y. Han, J. Li, H. Xu, and X. Li, "Accelerating dnnbased 3d point cloud processing for mobile computing," *Science China Information Sciences*, vol. 62, no. 11, pp. 1–11, 2019.
- [18] Y. Feng, B. Tian, T. Xu, P. Whatmough, and Y. Zhu, "Mesorasi: Architecture support for point cloud analytics via delayed-aggregation," in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 1037–1050.
- [19] X. Zheng, M. Zhu, Y. Xu, and Y. Li, "An fpga based parallel implementation for point cloud neural network," in 2019 IEEE 13th International Conference on ASIC (ASICON), 2019, pp. 1–4.
- [20] Y. Lin, Z. Zhang, H. Tang, H. Wang, and S. Han, "Pointacc: Efficient point cloud accelerator," in *MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture*, 2021, pp. 449–461.
- [21] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su *et al.*, "Shapenet: An informationrich 3d model repository," *arXiv preprint arXiv:1512.03012*, 2015.
- [22] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, "Indoor segmentation and support inference from rgbd images," in *European conference on computer vision*. Springer, 2012, pp. 746–760.