research-article

Open access

Enhancing the Encoding Process in Point Cloud Completion

Authors:

Kaiyue Zhou,

Zelong Tan,

Shu Yang,

Shengjin WangAuthors Info & Claims

ICCPR '24: Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition

Pages 59 - 65

https://doi.org/10.1145/3704323.3704330

Published: 07 January 2025 Publication History

PDF eReader

Abstract

Point cloud completion aims to reconstruct a complete 3D shape from sparse or incomplete point clouds. Existing methods focus more on the decoder design, generating more reasonable details from a latent code embedded by a relatively simple encoder with an aggregation operation. However, such simple encoding might ignore the important points in a local region. In this work, we propose to apply hierarchical self-distillation (HSD) as an alternative encoding plan to the conventional multi-scale grouping (MSG), to comprehensively generate more representative code for completion tasks. Different from commonly applied multi-scale grouping methods, our HSD approach learns the dissociation of high-level features among different encoding scales, which is particularly suitable for encoding the information of incomplete point clouds. Experiments show that our approach outperforms state-of-the-art learning-based methods by at least 1% by incorporating a self-aware loop within the network’s knowledge path.

1 Introduction

Figure 1:

Studies on point clouds have started emerging in recent years, and the categories are divided primarily into reconstruction and understanding. PointNet [9] and PointNet++ [10] established the rudimentary foundation for point cloud understanding, while PCN [25] and PU-Net [24] formulated the pipeline for point cloud reconstruction (including completion and upsampling). Their common property is that the point cloud (either partial or complete) should always be encoded as a latent code with high-level information before the subsequent understanding or reconstruction modules. These pioneering efforts for encoding primarily apply simple aggregation operations. This is because of the order invariant prerequisite for encoding orderless point clouds, which is generally accomplished by a symmetric function for aggregation, e.g., max pooling, summation, or a global set abstraction layer. As such, the information loss [2] inevitably prevents these methods from better shape awareness or point recovery.

Generally, point cloud completion serves as a pre-processing step for the downstream tasks, including autonomous driving, virtual reality (VR), augmented reality (AR), industrial manufacturing, and robotic gripping, which might be primarily subject to local geometric details of the recovered object. In autonomous driving, it enhances the perception systems of self-driving cars by accurately reconstructing occluded or incomplete 3D data, ensuring safer navigation. In VR and AR, it enables more realistic and immersive experiences by reconstructing detailed virtual environments. In healthcare, it aids in creating complete 3D models from partial scans, facilitating better diagnosis and treatment planning. Additionally, in robotics, it improves the ability of robots to understand and interact with their environment, enabling precise manipulation and obstacle avoidance. These applications demonstrate the significant potential of deep learning-based point cloud completion in advancing technology and improving efficiency and safety in various domains.

In deep learning-based point cloud processing algorithms, including both understanding [10, 16] and reconstruction [11, 20, 24, 25], multi-scale grouping (MSG) and k-nearest neighbors grouping (kNN) are widely adopted techniques, to extract features in Euclidean or topological spaces. Nevertheless, we argue that naively encoding a point cloud, especially a partial cloud, can result in the loss of high-level shape details. In real-world scanning, the quality of a point cloud varies remarkably, which can be sparse, discontinuous, or incomplete. This significantly increases the need for encoding a consistent and robust representation from point clouds with missing regions.

In this paper, we aim to explore the importance of the encoding step in the point cloud completion task by proposing a hierarchical structure to be the encoder for preserving more geometrical details. Worth mentioning, our scheme can be extended to other point cloud tasks other than completion. Our intuition is that a simple encoding procedure, especially for a partial point cloud, is insufficient for recovering shape details. Specifically, to avoid a single point with a higher weight selected by the symmetric function dominating the representative role of a shape, we leverage multiple encoders that focus on different perceptual scales to form a more stable representation. This scheme, as demonstrated in our experiments, achieves better reconstruction and understanding performances when attached to the state-of-the-art (SOTA) methods. More importantly, it can be adopted for more point cloud networks as a plug-and-play module. An overall illustration is demonstrated in Figure 1, where a vanilla encoder constructed with a set abstraction (SA) layer and our hierarchical self-distillation (HSD) based encoder are compared. Similar to MSG, the proposed HSD also encodes features based on a hierarchical structure. However, our method differentiates the features across layers instead of concatenating them.

Therefore, we summarize our contributions as follows:

•

We propose a hierarchical structure concentrating on encoding a more representative latent code for point clouds. This plug-and-play module can be integrated into most point cloud processing networks as an encoder.

•

Experiments on the PCN dataset indicate that the proposed encoding module can simply elevate the baseline SOTA models, further improving the performance of the completion task.

2 Related Work

Figure 2:

2.1 Point Cloud Completion

2.1.1 Supervised.

As the first work of point cloud completion, PCN [25] tackles the problem by explicitly learning the mapping between the sparse input and the dense target in a coarse-to-fine manner. After that, a tremendous number of works followed this strategy. SnowflakeNet [20] implements the point-wise splitting module to learn the offsets between the upsampled points and ground truth points. A global attention module is leveraged to refine the points in each individual generation step. In the same manner, PMP-Net [17] and PMP-Net++ [18] gradually move points along the shortest paths to the target, learning the geometric relationships between the source and target point clouds. Another branch in this category is based on implicitly generating the points. FoldingNet [23] and AtlasNet [5] propose folding the 2D grids into various surfaces representing the particular parts of a shape. TopNet [13] proposes a tree decoder to generate points for representing various geometric structures. However, these works have focused heavily on the decoder design, such that the importance of the encoder has been seriously under-explored. For instance, SnowflakeNet applies the encoder in the same way as PointNet++ [10]. We argue that for partial or incomplete input, the encoder should be considered more carefully.

2.1.2 Un-supervised or self-supervised.

Recently, communities have tended to compensate for sparse point clouds without ground truth coordinates for supervision. ACL-SPC [6] proposed the first self-supervised scheme for point cloud completion. It employs a learnable generator to form a complete target, and a sampler is subsequently utilized to generate a set of synthetic partial point clouds as the input for further closed loops. Without the explicit viewpoint prior, VAPCNet [4] considers viewpoint representations as samples in contrastive learning, where a rotated viewpoint of the current scan is regarded as the positive sample and other novel viewpoints as negative ones. By dividing the partial point cloud into different partitions, P2C [3] first produces an output to match an unseen partition. Further constraints are applied, such as prediction re-encoding, to force the latent code to be consistent with the partial code. This finding inspires us to believe that the encoding from a partial point cloud can be insufficient for representing the latent code of that complete counterpart.

2.2 Point Cloud Encoding

Initially, PointNet [9] trivially encodes features with a global aggregation, which lacks the local awareness of each point. Hierarchy-based feature aggregation was first proposed by PointNet++ [10] and DGCNN [16], i.e., multi-scale grouping (MSG) and kNN, respectively, aiming at extracting more representative local features for classification and segmentation tasks. Such techniques are thereafter commonly utilized in reconstruction tasks, e.g., PU-Net [24] for upsampling, and SnowflakeNet [20] for completion. On the other track, Point Transformer [27] has led a trend of applying self-attention on point clouds, such as PU-Transformer [11] for upsampling and PointAttN [7] for completion. It first encodes local features by the subtractive relations between a query point and it’s neighbors, and then utilize transformer to facilitate the point-wise correspondence between these localized features.

In addition to these foundational techniques widely adopted across various tasks, point cloud understanding methods generally place greater emphasis on encoding diverse geometric representations. CurveNet calculates hypothetical curves based on point-wise features [21]. Similarly, representations such as triangular and umbrella surfaces with orientations are introduced by RepSurf [12]. PointMLP provides evidence that simple residual MLPs are can be also effective for encoding point clouds [8]. Kernel-based methods are also well-studied. KPConv adapts extra kernel points to establish convolution to local geometry [14]. PointConv treats convolution kernels as nonlinear functions to approximate weight and density of the local coordinates [19].

In general, the task of point cloud completion necessitates a more meaningful embedding, given that encoding a partial point cloud can potentially result in less-determinant or even incorrect representations. Motivated by this, our objective is to design a comprehensive point cloud encoder tailored specifically for completion tasks.

3 Method

The encoding step in the network forward procedure is commonly overlooked in traditional point cloud completion methods, which tend to encode with fundamental architectures such as PointNet++ [10] and DGCNN [16]. Thereafter, the back propagation step during the backward process gives rise to holistic optimization via the Chamfer distance (CD) loss to constrain the geometric features in a reasonable domain. In this work, we argue that such simple encoding can be further improved and is important for the point cloud completion task. Therefore, inspired by PointHSD [28, 29] in joint learning for simultaneous point cloud completion and understanding, we introduce the hierarchical self-distillation-based encoder during the forward step to enhance the self-recognition capacity in a self-supervised manner. The difference lies in the fact that PointHSD was proposed to be a post-encoder to further regularize the optimization for the task of sparse point cloud completion and understanding, while our method applies HSD as a comprehensive encoder as an alternative to PointNet++-like encoders to understand local details. The overall architecture is illustrated in Figure 2. Unlike PointHSD, which encodes information for both completion and understanding, our method applies HSD solely for the completion task, demonstrating its effectiveness across various point cloud processing methods. Comparing with MSG that concatenates features in a additive manner, our proposed HSD formulates features in a subtractive way.

3.1 Hierarchical Self-Distillation-based Encoder

Table 1:

Model	Average	Plane	Cabinet	Car	Chair	Lamp	Couch	Table	Boat
FoldingNet [23]	14.31	9.49	15.80	12.61	15.55	16.41	15.97	13.65	14.99
TopNet [13]	12.15	7.61	13.31	10.90	13.82	14.44	14.78	11.22	11.12
AtlasNet [5]	10.85	6.37	11.94	10.10	12.06	12.37	12.99	10.33	10.61
PCN [25]	9.64	5.50	22.70	10.63	8.70	11.00	11.34	11.68	8.59
GRNet [22]	8.83	6.45	10.37	9.45	9.41	7.96	10.51	8.44	8.04
CDN [15]	8.51	4.79	9.97	8.31	9.49	8.94	10.69	7.81	8.05
PMP-Net [17]	8.73	5.65	11.24	9.64	9.51	6.95	10.83	8.72	7.25
NSFA [26]	8.06	4.76	10.18	8.63	8.53	7.03	10.53	7.35	7.48
SnowflakeNet [20]	7.21	4.29	9.16	8.08	7.89	6.07	9.23	6.55	6.40
w. MSG	7.142	4.212	9.243	8.207	7.636	6.131	9.192	6.514	6.401
w. HSD	7.139	4.189	9.242	8.144	7.701	6.102	9.193	6.530	6.396

Table 1: Performance on the PCN dataset in terms of L2-CD (× 1000).

Conventionally, various aggregation operations are leveraged to encode a global or local shape, e.g., pooling and set abstraction. Hence, these methods inevitably ignore particular low-level geometric features. This phenomenon can be further amplified, especially when the input is incomplete. To alleviate the feature degradation caused by aggregation, we borrow the idea of PointHSD [28, 29], which learns the distinction of different local features by maximizing the mutual information I(Z_l;Z_L) among them, where Z_l and Z_L denote the intermediate and deepest latent representations.

Such subtractive operation is on the contrary of the additive MSG encoding strategy used in PointNet++ [10]. However, both of them aim to formulate the representation that encodes the most local details. Specifically, we set the number of neighbors in each layer of the hierarchical encoder to be monotonically increasing as illustrated in Figure 2, which is 8, 16, and 24, respectively. The physical intuition behind it is straightforward: for an incomplete point cloud with largely missing regions, a larger perceptual field corresponds to a richer representation of local neighbor information. Therefore, I(Z_l;Z_L) endows smoothness to the probabilistic distribution using information from the teacher (the last layer), functioning as a feature regularization of the students (the intermediate layers), to alleviate overfitting problem of the entire model.

As the proposed HSD can be integrated into any hierarchy-based structure, it can be substituted for encoders in various point cloud completion methods. Specifically, we apply PointHSD to SnowflakeNet [20] to substitute for its set abstraction modules.

Although only geometric priors are available, it is still possible to represent an object with its latent code. Therefore, we propose to map each code from different aggregation operations at layer l to a probabilistic space by the softmax activation to formulate y_l, such that the distribution y_L in the the deepest layer functions as a fake objective. y_l serves as the supervision to former distributions. Let \(\mathcal {L}_\mathit {KL}\) be the Kullback-Leibler (KL) divergence, the self-distillation loss is practically implemented as:

\begin{equation}\mathcal {L}_\mathit {dis} = \sum _{l=1}^{L-1} \mathcal {L}_\mathit {KL}(y_l, y_L), \end{equation}

(1)

where y_l is the prediction and we set the total layer number L = 3. Note that in PointHSD [28], the predicted label distributions y_l are also compared with the ground truth distribution y_gt, which is unavailable in the completion-only task. Similar to PointHSD, knowledge can therefore flow back via Equation 1 during the forward training step to provide stronger supervision.

3.2 Reconstruction

To validate the effectiveness of the proposed HSD-based encoder, we keep the decoders of the baseline models unchanged. In detail, three splitting-based deconvolution (SPD) modules are integrated to recover a point cloud from coarse to fine. SPD smoothly rearranges the generated points based upon each point in a coarse version, rather than simply shuffling the high-level representations like PU-Net [24] or grid composition like PCN [25]. In general, the network aims to reconstruct the full point cloud \(Q \in \mathbb {R}^N\) from the partial input \(P^{\prime } \in \mathbb {R}^{N^{\prime }}\), where N and N′ are the number of points in the ground truth and partial input, respectively. Specifically, the reconstruction loss is formulated as:

\begin{equation}\mathcal {L}_\mathit {CD} (P,Q) = \frac{1}{|P|} \sum _{p \in P} \underset {q \in Q}{\min } {||p-q||}_2^2 + \frac{1}{|Q|} \sum _{q \in Q} \underset {p \in P}{\min } {||q-p||}_2^2, \end{equation}

(2)

where \({||\cdot ||}_2^2\) denotes the L2-norm of CD. Here, we omit the subscripts in Eq. 2 for simplicity.

3.3 Optimization

Facilitated by self-distillation (Sec. 3.1) and reconstruction (Sec. 3.2), the overall procedure can be optimized by:

\begin{equation}\mathcal {L} = \lambda \sum _{i=1}^{3} {\mathcal {L}_\mathit {CD}}(P_i,Q_i) + \mathcal {L}_\mathit {dis}, P_i \subset P, Q_i \subset Q, \end{equation}

(3)

where λ is set to 1000 empirically; P_i is the completed prediction after the ith SPD in the decoder, matching the size of Q_i; and Q₃ = Q denotes the complete ground truth.

4 Experiment

Figure 3:

4.1 Dataset

Our data are based on a commonly used benchmark, namely PCN [25], a subset derived from ShapeNet [1]. The PCN dataset contains 8 objects, consisting of 28,974 training and 1,200 testing samples in total. We follow the same split and preparation settings as in previous works [17, 18, 20] for fair comparisons. There are 8 different views of an object for training and only 1 view for testing.

4.2 Implementation Detail

Our networks are trained on a server deployed with 10 Nvidia RTX 2080Ti GPUs unless otherwise stated. The batch sizes are 240 for training and 16 for testing. The upsampling ratio of the SnowflakeNet baseline is (1, 4, 8) and remains the same for our advanced variant. Following the criterion of the community, we report the Chamfer distance (CD) as the evaluation metric, i.e., a lower value denotes better performance. Our source Code is available at https://github.com/ky-zhou/PointHSD.

4.3 Quantitative Results

Table 1 shows comparisons of previous works and our different variants based on SnowflakeNet. We directly refer to the results reported in the existing works if not stated otherwise. Next, we conduct experiments for variants using MSG and HSD as substitutes of encoder. It can be observed that 5 out of the 8 individual objects outperform the baseline model with respect to reconstruction error.

Furthermore, the MSG-based encoder outperforms the vanilla SnowflakeNet, indicating that both MSG and HSD improve the encoding capacity through the use of more perceptual fields or information dissociation based on scale. In this scenario, HSD behaves analogously to MSG. However, MSG concatenates features to a larger dimension, thereby increasing the GPU memory required for computation. In contrast, our method calculates the feature differences among multiple layers, which minimally increases the computational cost.

Meanwhile, the self-looping knowledge path facilitates the network’s capacity to transfer information from the deepest layers back to earlier ones, thereby functioning as a stronger regularization term (not only for joint completion and classification tasks [28, 29], but also for the classification-only task). This mechanism is predicated on the notion that a larger perceptual field for an incomplete object inherently encompasses greater informational content.

4.4 Qualitative Visualization

In Figure 3, we present visualizations comparing the outcomes of the original SnowflakeNet and our proposed method. Specifically, we intentionally select two objects with intricate shapes from each category within the PCN dataset [25] to clearly illustrate the distinctions between the two methods. The results demonstrate the effectiveness of the proposed encoding strategy. For example, the gaps or holes present in airplanes and cabinets are more clearly revealed by the incorporation of HSD.

4.5 Ablation Study

Table 2:

One step of PMP-Net	CD
W/o. HSD	12.620
w. HSD	12.596

Table 2: Performance on the PCN dataset in terms of L2-CD × 1000 for one step of PMP-Net. The encoding strategy with HSD slightly outperforms the original setting. However, this indicates that HSD might be inappropriate to such structures, where the encoding processes are isolated among compression and recovery.

The ablation study is naturally available in Table 1, where the model SnowflakeNet can be regarded as the baseline. It can be clearly observed that the proposed HSD encoding module is able to encode more powerful representations than the baseline, resulting in improved reconstruction capacity in the decoding process. Meanwhile, HSD exhibits competitive performance when compared with MSG, underscoring its comparable efficacy in encoding for completion tasks. This parallel effectiveness positions HSD as a noteworthy alternative, emphasizing its potential in tasks related to point cloud completion.

Although SnowflakeNet and PMP-Net are both completion networks, they have fundamental structural differences. Originally, SnowflakeNet behaved as a single autoencoder that consists of an encoder and a decoder such that the completed points are generated in one step directly. In contrast, PMP-Net is composed of three cascaded autoencoders that rearrange the points in three steps continuously. In such cases, HSD cannot be applied to the three encoders to make the latent code consistent, as they encode significantly different shapes. Therefore, we evaluate one step from PMP-Net with the proposed HSD encoding strategy. The results for the PCN dataset are shown in Table 2. More discussion will be explicitly described in the following section.

4.6 Discussion

Figure 4:

As discussed in Section 4.5, the direct application of PMP-Net with an HSD is inadequate. The large difference among low-level shapes leads to messy latent codes, which cannot be accurately perceived by the network. See Figure 4 for details. This suggests the proposed should be applied to hierarchy-based encoders, rather than the cascaded encoding-decoding streams like PMP-Net. This observation also constrains the potential application of proposed method to complicated network structures that encompass more than one encoder, since the dissociated information pattern may become entangled among different encoding stages.

Moreover, the improvement in our classifier-free network is not as pronounced as that achieved by classifier-based methods [28, 29], indicating that encoding incomplete point clouds remains a significant challenge in the field.

Finally, similar to MSG, HSD still requires aggregation to encode the hierarchical features into a latent code, which suffers from information loss principally [2]. As a result, learning the lost information due to aggregation will be a key focus in our future work. The correspondence between points and the global code can be established if a differentiable sampling is leveraged, which has the potential to benefit processes in both seed generation and point upsampling. We plan to explore the information in addition to the aggregated features in the future, which requires more comprehensive point-wise feature encoding.

5 Conclusion

In this paper, we emphasize the importance of encoding in the task of point cloud completion by substituting the vanilla feature extraction module with HSD. The applied HSD aims to transfer the rich knowledge accumulated in the deepest aggregation back to former ones. Our experiment shows that HSD can improve the completion performance by just boosting the hierarchy-based encoder, therefore, HSD is a promising alternative to MSG. Also, we demonstrate the enhancement on point cloud completion by simply improving the encoder, highlighting the potential to apply more advanced encoding strategies in this domain. In future, instead of just learning the dissociation from the latent codes extracted by various scales, we plan to make the sampled points through encoding not only differentiable with respect to the encoded representation, but also with respect to the input.

References

[1]

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:https://arXiv.org/abs/1512.03012 (2015).

Abstract

1 Introduction

2 Related Work

2.1 Point Cloud Completion

2.1.1 Supervised.

2.1.2 Un-supervised or self-supervised.

2.2 Point Cloud Encoding

3 Method

3.1 Hierarchical Self-Distillation-based Encoder

3.2 Reconstruction

3.3 Optimization

4 Experiment

4.1 Dataset

4.2 Implementation Detail

4.3 Quantitative Results

4.4 Qualitative Visualization

4.5 Ablation Study

4.6 Discussion

5 Conclusion

References

Index Terms

Recommendations

A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew

SPIHT Algorithm Combined with Huffman Encoding

JPEG optimization using an entropy-constrained quantization framework

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations