1 Introduction
Studies on point clouds have started emerging in recent years, and the categories are divided primarily into reconstruction and understanding. PointNet [
9] and PointNet++ [
10] established the rudimentary foundation for point cloud understanding, while PCN [
25] and PU-Net [
24] formulated the pipeline for point cloud reconstruction (including completion and upsampling). Their common property is that the point cloud (either partial or complete) should always be encoded as a latent code with high-level information before the subsequent understanding or reconstruction modules. These pioneering efforts for encoding primarily apply simple aggregation operations. This is because of the order invariant prerequisite for encoding orderless point clouds, which is generally accomplished by a symmetric function for aggregation, e.g., max pooling, summation, or a global set abstraction layer. As such, the information loss [
2] inevitably prevents these methods from better shape awareness or point recovery.
Generally, point cloud completion serves as a pre-processing step for the downstream tasks, including autonomous driving, virtual reality (VR), augmented reality (AR), industrial manufacturing, and robotic gripping, which might be primarily subject to local geometric details of the recovered object. In autonomous driving, it enhances the perception systems of self-driving cars by accurately reconstructing occluded or incomplete 3D data, ensuring safer navigation. In VR and AR, it enables more realistic and immersive experiences by reconstructing detailed virtual environments. In healthcare, it aids in creating complete 3D models from partial scans, facilitating better diagnosis and treatment planning. Additionally, in robotics, it improves the ability of robots to understand and interact with their environment, enabling precise manipulation and obstacle avoidance. These applications demonstrate the significant potential of deep learning-based point cloud completion in advancing technology and improving efficiency and safety in various domains.
In deep learning-based point cloud processing algorithms, including both understanding [
10,
16] and reconstruction [
11,
20,
24,
25], multi-scale grouping (MSG) and k-nearest neighbors grouping (kNN) are widely adopted techniques, to extract features in Euclidean or topological spaces. Nevertheless, we argue that naively encoding a point cloud, especially a partial cloud, can result in the loss of high-level shape details. In real-world scanning, the quality of a point cloud varies remarkably, which can be sparse, discontinuous, or incomplete. This significantly increases the need for encoding a consistent and robust representation from point clouds with missing regions.
In this paper, we aim to explore the importance of the encoding step in the point cloud completion task by proposing a hierarchical structure to be the encoder for preserving more geometrical details. Worth mentioning, our scheme can be extended to other point cloud tasks other than completion. Our intuition is that a simple encoding procedure, especially for a partial point cloud, is insufficient for recovering shape details. Specifically, to avoid a single point with a higher weight selected by the symmetric function dominating the representative role of a shape, we leverage multiple encoders that focus on different perceptual scales to form a more stable representation. This scheme, as demonstrated in our experiments, achieves better reconstruction and understanding performances when attached to the state-of-the-art (SOTA) methods. More importantly, it can be adopted for more point cloud networks as a plug-and-play module. An overall illustration is demonstrated in Figure
1, where a vanilla encoder constructed with a set abstraction (SA) layer and our hierarchical self-distillation (HSD) based encoder are compared. Similar to MSG, the proposed HSD also encodes features based on a hierarchical structure. However, our method differentiates the features across layers instead of concatenating them.
Therefore, we summarize our contributions as follows:
•
We propose a hierarchical structure concentrating on encoding a more representative latent code for point clouds. This plug-and-play module can be integrated into most point cloud processing networks as an encoder.
•
Experiments on the PCN dataset indicate that the proposed encoding module can simply elevate the baseline SOTA models, further improving the performance of the completion task.
3 Method
The encoding step in the network forward procedure is commonly overlooked in traditional point cloud completion methods, which tend to encode with fundamental architectures such as PointNet++ [
10] and DGCNN [
16]. Thereafter, the back propagation step during the backward process gives rise to holistic optimization via the Chamfer distance (CD) loss to constrain the geometric features in a reasonable domain. In this work, we argue that such simple encoding can be further improved and is important for the point cloud completion task. Therefore, inspired by PointHSD [
28,
29] in joint learning for simultaneous point cloud completion and understanding, we introduce the hierarchical self-distillation-based encoder during the forward step to enhance the self-recognition capacity in a self-supervised manner. The difference lies in the fact that PointHSD was proposed to be a post-encoder to further regularize the optimization for the task of sparse point cloud completion and understanding, while our method applies HSD as a comprehensive encoder as an alternative to PointNet++-like encoders to understand local details. The overall architecture is illustrated in Figure
2. Unlike PointHSD, which encodes information for both completion and understanding, our method applies HSD solely for the completion task, demonstrating its effectiveness across various point cloud processing methods. Comparing with MSG that concatenates features in a additive manner, our proposed HSD formulates features in a subtractive way.
3.1 Hierarchical Self-Distillation-based Encoder
Conventionally, various aggregation operations are leveraged to encode a global or local shape, e.g., pooling and set abstraction. Hence, these methods inevitably ignore particular low-level geometric features. This phenomenon can be further amplified, especially when the input is incomplete. To alleviate the feature degradation caused by aggregation, we borrow the idea of PointHSD [
28,
29], which learns the distinction of different local features by maximizing the mutual information
I(
Zl;
ZL) among them, where
Zl and
ZL denote the intermediate and deepest latent representations.
Such subtractive operation is on the contrary of the additive MSG encoding strategy used in PointNet++ [
10]. However, both of them aim to formulate the representation that encodes the most local details. Specifically, we set the number of neighbors in each layer of the hierarchical encoder to be monotonically increasing as illustrated in Figure
2, which is 8, 16, and 24, respectively. The physical intuition behind it is straightforward: for an incomplete point cloud with largely missing regions, a larger perceptual field corresponds to a richer representation of local neighbor information. Therefore,
I(
Zl;
ZL) endows smoothness to the probabilistic distribution using information from the teacher (the last layer), functioning as a feature regularization of the students (the intermediate layers), to alleviate overfitting problem of the entire model.
As the proposed HSD can be integrated into any hierarchy-based structure, it can be substituted for encoders in various point cloud completion methods. Specifically, we apply PointHSD to SnowflakeNet [
20] to substitute for its set abstraction modules.
Although only geometric priors are available, it is still possible to represent an object with its latent code. Therefore, we propose to map each code from different aggregation operations at layer
l to a probabilistic space by the softmax activation to formulate
yl, such that the distribution
yL in the the deepest layer functions as a fake objective.
yl serves as the supervision to former distributions. Let
\(\mathcal {L}_\mathit {KL}\) be the Kullback-Leibler (KL) divergence, the self-distillation loss is practically implemented as:
where
yl is the prediction and we set the total layer number
L = 3. Note that in PointHSD [
28], the predicted label distributions
yl are also compared with the ground truth distribution
ygt, which is unavailable in the completion-only task. Similar to PointHSD, knowledge can therefore flow back via Equation
1 during the forward training step to provide stronger supervision.
3.2 Reconstruction
To validate the effectiveness of the proposed HSD-based encoder, we keep the decoders of the baseline models unchanged. In detail, three splitting-based deconvolution (SPD) modules are integrated to recover a point cloud from coarse to fine. SPD smoothly rearranges the generated points based upon each point in a coarse version, rather than simply shuffling the high-level representations like PU-Net [
24] or grid composition like PCN [
25]. In general, the network aims to reconstruct the full point cloud
\(Q \in \mathbb {R}^N\) from the partial input
\(P^{\prime } \in \mathbb {R}^{N^{\prime }}\), where
N and
N′ are the number of points in the ground truth and partial input, respectively. Specifically, the reconstruction loss is formulated as:
where
\({||\cdot ||}_2^2\) denotes the L2-norm of CD. Here, we omit the subscripts in Eq.
2 for simplicity.
3.3 Optimization
Facilitated by self-distillation (Sec.
3.1) and reconstruction (Sec.
3.2), the overall procedure can be optimized by:
where
λ is set to 1000 empirically;
Pi is the completed prediction after the
ith SPD in the decoder, matching the size of
Qi; and
Q3 =
Q denotes the complete ground truth.
5 Conclusion
In this paper, we emphasize the importance of encoding in the task of point cloud completion by substituting the vanilla feature extraction module with HSD. The applied HSD aims to transfer the rich knowledge accumulated in the deepest aggregation back to former ones. Our experiment shows that HSD can improve the completion performance by just boosting the hierarchy-based encoder, therefore, HSD is a promising alternative to MSG. Also, we demonstrate the enhancement on point cloud completion by simply improving the encoder, highlighting the potential to apply more advanced encoding strategies in this domain. In future, instead of just learning the dissociation from the latent codes extracted by various scales, we plan to make the sampled points through encoding not only differentiable with respect to the encoded representation, but also with respect to the input.