3D Shape Analysis and Applications from Partial Point Cloud Data

Mohammadi, SEYED SABER

doi:10.15167/mohammadi-seyed-saber_phd2023-03-27

This thesis focuses on the use of deep learning models to analyse Three Dimensional (3D) shape classification and completion for robotic grasping. Unlike 2D images, 3D data provides valuable information about the shape, scale, and geometry of objects and their surrounding environments. Point Cloud Data (PCD) is a popular format for representing 3D data due to its simplicity and its availability from a wide range of sensors. While significant progress has been made in applying deep learning methods to 3D shape analyses with PCDs, the majority of these works do not accurately reflect real-world challenges because they often assume the availability of a complete 3D scan of an object. Such scans are typically obtained by placing multiple cameras (or moving a single camera) around the object and capturing PCD from each camera view. The individual views (single-view) are then combined to create a single, complete scan of the object. However, in many real-world situations, only partial scans of objects are available. As an example, consider an autonomous agent who is exploring the environment. In this case, it has to classify the objects as quickly as possible when seeing them due to the time limitation or the motion constrained to a specific path. These works, therefore, may suffer from a performance drop when we apply them in such a scenario. Moreover, the registration of a full PCD can be characterised as a procedure for determining the transformation between two or more individual PCDs captured from a single viewpoint. This process requires the precise estimation of the relative poses among the captured PCDs in order to align them because it is difficult to get accurate relative transformations. These facts motivated us to focus our study in this thesis on analysing the 3D shape utilising only partial PCDs in order to better address real-world challenges. Given a set of partial PCDs of an object captured from multiple viewpoints, we first address the problem of how to efficiently aggregate such multi-view information to improve the accuracy of 3D shape classification. We proposed a novel multi-view shape classifier, PointView-GCN, that aggregates multiple single-view shape features using multi-level Graph Convolutional Networks (GCNs). Our approach produces a more descriptive global shape feature that improves classification accuracy. To facilitate this study, we also created two novel single-view PCD datasets for model training and method evaluation. We further address a more challenging scenario for shape classification in which we only have access to partial PCDs about an object from a single viewpoint. We propose a novel method that uses a generative model (a Conditional Variational Auto-Encoder, or CVAE) to "hallucinate" the shape of the unseen part of the object. We then use a multi-level GCN to combine the generated views and improve the classification accuracy. With experiments on our single-view datasets, we prove that the proposed method outperforms the best single-view PCD-based methods. To improve the accuracy of our single-view shape classifier, we further introduce a Teacher-Student architecture, where the multi-view Teacher combines shape features from 3D scans of an object taken from multiple viewpoints. In contrast, the single-view Student network has access to the shape feature of a single-view PCD. During training, the Student learns from the Teacher and gains access to information unavailable during testing, allowing it to make more accurate predictions. Our experiments and analysis show that the Teacher network can create an aggregated global shape feature that performs comparably to state-of-the-art models when tested on synthetic and real-world datasets. In addition, using the retrained backbone network for the Teacher outperforms the best state-of-the-art method on 3D datasets. Finally, we demonstrated the application of single-view shape analysis to robotic grasping. Grasping with single-view PCD is challenging due to the partial observation of the object, which causes the generation of inaccurate grasp pose. We, therefore, design a pipeline that first completes the partial PCDs. We proposed a novel shape completion network based on a Transformer encoder-decoder with an Offset-Attention layer. The completed PCDs allow for more accurate grasp pose candidates. We proved with experiments that our method outperforms state-of-the-art methods on PCD completion tasks and greatly improves the grasping success rate in real-world scenarios.