Keywords

1 Introduction

In this paper, we propose a 3D CNN framework to classify 3D objects. Our framework is able to consume point cloud data directly and doesn’t require conversion to other 3D representations. We propose a novel method to represent the point cloud data based on a margin-based method that constructs a minimum volume bounding box around the point cloud which is then fed to our model. Human beings are generally better at classifying and recognizing objects, however it is much harder task for computers to do it algorithmically. With the advent of affordable 3D scanners, 3D representation of data has become ubiquitous, however indexing of this data requires semantic understanding of the objects involved. Traditional methods use various geometric handcrafted features to tackle the problem of 3D object classification, the results vary according to the features that are selected. Deep learning techniques have become pervasive and have found immense applications in the field of image classification, image-caption generation, video classification, action recognition in videos etc. These methods and frameworks have been extended to 3D data as well, some frameworks consider 3D CAD models, while some use meshed representations as their input data. We propose a deep learning framework based on 3D CNNs to classify 3D objects directly based on point cloud data, towards this we make the following contributions:

  1. 1.

    We propose 3D-GCNN, a 3D CNN based deep learning model that consumes point clouds and converts them into novel margin-based occupancy grid representation.

  2. 2.

    We propose a 3D CNN architecture which has less trainable parameters but still achieves convergence and high accuracy.

  3. 3.

    We demonstrate our proposed framework on ModelNet10 and ModelNet40 dataset and show that we achieve better or comparable results to existing state of the art methods.

2 Related Work

The task of 3D object classification can be divided into two categories - techniques that use hand-crafted features and techniques that employ deep learning techniques. For the hand-crafted techniques many approaches have been proposed, they mainly utilize various geometric features like spin images [4], 3D shape context [1], point-feature histograms [7] and viewpoint feature histograms [8]. Authors in [2] propose to use Metric tensors and Christoffel symbols as novel geometric features and use a simple SVM classifier to perform 3D object categorization. On the contrary, deep learning techniques do not need any hand-crafted features, authors in [11] propose a deep learning architecture where the meshed models are represented as a binary tensors with 1 representing the mesh inside a voxel grid and 0 representing its absence. Authors in literature [9] propose an architecture where 3D models are represented as panoramic images and a variation of CNN is proposed that learns from such representations. All the works mentioned in the literature either work on 3D meshes, CAD models or involves representing 3D shapes in more manageable ways and then feeding them to deep learning frameworks. However, with the advent of inexpensive 2.5D scanners (Microsoft Kinect) the most direct representation of 3D models is point cloud data, following this line of thought, authors in [6] propose a deep learning framework that consumes point clouds directly for 3D object recognition and segmentation. Taking this approach further authors in [3] propose a novel method to represent point clouds using density occupancy grid representation. Inspired by these works we propose a refined approach to construct density occupancy grids and use this information to feed to our deep learning framework. Extensive experiments are performed on the ModelNet 10 and ModelNet 40 datasets to show how our framework outperforms/performs comparable to existing state-of-the-art methods.

3 Proposed Methodology

In this section we explain in detail the proposed methodology for 3D object classification. The proposed framework is depicted in Fig. 1.

3.1 Preprocessing Input Data

The proposed framework is experimented on the famous ModelNet dataset provided by [11]. The dataset consists of 662 object categories with a total of 127,915 models. A subset of these models called ModelNet10 and ModelNet 40 are benchmark datasets for 3D object classification and recognition tasks. The dataset consists of CAD models in mesh representation, we propose a pre-processing step to convert the CAD models into point clouds. The CAD models are sampled to obtain a sparse point cloud representation which can directly be fed to our 3D-GCNN Model. The conversion process involves performing a random mesh sampling which uses barycentric co-ordinate system for the triangle meshes. The idea is to generate point clouds inside the triangle using its vertices and a linear combination of three random numbers. Consider the three vertices of the triangle P, Q and R, we generate three random numbers u, v, w whose value lies in between [0, 1] and is given by,

$$\begin{aligned} w = 1 - (u+v) \end{aligned}$$
(1)

The value obtained from this equation is multiplied with the coordinates of the vertices of the triangle. In this manner the CAD models are sampled to have 2048 points.

3.2 Margin Based Occupancy Grid

In this section we explain the refined occupancy grid approach that we have formulated to represent the input point cloud data. Occupancy grids are a type of data structure [10] that allows us to obtain an efficient and compact representation of the volumetric space [3]. The occupancy grids provide good 3D shape cues that can be learned for various tasks. The occupancy grids can be formed using two approaches -

  1. 1.

    Binary Grids - In this approach a binary tensor is defined which has either 1 or 0 based on whether the grid is occupied.

  2. 2.

    Density Grids - In this approach the grids are defined by the density of the points present in the voxel.

In our formulation we use the density grids but unlike in the work [3] where the density grids are fixed with fixed leaf and grid size, we propose a margin based density grid formulation with variable voxel leaf size. The procedure to form occupancy grids can be summarized as follows:

  1. 1.

    We calculate the centroid of the 3D model and draw a margin based on the farthest and nearest points with respect to the centroid. All the points are translated about this margin to form the bounding boxes.

  2. 2.

    The 3 principal axes are equally segmented with respect to the bounding box limits. This allows for a variable voxel leaf-size which can be set based on the largest axis segmented.

  3. 3.

    The binning of the points is done based on the search and sort technique [5]. Each point is mapped to a voxel and voxels beyond the specific grid size are clipped to maintain uniformity.

    The size of the voxel is heuristically set to \(15\times 15\times 15\) as opposed to \(30\times 30\times 30\) as in [3]. This modification is in accordance to our cascaded 3D CNN architecture that accepts low resolution point cloud and can be trained on a CPU. Figure 1 shows conversion of CAD models to occupancy grids.

Fig. 1.
figure 1

An overview of the proposed 3D classification framework.

Fig. 2.
figure 2

The CAD models which are in mesh representation are first converted to low resolution point cloud data using random mesh sampling technique (Left). The point clouds are later mapped to voxels (Right) of variable leaf size by forming density occupancy grids (Middle).

Fig. 3.
figure 3

Our network architecture which we call 3D-GCNN consists of cascaded 3D CNNs with less trainable parameters and can be trained on local CPU machines with comparable performance to existing state-of-the-art methods.

3.3 Network Architecture

In this section we describe our 3D CNN architecture that consumes low resolution point cloud data in the form of density occupancy grids and performs classification. CNNs have had tremendous success in various image-related tasks. The same idea has been extended for 3D object recognition and classification [3, 6].The architecture is shown in Fig. 2. Our network architecture which we refer to as 3D-GCNN consists of 2 convolutional layers, a max-pooling layer, a dropout layer and finally dense layers and softmax to obtain classification scores. The 3D CNN architecture is designed to have lesser trainable parameters and to accept low-resolution point cloud so that it can be trained/tested on a local machine with no GPU compute while still achieving comparable performance to state-of-the-art work. The main layers along with the parameters are described below:

  1. 1.

    Input Layer - The input layer of the 3D-GCNN accepts low resolution point cloud and outputs the density occupancy grid based voxels of size \(15\times 15\times 15\). This acts as an input layer to the first convolutional layer.

  2. 2.

    Convolutional Layer 1 - The first convolutional layer has 32 filters each of size \(3\times 3\times 3\) and stride 1. The output of this layer is \(13\times 13\times 13\times 32\). This layer is followed by a ReLU layer given by \(f(x_i) = max(0,x_i)\). The output activation maps are computed as follows,

    $$\begin{aligned} f1 = max(0,W^T_1*x_i+B_1) \end{aligned}$$
    (2)
  3. 3.

    Convolutional Layer 2 - The second convolutional layer has 32 filters again of \(3\times 3\times 3\) size and stride of 1. The output of this layer is \(11\times 11\times 11\times 32\). The output activation maps are again computed using the Eq. 2.

  4. 4.

    Max-Pooling Layer - The max-pooling layer acts as a global feature aggregator. The features extracted from the convolutional layers are pooled here. The kernel size of the max-pooling layer is \(2\times 2\times 2\) with a stride of 2 and the output being \(5\times 5\times 5\times 32\). We use a dropout of 0.5 to avoid overfitting after the max-pooling layer.

  5. 5.

    Dense Layers and Softmax Classification - The activation maps are flattened before feeding them to the dense layers. We use 128 hidden units with ReLU activation function. The output layer is softmax classification which consists of K units, where K represents the different object classes. The 3D-GCNN is subjected to a categorical cross-entropy loss function.

The weights are refined by experimenting with Adadelta and Adam Optimizers. The Adadelta optimizer was set with a learning rate of 1.0 and momentum of 0.95 with no decay rate for the learning rate, while with the Adam optimizer the learning rate was set to 0.001 with bias correction of 0.9 and 0.999 respectively. From our experiments it was found that Adadelta generalized well while using Adam allowed the network to converge quickly.

Fig. 4.
figure 4

A visualization of intermediate activation maps learning the features of chair model.

4 Experiments and Results

In this section we describe the various experiments conducted on ModelNet dataset. We use the same train/test split as given by the ModelNet. We first divide the training set into a train/cross-validation set which is 80% and 20% respectively. This is true for both ModelNet 10 and ModelNet 40 datasets. The network was trained for 120 epochs on both datasets before the model converged. The experiments done on a local machine with no GPU support took about 2 h 40 min to fully train for both the datasets, while the same model migrated to a Nvidia Quadro K5000 GPU with Ubuntu OS 16.04, 64 GB RAM and Intel Xeon processor took about 20 and 40 min for ModelNet 10 and ModelNet 40 datasets respectively. The per-category accuracy of our model is still comparable to state-of-the-art methods even without data augmentation. Our 3D-GCNN architecture achieves a per-category test accuracy of 89.1% and 83.5% on ModelNet 10 and ModelNet 40 datasets respectively. Table 1 summarizes the comparison of our model with existing methods. From the table it is clear that our proposed framework outperforms the methods in [3, 9, 11] and is comparable to the work of [6]. The lower accuracy can be attributed to the way CNNs extract features and perform classification, the presence of similar looking objects like desk and table affect the performance of the model. Authors in [3] show that making the network deeper or employing a balanced dataset led to any increase in the performance, thus our 3D-GCNN model with relatively lesser trainable parameters 541,994 performs better than deeper network models.

Table 1. A comparison with state-of-the-art methods suggest that our framework performs comparable to existing methods.

5 Conclusion

In this paper we propose 3D-GCNN, a 3D CNN based object classification framework that uses margin based density occupancy grids with variable leaf size to directly consume point clouds as input data and perform 3D object classification. Extensive experiments performed on standard datasets show that our framework performs comparable to existing state-of-the-art methods.