Abstract
In this paper we propose to solve the problem of 3D Object Classification on point cloud data. We propose a 3D CNN architecture which we call 3D-GCNN that consumes point cloud data directly and performs classification. We present a novel method to represent the point cloud data using a margin based density occupancy grids which creates a minimum volume bounding box around the point cloud data. This information is fed to our proposed 3D CNN model which has far lesser trainable parameters and ensures convergence. We demonstrate our results on the ModelNet 10 and ModelNet 40 models and show we achieve better or comparable performance compared to other methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In this paper, we propose a 3D CNN framework to classify 3D objects. Our framework is able to consume point cloud data directly and doesn’t require conversion to other 3D representations. We propose a novel method to represent the point cloud data based on a margin-based method that constructs a minimum volume bounding box around the point cloud which is then fed to our model. Human beings are generally better at classifying and recognizing objects, however it is much harder task for computers to do it algorithmically. With the advent of affordable 3D scanners, 3D representation of data has become ubiquitous, however indexing of this data requires semantic understanding of the objects involved. Traditional methods use various geometric handcrafted features to tackle the problem of 3D object classification, the results vary according to the features that are selected. Deep learning techniques have become pervasive and have found immense applications in the field of image classification, image-caption generation, video classification, action recognition in videos etc. These methods and frameworks have been extended to 3D data as well, some frameworks consider 3D CAD models, while some use meshed representations as their input data. We propose a deep learning framework based on 3D CNNs to classify 3D objects directly based on point cloud data, towards this we make the following contributions:
-
1.
We propose 3D-GCNN, a 3D CNN based deep learning model that consumes point clouds and converts them into novel margin-based occupancy grid representation.
-
2.
We propose a 3D CNN architecture which has less trainable parameters but still achieves convergence and high accuracy.
-
3.
We demonstrate our proposed framework on ModelNet10 and ModelNet40 dataset and show that we achieve better or comparable results to existing state of the art methods.
2 Related Work
The task of 3D object classification can be divided into two categories - techniques that use hand-crafted features and techniques that employ deep learning techniques. For the hand-crafted techniques many approaches have been proposed, they mainly utilize various geometric features like spin images [4], 3D shape context [1], point-feature histograms [7] and viewpoint feature histograms [8]. Authors in [2] propose to use Metric tensors and Christoffel symbols as novel geometric features and use a simple SVM classifier to perform 3D object categorization. On the contrary, deep learning techniques do not need any hand-crafted features, authors in [11] propose a deep learning architecture where the meshed models are represented as a binary tensors with 1 representing the mesh inside a voxel grid and 0 representing its absence. Authors in literature [9] propose an architecture where 3D models are represented as panoramic images and a variation of CNN is proposed that learns from such representations. All the works mentioned in the literature either work on 3D meshes, CAD models or involves representing 3D shapes in more manageable ways and then feeding them to deep learning frameworks. However, with the advent of inexpensive 2.5D scanners (Microsoft Kinect) the most direct representation of 3D models is point cloud data, following this line of thought, authors in [6] propose a deep learning framework that consumes point clouds directly for 3D object recognition and segmentation. Taking this approach further authors in [3] propose a novel method to represent point clouds using density occupancy grid representation. Inspired by these works we propose a refined approach to construct density occupancy grids and use this information to feed to our deep learning framework. Extensive experiments are performed on the ModelNet 10 and ModelNet 40 datasets to show how our framework outperforms/performs comparable to existing state-of-the-art methods.
3 Proposed Methodology
In this section we explain in detail the proposed methodology for 3D object classification. The proposed framework is depicted in Fig. 1.
3.1 Preprocessing Input Data
The proposed framework is experimented on the famous ModelNet dataset provided by [11]. The dataset consists of 662 object categories with a total of 127,915 models. A subset of these models called ModelNet10 and ModelNet 40 are benchmark datasets for 3D object classification and recognition tasks. The dataset consists of CAD models in mesh representation, we propose a pre-processing step to convert the CAD models into point clouds. The CAD models are sampled to obtain a sparse point cloud representation which can directly be fed to our 3D-GCNN Model. The conversion process involves performing a random mesh sampling which uses barycentric co-ordinate system for the triangle meshes. The idea is to generate point clouds inside the triangle using its vertices and a linear combination of three random numbers. Consider the three vertices of the triangle P, Q and R, we generate three random numbers u, v, w whose value lies in between [0, 1] and is given by,
The value obtained from this equation is multiplied with the coordinates of the vertices of the triangle. In this manner the CAD models are sampled to have 2048 points.
3.2 Margin Based Occupancy Grid
In this section we explain the refined occupancy grid approach that we have formulated to represent the input point cloud data. Occupancy grids are a type of data structure [10] that allows us to obtain an efficient and compact representation of the volumetric space [3]. The occupancy grids provide good 3D shape cues that can be learned for various tasks. The occupancy grids can be formed using two approaches -
-
1.
Binary Grids - In this approach a binary tensor is defined which has either 1 or 0 based on whether the grid is occupied.
-
2.
Density Grids - In this approach the grids are defined by the density of the points present in the voxel.
In our formulation we use the density grids but unlike in the work [3] where the density grids are fixed with fixed leaf and grid size, we propose a margin based density grid formulation with variable voxel leaf size. The procedure to form occupancy grids can be summarized as follows:
-
1.
We calculate the centroid of the 3D model and draw a margin based on the farthest and nearest points with respect to the centroid. All the points are translated about this margin to form the bounding boxes.
-
2.
The 3 principal axes are equally segmented with respect to the bounding box limits. This allows for a variable voxel leaf-size which can be set based on the largest axis segmented.
-
3.
The binning of the points is done based on the search and sort technique [5]. Each point is mapped to a voxel and voxels beyond the specific grid size are clipped to maintain uniformity.
The size of the voxel is heuristically set to \(15\times 15\times 15\) as opposed to \(30\times 30\times 30\) as in [3]. This modification is in accordance to our cascaded 3D CNN architecture that accepts low resolution point cloud and can be trained on a CPU. Figure 1 shows conversion of CAD models to occupancy grids.
3.3 Network Architecture
In this section we describe our 3D CNN architecture that consumes low resolution point cloud data in the form of density occupancy grids and performs classification. CNNs have had tremendous success in various image-related tasks. The same idea has been extended for 3D object recognition and classification [3, 6].The architecture is shown in Fig. 2. Our network architecture which we refer to as 3D-GCNN consists of 2 convolutional layers, a max-pooling layer, a dropout layer and finally dense layers and softmax to obtain classification scores. The 3D CNN architecture is designed to have lesser trainable parameters and to accept low-resolution point cloud so that it can be trained/tested on a local machine with no GPU compute while still achieving comparable performance to state-of-the-art work. The main layers along with the parameters are described below:
-
1.
Input Layer - The input layer of the 3D-GCNN accepts low resolution point cloud and outputs the density occupancy grid based voxels of size \(15\times 15\times 15\). This acts as an input layer to the first convolutional layer.
-
2.
Convolutional Layer 1 - The first convolutional layer has 32 filters each of size \(3\times 3\times 3\) and stride 1. The output of this layer is \(13\times 13\times 13\times 32\). This layer is followed by a ReLU layer given by \(f(x_i) = max(0,x_i)\). The output activation maps are computed as follows,
$$\begin{aligned} f1 = max(0,W^T_1*x_i+B_1) \end{aligned}$$(2) -
3.
Convolutional Layer 2 - The second convolutional layer has 32 filters again of \(3\times 3\times 3\) size and stride of 1. The output of this layer is \(11\times 11\times 11\times 32\). The output activation maps are again computed using the Eq. 2.
-
4.
Max-Pooling Layer - The max-pooling layer acts as a global feature aggregator. The features extracted from the convolutional layers are pooled here. The kernel size of the max-pooling layer is \(2\times 2\times 2\) with a stride of 2 and the output being \(5\times 5\times 5\times 32\). We use a dropout of 0.5 to avoid overfitting after the max-pooling layer.
-
5.
Dense Layers and Softmax Classification - The activation maps are flattened before feeding them to the dense layers. We use 128 hidden units with ReLU activation function. The output layer is softmax classification which consists of K units, where K represents the different object classes. The 3D-GCNN is subjected to a categorical cross-entropy loss function.
The weights are refined by experimenting with Adadelta and Adam Optimizers. The Adadelta optimizer was set with a learning rate of 1.0 and momentum of 0.95 with no decay rate for the learning rate, while with the Adam optimizer the learning rate was set to 0.001 with bias correction of 0.9 and 0.999 respectively. From our experiments it was found that Adadelta generalized well while using Adam allowed the network to converge quickly.
4 Experiments and Results
In this section we describe the various experiments conducted on ModelNet dataset. We use the same train/test split as given by the ModelNet. We first divide the training set into a train/cross-validation set which is 80% and 20% respectively. This is true for both ModelNet 10 and ModelNet 40 datasets. The network was trained for 120 epochs on both datasets before the model converged. The experiments done on a local machine with no GPU support took about 2 h 40 min to fully train for both the datasets, while the same model migrated to a Nvidia Quadro K5000 GPU with Ubuntu OS 16.04, 64 GB RAM and Intel Xeon processor took about 20 and 40 min for ModelNet 10 and ModelNet 40 datasets respectively. The per-category accuracy of our model is still comparable to state-of-the-art methods even without data augmentation. Our 3D-GCNN architecture achieves a per-category test accuracy of 89.1% and 83.5% on ModelNet 10 and ModelNet 40 datasets respectively. Table 1 summarizes the comparison of our model with existing methods. From the table it is clear that our proposed framework outperforms the methods in [3, 9, 11] and is comparable to the work of [6]. The lower accuracy can be attributed to the way CNNs extract features and perform classification, the presence of similar looking objects like desk and table affect the performance of the model. Authors in [3] show that making the network deeper or employing a balanced dataset led to any increase in the performance, thus our 3D-GCNN model with relatively lesser trainable parameters 541,994 performs better than deeper network models.
5 Conclusion
In this paper we propose 3D-GCNN, a 3D CNN based object classification framework that uses margin based density occupancy grids with variable leaf size to directly consume point clouds as input data and perform 3D object classification. Extensive experiments performed on standard datasets show that our framework performs comparable to existing state-of-the-art methods.
References
Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing objects in range data using regional point descriptors. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24672-5_18
Ganihar, S.A., Joshi, S., Shetty, S., Mudenagudi, U.: Metric tensor and Christoffel symbols based 3D object categorization. In: ACM SIGGRAPH 2014 Posters, SIGGRAPH 2014, pp. 38:1–38:1. ACM, New York (2014). https://doi.org/10.1145/2614217.2630582
Garcia-Garcia, A., Gomez-Donoso, F., Rodrguez, J., Orts, S., Cazorla, M., Azorin-Lopez, J.: PointNet: a 3D convolutional neural network for real-time object class recognition, pp. 1578–1584, July 2016. https://doi.org/10.1109/IJCNN.2016.7727386
Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999). https://doi.org/10.1109/34.765655
Leimer, K.: External sorting of point clouds, September 2013. https://www.cg.tuwien.ac.at/research/publications/2013/leimer-2013-esopc/
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. CoRR abs/1612.00593 (2016). http://arxiv.org/abs/1612.00593
Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3D registration. In: Proceedings of the 2009 IEEE International Conference on Robotics and Automation, ICRA 2009, Piscataway, NJ, USA, pp. 1848–1853. IEEE Press (2009). http://dl.acm.org/citation.cfm?id=1703435.1703733
Rusu, R.B., Bradski, G.R., Thibaux, R., Hsu, J.M.: Fast 3D recognition and pose using the viewpoint feature histogram. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2155–2162 (2010)
Shi, B., Bai, S., Zhou, Z., Bai, X.: DeepPano: deep panoramic representation for 3-D shape recognition. IEEE Signal Process. Lett. 22(12), 2339–2343 (2015). https://doi.org/10.1109/LSP.2015.2480802
Thrun, S.: Learning occupancy grid maps with forward sensor models. Auton. Robot. 15(2), 111–127 (2003). https://doi.org/10.1023/A:1025584807625
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920, June 2015. https://doi.org/10.1109/CVPR.2015.7298801
Acknowledgement
This work was partly carried out under Department of Science and Technology (DST) sponsored Indian Heritage in Digital Space (IHDS).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Tigadoli, R., Tabib, R.A., Jamadandi, A., Mudenagudi, U. (2019). 3D-GCNN - 3D Object Classification Using 3D Grid Convolutional Neural Networks. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11941. Springer, Cham. https://doi.org/10.1007/978-3-030-34869-4_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-34869-4_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34868-7
Online ISBN: 978-3-030-34869-4
eBook Packages: Computer ScienceComputer Science (R0)