1 Introduction

With the development of computer graphics, 3D shapes play an important role in many domains with a wide range of applications. Accurate 3D shape classification and retrieval are necessary. In recent years, with remarkable advances in deep learning, various network structures have been proposed for 3D shape classification and retrieval, such as 3D ShapeNets [35], PointNet [3], RotationNet [12]. Quite significantly, view-based methods have the best performance so far. Using deep learning schemes to extract view descriptor typically refers to exploiting well-established models, such as VGGNet [28], GoogLeNet [32], and ResNet [9]. Although deep learning methods for 2D views have been well investigated [19, 36], the structural relationship and the local details of the 3D shapes are still unexplored. If these methods are directly deployed to non-rigid 3D shapes classification and retrieval, it may lead to poor performance because the non-rigid 3D shapes are more complex and changeable. Comparing with the rigid 3D shapes, such as desk, chair and bed, the structure of non-rigid 3D shapes is more complex, such as ant, cat and human. In addition, the non-rigid 3D shapes have varieties of postural changes, which leads to different shapes that can be very similar in one posture. Therefore, the classification and retrieval of non-rigid 3D shapes are more difficult and challenging.

Fig. 1.
figure 1

A classification and retrieval framework for Non-rigid 3D shapes based on fusion view.

To tackle this issue, we propose here a FVCNN framework. It contains three function modules: projection module, feature coding module and descriptor generation module, as shown in Fig. 1. First, the projection module follows the principles and laws of the human visual system, constructing an efficient coordinate system to project the vertices on the observable region of the 3D shape onto a 2D plane. Then a feature coding module is used to extract the NS feature and SR feature as the pixel of the 2D plane, so that two kinds of views are generated. Since the views contain the local details and structural relationship features of the 3D shapes, the views can comprehensively describe the 3D shapes if the content features are explored efficiently. Finally, we use an efficient CNNs fusion module to extract the features of the two views, which are fused to further extract the deep fusion features as the 3D shape descriptors. We evaluate the proposed method on SHREC and the experiment results show that it outperforms state-of-the-art methods.

Our main contributions of this paper are as follows:

  • We propose a projection and feature encoding module to generate the NS and SR views, which contain the local details and structural relationships of the 3D shape, allowing the content features to be explored efficiently such that the views can comprehensively describe the 3D shapes.

  • We develop a CNNs fusion module to extract the features from the views, then fuse the features and extract the deep fusion features as the 3D shapes descriptors. Through deep fusion features, the expression ability of the shape descriptor can be improved, and the limitations of single feature are overcome.

2 Related Works

In earlier research on 3D shape classification and retrieval, the features of 3D shapes can be divided into five categories: statistic features [21, 30], view features [4, 24], topological features [2, 16], function transformation features [5, 17] and fusion features [27, 40]. The key issues with these features are the their weaknesses in descriptive ability and meanwhile expensive in computation time to calculate. It has been generally realized that obtaining efficient features is the key to the classification and retrieval of 3D shapes. Recently, deep learning has been applied in many fields and achieved satisfying results. The deep feature in a 3D shape is a comprehensive one integrating the characteristics of all aspects of the object [8, 15, 18, 22, 33, 37, 39, 41]. Introducing deep learning into 3D shape classification and retrieval has been a hot research topic recently. There are two categories of approaches for the CNN-based 3D shape classification: one is voxel-based and the other 2D image-based.

  1. (1)

    voxel-based approaches

    Charles et al. [3] introduced a hierarchical neural network called PointNet++ which applies PointNet recursively on a nested partition in the input point set. By exploiting the metric space distances, the network can learn local features by increasing contextual scales. The experimental results show that PointNet++ can learn deep point set features efficiently and robustly. Luciano et al. [41] brought forth a deep learning framework for efficient 3D shape classification by geodesic moments. It uses a two-layer stacked sparse autoencoder to learn deep features from geodesic moments by training the hidden layers individually in an unsupervised fashion followed by a softmax classifier. Ren et al. [23] developed a new definition about 2D multilayer dense representation (MDR) for 3D volumetric data to extract concise informative shape description. As a result, a novel adversarial network is designed to train a set of CNN, recurrent neural network (RNN) and an adversarial discriminator. The method improved the efficiency and effectiveness of 3D volumetric data processing.

  2. (2)

    2D image-based approaches

    Bai et al. [1] presented a real-time 3D shape retrieving engine GIFT based on the projection image of 3D shapes, which combines GPU acceleration and Inverted File (Twice). As a result, this method achieved ultra-high time efficiency where every retrieval task can be finished within one second. Sinha et al. [29] adopted an approach by converting the 3D shape into a geometry image so that standard CNNs can be used to learn 3D shapes directly. By projecting and cutting the spherical parameterized shape, the original 3D shape is transformed into a flat and regular geometric image. Based on the geometric image, the shape descriptor is extracted by CNNs. Shi et al. [26] introduced a rotation-invariant deep representation for 3D shape classification and retrieval known as DeepPano that verifies the rotation invariance of the representation. A variant of CNN is specifically designed to learn the deep representations directly from such views. Different from a typical CNN, a row-wise max-pooling layer is inserted between the convolution and fully-connected layers, making the learned representation invariant to the rotation around a principle axis. Su et al. [31] proposed a new CNN architecture that combines information groups to provide better recognition performance from multiple views of 3D graphics to single compact shape descriptor. The same structure can also be used to identify the hand-drawn sketches of human bodies accurately.

3 Methodology

3.1 Overview

Fig. 2.
figure 2

Schematic diagram of visual views generation.

In human visual system, people recognize an object by observing its local details and structural relationships. Motivated by this observation, we propose a novel method to generate views. The views are formed by projecting the vertices in the visual area of the 3D shape onto a 2D plane where the object features labeled with pixel values, as shown in Fig. 2. We extract the features from the views, and then fuse them to extract the deep fusion features as the descriptors of a 3D shape through the CNNs fusion module. The architecture of the method contains the following main steps as illustrated in Fig. 1.

Step. 1: Projection module for non-rigid 3D shape

In order to ensure the consistency of the extracted features, the 3D shape is preprocessed by the method described in [34] that can eliminate the influence caused by rotation and translation. The 3D shape is eventually surrounded by a sphere. A projection module similar to a visual imaging system is established, and the partial vertices of the 3D shape are then projected onto the 2D plane.

Step. 2: Feature coding module for the views

In this step, we propose a feature coding module. The features of the 3D shape such as NS and SR are coded as the pixel values on the view plane from which the views are generated.

Step. 3: CNNs fusion module for non-rigid 3D shape

A feature extractor combining view-pooling and CNNs is developed to extract the 3D shape descriptors. The module can be iteratively updated by training until the number of iterations reaches a given threshold or the performance of the module convergence.

We will describe the design and analysis details for each key part of the model in the following sections.

Fig. 3.
figure 3

Schematic diagram of the projection module.

3.2 Projection Module for Non-rigid 3D Shape

We shall first define a sphere with a radius r which surrounds the 3D shape, and establish a coordinate system, as shown in Fig. 3. The center of the sphere is defined as the coordinate origin O(0,0,0) which also is the centroid of the 3D shape. The Z axis is the long spindle of the coordinates. The view plane can then be set up as perpendicular to the Z axis with the size of \(h \times h\) and its center at O’(0,0,d). At the viewpoint V, we can observe the 3D shape through the view plane where the partial vertices are projected onto. In order to achieve this, the coordinate of the viewpoint can be determined by \(V(0,0,\alpha )\) where \(\alpha = dr/(r-h/2)\) according to the theory of similar triangles. Using the projection function \(E_{pro}:R^{3} \rightarrow R^{3}\) as shown in Eq. 1, we can calculate the coordinates of point p’ on the view plane by \(p'=F_{pro}(p)\) where the point \(p(p_{x},p_{y},p_{z})\) is the vertex on the 3D shape and the point \(p'(p'_{x},p'_{y},p'_{z})\) is the projection point.

(1)

3.3 Feature Coding Module for the Views

Discrete Grid Division of the View Plane.

We set \(S_{h}\) and \(S_{v}\) as the horizontal and vertical step length such that \(n_{h}=h/S_{h}\) and \(n_{v}=h/S_{v}\) are the division number of the view plane. According to the step-length, the 2D plane can then be divided into the areas \(A_{ij},i=1,2,\ldots ,n_{h},j=1,2,\ldots ,n_{v}\), according to

$$\begin{aligned} A_{ij}=\left\{ (x,y,z)\left| \begin{aligned} (i-1)S_{h}\le x< iS_{h}\\ (j-1)S_{v}\le x < jS_{v}\\ z=d \quad \quad \quad \quad \end{aligned} \right. \right\} \end{aligned}$$
(2)

The center point \((x^{*}_{i},y^{*}_{j},d)\) of the area \(A_{ij}\) can be calculated as

$$\begin{aligned} \left\{ \begin{aligned} x^{*}_{i}=(i-1)S_{h}+ \frac{1}{2} S_{h}\\ y^{*}_{j}=(j-1)S_{v}+ \frac{1}{2} S_{v} \end{aligned} \right. \end{aligned}$$
(3)

In order to simplify the expression, a local filter function can be defined as

(4)

The pixel value of each center point of \(A_{ij}\) can finally be calculated accordingly

$$\begin{aligned}&F_{NS}(x^{*}_{i},y^{*}_{j})= \mathop {avg}\limits _{p}(H_{NS}(p)I_{A_{ij}}(p)) \end{aligned}$$
(5)
$$\begin{aligned}&F_{SR}(x^{*}_{i},y^{*}_{j})= \max _{p}(H_{SR}(p)I_{A_{ij}}(p)) \end{aligned}$$
(6)

We obtain two categories of view: NS view such as \((x^{*}_{i},y^{*}_{j},F_{NS}(x^{*}_{i},y^{*}_{j}))\) and SR view such as \((x^{*}_{i},y^{*}_{j},F_{SR}(x^{*}_{i},y^{*}_{j}))\). \(H_{NS}(p)\) and \(H_{SR}(p)\) are the feature coding functions as described in the following section.

Feature Coding for the Views. We design two coding functions to code the NS and SR features as the pixel value of the non-rigid 3D shape. The views developed this way that contain not only the local shape features of the object observed, but also the positional relationship between the features. The pixel values reflect geometric features such as structural relationship, local details and topological structure of the 3D shape. So the features based on views are aligned into the comprehensive descriptors of the 3D shape.

  1. (1)

    NS features as the pixel value for the views

The NS features [38] of a 3D shape can be used to describe the local structure and details. A view with its pixel values constructed with NS features is named NS view here. By optimizing the heat kernel signature (HKS) features, we can obtain the NS features on the vertices of the 3D shape according to

$$\begin{aligned} H_{NS}(p)=F[\frac{d}{d\tau }logK_{\beta ^{\tau }}(p)] \quad and \quad K_{\beta ^{\tau }}(p)=\sum _{i\le 0} e^{-\varLambda _{i}\beta _{\tau }}\varPhi ^{2}_{i}(x) \end{aligned}$$
(7)

where \(\varLambda _{i}\) and \(\varPhi _{i}\) are the eigenvalues and eigenfunctions of the discrete Laplace-Beltrami operator, \(\beta \) is a constant, and \(\tau \in [lb(t_{min}), lb(t_{max})]\) in which \(t_{min}\) and \(t_{max}\) are the critical time values beyond which NS features of the 3D shape no longer change.

As shown in Fig. 4, the same colors of vertices indicate that their NS features are similar. The NS features have isometric invariance and robustness under small perturbation such as small topological change or noise.

Fig. 4.
figure 4

The NS feature of the man models

Fig. 5.
figure 5

The NS features of the house and man models.

As shown in Fig. 5(a) and (b) or (c) and (d) are the same 3D models with different scales, but their NS features are similar. Although the scale is changed, the NS features are still robust. For different types of 3D shapes, such as (a) and (c) or (b) and (d), their NS features are distinctly different.

  1. (2)

    SR features as the pixel value for the views

The minimum circumferential sphere enclosing the 3D shape is adopted as shown in Fig. 6. According to Eq. 8, we can obtain the pixel value \(H_{SR}\) to describe the global structural features on the vertex p of the 3D shape.

Fig. 6.
figure 6

The model for extracting the SR features.

$$\begin{aligned} H_{SR}(p)=(cos\theta + cos\varphi )dis(Op) \end{aligned}$$
(8)

where \((\theta ,\varphi ,r)\) is the spherical coordinates of the vertex p of the 3D shape, dis(Op) is the distance between point O and point p.

3.4 CNNs Fusion Module for Non-rigid 3D Shape

In this section, we develop two CNNs: the convolutional neural networks based on traditional networks (CNNs-T) and the convolutional neural networks based on ResNet (CNNs-R), as shown in Figs. 7 and 8, respectively. For each CNN, the input data has two categories of views obtained in the previously defined modules.

Motivated by [10], we define a composite function of three consecutive operations for each block of each CNN: batch normalization (BN) [11], followed by a rectified linear unit (ReLU) [6] and a \(3 \times 3\) convolution (Conv). We train two kinds of views in \(CNN_{1}\) to extract the features of the corresponding views, and then fuse the features at the view-pooling layer and input them to \(CNN_{2}\) for further extraction.

Fig. 7.
figure 7

The convolutional neural networks based on traditional networks (CNNs-T).

CNNs-T. The training process of CNNs-T is illustrated in Fig. 7. Traditional convolutional feed-forward networks use the output of the \(l_{th}\) layer as the input to the \((l+1)_{th}\) layer [13].

Fig. 8.
figure 8

The convolutional neural networks based on ResNet (CNNs-R).

CNNs-R. The training process of CNNs-R is illustrated in Fig. 8. In ResNet [9], a skip-connection is added, bypassing the non-linear transformation by an identity function. The output of the \(l_{th}\) layer is used as the input to the \((l+1)_{th}\) layer and \((l+2)_{th}\) layer. The advantage of ResNet is that the gradient can flow directly from the later layers to the earlier layers by the identity function.

View-Pooling Layer. View-pooling layers are closely related to max-pooling layers, and the only difference is that the pooling operations are carried out in three dimensions.

Implementation Details. There are five blocks for the \(CNN_{1}\), and each block has the same number of layers. For a \(3\times 3\) convolutional layer, each side of the input is zero-padded by one pixel for the purpose of fixing the feature-map size. At the end of the \(CNN_{1}\), a view-pooling is performed and \(CNN_{2}\) is attached, so it forms three parts. At the end of the \(CNN_{2}\), two fully-connected layers and one softmax classifier are used. In addition, the numbers of the feature-map for each block are 32, 32, 64, 64, 64, 128, 128 and 256.

In our experiments, we use the above two network structures to extract the descriptors of the 3D shapes and implement the classification and retrieval of 3D shapes.

4 Experiment Result and Analysis

All algorithms proposed in this work are implemented and tested using Matlab2017b on a PC with the following specifications, CPU: Intel(R) Core(TM) i9-7960X 2.80 GHz, GPU: NVIDIA GeForce GTX1080TI, RAM: 16 GB DDR4, OS: Windows10 SP1 of 64 bits.

4.1 Dataset

We evaluate our method on the SHREC [14] database of watertight meshes. SHREC contains 600 3D shapes from 30 categories, among them 480 and 120 3D shapes are used for training and testing, respectively. We randomly select 30 shapes from 30 categories, are shown in Fig. 9.

Fig. 9.
figure 9

30 selected 3D shapes from SHREC database.

4.2 The NS and SR Features of the 3D Shapes

Fig. 10.
figure 10

The NS features (a) and SR features (b) of the selected 3D shapes.

Based on the feature coding module in Sect. 3.3, we extract the NS features and SR features of 30 selected 3D shapes, respectively. Using the color represents the NS features and SR features of the 3D shapes, as shown in Fig. 10(a) and (b). These color blocks reflect the local details and structural relations of the 3D shapes, and the similarity between the NS features and SR features from different categories is low.

4.3 The Examples of NS and SR Views for Non-rigid 3D Shapes

Based on the projection and feature coding module described in Sect. 3, we extracted the views of the 3D shapes from 30 categories to analyze their expression capabilities, as shown in Fig. 11(a) and (b), the NS views and SR views correspond to the 3D shapes in Fig. 9. We can find that the views reflect well the local details and structural relations of the 3D shapes and the similarities between the views of different categories are low. Through these two views, the 3D shapes can be described efficiently and the similarities and differences between shapes can be distinguished.

Fig. 11.
figure 11

The NS views (a) and SR views (b) of the selected 3D shapes.

4.4 Non-rigid 3D Shape Classification and Retrieval Efficiency Analysis

Fig. 12.
figure 12

The comparison of the parameter (a) and the view (b) efficiency.

The Comparison of CNNs-T and CNNs-R. The results in Fig. 12(a) show that CNNs-R utilize parameters more efficiently, consistently outperforming CNNs-T in reducing top1-errors when they both have the same parameters. Moreover, CNNs-R also explores the view features more effectively as it delivers better performance in accuracy using the same visual view (e.g., 83.76% vs 66.35%, 89.29% vs 75.68%, 97.44% vs 78.41%), as shown in Fig. 12(b).

The Retrieval Results of CNNs-T and CNNs-R. In the experiment, 3D man and ant shapes are chosen as the query example shapes. We compared the 3D shape retrieval performance of CNNs-R and CNNs-T, and the results are shown in Fig. 13. In Fig. 13(a) and (b), we can see that all shapes in retrieval results are relevant. Although the 3D ant shape can be complex and many forms of deformation exist, CNNs-R delivered good retrieval performance. In contrast, the retrieval results obtained from CNNs-T have one irrelevant retrieval in the 3D man shape and two irrelevant ones in the 3D ant shape, as shown in Fig. 13(c) and (d). These results verify the superior performance of the proposed CNNs-R approach.

Fig. 13.
figure 13

The retrieval results.

Fig. 14.
figure 14

Precision-recall curves among different algorithms on the SHREC-11 dataset.

Table 1. Comparison results among different algorithms on the SHREC dataset.

Comparative Analysis with the State of the Art Methods. We now compare our methods with state-of-the-art approaches, including Zer [20], LFD [4], SN [35], Conf [7], Sph [25], Geometry Image [29]. The results of non-rigid 3D shape classification and retrieval are summarized in Table 1 and in Fig. 14. We can see that both CNNs-T and CNNs-R have better performance. CNNs-T delivered classification accuracy and MAP retrieval reaching 82.7% and 76%, respectively, which is 4% higher than Geometry Image [29] in MAP retrieval. Moreover, CNNs-R has the best performance of them all with the classification accuracy of 97.4% and MAP retrieval of 81% which are 0.8% and 9% higher than Geometry Image [29], respectively.

5 Conclusion

In this paper, we bring forward a FVCNN framework for classifying and retrieving non-rigid 3D shapes. Firstly, we propose a projection module to transform the non-rigid 3D shape onto a 2D view plane and a feature coding module to extract the NS features and SR features of the 3D shape. And then the NS views and SR views are generated by using the NS features and SR features as the pixel values, respectively, which are able to express the 3D shapes efficiently. Finally, we propose a CNNs fusion module to extract the view-based features and fuse them to extract the deep fusion features as the 3D shape descriptors. The method in this paper use neural network architecture and outperformed a more traditional non-learning based approach, these is still much space for improvement.

In the future we wish to build upon these insights for generative models of 3D shape with encoded views instead of traditional images. An future direction is to consider integrating the discriminative power of view-based approaches and the robustness approaches reasoning more locally with geometry.