Keywords

1 Introduction

With the 3D objects widely used in a variety of fields, such as 3D games, movies entertainment, multimedia and so on, efficient and concise 3D shape analysis is be-coming more and more important. Retrieving non-rigid 3D objects has always been a hot and difficult topic. A large number of 3D retrieval methods have been proposed [1]. In [1] the authors present a review of the existing methods. Most of the methods represent a 3D shape as a shape descriptor. Shape descriptors are crucial for the 3D object representation. They not only need to describe the intrinsic attributes of shapes, but also need a good distinction for different models.

In the past decade, a large number of shape descriptors have been proposed, such as D2 shape distribution [2], statistical moments [3, 4], Fourier descriptor [5], Eigen-value Descriptor [6], etc. Those shape descriptors can effectively express 3D shapes, however they are relatively sensitive to non-rigid deformation and topological changes, and they are not suitable for non-rigid 3D shape retrieval.

Intrinsic descriptors with isometric invariance are proposed for non-rigid 3D models, the Laplace-Beltrami operator-based methods are the most common, includes the Global Point Signature (GPS) [7], the Heat Kernel Signature (HKS) [8], the Wave Kernel Signature (WKS) [9], the scale-invariant Heat Kernel Signature (SIHKS) [10], etc.

In the aforementioned methods, the shape descriptors are hand-crafted, rather than self-learning by data-driven. They are often not robust enough to deal with the deformation of the 3D models and geometric noises. In [11], the authors applied standard Bag-of-feature (BoF) to construct global shape descriptor. A set of HKSs of shapes are clustered using k-means to form the dictionary of words. Then they combine spatially-close words into histogram as shape descriptors. In [12], it combines the standard and spatial BoF descriptors for shape retrieval. Unlike the conventional BoF methods, the authors of [13] propose a supervised BoF approach with mean pooling to form discriminative shape descriptor. The authors of [14] adopt three levels for extracting high-level feature. They combine the SIHKS, the shape diameter function (SDF) and the averaged geodesic distance (AGD) three representative intrinsic features, then a position-independent BoF is extracted in mid-level. Finally a stack of restricted Boltzmann machines are used for high-level feature learning. [15,16,17] on the basis of histogram to optimize HKS by auto-encoder. In [14] they extract low level feature first, then neural networks are used to maximize the inter-class margin, minimize intra-class distance. [18] combines with SIHKS and WKS, the shape descriptor is obtained by the weight pooling, and linearly mapped into subspace using large margin nearest neighbor.

However, they almost have some common problems. Most high-level features are based on linear mapping. Although some methods extract high-level feature by deep learning, they have too few training samples. And the train set and test set always contain the same categories in their experiments.

2 Overview of Our Method

In this paper, we propose a novel non-rigid 3D object retrieval method. Our method improves both retrieval performance and time consumption. Inspired by mean pooling proposed in [13] and weighted pooling proposed in [18], we propose a simple and efficient method to simplify the multi-scale descriptor into a compact descriptor. We named it sum pooling descriptor. We use sum pooling instead of the weighted pooling, because the difference between the two is very small, with the increase in the number of points. This difference can be ignored when using the network optimization. So we use the simplest way to resolve this problem. This operation can be considered as the sum of the energy at each scale of each vertex on the meshes. it can describe the transformation of total energy. And we test this approach on HKS and WKS respectively.

Then we train a discriminative deformation-invariant shape descriptor via a two-stream network. It is called Siamese network in the field of computer vision and image processing. Figure 1 shows the pipeline of our proposed method, where the red arrows indicate the process of network training, and the blue arrows indicate test process. It included three steps:

Fig. 1.
figure 1

The detailed framework of our method. We preprocess the 3D shapes from dataset and extract the point descriptors for each shape. The sum pooling step transforms the point descriptor into the compact descriptors, which can be represented by a vector. Finally, the high level descriptor is trained by the Siamese network. (Color figure online)

  • Multi-scale feature extraction. We extract two kinds of multi-scale features respectively, HKS and WKS, from 3D models to describe each non-rigid shape.

  • Low level feature extraction. We employ the sum pooling operation on multi-scale features and get a succinct and compact representation of 3D shape. Each model can be represented as a vector.

  • High level feature extraction. The low level features are mapped into subspace by a two-stream network which we design for 3D shape descriptor.

The reasons for choosing the two-stream network architecture for high-level feature extraction are as follows:

  • The number of 3D models in one dataset is usually only a few tens or hundreds. The two-stream network architecture employ multiple inputs to increase the number of input samples of the network, it’s a large extent and improve the ability of the network to fit.

  • The low-level descriptors could be mapped into a better projection subspace in the Euclidean distance by the nonlinear mapping of the neural network.

  • By comparing the differences between the different categories of samples in the loss function, the inter-class margin could be maximized and intra-class distance could be minimized.

3 Method

In this section, we describe the details of our method. In the first part, we describe the extraction of multi-scale features, includes HKS and WKS. The sum pooling operation and the design of the Siamese network are introduced in the second and third part.

3.1 Multi-scale Features Extraction

Heat Kernel Signature.

The Heat Kernel Signature (HKS) [8] has been widely used in 3D shape analysis due to its capability of capturing significant intrinsic geometric properties, such as isometric invariance and robustness of geometric transformation. In HKS, a 3D model is represented as a Riemannian manifold \( {\text{M}} \). The heat diffusion process on \( {\text{M}} \) is governed by the heat equation

$$ \Delta_{M} u(x,t) = - \frac{\partial u(x,t)}{\partial t} $$
(1)

\( \Delta M \) is the Laplace-Beltrami operator (LBO) on the Riemannian manifold \( {\text{M}} \). And \( k_{t} (x,y):\Re^{ + } \times {\rm M} \times {\rm M} \to \Re \) is a function on \( M \), which is heat kernel function used to describe the heat values between point x and point y on a given time t. The eigen-decomposition of the heat kernel is

$$ k_{t} (x,y) = \sum\nolimits_{k = 0}^{\infty } {e^{{ - \lambda k^{t} }} } \phi_{k} (x)\phi_{k} (y) $$
(2)

where \( \lambda_{i} \) and \( \phi_{i} \) are the i-th eigenvalue and the i-th eigenfunction of the LBO respectively. The equation \( k_{t} \left( {x,x} \right) \) can represent the heat values at point x from time \( t_{0} \) to time t, where the unit heat sources \( u_{0} \) aggregates on the 3D model surface \( S \). Thus the HKS can be represented as

$$ k_{t} (x,x) = \sum\limits_{k = 0}^{K - 1} {e^{{ - \lambda_{k} t}} } \phi_{k} (x)^{2} $$
(3)

Note that we no longer choose all the eigenfunctions, just intercept the top k eigenfunctions of LBO. And the different diffusion time of HKS concatenated to obtain a descriptor of point x

$$ HKS(x) = (k_{{t_{1} }} (x,x),k_{{t_{2} }} (x,x), \ldots ,k_{{t_{\tau } }} (x,x)) $$
(4)

where the HKS(x) describes the local descriptor of point x on one model. It is showed as a T-dimensional vector. For a 3D model with n point \( X = \{ x_{1} ,x_{2} , \ldots ,x_{n} \} \), we finally get a global descriptor of size of \( n*T \).

$$ HKS = (HKS(x_{1} ),HKS(x_{2} ), \ldots ,HKS(x_{n} )) $$
(5)

Wave Kernel Signature.

The Wave Kernel Signature (WKS) [9] reflects the fact that the probability distribution of different energy of quantum particles. The energy of a quantum particle on a model surface is governed by the wave function \( \psi (x,t) \) which is a solution of the Schrödinger equation

$$ \frac{\partial \psi (x,t)}{\partial t} = i\Delta \psi (x,t) $$
(6)

Similar to HKS, WKS also depends on the LBO eigen-decomposition. We get an energy probability distribution with expectation value E. Therefore the wave equation can be written as

$$ \psi_{E} (x,t) = \sum\limits_{k = 0}^{\infty } {e^{{i\lambda_{k} t}} } \phi_{k} (x)f_{E} (\lambda_{k} ) $$
(7)

The probability to measure the particle at point x is \( |\psi_{E} (x,t)|^{2} \). Considering that the time parameters do not have an intuitive explanatory of shape. The WKS define as the average probability over time is

$$ p(x) = \mathop {\lim }\limits_{x \to \infty } \frac{1}{T}\int_{0}^{T} {|\psi_{E} (x,t)|}^{2} = \sum\limits_{k = 1}^{\infty } {\phi_{k} } (x)^{2} f_{E} (\lambda_{k} )^{2} $$
(8)

Therefor descriptor of different scales can be described by choose different distributions \( f_{E} \). The multi-scales descriptor WKS with a set of energy distributions \( E = \{ e_{1} , \ldots ,e_{q} \} \) is shown below

$$ WKS(E,x) = (p_{{e_{1} }} (x), \ldots ,p_{{e_{q} }} (x))^{T} $$
(9)

3.2 Sum Pooling Descriptor

For retrieval, a concise and compact descriptor to describe a shape is necessary, and there is a need for a good distinction between different no-rigid 3D shapes. Both HKS and WKS are multi-scale descriptors. They are not suitable for retrieval. In order to form a compact descriptor, we sum energy values at all points on each scale. It describes the process of the heat values changes on the shape surface over time. For a given shape \( S \) with \( V \) points, the scale size of multi-scale descriptor is \( N \). We compute a sum pooling HKS or WKS over all point descriptors \( d(x) \) computed from the points x of \( S \). Thus, our shape descriptor is defined as

$$ y(S) = \sum\limits_{x \in S} {d(x)/V} $$
(10)

where \( x \in S \) is a point from shape \( S \). \( N \) is the number of all points. The descriptor of point x is represented by \( d(x) \).

3.3 Siamese Network Architecture

As shown in Fig. 2, in order to make the descriptor map to the optimal Euclidean distance projection subspace while maximize the inter-class margin and minimize intra-class distance, we apply the two-stream network architecture to handle this part of work. Figure 3 shows that the Siamese network achieves the two-stream architecture by sharing network weights. It is proposed to be used for signature verification [19]. Later, various improvements to Siamese networks are applied in many missions, including face verification [20, 21], ground-to-aerial image matching [22], local path descriptor learning [23, 24] and stereo matching [25]. In this work, we design a novel Siamese network to learn a concise and compact descriptor for 3D shape retrieval.

Fig. 2.
figure 2

The Siamese Network maps features to high-dimensional subspace, which narrowing the differences between the samples from same class and expanding the distance between the samples from different classes.

Fig. 3.
figure 3

Siamese network architecture.

Note that Siamese network is a metric learning algorithm. Our aim is to get a compact descriptor. Considering that the hierarchical structure of the neural network and the distance of two samples are measured at the last layer in the European space. Thus we extract the last layer of the network as a descriptor after the training is finished.

Network Architecture.

Figure 3 shows the architecture of our network. Here \( X_{1} \) and \( X_{2} \) are a pair of input samples of network. If they come from the same category then Y = 1, and we calls “positive pair” and otherwise Y = 0 (“negative pair”). The W is the sharing weights of network, and the function \( E_{W} \) that measures the Euclidean distance between the network outputs of \( X_{1} \) and \( X_{2} \). It is defined as

$$ E_{W} (X_{1} ,X_{2} ) = ||G_{W} (X_{1} ) - G_{W} (X_{2} )||_{2} $$
(11)

where \( G_{W} (X_{1} ) \) and \( G_{W} (X_{2} ) \) are the high-level space vector after network mapping and \( ||G_{W} (X_{1} ) - G_{W} (X_{2} )||_{2} \) is the Euclidean distance of two network’s latent representations.

Considering that the input dimension is small, we do not incorporate the regularization term into the loss function, instead apply the RELU activation function directly in the first layer of the network, not only played a role of the regularization term, but also we find it increases the speed of convergence nearly ten times.

Loss.

We apply a margin contrastive loss function consists of squared error loss function and hinge loss function as shown below

$$ loss(W) = YE_{W}^{2} + (1 - Y)\;\hbox{max} (0,\varepsilon - E_{W} ) $$
(12)

where Y is a binary label of the pair indicates whether they are the same category or not and \( \varepsilon \) is a distance margin.

4 Experiments

4.1 Dataset and Evaluation Metric

Dataset.

We evaluated our approaches on the dataset of the SHREC’14-Shape Retrieval of Non-Rigid 3D Human Model [26]. “Real” dataset contains real scans objects of 20 males and 20 females different persons altogether 40 subjects, each subject includes ten different poses a total of 400 models. “Synthetic” dataset includes 300 models with 15 groups (5 males, 5 females and 5 children). Each class has a unique shape with 20 different poses. Compared with other 3D datasets, the datasets we adopted are composed of human body models, rather than the different kinds of objects. It made these datasets particularly challenging.

Evaluation Metrics.

We chose several evaluation metrics commonly used in 3D retrieval for evaluating the performance of our proposed descriptors.

  • Nearest Neighbor (NN): The percentage of the closest matching objects and query are from the same class.

  • First Tier (FT): The percentage of shapes is from the same class of the query are in the top K.

  • Second Tier (ST): The percentage of shapes is from the same class of the query are in the top 2K.

  • E-measure (EM): The EM is obtained by the precision and recall for top 32 of retrieved results.

  • Discounted Cumulative Gain (DCG): a logarithmic statistic used to weighs correct results near the front of the ranked list and toward the end of the list.

  • The Mean Average Precision (mAP): from the PR curves the average of all precision values computed in each relevant object in the retrieved list.

4.2 Implementation Details

In this subsection, we introduce our experimental settings. The LBO is limited to the first 100 eigenfunctions. For HKS, we choose 100 time scale. For WKS, the variance of the WKS Gaussian is set to 6. The dimension of WKS was 100. We down-sample all shapes to 10K vertexes and 20K faces with Meshlab [27].

We train a Siamese network which is made up of 3 layers fully connected network with size 150-516-256-150. We over sampling the number of “positive pair” ensure the number of two input samples of network is the same. The margin of safety \( \varepsilon \) in the loss function is set to 1.0.

4.3 Experiment 1

In this experiment, we divide the dataset into the training set and the test set according to [13]. We select 4 samples of each class as the training set, and the other as the test set.

We test the sum pooling HKS (SPHKS), the sum pooling WKS (SPWKS) and the stacked sum pooling HKS and WKS (SPHWKS) three low level descriptors and their corresponding trained descriptors by network. We run each test in this paper 20 times with different training set and test set splits, and take the average as the final results.

We test the performance of those features. After combining the Siamese network, our descriptor obtained significant results (Tables 1 and 2). And in this more challenging Real dataset, our approach is getting better results. It shows our method has a better effect on a dataset with noise and topology changes. The performance of stacked descriptors was worse than HKS. However, after network mapping, the performance of the descriptors is significantly improved.

Table 1. Shape descriptors evaluation on the SHREC’14 “Synthetic” dataset in experiment 1. We test the sum pooling descriptor (SPHKS, SPWKS, SPHWKS) with its corresponding descriptor trained by network.
Table 2. Shape descriptors evaluation on the SHREC’14 “Real” dataset in experiment 1.

We also compare the P-R curves Fig. 4 and the mean average accuracy (mAP) shows in Table 3.

Fig. 4.
figure 4

Comparison of performance of three descriptors before and after network mapping on the SHREC’14 Synthetic dataset (left) and Real dataset (right) in experiment 1.

Table 3. Comparison of different retrieval methods. The last six records are the results of our method. And the top three scores of each column are highlighted.

Our method compares with some state-of-the-art methods which are introduced in [30], including ISPM, DBN, HAPT, R-BIHDM, Shape Google, Unsupervised DL and supervised DL. We also compare our method with some methods that use deep learning or the recent learning methods, including CSD+LMNN [18], DLSD [28], RMVM [29]. Although our approach is not the best, but such as CSD+LMNN method, its training set and test set must have the same class which is not reasonable in the practical application.

By comparing the P-R curves and the results of the mAP, it’s clear that our algorithm decreases slowly with the increase of Recall rate, which is more stable than other approaches. And the mAP of our proposed method is superior to most existing methods.

4.4 Experiment 2

We make two more interesting tests and choose some classes as training set, the rest as test set. One of the reasons we chose a two-stream architecture network is to weaken the label, so that the network have the power to find the difference between the different classes, rather only distinguishing between known classes. It is not difficult to find that out method is more suitable for this experiment than experiment 1. In both experiments, the number of negative sample pairs has been sufficient. However, in experiment 1, the Synthetic dataset and real dataset can only get 240 and 640 positive sample pairs. In this experiment, the number of positive sample pairs is significantly more than experiment 1, for the Synthetic dataset and the Real dataset we can obtain 2800 and 2000 sample pairs.

We choose half classes (20 calsses, 200 models in “Real” dataset and 7 classes, 2800 models in “Synthetic” dataset) as training set, and the rest as test set. The experiment results in (Tables 4 and 5) show that our method is more effective in improving the dataset with noise, which just shows the advantages of the neural network.

Table 4. Shape descriptors devaluation on the SHREC’14 “Real” dataset in experiment 2.
Table 5. Shape descriptors evaluation on the SHREC’14 “Synthetic” dataset in experiment 2.

Obviously, as shown in Tables 4 and 5, even if the test classes are not in the training set, the results of our experiments are also significant. Our method is still more effective in SHREC’14 Real dataset. The SHREC’14 Synthetic dataset has little noise. The sum pooling descriptors have enough power to distinguish shapes from different classes.

The Precision-Recall plots Fig. 5 and the mAP Table 6 show the results of different descriptor. The network-optimized descriptors are more stable. Despite the stacked descriptor with HKS and WKS did not show good performance, it get a remarkable result trained by the Siamese network.

Fig. 5.
figure 5

Comparison of performance of three descriptors before and after network mapping on the SHREC’14 Synthetic dataset (left) and “Real” dataset (right) in experiment 2.

Table 6. The mean average precision of experiments evaluation on the SHREC’14 dataset in experiment 2.

We visualized the trained SPHWKS descriptor, from Fig. 6 we find that no trained descriptors are difficult to distinguish visually. After training, there is a significant difference between classes.

Fig. 6.
figure 6

Shape descriptors visualization. We choose three classes (left, middle, right) and the models between the different classes are chosen to have the same posture. From top to bottom are the models of human, the sum pooling descriptors and trained sum pooling descriptors.

4.5 The Time Consumption

We test our method on the PC with Inter(R) Core(TM) i5-4690 CPU and 8.0 GB RAM. It takes approximate 2.7 s for each shape to compute LBO eigenvalues and eigenvectors. It takes a total of 2.3 s to finish compute the low level descriptor for one shape, where the sum pooling process only needs 0.4 s. The time consumption in the high level descriptor extraction is mainly in the training stage. It take about 15 m to finish training a network.

5 Conclusion

We proposed a novel 3D shape descriptor via Siamese network. We first extracted a compact low level descriptor. We obtained it by sum pooling the point descriptor HKS and WKS. The HKS is more sensitive to the global information, the WKS is more effective for detail information. We stacked HKS and WKS, and further optimize it through the Siamese network. The network descriptor is obtained by extracting the last layer of Siamese network. We tested our proposed method on two challenging datasets, including Real and Synthetic dataset of SHREC’14 Human dataset. The results of our experiments showed that our proposed method is very effective for containing noises and indistinguishable data. The performance of our method is more than the existing method in many aspects.