Keywords

1 Introduction

With the development of new equipments and new methods, the issue of virtually reproduction of the surrounding 3D real world comes back to people’s attention. The application areas consist of robotics, entertainment industry, archeology, vehicle industry, etc. A considerable amount of work has been devoted in these areas to achieve 3D reconstruction with high quality regardless of running time. However, in the last decade, the interest in real-time 3D reconstruction systems has increased dramatically. Current real-time reconstruction techniques are bounded either because of the capturing environment or due to the quality of reconstructed surface. This work is aimed to propose a novel efficient algorithm without compromising the quality and robustness of the reconstruction scheme.

Manual modeling is widely used in the film and game industry, where the physical accuracy of the 3D models is less interesting. However, the approach is quite tedious and time-consuming, which make it inapplicable for large scale scenes modeling. The active 3D laser scanning is an alternative to acquire highly precise 3D models. Multiple scans and even hundreds of scans from many different directions are usually required to obtain complete surface information of the modeled object [2]. However, the limited range and highly controlled illumination condition pose a major challenge for active scanning to further application. The development of time-of-flight (ToF) cameras overcomes the shortcomings of active laser scanning at the cost of low accuracy. With the advent of modern cameras and rapid development in the field of multi-view geometry, image-based modeling becomes the most promising alternative to active scanning. The multi-view 3D reconstruction can be split into two steps:

  1. 1.

    Camera calibration,

  2. 2.

    3D modeling from calibrated images.

In this work, it is assumed that the images have been already calibrated. The first step towards 3D geometry data acquisition is dense or sparse correspondence matching among neighboring images. After correspondence matching, a 3D point cloud with normal information can be generated as described in [4]. The benefit of using sparse feature point instead of dense disparity map is the quick evaluation of the 3D point cloud, but the accuracy of the output of Poisson surface reconstruction [3] will be dramatically affected. To optimize the Poisson surface mesh, various passive visual cues are explored and applied. The rest of this work is organized as follows: in Sect. 2, we briefly review the related work. In Sect. 3, we introduce the proposed mesh optimization method based on various visual cues. In Sect. 4, we present the experimental results with real objects and evaluate the reconstruction quality with ground-truth data. Conclusions are drawn in Sect. 5.

2 Related Work

The multi-view 3D surface reconstruction can be categorized into variational and non-variational approaches. In variational category, the surface reconstruction problem is formulated as energy minimization problem. Furukawa et al. [1] proposed an accurate modeling which used local photometric consistency and global visibility constraints to extract quasi-dense rectangular patches in 3D space. An iterative deformation algorithm was applied to the generated triangle mesh and the vertices are updated with the following rule [1],

$$\begin{aligned} \partial (\mathbf {v})=-w(\mathbf {v})(\triangledown f(\mathbf {v})\cdot \mathbf {N}(v))\mathbf {N}(\mathbf {v})+(\triangledown E_R\cdot \mathbf {N}(\mathbf {v}))\mathbf {N}(\mathbf {v})+(-\beta _2\triangle \mathbf {v}+ {\beta _{3}}{\triangle ^{2}}\mathbf {v}), \end{aligned}$$
(1)

where the derivative is along the surface normal, \(f(\mathbf {v})\) is a scalar photometric discrepancy function and \(w(\mathbf {v})\) is an adaptive weight. \(E_R\) corresponds to silhouette consistency term and \(\beta _2\), \(\beta _3\) are constants to avoid oscillations. This approach requires in average more than 200 iterations to reach the satisfactory surface mesh data [1].

Another variational 3D surface reconstruction is based on the Poisson surface reconstruction. When the oriented point cloud is accurate, Poisson surface reconstruction will deliver very precise surface data. And the octree data structure guarantees the efficiency of execution time and memory consumption without compromising the resolution of the surface. However, this approach does not investigate the visual cues from the objects.

One most recent variational approach is total variation reconstruction [2] that inspires the algorithms proposed in this work, where the energy minimization problem is formulated as a convex form. Consequently, the solution is the globally minimum as compared to other approaches. The passive visual cues are also incorporated in this approach. However, the application of the voxel grids limits the resolution of the surface mesh.

3 Proposed Mesh Optimization Method

As discussed earlier, the focus of the work is on a robust and efficient image-based modeling technique. Mesh optimization with visual cues is applied on the Poisson surface mesh of the sparse or quasi-dense oriented point cloud which is generated by the method proposed in [4]. There is a list of challenges to optimize the Poisson-based mesh:

  1. 1.

    Initial Poisson surface mesh can be very noisy because of the inaccurate and incomplete oriented point cloud.

  2. 2.

    The topology of the triangle meshes should be preserved while the movement of the vertices.

  3. 3.

    Silhouette cue only can not identify the concavities on the surface.

  4. 4.

    Photometric cue is not applicable for homogeneous and textureless objects.

  5. 5.

    The whole process should be efficient regarding to the time cost.

In the following, we will address to these issues in detail.

3.1 One-Ring Neighborhood

Within the mesh topology, every vertex \(\mathbf {v}_i\) is connected to multiple vertices, called the one-ring neighborhood \(Nei(\mathbf {v}_i)=\{\mathbf {v}_{i1}; \mathbf {v}_{i2}; ...\}\) as illustrated in Fig. 1. The one-ring neighborhood consists of a set of faces surrounding the vertex. The normal vector of the center vertex can be estimated by averaging the enclosing face normal vectors. The movement of the vertex along the normal direction will not change the topology of the triangle meshes. The adjustment of the vertices will result in the surface mesh to extend or to contract.

Fig. 1.
figure 1

One-ring neighborhood

Fig. 2.
figure 2

Silhouette mapping functions along the normal vector \(\mathbf {N}(\mathbf {v})\)

3.2 Silhouette Consistency

Let \(M_i\) be binary mask of image which segments the object from the background and is known as silhouette. Furthermore, let \(O \subset \mathbb {R}^{3}\) be the object of interest. The projection of surface \(S \subset O\) to the corresponding view \(P_i\) that fulfills the silhouette consistency condition should follow the property:

$$\begin{aligned} \ P_i\cdot S \in M_i. \end{aligned}$$
(2)

Although the Eq. 2 is not invertible, the largest possible surface defined as “visual hull” [5] still can be determined. To formalize it as energy minimization problem, the total energy will be expressed as a sum of energy contribution from each individual vertex \(\mathbf {v}\):

$$\begin{aligned} E(S) = \sum _{i=1}E_i(\mathbf {v}), \end{aligned}$$
(3)
$$\begin{aligned} E_i(\mathbf {v}) = \sum _{p \in \mathbf {N}(\mathbf {v})}\left\{ \rho (p)_{f}\left[ 1-u(p)\right] + \rho (p)_{b}u(p) \right\} . \end{aligned}$$
(4)

where p is the discrete sample along the normal vector \(\mathbf {N}(\mathbf {v})\) of vertex \(\mathbf {v}\) and \(\rho (p)_{f}\) is the silhouette foreground mapping function, which will be ‘1’ inside the foreground and ‘0’ outside the foreground. However, if other segmentation techniques are used, the score will be the value between 0 and 1. Oppositely, \(\rho (p)_{b}\) is ‘1’ inside the background and ‘0’ outside the background as shown in Fig. 2. In Eq. 4, u(p) is the indicator function, which will be ‘1’ inside the object and ‘0’ outside the object. Optimization algorithm used to minimize Eq. 4 is given in Algorithm 1. The basic idea is to first move all vertices lying outside the surface, not satisfying Eq. 2, inwards until the Eq. 2 stays true for every view. After that, all vertices will be on or inside the surface of the object. So in the second step, all vertices should move outwards as long as Eq. 2 keeps true. The adapted size \(\triangle p\) between adjacent sample points along the normal vector is determined by the average edge length around the one-ring neighbor which in turn preserves the topology of the triangle meshes.

figure a

3.3 Photometric Consistency

Photometric consistency is a good cue to handle concavities on the surface. As before, a photometric consistency map function \(\rho (p)\) along the normal vector \(\mathbf {N}(\mathbf {v})\), depicted in Fig. 3 is applied to formulate the energy minimization function. The energy contribution of each vertex is given as

$$\begin{aligned} E(\mathbf {v}_i)=\sum _{p \in \mathbf {N}(\mathbf {v})}\rho (p).\triangledown u(p). \end{aligned}$$
(5)

where u(p) is again object indicator function i.e. ‘1’ inside the object and ‘0’ outside the object.

Fig. 3.
figure 3

The mapping function along the normal vector based on photometric consistency

The photometric consistency works with Lambertian assumption that appearance of a 3D vertex \(\mathbf {v}\) is the same in different views. However, by means of realistic capturing setup such a prerequisite is not guaranteed and is also hard to realize. For that reason most image-based reconstruction methods neglect these effects and try to compensate for the color variations using more sophisticated normalized comparison criteria, like NCCFootnote 1 scheme. To estimate the photometric consistency score of the current sample point, a reference camera \(P_i\) is determined, of which principal plane tends to perpendicular to the normal vector \(\mathbf {N}(\mathbf {v})\). Since the nonlinear distortion in the reference view is relatively small, a square window around the projected sample point in the reference view is selected and split into two triangles T for photometric consistency evaluation. With Barycentric coordinate each discrete pixel within the square window is unprojected to the 3D point inside the square patch. Those unprojected 3D points \(p_s\) are again reprojected as \(p_i\) into the compared views \(P_j\) with sub-pixel accuracy. The NCC between the reference image \(I_i\) and a comparison image \(I_j\) can now be computed as:

$$\begin{aligned} \phi NCC^{i,j}=\frac{1}{n}\sum _{s\in T}\frac{(I_i(p_i^s)-\mu _i)(I_j(p_j^s)-\mu _j)}{\sigma _i \sigma _j}. \end{aligned}$$
(6)

where

$$\begin{aligned} p_i^{0\dots n-1}=\mathbf P p_s^{0 \dots n-1},\;\;\; \mu _i= \frac{1}{n}\sum _{s=0}^{n-1}I_i(p_i^s),\;\;\;\sigma _i=\sqrt{\frac{1}{n}\sum _{s}(I_i(p_i^s)-\mu _i)^2}. \end{aligned}$$
(7)

The classical photometric consistency estimation generally yields noisy measurement due to homogeneity or repeatability of the texture pattern, which could result in noisy reconstruction. For that reason, more elaborate voting scheme is applied to increase the accuracy of the photometric consistency computation [2]. The idea is to evaluate the contribution of each camera as shown in Eq. 8. The vote is accepted only if the optimum is reached at current sample point. This methodology leads to a considerable increase in the precision of the corresponding photometric consistency map function.

$$\begin{aligned} VOTE^{i,j} =\left\{ \begin{matrix} 1 \, \;\;\;\;\phi NCC^{i,j} \ge 0.9,\\ 0 \, \;\;\;\;\phi NCC^{i,j} < 0.9.\\ \end{matrix}\right. \end{aligned}$$
$$\begin{aligned} Score = \sum _j VOTE^{i,j}. \end{aligned}$$
(8)

The photometric consistency map function \(\rho (p)\) which is illustrated in Fig. 3 is computed in Eq. 10 at the discrete sample points \(p_s\) along the Normal vector \(\mathbf {N}(\mathbf {v}_i)\) and Eq. 10 gives one example, where the highlighted positions indicate the adapted vertices on the surface. As can be seen, that the photometric consistency evaluation for the vertex is independent from each other, so the process can be parallelized to run faster.

$$\begin{aligned} \rho (p_s, \mathbf {v_i}) = 1-\frac{Score}{Number\;of\;Visible\;Views} \end{aligned}$$
(9)
(10)

3.4 A Combinatorial Consistency

As shown in Fig. 4, the photometric consistency optimization restores the object in a confined region and the contour of the object can be precisely reconstructed at cost of the lost of details and concavities on the surface. Whereas the silhouette consistency optimization works well to recover the structure details on the surface, but if some part of surface has already lost during Poisson surface reconstruction, it will be never recovered any more.

Fig. 4.
figure 4

Visual effects when applying silhouette consistency and photometric consistency separately. The 1st row: Silhouette consistency only. The second row: Photometric consistency only.

Furthermore, the two kinds of optimization can not be applied directly stepwise. Because if the photometric cue is applied first, the silhouette based optimization will fill all the concavities. And the photo-consistency based approach is useless, when operated on the vertices far away from the object surface. To overcome this problem silhouette-based optimization is modified to incorporate multi-view term while optimization.

The goal is to minimize the same energy function Eq. 3 as of silhouettes based optimization. Whereas the mapping functions \(\rho (p_s)_{f}\) and \(\rho (p_s)_{b}\) will be realized differently in this approach. Previously, only silhouette information was considered to determine the mapping functions. Here the state of each discrete sample point \(p_s\) in 3D space, whether it is inside or outside the object is determined by measuring photometric consistency score along the visual rays and exploiting the silhouette consistent shape. In detail, after the initial judgement, that the sample point \(p_s\) is inside the objects, the modified photometric consistency optimization is followed. As described in Sect. 3.3, NCC score is calculated at discrete sample point \(p_s^t\) under a voting scheme. The sample point position \(p_s^{t_{max}}\) which corresponds to the maximum voted NCC scores is considered as the searched point on the surface. If the updated point position \(p_s^{t_{max}}\) is not between the reference camera center and the sample point \(p_s\), the point \(p_s\) will be considered outside the surface and the concavity is recovered. Accordingly the mapping functions \(\rho (p_s)_f\) and \(\rho (p_s)_b\) are updated. For the point \(p_s\) which is marked as exterior, it should be moved inwards further along the normal vector. The whole process will run iteratively until the total energy is converged.

The adaptation functions, smoothing and decimation are also used within the optimization approach, which ensure the topology of the triangle meshes preserves. However, smoothing needs to be stopped when triangular meshes approach the ground truth surface, where the energy function is close to zero. The whole process is summarized again in Algorithm 2.

figure b

4 Experiments and Results

In our experiments, we used four different datasets to evaluate the performance of the proposed reconstruction approach. The two test datasets TempleRing and dinoRing are standard famous objects for testing provided by [6]. Furthermore, to evaluate the reconstruction quality quantitively, two more datasets totenKopf and teddy with the ground truth data are tested as well. The ground truth data were captured by structure-light 3D scanner from Steinbichler COMET L3D 5M [7] with measurement deviation in \(10\,\upmu \mathrm{m}\) for the measurement field in 100 mm. The test object is scanned multiple times from different angles to provide accurate 3D mesh in high resolution. The objects under observations have all interesting properties like concavities, homogeneity, shadowing, textureless. The captured ground truth of the objects are shown in the second row of Fig. 5(c) and (d). The projection matrices of the datasets totenKopf and teddy are acquired by camera calibration with known calibration object. A summary of the available datasets is shown in Table 1.

The datasets templeRing and teddy are rich in texture that makes it easier to find features in images. As expected, the point cloud generated by the framework proposed in [4] covers most of the surface of the object. Simple averaging of the visible visual rays are used to estimate the normal vector of the vertex since sufficient visibility in multiple views. Thus, the Poisson surface mesh is close to the ground truth data. With a combinatorial consistency optimization, the Poisson surface mesh is adapted to the ground truth data.

Fig. 5.
figure 5

Experimental datasets. TempleRing/dinoRing: The 1st row - Selected views of the objects. The 2nd row - Oriented point clouds. The 3rd row - Initial Poisson Surface. The 4th row - Optimized Surface. Teddy/totenKopf: The 1st row - Selected views of the objects. The 2nd row - Ground truth data. The 3rd row - Oriented point clouds. The 4th row - Initial Poisson Surface. The 5th row - Optimized Surface.

The datasets dinoRing and totenKopf are textureless and homogeneous that makes it difficult to extract the features in images. The generated point cloud is almost empty around homogeneous textureless areas. Though with robust normal vector estimation using PCA, the initial Poisson surface mesh is noisy due to sparsity of the point cloud. The silhouette cue helps to recover the missing regions of the Poisson surface and the photometric cue benefits to recover the details on the surface. However, the areas near the teeth of the totenKopf was not fully reconstructed due to narrow indentation. To estimate the narrow indentation, the window size of the photometric consistency optimization plays a vital role. The smaller the window size is employed the better the reconstruction of the narrow concavities. The average number of iterations applied for optimization is about 50 times.

Table 1. Datasets
Fig. 6.
figure 6

Accuracy measurement for “teddy” dataset.

Fig. 7.
figure 7

Accuracy measurement for “totenKopf” dataset.

Fig. 8.
figure 8

Completeness measurement for “teddy” and “totenKopf” dataset.

There are no surface meshes reconstructed at the bottom of the objects, as there are no images captured about that region. Therefore in our reconstructed surface, vertices close to bottom of the surface will be far away from ground truth. To evaluate the accuracy and completeness of the reconstructed results, the methods introduced in [6] will be applied. The distance of vertices on reconstructed meshes to the nearest vertices on the ground truth meshes is calculated for accuracy measurement, whereas the distance of the vertices on the ground truth meshes to the nearest vertices on the reconstructed meshes for completeness measurement. As depicted in Figs. 6 and 7, the x-axis represents the distance threshold that is tolerant for accuracy measurement and the y-axis represents the portion of the reconstructed vertices within the tolerant threshold. It shows that \(90\,\%\) of the vertices on the object teddy deviate from the ground truth surface meshes within 0.5 mm, whereas \(90\,\%\) of the vertices on the object totenKopf deviate from the ground truth surface meshes within the range 0.95 mm to 1 mm. Intuitively, vertices on ground truth meshes that have no proper nearest points on the reconstructed meshes will be regarded as “not covered”. Though the vertices on the bottom part of the objects that are not captured by the cameras will greatly impact the completeness, as shown in the Fig. 8, within 1.5 mm distance deviation, for teddy the reconstructed meshes covered \(88.8\,\%\) of the surface and for totenKopf covered \(87.4\,\%\) of the surface.

5 Conclusion

The sparsity of the oriented point cloud for textureless, homogeneous surface severely affects the result of Poisson surface reconstruction. In this work, we have incorporated the passive visual cues from multiple camera views to improve the quality of the surface meshes. To address the problem, an energy minimization of the surface meshes is formulated in a combinatorial scheme that fulfills silhouette consistency and photometric consistency simultaneously. The experimental results prove the efficiency of the proposed method. The results can be improved further by decreasing the step size along the normal vector and increasing the number of iterations at the cost of running time and memory. The independent operation of the vertices provides the opportunity for real-time application.