Keywords

1 Introduction

Several systems in augmented reality (AR) use known markers inside the scene to register the view position and render virtual objects on the real scene. These markers can be a special pattern, such as fiducial markers, or textured objects [1]. In order to classify and to register these textured objects, conventional methods use feature points [2, 3], generated by surface texture, and pose estimation methods, such as three-points RANSAC with RGB-D frames [4] or PnP solving methods [5].

These textured markers can be different classes of objects. Most AR systems use pose estimation of planar [2] and three-dimensional(3D) rigid objects [1]. Another class of objects, used as markers, are deformable objects. These objects, which change their shape over-time, can be pose estimated by methods such as [6]. However, in all of these presented pose estimation methods, detected feature points are simplified as a single point. As [7] shows, more information can be extracted from the captured information, as normal in case of RGB-D frames.

In this work, we introduce a novel approach of 3D pose estimation for several classes of markers, based on local 3D pose of locally planar feature points and RGB-D information. For each match feature of the RGB-D frame, we estimate its three-dimensional pose and, based on these feature point poses, the global pose of the object is estimated. This approach has the advantage of extracting more information from each matched feature point, giving richer information for the global pose estimation process, unlike previous works. Our local pose based registration and rendering results for rigid objects are shown in Fig. 1.

Our contributions in this paper are the local pose based pose estimation approach, the method to locally pose estimate each match feature point, as well as, the method for global pose estimation based on our approach combined with the RANSAC method [8], in order to achieve near real-time performance. With obtained results from Sect. 4, we show that is possible to achieve camera registration for AR systems with different classes of textured markers, and, that our method is more robust through different RANSAC settings as compared to conventional three-points based RANSAC methods in our evaluation settings. Furthermore, if better precision or different object models are required, our local pose information could be combined with more complex methods, since it only extract more information from the captured frames, and thus do not conflict with other solutions.

Fig. 1.
figure 1

Camera registration and rendering results, using planar and 3D rigid object makers.

2 Related Works

Pose estimation of known objects, such as markers, based on their surface texture feature points is a well known problem in the literature. Several works pose estimate known objects,rigid or deformable, using images and video steam input such as Lepetit et al. [5], Lowe [2], Xiang et al. [9]. On the other hand, devices that can capture RGB-D information haves become widely available. These devices return depth information in real-time, making the capture of 3D poses of different objects more accurate or easier.

Lima et al. [1] and Lee et al. [10] achieve the pose estimation based on 3D positions of extracted object feature points using RGB-D frames. Marcon et al. [7] and Choi et al. [11] also pose estimate rigid objects based on RGB-D information and feature points, however, similarly to our work, they also extract normal information from the surface in order to be used in the pose estimation, using primary component analysis (PCA)[7] or vector products [11] of the surface around the feature point.

Also, Jordt et al. [12] and Pritchard et al. [6] fully estimates the pose of deformable objects based on the 3D points captured by RGB-D data and deformation models. In these works, pose estimation does not consider that objects has fixed positional relationship to each other feature point.

Besides object pose estimation, works of camera registration for environment location and mapping such as SLAM [13] also have similar approaches to our method, once it need to register a set of known point in the space to new observed ones. Works as Park et al. [14] achieve the camera registration based on ICP methods. On the other hand, Henry et al. [4], Taguchi et al. [15] and Lee et al. [13] proceeds the camera registration based on RGB-D information and point correlation. Furthermore, these works also implement the RANSAC method in order to achieve near real-time process time, as our work.

However, none of the presented works in this section explicitly extracts the local pose of detected feature point. Our work out-stands in this aspect, as we explicitly extract more information from each match feature point, in order to pose estimate different class of objects such as AR markers.

3 Method

In this section we will describe the method applied to obtain the object 3D pose. Here we separate the estimation method in 3 parts, the pre-process of object feature points, local pose estimation, and global pose estimation of rigid objects based on the local information. As explained in Sect. 1, our pose estimation approach is based on the local pose estimation of each feature point, giving us a rough estimation of the global pose.

Our assumptions are that, the feature points over the object surface are locally planar, and feature points orientations are generated by the local surface texture gradient, and not by the shape of the object. We also consider that camera projection matrix \(\mathbf {K}_{c}\) of the RGB-D camera, the image \(I_{tex}\), which contains the information of the surface texture of the object to be estimated, are known. As well as, its mapping function \(f_{pos}(x_i,y_i)\), that maps the 2D coordinates at image plane coordinates to 3D position over the object surface at global coordinates, and \(f_{vec}(x_i,y_i)\), that returns the normal vector of the surface from the image coordinates position, are also known.

3.1 Pre-process of Object Feature Points

The objective of this pre-process part is to extract robust features of the object \(F_{obj}\) and process three-dimensional information of each feature point in order to used them at local pose estimation, described in Sect. 3.2. The extraction of robust features is done by applying approaches similar to ASIFT [16], extracting features that are robust to background changes and affine transformations. Initially, the object is rendered with different camera positions, with transformation \(\mathbf {E}_{cam}\), as [16]. Since we expect the features to be locally planar in this approach, the object is simplified as an image plane that contains the surface texture \(I_{tex}\) of the object. With this, even features of 3D objects or deformable objects are pre-processed in the same way as planar objects.

Then, for each camera position, we extract its feature points \(F_{view}\). Also, the 3D information of each extracted feature point is calculated. First the 3D position of point \(\mathbf {p}_{fts}\) is calculated using plane line intersection method between the rendered object plane with the line that crosses the camera position \(\mathbf {p}_{cam}\) and the feature point 3D position \(\mathbf {p}^{3D}\) detected at 2D position \(p^{img}\) of the rendered image plane. This 3D position related to the feature point is calculated as

$$\begin{aligned} \mathbf {p}^{3D} = \mathbf {E}_{cam}^{-1}\cdot (z_{img} \cdot (\mathbf {K}_{c}^{-1} p^{img})), \end{aligned}$$
(1)

where \(z_{img}\) is some distance defined as distance of the image plane to the camera position, and \(E_{cam}\) is the rendered camera transformation matrix.

After calculating the 3D position, a 3D vector \(\mathbf {o}_{fts}\) over the texture image plane that points toward the feature 2D orientation is calculated. This is done by calculating the three-dimensional position of point \(\mathbf {p}_o\) in a similar way of calculating \(\mathbf {p}_{fts}\), but instead of calculating using the rendered image position \(p^{img}\) of the feature, we use the 2D position

$$\begin{aligned} p_o^{img} = p^{img} + s \mathbf {v}_{ori}, \end{aligned}$$
(2)

where s is the size of the feature patch that was used to extract the feature point descriptor, and \(\mathbf {v}_{ori}\) is the two-dimensional orientation vector of the feature. Then, \(\mathbf {o}_{fts}\) is calculated as

$$\begin{aligned} \mathbf {o}_{fts} = (\mathbf {p}_o - \mathbf {p}_{fts})/ \Vert (\mathbf {p}_o - \mathbf {p}_{fts})\Vert \end{aligned}$$
(3)

Finally, the normal of the feature \(\mathbf {n}_{fts}\) is calculated. Once the object is converted into a planar image, the feature point normal is the same as that of the plane normal, \(\mathbf {n}_{fts} = \mathbf {n}_{plane}\).

At this point, for each camera position \(\mathbf {p}_{cam^i}\), feature point collection \(F_{view^i} = \{ F_{view^i}^0, \ldots , F_{view^i}^n\}\) is listed with its 3D information. After listing all the collections in all the views, differently from [16], each listed feature points collection \(F_{view^i}\) at view position i is compared with other feature point collections \(F_{view^j}\) in different views j using feature matching methods, and only features that appear in more than \(n_v\) views are picked up. With this, feature points \(F_{obj}\) that are robust across views and backgrounds are filtered in order to be used at the pose estimation. Each feature point, \(F_{obj}^i = \{ \mathbf {p}_{fts}^i, \mathbf {n}_{fts}^i, \mathbf {o}_{fts}^i, dsc_{fts}^i \}\), contains three-dimensional information of position \(\mathbf {p}^i_{fts}\), normal \(\mathbf {n}_{fts}^i\), orientation vector \(\mathbf {o}_{fts}^i\) and its descriptor \(dsc_{fts}^i\). Figure 2 illustrates these information.

Fig. 2.
figure 2

Local pose estimation method. \(T_{fts}\) is calculated to match p, \(\mathbf {n}\) and \(\mathbf {o}\) of the pre-processed feature with the observed feature.

As a last task of this pre-process, all the calculated 3D information are reconverted to match the original object using functions \(f_{pos}(x_i,y_i)\) and \(f_{vec}(x_i,y_i)\). In case of planar objects, these information do not change, but for 3D rigid objects, these information are reconverted so they lie over the object surface. In case of the pose estimation in Fig. 1, they are converted into a cylindrical object with

$$\begin{aligned} \begin{array}{l} f_{pos}(x_i, y_i) = [ r \sin (\pi x_i/W), ~ y_i, ~ r \cos ( \pi x_i/W) ]^\intercal \\ f_{vec}(x_i, y_i) = [ \sin ( \pi x_i/W) , ~ 0, ~ \cos ( \pi x_i/W) ]^\intercal \end{array} \end{aligned}$$
(4)

where, r is the radius of the object and W the total width of the texture image plane.

3.2 Local Pose Estimation

In order to estimate the local pose of the feature point using RGB-D information, initially, feature points are extracted from the newly captured RGB frame and compared with \(F_{obj}\), extracted at the pre-process part, using the conventional feature match method. For each match feature point, the three-dimensional position \(\mathbf {p}_{obs}\) at the surface of the object that generated the feature is calculated from its two-dimensional position \(p_f\) and depth d, obtained from the depth map.

Then, similarly to (2), using the size of patch s and two-dimensional orientation vector \(\mathbf {v}_{ori}\) returned from the feature extraction, three-dimensional position \(\mathbf {p}_{o}\) of the object seen by the RGB frame at 2D position

$$\begin{aligned} p_o^{img} = p_f + s \mathbf {v}_{ori} \end{aligned}$$
(5)

is calculated. From \(\mathbf {p}_o\), obtained by \(p_{o}^{img}\) and its depth value, the vector that points toward the feature orientation over the object surface

$$\begin{aligned} \mathbf {o}_{obs} = (\mathbf {p}_o - \mathbf {p}_{obs})/ \Vert \mathbf {p}_o - \mathbf {p}_{obs} \Vert \end{aligned}$$
(6)

is calculated.

Once the feature is considered to be locally planar, the cross product of any two non-collinear vectors that belong to the feature patch gives us the normal of the feature point \(\mathbf {n}_{obs}\). Thus, we calculate the feature normal by the cross product of \(\mathbf {o}_{obs}\) with another vector over the observed surface. In order to guarantee that both vectors are not collinear, a 3D point \(\mathbf {p}_{ort}\) over the surface is calculated similarly to (5) by the 2D point of the RGB frame at position \(p_{ort}^{img} = p_f + s \mathbf {v}_{ort}\), where \(\mathbf {v}_{ort}\) is a vector in the RGB frame coordinates, orthogonal to \(\mathbf {v}_{ori}\). Then, we calculate \(\mathbf {n}_{obs}\) as

$$\begin{aligned} \mathbf {n}_{obs} = \mathbf {o}_{obs} \times \left( \mathbf {p}_{ort} - \mathbf {p}_{obs} / \Vert \mathbf {p}_{ort} - \mathbf {p}_{obs} \Vert \right) \end{aligned}$$
(7)

where \(\mathbf {p}_{ort}\) is the point over the object surface calculated by \(p_{ort}^{img}\) and its depth value.

With \(\mathbf {p}_{obs}\), \(\mathbf {o}_{obs}\) and \(\mathbf {n}_{obs}\), it is possible to calculate the pose of the feature point \(\mathbf {T}_f\) similarly to [17] and [11]. Here, we calculate this transformation as a combination of 3 transform matrices \(\mathbf {T}_{r1}\), \(\mathbf {T}_{r2}\) and \(\mathbf {T}_{t}\). First, \(\mathbf {T}_{r1}\) rotates the original feature normal \(\mathbf {n}_{fts}\) to match with \(\mathbf {n}_{obs}\). This is done by a transformation that rotates around axis \(\mathbf {r}_{a1}\) and angle \(a_1\) as

$$\begin{aligned} \begin{array}{l} \mathbf {r}_{a1} = \mathbf {n}_{obs} \times \mathbf {n}_{fts} \\ a_1 = \arccos (\mathbf {n}_{obs} \cdot \mathbf {n}_{fts}) \end{array} \end{aligned}$$
(8)

Then \(\mathbf {T}_{r2}\) rotates the new coordinate system to match \(\mathbf {T}_{r1}\mathbf {o}_{fts}\) and \(\mathbf {o}_{obs}\). As (8), \(\mathbf {T}_{r2}\) is calculated as a rotation transformation with axis \(\mathbf {r}_{a2}\), and angle \(a_2\) as: \( \mathbf {r}_{a2} = \mathbf {o}_{obs} \times (\mathbf {T}_{r1} \cdot \mathbf {o}_{fts})\) and \(a_2 = \arccos (\mathbf {o}_{obs} \cdot (\mathbf {T}_{r1} \cdot \mathbf {o}_{fts}))\). Finally \(\mathbf {T}_t\) is a translation of the coordinate system to the observed point \(\mathbf {p}_{obs}\), calculated as \( \mathbf {t} = \mathbf {p}_{obs} - \mathbf {T}_{r2} \cdot \mathbf {T}_{r1} \cdot \mathbf {p}_{fts}\).

The final transformation matrix of the feature pose \(\mathbf {T}_f\) is given by \(\mathbf {T}_f = (\mathbf {T}_{t} \cdot \mathbf {T}_{r2} \cdot \mathbf {T}_{r1})\). Figure 2 illustrates this pose estimation and Fig. 3 shows the pose obtained by each feature in a planar object.

Fig. 3.
figure 3

Estimated local poses of match features returned by our local pose estimation process. The coordinate system returned by each feature point is drawn over its position.

3.3 Global Pose Estimation

In this approach, the global pose of a 3D rigid object is estimated based on local pose of the match features \(\mathbf {T}_{fc} = \{ \mathbf {T}_f^0, \ldots , \mathbf {T}_f^n \}\). Each pose returned by the feature point provides an estimation of the global pose. In order to achieve process time near real-time systems, we also applied a variant of RANSAC approach [8] to our global pose estimation. As in [8], initially a randomly selected subset S1 of the found feature points is selected by a percentage \( p_{subset} \). Then, pose transformation matrices \(\mathbf {T}_{f}^r\) within \(\mathbf {T}_{fc}\) are randomly selected.

For each point within S1, the quadratic error of \(\mathbf {T}_{f}^r\), \(\epsilon ^i = \Vert \mathbf {T}_{f}^r \cdot \mathbf {p}_{fts}^i - \mathbf {p}_{obs}^i \Vert ^2 \) is calculated as the distance between the observed positions of the feature in S1 and its pose estimated position based on \(\mathbf {T}_{f}^r\), where \(\mathbf {p}_{fts}^i\) is the original position of observed feature point at \(\mathbf {p}_{obs}^i\). Then, the number of points \(n_{\tau }\) that have distance error less than a threshold \(\tau _{error}\) is counted.

However, unlike [8], we do not stop the search after any \(\mathbf {T}_{f}\) that has \(n_{\tau }\) larger then a percentage \( p_{found} \) of S1. Instead, we keep the search until we find \(n_{mean}\) transformations \(\mathbf {T}_{RANSAC} = \{ \mathbf {T}_0, \ldots , \mathbf {T}_{n_{mean}} \}\) that satisfy \((n_{\tau } > p_{found} \cdot p_{subset} \cdot n)\). Then, these transformations are converted to a set of vectors of translation \(\mathbf {t}_f^i\) and quaternions \(q_f^i\)[17]. Finally, the object global position \(\mathbf {T}_{obj}\) is calculated as the transformation matrix obtained by the transform resulting from the mean of the translations vectors \(\mathbf {t}_f\) and quaternions \(q_f\).

This change is due to the nature of error returned by our approach. Since \(\mathbf {o}_{fts}\) and \(\mathbf {o}_{obs}\) are both calculated by the projection of the object surface texture at the captured image, small projection errors at the image of the feature point orientation vector can happen, as can be seen in Fig. 3.

In case the RANSAC method can not find \(n_{mean}\) samples of valid transformations in \(N_{trial}\) trials, our global pose detection method returns the transformations \(\mathbf {T}_{obj}\) that have the minimal error within the sampled transformation of the \(N_{trial}\). This transformation is the one that minimizes the sum of the quadratic errors, calculated by

$$\begin{aligned} \mathbf {T}_{obj} = \underset{{ \mathbf {T}_f }}{\arg \! \min } \; \sum _{i = 1}^n ( \Vert \mathbf {T}_f.\mathbf {p}_{fts}^i - \mathbf {p}_{obj}^i \Vert ^2 ) \end{aligned}$$
(9)

4 Experimental Results

In this section we show the results from our local pose estimation for object based augmented reality, as well as the experimental evaluation of the approach. All the pose detection method was implemented in MATLAB code. The variables of our approach and method used for these tests, unless explicitly written in the subsection, are \(n_{v} = 3\), \(n_{mean} = 8\), RANSAC error threshold \(\tau _{error} = 225 ~ mm^2 = (15mm)^2\) for our global pose method and \(\tau _{error} = 25 ~ mm^2 = (5mm)^2\) for the three-point RANSAC, RANSAC percentage of subset \( p_{subset} = 70\,\%\) and found \( p_{found} = 70\,\%\). Also, all the process time values in this paper is calculated including the extraction and matching of feature points processes.

Figure 1 shows the rendering results of the object register based on planar and rigid object using real captured data. On these images, red lines show the estimated pose of the object. Note that our pose estimation decrease its accuracy for planar objects as the object becomes more orthogonal to the camera view. This is due to a decrease in the detected feature number and feature area used to estimate the normal. Since it has less area to estimate the local normal, the estimated vector \(\mathbf {o}_{obs}\) and its 2D orthogonal vector, at equation (7), have less length, being more easily affected by noises.

Also, from Fig. 1, it is possible to observe that our local pose estimation algorithm can be used to pose estimate rigid 3D objects, due to the locally planar feature points hypothesis. Showing that even rounded objects can be estimated if their feature points are locally planar. More over, in these 3D object cases, some feature points will not be visible due to self occlusion, still, our local pose estimation based approach can estimate the marker pose.

4.1 Synthesized Data Evaluation

In order to evaluate our approach for the planar objects, we compared our approach to conventional RANSAC approach based on plane calculation from triangles formed by each three observed points, as in [4]. For the test setup, we generated several \(1280 \times 960\) images and its corresponding depth map with the planar object placed in different poses, using a virtual calibrated camera. The object was placed in front of the camera with random translation and rotation.

Similarly to [5], the rotation error is measured as \(E_{rot}(\%) = \Vert q_{true} - q \Vert / \Vert q\Vert \), and translation error as \(E_{trans}(\%) = \Vert t_{true} - t\Vert / \Vert t\Vert \), where \(q_{true}\) and \(t_{true}\) are the ground truth quaternion rotation and translation values of the object in each frame. We present the translation and rotation error in Fig. 4 with a box plot representation, across different depth map noises and 2D position noises.

Fig. 4.
figure 4

Box plot evaluation and process time of our method and conventional three-points based RANSAC pose. Upper row: results across different depth map noise. Lower row: results across 2D noise of feature point detected position.

Here, depth map noise is added using Gaussian noise, with mean equal to 0 and standard deviation varying from \([0,\ldots ,75]\) in millimeters, and a low pass filter, \(3 \times 3\) frame mean convolution, is applied to simulate a AR system with noise and its filter. Similarly, 2D noise is added at the detected feature position at the image space as a Gaussian noise with mean 0 and deviation varying from \([0,\ldots ,30]\) in pixels.

From these graphs, observe that our method has lower accuracy and higher variation with depth map noise, compared to conventional three-point RANSAC method. This is due to the fact that our method is based on local information, where small depth noises can generate big errors in the global pose, specially in rotation. However, when the noise increases, our method has better accuracy, as well as, lower variance in comparison to conventional method. For 2D noise case, our approach is more accurate for small 2D errors compared with three-points based RANSAC, since the surface around the feature point itself is not changing.

Furthermore, from the process time plots in the last column of Fig. 4, it is possible to observe that our method has a slow increase in process time across both type of noise values. However, the process time of the conventional three-point detection method rapidly increases with both types of noises. This is due to the failure of the RANSAC method.

In this setup, the RANSAC error thresholds are fixed with values empirically found to return good pose estimation across different poses. As the error increases, the tests rejection cases of RANSAC also increase, leading to more trials. Since our method is based on the local information of each feature point found, the total number of trials is equal to the total points found number \(n_{f}\) and induced errors only affect each point. On the other hand, the three-points method uses a combination of each 3 points, meaning that has \( \left( {\begin{array}{c}n_{f}\\ 3\end{array}}\right) \) of total trials, also, each noisy point affects a large number of trials. If m points are affected by noise, only \(\left( {\begin{array}{c}n_{f} - m\\ 3\end{array}}\right) \) are correct. Leading to a large number of failures, and thus to a larger process time.

In order to analyze the effect of the RANSAC error threshold, we evaluated the mean rotation error response across different depth map noises, varying from \([0,\ldots ,75]\) millimeters (mm), and RANSAC error thresholds, varying from \([5^2,\ldots ,85^2]\) in \(mm^2\). We present these results and the process times in Fig. 5.

Fig. 5.
figure 5

Evaluation of our method and three-points based RANSAC, across RANSAC error threshold and depth map noise. Left two: rotation error. Right two: process time.

From this result, it is possible to observe that the three-points based RANSAC has a sudden increase in process time with the RANSAC error threshold decrease. However, from the mean error, it is possible to observe that it also has a sudden error increase with noise or RANSAC error threshold increase. This property allows observing that the three-points based RANSAC is sensitive to settings. In contrast, our local pose based approach does not present this characteristic, showing almost constant process time increase through RANSAC error decrease and noise increase, as well as, slow error increase through noise and RANSAC error threshold increase.

4.2 Other Results

This section shows other results obtained by our local pose approach, using it in a different way as described at subsection 3.3. Figures 6 and 7 show those results.

Fig. 6.
figure 6

Pose estimation based on a single feature point matching.

Figure 6 shows our pose estimation based on a single pre-processed feature point matching. The detected feature point pose is shown in the image by the rendered coordinate system using red, blue and green lines, over the detected feature point position. This result shows that, even with a single feature point to be matched, it is possible to retrieve the pose when this feature point is observed. Such a result is impossible to obtain using conventional three-points, or multiple-points, based pose estimation. This pose estimation could be used in markers with small surface area or in cases that are hard to observe multiple feature points of the marker due to occlusions.

Fig. 7.
figure 7

Deformable marker rough registration.

Figure 7 shows our registration of deformable object as marker. In order to obtain these results, we do not fully detect the pose of the object as in [18]; instead, we use the local pose to obtain rough registration. This rough registration is obtained by finding the nearest match original feature point position to the position where the object should be rendered, and based on the pose of this feature point the object is rendered. In case of the result showed in Fig. 7, it is obtained by rendering a 3D model over each local pose found. Full deformable pose could be achieved by combining our results with more complex deformable models such as [6] or [18], since our method only extracts more information from the captured data. Moreover, the combination of our approach with other solutions in the literature would also be relatively easy due to this characteristic.

5 Conclusion

In this paper, we presented an approach, and methods that implements this approach, of pose estimation of objects, based on local poses of each match feature points in order to achieve the camera registration for AR systems. Based on this local pose estimation, we showed that several classes of objects, planar, 3D rigid or deformable, can be used as markers for camera registration. Also, the results show that our approach is more robust through RANSAC settings than conventional three-point based RANSAC.

Furthermore, our local pose-based approach is generic and not conflicting with other solutions, once it is only extracting more information from captured data. Therefore, could easily be combined with other solutions, allowing new ways for AR registration, similar to [6, 12], or interaction, such as [19]. Future works lies on this characteristic, combining our approach with more complex tracking methods, or deformable object models [18].