1 Introduction

Recently, there have been significant advances in single image 3D object pose estimation thanks to deep learning  [7, 32, 42]. However, the accuracy achieved by today’s feed-forward networks is not sufficient for many applications like augmented reality or robotics  [8, 44]. As shown in Fig. 1, feed-forward networks can robustly estimate the coarse high-level 3D rotation and 3D translation of objects in the wild (left) but fail to predict fine-grained 3D poses (right)  [48].

To improve the accuracy of predicted 3D poses, refinement methods aim at aligning 3D models to objects in RGB images. In this context, many methods train a feed-forward network that directly predicts 3D pose updates given the input image and a 3D model rendering under the current 3D pose estimate  [23, 38, 52]. In contrast, more recent methods use differentiable rendering  [27] to explicitly optimize an objective function conditioned on the input image and renderer inputs like the 3D pose  [17, 34]. These methods yield more accurate 3D pose updates because they exploit prior knowledge about the rendering pipeline.

Fig. 1.
figure 1

Given an initial 3D pose predicted by a feed-forward network (left), we predict deep cross-domain correspondences between real-world RGB images and synthetic 3D model renderings in the form of geometric correspondence fields (middle) that enable us to refine the 3D pose in a differentiable rendering framework (right). (Color figure online)

However, existing approaches based on differentiable rendering have significant shortcomings because they rely on comparisons in the RGB or mask space. First, methods which compare real-world images and synthetic renderings in the RGB space require photo-realistic 3D model renderings  [27]. Generating such renderings is difficult because objects in the real world are subject to complex scene lighting, unknown reflection properties, and cluttered backgrounds. Moreover, many 3D models only provide geometry but no textures or materials which makes photo-realistic rendering impossible  [39]. Second, methods which rely on comparisons in the mask space need to predict accurate masks from real-world RGB images  [17, 34]. Generating such masks is difficult even using state-of-the-art approaches like Mask R-CNN  [11]. Additionally, masks discard valuable object shape information which makes 2D-3D alignment ambiguous. As a consequence, the methods described above are not robust in practice. Finally, computing gradients for the non-differentiable rasterization operation in rendering is still an open research problem and existing approaches rely on hand-crafted approximations for this task  [14, 17, 27].

To overcome these limitations, we compare RGB images and 3D model renderings in a feature space optimized for 3D pose refinement and learn to approximate the rasterization backward pass in differentiable rendering from data. In particular, we introduce a novel network architecture that jointly performs both tasks. Our network maps real-world images and synthetic renderings to a common feature space and predicts deep cross-domain correspondences in the form of geometric correspondence fields (see Fig. 1, middle). Geometric correspondence fields hold 2D displacement vectors between corresponding 2D object points in RGB images and 3D model renderings similar to optical flow  [5]. These predicted 2D displacement vectors serve as pixel-level gradients that enable us to approximate the rasterization backward pass and compute accurate gradients for renderer inputs like the 3D pose, the 3D model, or the camera intrinsics that minimize an ideal geometric reprojection loss.

Our approach has three main advantages: First, we can leverage depth, normal, and object coordinate  [2] renderings which provide 3D pose information more explicitly than RGB and mask renderings  [23]. Second, we avoid task-irrelevant appearance variations in the RGB space and 3D pose ambiguities in the mask space  [9]. Third, we learn to approximate the rasterization backward pass from data instead of relying on a hand-crafted algorithm  [14, 17, 27].

To demonstrate the benefits of our novel 3D pose refinement approach, we evaluate it on the challenging Pix3D  [39] dataset. We present quantitative as well as qualitative results and significantly outperform state-of-the-art refinement methods in multiple metrics by up 55% relative. Finally, we combine our refinement approach with feed-forward 3D pose estimation  [8] and 3D model retrieval  [9] methods to predict fine-grained 3D poses for objects in the wild without providing initial 3D poses or ground truth 3D models at runtime given only a single RGB image. To summarize, our main contributions are:

  • We introduce the first refinement method based on differentiable rendering that does not compare real-world images and synthetic renderings in the RGB or mask space but in a feature space optimized for the task at hand.

  • We present a novel differentiable renderer that learns to approximate the rasterization backward pass instead of relying on a hand-crafted algorithm.

2 Related Work

In the following, we discuss prior work on differentiable rendering, 3D pose estimation, and 3D pose refinement.

2.1 Differentiable Rendering

Differentiable rendering  [27] is a powerful concept that provides inverse graphics capabilities by computing gradients for 3D scene parameters from 2D image observations. This novel technique recently gained popularity for 3D reconstruction  [20, 33], scene lighting estimation  [1, 22], and texture prediction  [16, 49].

However, rendering is a non-differentiable process due to the rasterization operation  [17]. Thus, differentiable rendering approaches either try to mimic rasterization with differentiable operations  [26, 34] or use conventional rasterization and approximate its backward pass  [6, 14, 45].

In this work, we also approximate the rasterization backward pass but, in contrast to existing methods, do not rely on hand-crafted approximations. Instead, we train a network that performs the approximation. This idea is not only applicable for 3D pose estimation but also for other tasks like 3D reconstruction, human pose estimation, or the prediction of camera intrinsics in the future.

2.2 3D Pose Estimation

Modern 3D pose estimation approaches build on deep feed-forward networks and can be divided into two groups: Direct and correspondence-based methods.

Direct methods predict 3D pose parameters as raw network outputs. They use classification  [41, 42], regression  [30, 46], or hybrid variants of both [28, 48] to estimate 3D rotation and 3D translation  [21, 31, 32] in an end-to-end manner. Recent approaches additionally integrate these techniques into detection pipelines to deal with multiple objects in a single image  [18, 20, 44, 47].

In contrast, correspondence-based methods predict keypoint locations and recover 3D poses from 2D-3D correspondences using PnP algorithms  [36, 38] or trained shape models  [35]. In this context, different methods predict sparse object-specific keypoints  [35,36,37], sparse virtual control points  [7, 38, 40], or dense unsupervised 2D-3D correspondences  [2, 3, 8, 15, 43].

In this work, we use the correspondence-based feed-forward approach presented in  [8] to predict initial 3D poses for refinement.

2.3 3D Pose Refinement

3D pose refinement methods are based on the assumption that the projection of an object’s 3D model aligns with the object’s appearance in the image given the correct 3D pose. Thus, they compare renderings under the current 3D pose to the input image to get feedback on the prediction.

A simple approach to refine 3D poses is to generate many small perturbations and evaluate their accuracy using a scoring function  [25, 50]. However, this is computationally expensive and the design of the scoring function is unclear. Therefore, other approaches try to predict iterative 3D pose updates with deep networks instead  [23, 29, 38, 52]. In practice, though, the performance of these methods is limited because they cannot generalize to 3D poses or 3D models that have not been seen during training  [38].

Recent approaches based on differentiable rendering overcome these limitations  [17, 27, 34]. Compared to the methods described above, they analytically propagate error signals backward through the rendering pipeline to compute more accurate 3D pose updates. In this way, they exploit knowledge about the 3D scene geometry and the projection pipeline for the optimization.

In contrast to existing differentiable rendering approaches that rely on comparisons in the RGB  [27] or mask  [17, 34] space, we compare RGB images and 3D model renderings in a feature space that is optimized for 3D pose refinement.

3 Learned 3D Pose Refinement

Given a single RGB image, a 3D model, and an initial 3D pose, we compute iterative updates to refine the 3D pose, as shown in Fig. 2. For this purpose, we first introduce the objective function that we optimize at runtime (see Sect. 3.1). We then explain how we compare the input RGB image to renderings under the current 3D pose in a feature space optimized for refinement (see Sect. 3.2), predict pixel-level gradients that minimize an ideal geometric reprojection loss in the form of geometric correspondence fields (see Sect. 3.3), and propagate gradients backward through the rendering pipeline to perform a gradient-based optimization directly on the 3D pose (see Sect. 3.4).

Fig. 2.
figure 2

Overview of our system. In the forward pass (), we generate 3D model renderings under the current 3D pose. In the backward pass (), we map the RGB image and our renderings to a common feature space and predict a geometric correspondence field that enables us to approximate the rasterization backward pass and compute gradients for the 3D pose that minimize an ideal geometric reprojection loss. (Color figure online)

3.1 Runtime Object Function

Our approach to refine the 3D pose of an object is based on the numeric optimization of an objective function at runtime. In particular, we seek to minimize an ideal geometric reprojection loss

$$\begin{aligned} e(\mathcal {P}) = \frac{1}{2} \sum _{i} \Vert \text {proj}(\mathbf {M}_i,\mathcal {P}_\text {gt}) - \text {proj}(\mathbf {M}_i,\mathcal {P}) \Vert ^2_2 \end{aligned}$$
(1)

for all provided 3D model vertices \(\mathbf {M}_i\). In this case, \(\text {proj}(\cdot )\) performs the projection from 3D space to the 2D image plane, \(\mathcal {P}\) denotes the 3D pose parameters, and \(\mathcal {P}_\text {gt}\) is the ground truth 3D pose. Hence, it is clear that \(\mathrm {argmin}e(\mathcal {P}) = \mathcal {P}_\text {gt}\).

To efficiently minimize \(e(\mathcal {P})\) using a gradient-based optimization starting from an initial 3D pose, we compute gradients for the 3D pose using the Jacobian of \(e(\mathcal {P})\) with respect to \(\mathcal {P}\). Applying the chain rule yields the expression

(2)

where \(\mathcal {P}_\text {curr}\) is the current 3D pose estimate and the point where the Jacobian is evaluated. In this case, the term \(\left[ \frac{\partial \text {proj}(\mathbf {M}_i,\mathcal {P})}{\partial \mathcal {P}}\right] ^T\) can be computed analytically because it is simply a sequence of differentiable operations. In contrast, the term \(\big [\text {proj}(\mathbf {M}_i, \mathcal {P}_\text {gt}) - \text {proj}(\mathbf {M}_i, \mathcal {P}_\text {curr})\big ]\) cannot be computed analytically because the 3D model vertices projected under the ground truth 3D pose, i.e., \(\text {proj}(\mathbf {M}_i, \mathcal {P}_\text {gt})\), are unknown at runtime and can only be observed indirectly via the input image.

However, for visible vertices, this term can be calculated given a geometric correspondence field (see Sect. 3.4). Thus, we introduce a novel network architecture that learns to predict geometric correspondence fields given an RGB image and 3D model renderings under the current 3D pose estimate in the following. Moreover, we embed this network in a differentiable rendering framework to approximate the rasterization backward pass and compute gradients for renderer inputs like the 3D pose of an object in an end-to-end manner (see Fig. 2).

3.2 Refinement Feature Space

The first step in our approach is to render the provided 3D model under the current 3D pose using the forward pass of our differentiable renderer (see Fig. 2). In particular, we generate depth, normal, and object coordinate  [2] renderings. These representations provide 3D pose and 3D shape information more explicitly than RGB or mask renderings which makes them particularly useful for 3D pose refinement  [9]. By concatenating the different renderings along the channel dimension, we leverage complementary information from different representations in the backward pass rather than relying on a single type of rendering  [27].

Next, we begin the backward pass of our differentiable renderer by mapping the input RGB image and our multi-representation renderings to a common feature space. For this task, we use two different network branches that bridge the domain gap between the real and rendered images (see Fig. 2). Our mapping branches use a custom architecture based on task-specific design choices:

First, we want to predict local cross-domain correspondences under the assumption that the initial 3D pose is close to the ground truth 3D pose. Thus, we do not require features with global context but features with local discriminability. For this reason, we use small fully convolutional networks which are fast, memory-efficient, and learn low-level features that generalize well across different objects. Because the low-level structures appearing across different objects are similar, we do not require a different network for each object  [38] but address objects of all categories with a single class-agnostic network for each domain.

Second, we want to predict correspondences with maximum spatial accuracy. Thus, we do not use pooling or downsampling but maintain the spatial resolution throughout the network. In this configuration, consecutive convolutions provide sufficient receptive field to learn advanced shape features which are superior to simple edge intensities  [18], while higher layers benefit from full spatial parameter sharing during training which increases generalization. As a consequence, the effective minibatch size during training is higher than the number of images per minibatch because only a subset of all image pixels contributes to each computed feature. In addition, the resulting high spatial resolution feature space provides an optimal foundation for computing spatially accurate correspondences.

For the implementation of our mapping branches, we use fully convolutional networks consisting of an initial \(7\times 7\) Conv-BN-ReLU block, followed by three residual blocks  [12, 13], and a \(1\times 1\) Conv-BN-ReLU block for dimensionality reduction. This architecture enforces local discriminability and high spatial resolution and maps RGB images and multi-representation renderings to \(W\times H\times 64\) feature maps, where W and H are the spatial dimensions of the input image.

3.3 Geometric Correspondence Fields

After mapping RGB images and 3D model renderings to a common feature space, we compare their feature maps and predict cross-domain correspondences. For this purpose, we concatenate their feature maps and use another fully convolutional network branch to predict correspondences, as shown in Fig. 2.

In particular, we regress per-pixel correspondence vectors in the form of geometric correspondence fields (see Fig. 1, middle). Geometric correspondence fields hold 2D displacement vectors between corresponding 2D object points in real-world RGB images and synthetic 3D model renderings similar to optical flow  [5]. These displacement vectors represent the projected relative 2D motion of individual 2D object points that is required to minimize the reprojection error and refine the 3D pose. A geometric correspondence field has the same spatial resolution as the respective input RGB image and two channels, i.e., \(W\times H\times 2\).

If an object’s 3D model and 3D pose are known, we can render the ground truth geometric correspondence field for an arbitrary 3D pose. For this task, we first compute the 2D displacement \(\nabla \mathbf {m}_i = \text {proj}(\mathbf {M}_i,\mathcal {P}_\text {gt}) - \text {proj}(\mathbf {M}_i,\mathcal {P}_\text {curr})\) between the projection under the ground truth 3D pose \(\mathcal {P}_\text {gt}\) and the current 3D pose \(\mathcal {P}_\text {curr}\) for each 3D model vertex \(\mathbf {M}_i\). We then generate a ground truth geometric correspondence field \(G(\mathcal {P}_\text {curr},\mathcal {P}_\text {gt})\) by rendering the 3D model using a shader that interpolates the per-vertex 2D displacements \(\nabla \mathbf {m}_i\) across the projected triangle surfaces using barycentric coordinates.

In our scenario, predicting correspondences using a network has two advantages compared to traditional correspondence matching  [10]. First, predicting correspondences with convolutional kernels is significantly faster than exhaustive feature matching during both training and testing  [51]. This is especially important in the case of dense correspondences. Second, training explicit correspondences can easily result in degenerated feature spaces and requires tedious regularization and hard negative sample mining  [4].

For the implementation of our correspondence branch, we use three consecutive \(7\times 7\) Conv-BN-ReLU blocks followed by a final \(7\times 7\) convolution which reduces the channel dimensionality to two. For this network, a large receptive field is crucial to cover correspondences with high spatial displacement.

However, in many cases local correspondence prediction is ambiguous. For example, many objects are untextured and have homogeneous surfaces, e.g., the backrest and the seating surface of the chair in Fig. 1, which cause unreliable correspondence predictions. To address this problem, we additionally employ a geometric attention module which restricts the correspondence prediction to visible object regions with significant geometric discontinuities, as outlined in white underneath the 2D displacement vectors in Fig. 1. We identify these regions by finding local variations in our renderings.

In particular, we detect rendering-specific intensity changes larger than a certain threshold within a local \(5\times 5\) window to construct a geometric attention mask \(w^{att}\). For each pixel of \(w^{att}\), we compute the geometric attention weight

$$\begin{aligned} w^{att}_{x,y} = \underset{u,v \in W}{\max }\bigg ( \delta ^R\Big ( R(\mathcal {P}_\text {curr})_{x,y},R(\mathcal {P}_\text {curr})_{x-u,y-v}\Big )\bigg )> t^R \>. \end{aligned}$$
(3)

In this case, \(R(\mathcal {P}_\text {curr})\) is a concatenation of depth, normal, and object coordinate renderings under the current 3D pose \(\mathcal {P}_\text {curr}\), (xy) is a pixel location, and (uv) are pixel offsets within the window W. The comparison function \(\delta ^R(\cdot )\) and the threshold \(t^R\) are different for each type of rendering. For depth renderings, we compute the absolute difference between normalized depth values and use a threshold of 0.1. For normal renderings, we compute the angle between normals and use a threshold of \(15^\circ \). For object coordinate renderings, we compute the Euclidean distance between 3D points and use a threshold of 0.1. If any of these thresholds applies, the corresponding pixel (xy) in our geometric attention mask \(w^{att}\) becomes active. Because we already generated these renderings before, our geometric attention mechanism requires almost no additional computations and is available during both training and testing.

Training. During training of our system, we optimize the learnable part of our differentiable renderer, i.e., a joint network \(f(\cdot )\) consisting of our two mapping branches and our correspondence branch with parameters \(\theta \) (see Fig. 2). Formally, we minimize the error between predicted \(f(\cdot )\) and ground truth \(G(\cdot )\) geometric correspondence fields as

$$\begin{aligned} \underset{\theta }{\text {min}} \sum _{x,y} w^{att}_{x,y} \Vert f(I, R(\mathcal {P}_\text {curr}); \theta )_{x,y} - G(\mathcal {P}_\text {curr},\mathcal {P}_\text {gt})_{x,y} \Vert ^2_2 \> . \end{aligned}$$
(4)

In this case, \(w^{att}\) is a geometric attention mask, I is an RGB image, \(R(\mathcal {P}_\text {curr})\) is a concatenation of depth, normal, and object coordinate renderings generated under a random 3D pose \(\mathcal {P}_\text {curr}\) produced by perturbing the ground truth 3D pose \(\mathcal {P}_\text {gt}\), \(G(\mathcal {P}_\text {curr},\mathcal {P}_\text {gt})\) is the ground truth geometric correspondence field, and (xy) is a pixel location. In particular, we first generate a random 3D pose \(\mathcal {P}_\text {curr}\) around the ground truth 3D pose \(\mathcal {P}_\text {gt}\) for each training sample in each iteration. For this purpose, we sample 3D pose perturbations from normal distributions and apply them to \(\mathcal {P}_\text {gt}\) to generate \(\mathcal {P}_\text {curr}\). For 3D rotations, we use absolute perturbations with \(\sigma =5^\circ \). For 3D translations, we use relative perturbations with \(\sigma =0.1\). We then render the ground truth geometric correspondence field \(G(\mathcal {P}_\text {curr},\mathcal {P}_\text {gt})\) between the perturbed 3D pose \(\mathcal {P}_\text {curr}\) and the ground truth 3D pose \(\mathcal {P}_\text {gt}\) as described above, generate concatenated depth, normal, and object coordinate renderings \(R(\mathcal {P}_\text {curr})\) under the perturbed 3D pose \(\mathcal {P}_\text {curr}\), and compute the geometric attention mask \(w^{att}\). Finally, we predict a geometric correspondence field using our network \(f(I, R(\mathcal {P}_\text {curr}); \theta )\) given the RGB image I and the renderings \(R(\mathcal {P}_\text {curr})\), and optimize for the network parameters \(\theta \).

In this way, we train a network that performs three tasks: First, it maps RGB images and multi-representation 3D model renderings to a common feature space. Second, it compares features in this space. Third, it predicts geometric correspondence fields which serve as pixel-level gradients that enable us to approximate the rasterization backward pass of our differentiable renderer.

3.4 Learned Differentiable Rendering

In the classic rendering pipeline, the only non-differentiable operation is the rasterization  [27] that determines which pixels of a rendering have to be filled, solves the visibility of projected triangles, and fills the pixels using a shading computation. This discrete operation raises one main challenge: Its gradient is zero, which prevents gradient flow  [17]. However, we must flow non-zero gradients from pixels to projected 3D model vertices to perform differentiable rendering.

Fig. 3.
figure 3

To approximate the rasterization backward pass, we predict a geometric correspondence field (left), disperse the predicted 2D displacement of each pixel among the vertices of its corresponding visible triangle (middle), and normalize the contributions of all pixels. In this way, we obtain gradients for projected 3D model vertices (right).

We solve this problem using geometric correspondence fields. Instead of actually differentiating a loss in the image space and relying on hand-crafted comparisons between pixel intensities to approximate the gradient flow from pixels to projected 3D model vertices  [14, 17], we first use a network to predict per-pixel 2D displacement vectors in the form of a geometric correspondence field, as shown in Fig. 2. We then compute gradients for projected 3D model vertices by simply accumulating the predicted 2D displacement vectors using our knowledge of the projected 3D model geometry, as illustrated in Fig. 3.

Formally, we compute the gradient of a projected 3D model vertex \(\mathbf {m}_i\) as

(5)

In this case, \(f(I, R(\mathcal {P}_\text {curr}); \theta )\) is a geometric correspondence field predicted by our network \(f(\cdot )\) with frozen parameters \(\theta \) given an RGB image I and concatenated 3D model renderings \(R(\mathcal {P}_\text {curr})\) under the current 3D pose estimate \(\mathcal {P}_\text {curr}\), \(w^{att}_{x,y}\) is a geometric attention weight, \(w^{bar,i}_{x,y}\) is a barycentric weight for \(\mathbf {m}_i\), and (xy) is a pixel position. We accumulate predicted 2D displacement vectors for all positions (xy) for which \(\mathbf {m}_i\) is a vertex of the triangle \(\bigtriangleup _{\texttt {IndexMap}_{x,y}}\) visible at (xy). For this task, we generate an IndexMap which stores the index of the visible triangle for each pixel during the forward pass of our differentiable renderer.

Inference. Our computed \(\nabla \mathbf {m}_i\) approximate the second term in Eq. (2) that cannot be computed analytically. In this way, our approach combines local per-pixel 2D displacement vectors into per-vertex gradients and further computes accurate global 3D pose gradients considering the 3D model geometry and the rendering pipeline. Our experiments show that this approach generalizes better to unseen data than predicting 3D pose updates with a network  [23, 52].

During inference of our system, we perform iterative updates to refine \(\mathcal {P}_\text {curr}\). In each iteration, we compute a 3D pose gradient by evaluating our refinement loop presented in Fig. 2. For our implementation, we use the Adam optimizer  [19] with a small learning rate and perform multiple updates to account for noisy correspondences and achieve the best accuracy.

4 Experimental Results

To demonstrate the benefits of our 3D pose refinement approach, we evaluate it on the challenging Pix3D  [39] dataset which provides in-the-wild images for objects of different categories. In particular, we quantitatively and qualitatively compare our approach to state-of-the-art refinement methods in Sect. 4.1, perform an ablation study in Sect. 4.2, and combine our refinement approach with feed-forward 3D pose estimation  [8] and 3D model retrieval  [9] methods to predict fine-grained 3D poses without providing initial 3D poses or ground truth 3D models in Sect. 4.3. We follow the evaluation protocol of  [8] and report the median (MedErr) of rotation, translation, pose, and projection distances. Details on evaluation setup, datasets, and metrics as well as extensive results and further experiments are provided in our supplementary material.

4.1 Comparison to the State of the Art

We first quantitatively compare our approach to state-of-the-art refinement methods. For this purpose, we perform 3D pose refinement on top of an initial 3D pose estimation baseline. In particular, we predict initial 3D poses using the feed-forward approach presented in  [8] which is the state of the art for single image 3D pose estimation on the Pix3D dataset (Baseline). We compare our refinement approach to traditional image-based refinement without differentiable rendering  [52] (Image Refinement) and mask-based refinement with differentiable rendering  [17] (Mask Refinement).

RGB-based refinement with differentiable rendering  [27] is not possible in our setup because all available 3D models lack textures and materials. This approach even fails if we compare grey-scale images and renderings because the image intensities at corresponding locations do not match. As a consequence, 2D-3D alignment using a photo-metric loss is impossible.

For Image Refinement, we use grey-scale instead of RGB renderings because all available 3D models lack textures and materials. In addition, we do not perform a single full update  [52] but perform up to 1000 iterative updates with a small learning rate of \(\eta = 0.05\) using the Adam optimizer  [19] for all methods.

For Mask Refinement, we predict instance masks from the input RGB image using Mask R-CNN  [11]. To achieve maximum accuracy, we internally predict masks at four times the original spatial resolution proposed in Mask R-CNN and fine-tune a model pre-trained on COCO  [24] on Pix3D.

Table 1 ( ) summarizes our results. In this experiment, we provide the ground truth 3D model of the object in the image for refinement. Compared to the baseline, Image Refinement only achieves a small improvement in the rotation, translation, and pose metrics. There is almost no improvement in the projection metric (\(MedErr_{P}\)), as this method does not minimize the reprojection error. Traditional refinement methods are not aware of the rendering pipeline and the underlying 3D scene geometry and can only provide coarse 3D pose updates  [52]. In our in-the-wild scenario, the number of 3D models, possible 3D pose perturbations, and category-level appearance variations is too large to simulate all permutations during training. As a consequence, this method cannot generalize to examples which are far from the ones seen during training and only achieves small improvements.

Table 1. Quantitative 3D pose refinement results on the Pix3D dataset. In the case of provided , our refinement significantly outperforms previous refinement methods across all metrics by up to 55% relative. In the case of automatically (+ Retrieval  [9]), we reduce the 3D pose error (\(MedErr_{R,t}\)) compared to the state of the art for single image 3D pose estimation on Pix3D (Baseline) by 55% relative without using additional inputs.
Fig. 4.
figure 4

Evaluation on 3D pose accuracy at different thresholds. We significantly outperform other methods on strict thresholds using both GT and retrieved 3D models.

Additionally, we observe that after the first couple of refinement steps, the predicted updates are not accurate enough to refine the 3D pose but start to jitter without further improving the 3D pose. Moreover, for many objects, the prediction fails and the iterative updates cause the 3D pose to drift off. Empirically, we obtain the best overall results for this method using only 20 iterations. For all other methods based on differentiable rendering, we achieve the best accuracy after the full 1000 iterations. A detailed analysis on this issue is presented in our supplementary material.

Next, Mask Refinement improves upon Image Refinement by a large margin across all metrics. Due to the 2D-3D alignment with differentiable rendering, this method computes more accurate 3D pose updates and additionally reduces the reprojection error (\(MedErr_{P}\)). However, we observe that Mask Refinement fails in two common situations: First, when the object has holes and the mask is not a single blob the refinement fails (see Fig. 5, e.g., 1st row - right example). In the presence of holes, the hand-crafted approximation for the rasterization backward pass accumulates gradients with alternating signs while traversing the image. This results in unreliable per-vertex motion gradients. Second, simply aligning the silhouette of objects is ambiguous as renderings under different 3D poses can have similar masks. The interior structure of the object is completely ignored. As a consequence, the refinement gets stuck in poor local minima. Finally, the performance of Mask Refinement is limited by the quality of the target mask predicted from the RGB input image  [11].

Fig. 5.
figure 5

Qualitative 3D pose refinement results for objects of different categories. We project the ground truth 3D model on the image using the 3D pose estimated by different methods. Our approach overcomes the limitations of previous methods and predicts fine-grained 3D poses for objects in the wild. The last example shows a failure case (indicated by the ) where the initial 3D pose is too far from the ground truth 3D pose and no refinement method can converge. More qualitative results are presented in our supplementary material. Best viewed in digital zoom. (Color figure online)

In contrast, our refinement overcomes these limitations and significantly outperforms the baseline as well as competing refinement methods across all metrics by up to 70% and 55% relative. Using our geometric correspondence fields, we bridge the domain gap between real-world images and synthetic renderings and align both the object outline as well as interior structures with high accuracy.

Our approach performs especially well in the fine-grained regime, as shown in Fig. 4a. In this experiment, we plot the 3D pose accuracy \(Acc_{R,t}\) which gives the percentage of samples for which the 3D pose distance \(e_{R,t}\) is smaller than a varying threshold. For strict thresholds close to zero, our approach outperforms other refinement methods by a large margin. For example, at the threshold 0.015, we achieve more than 55% accuracy while the runner-up Mask Refinement achieves only 19% accuracy.

This significant performance improvement is also reflected in our qualitative examples presented in Fig. 5. Our approach precisely aligns 3D models to objects in RGB images and computes 3D poses which are in many cases visually indistinguishable from the ground truth. Even if the initial 3D pose estimate (Baseline) is significantly off, our method can converge towards the correct 3D pose (see Fig. 5, e.g., 1st row - left example). Finally, Fig. 6 illustrates the high quality of our predicted geometric correspondence fields.

Fig. 6.
figure 6

Qualitative examples of our predicted geometric correspondence fields. Our predicted 2D displacement vectors are highly accurate. Best viewed in digital zoom.

4.2 Ablation Study

To understand the importance of individual components in our system, we conduct an ablation study in Table 2. For this purpose, we modify a specific system component, retrain our approach, and evaluate the performance impact.

If we use smaller kernels with less receptive field (\(3\times 3\) vs \(7\times 7\)) or fewer layers (2 vs 4) in our correspondence branch, the performance drops significantly. Also, using shallow mapping branches which only employ a single Conv-BN-ReLU block to simulate simple edge and ridge features results in low accuracy because the computed features are not discriminative enough. If we perform refinement without our geometric attention mechanism, the accuracy degrades due to unreliable correspondence predictions in homogeneous regions.

Next, the choice of the rendered representation is important for the performance of our approach. While using masks only performs poorly, depth, normal, and object coordinate renderings increase the accuracy. Finally, we achieve the best accuracy by exploiting complementary information from multiple different renderings by concatenating depth, normal, and object coordinate renderings.

By inspecting failure cases, we observe that our method does not converge if the initial 3D pose is too far from the ground truth 3D pose (see Fig. 5, last example). In this case, we cannot predict accurate correspondences because our computed features are not robust to large viewpoint changes and the receptive field of our correspondence branch is limited. In addition, occlusions cause our refinement to fail because there are no explicit mechanisms to address them. We plan to resolve this issue in the future by predicting occlusion masks and correspondence confidences. However, other refinement methods also fail in these scenarios (see Fig. 5, last example).

Table 2. Ablation study of our method. Using components which increase the discriminability of learned features is important for the performance of our approach. Also, our geometric attention mechanism and the chosen type of rendering effect the accuracy.

4.3 3D Model Retrieval

So far, we assumed that the ground truth 3D model required for 3D pose refinement is given at runtime. However, we can overcome this limitation by automatically retrieving 3D models from single RGB images. For this purpose, we combine all refinement approaches with the retrieval method presented in  [9], where the 3D model database essentially becomes a part of the trained model. In this way, we perform initial 3D pose estimation, 3D model retrieval, and 3D pose refinement given only a single RGB image. This setting allows us to benchmark refinement methods against feed-forward baselines in a fair comparison.

The corresponding results are presented in Table 1 ( ) and Fig. 4b. Because the retrieved 3D models often differ from the ground truth 3D models, the refinement performance decreases compared to given ground truth 3D models. Differentiable rendering methods lose more accuracy than traditional refinement methods because they require 3D models with accurate geometry.

Still, all refinement approaches perform remarkably well with retrieved 3D models. As long as the retrieved 3D model is reasonably close to the ground truth 3D model in terms of geometry, our refinement succeeds. Our method achieves even lower 3D pose error (\(MedErr_{R,t}\)) with retrieved 3D models than Mask Refinement with ground truth 3D models. Finally, using our joint 3D pose estimation-retrieval-refinement pipeline, we reduce the 3D pose error (\(MedErr_{R,t}\)) compared to the state of the art for single image 3D pose estimation on Pix3D (Baseline) by 55% relative without using additional inputs.

5 Conclusion

Aligning 3D models to objects in RGB images is the most accurate way to predict 3D poses. However, there is a domain gap between real-world images and synthetic renderings which makes this alignment challenging in practice. To address this problem, we predict deep cross-domain correspondences in a feature space optimized for 3D pose refinement and combine local 2D displacement vectors into global 3D pose updates using our novel differentiable renderer. Our method outperforms existing refinement approaches by up to 55% relative and can be combined with feed-forward 3D pose estimation and 3D model retrieval to predict fine-grained 3D poses for objects in the wild given only a single RGB image. Finally, our novel learned differentiable rendering framework can be used for other tasks in the future.