1 Introduction

Since almost four decades the estimation of optical flow from image sequences is one of the most challenging tasks in computer vision. Despite of the recent success of learning-based approaches [2, 9, 18, 23, 36], global energy-based methods are still among the most accurate techniques for solving this task [16, 17, 22, 44]. Even if combined with partial learning [1, 33, 41, 42] such methods offer the advantage that they allow for a transparent modeling, since assumptions are explicitly stated in the underlying energy functional. However, since the complexity of the models has significantly grown within the last few years – recent methods try to estimate segmentation [33, 41, 44], occlusions [17, 44] or illumination changes [8] jointly with the optical flow – the minimization of the resulting non-convex energies has become an increasingly challenging problem.

In this context, many energy-based approaches [14, 22, 33, 41] rely on a suitable initialization provided by other methods. Among the most popular approaches that are considered useful as initialization are EpicFlow [30], Coarse-to-fine PatchMatch [15] and DiscreteFlow [25] – approaches that rely on the interpolation or fusion of feature matches. This has two main reasons: On the one hand, feature matching approaches are known to provide good results in the context of large displacements. On the other hand, they are typically based on some kind of filtering or a-posteriori regularization which renders the initialization sufficiently smooth and outlier-free. As a consequence, the initial flow field offers already a reasonable quality and the energy minimization starts with a good solution and is hence less likely to end up in undesired local minima.

While recent methods promote the use of feature-based approaches for initialization, they also show that integrating additional information in the estimation can be highly beneficial w.r.t. both accuracy and robustness [1, 16, 17, 33, 41]. Apart from considering domain-dependent semantic information [1, 5, 16, 33], it has proven useful to integrate structure constraints and symmetry cues. For instance, [41] proposed a method that jointly estimates the rigidity of each pixel together with its optical flow. Thereby structure constraints are imposed only on rigid parts of the scene. In contrast, [17] suggested an approach that exploits symmetry and consistency cues to jointly estimate forward and backward flows. This in turn, allows to infer occlusion information together with the optical flow.

Given the fact that the two aforementioned approaches as well as many other recent methods from the literature rely on a suitable initialization from feature-based methods, it is surprising that such information has hardly entered the initial feature matching step so far. While symmetry and consistency cues are at least considered in terms of simple forward-backward checks to detect occlusions and remove the corresponding outliers [9, 15, 30], structure constraints in terms of a rigid background motion have not found their way into feature matching approaches for computing the optical flow at all. Hence, it would be desirable to develop a feature-based method that allows to exploit structure information while still being able to estimate independently moving objects at the same time.

Contributions. In our paper, we develop such a hybrid method. In this context, our contributions are threefold. (i) First, we introduce a coarse-to-fine three-frame PatchMatch approach for estimating structure matches (SfM) that combines a depth-driven parametrization with different temporal selection strategies. While the parametrization robustifies the estimation by reducing the search space, the hierarchical optimization and the temporal selection improve the accuracy. (ii) Second, we propose a consistency-based selection scheme for combining matches from this structure-based PatchMatch approach and an unconstrained PatchMatch approach. Thereby, the backward flow allows us to identify reliable structure matches, while a robust voting scheme decides on the remaining cases. (iii) Finally, we embed the resulting matches into a full estimation pipeline. Using recent approaches for interpolation and refinement, our method provides dense results with sub-pixel accuracy. Experiments on all major benchmarks demonstrate the benefits of our novel SfM-aware PatchMatch approach.

1.1 Related Work

As mentioned, integrating additional information can render the estimation of the optical flow significantly more accurate and robust. We first comment on related work regarding the integration of such information, while afterwards we focus on related PatchMatch approaches for optical flow and scene structure.

Rigid Motion. In order to improve accuracy and robustness in case of a rigid background, one may enforce geometric assumptions such as the epipolar constraint [29, 38, 43, 44]. However, if this assumption is forced to hold for the entire scene, as proposed by Oisel et al. [29] and Yamaguchi et al. [43, 44], the approach is only applicable to fully rigid scenes, e.g. to those of the KITTI 2012 benchmark [11]. Although this problem can be slightly alleviated by soft constraints as proposed by Valgaerts et al. [37, 38], results for non-rigid scenes are typically not good. Hence, Wedel et al. [40] suggested to turn off the epipolar constraint for sequences with independent object motion. This, however, does not allow to exploit rigid body priors at all in the standard optical flow setting. Consequently, Gerlich and Eriksson [12] presented a more advanced approach that segments the scene into different regions with independent rigid body motions. While this strategy allows to handle automotive scenes with other rigdly moving objects quite well, e.g. sequences similar to the KITTI 2015 benchmark [24], it cannot model any type of non-rigid motion, e.g. as required for the different characters in the MPI Sintel benchmark [7]. In contrast, our SfM-aware PatchMatch approach combines information from general and SfM-based motion estimation. Hence, it is not restricted to fully rigid or object-wise rigid scenes.

Mostly Rigid Motion. Compared to [12], Wulff et al. [41] went a step further. Instead of requiring the scene to be object-wise rigid they assume the scene to be only mostly rigid. To this end, they suggested a complex iterative model that jointly segments the scene into foreground and background using semantic information as well as motion and structure cues while estimating the background motion with a dedicated epipolar stereo algorithm. In contrast to this approach, that uses the general optical flow method [25] as initialization and adaptively integrates strong rigidity priors later on in the estimation, our SfM-aware PatchMatch approach aims at integrating such priors already in the estimation of feature matches at the very beginning of the estimation – and this without the use of semantic information. Hence, our results are relevant for all methods relying on a suitable initialization – including the work of Wulff et al. [41] and other recent methods such as [17] or [33].

Parametrized Models. An alternative strategy that recently became very popular is to refrain from using global or object-wise rigidity priors and to model motions that are pixel- or piecewise rigid. Typically this is done by means of a suitable flow (over-)parametrization; see e.g. [13, 16, 24, 28, 39, 45]. For instance, Hornaček et al. [13] proposed a 9 DoF flow parametrization that models a locally rigid motion of planes. Similar, Yang et al. [45] and Hur and Roth [16, 17] suggested approaches that use a spatially coherent 8 DoF homography based on superpixels. In contrast to those methods, our SfM-aware PatchMatch approach does not explicitly rely on an over-parametrization. Vice versa, it gains robustness by restricting the search space to 1D when calculating the SfM matches. Moreover, it estimates the flow pixel-wise instead of segment-wise. Hence, it is more suitable for general scenes with non-rigid motion and fine motion details.

Semantic Information. Another way to improve the accuracy and the robustness of the estimation is to consider semantic. For instance, Bai et al. [1] proposed to use instance-level segmentation to identify independently moving traffic participants before computing separate rigid motions for both the background and the participants. Similarly, Hur and Roth [16] make use of a CNN to integrate semantic information into a joint approach for estimating the flow and a temporally consistent semantic segmentation. Furthermore, Sevilla-Lara et al. [33] suggested a layered approach that relies on semantic information when switching between different motion models. Finally, there is also the method of Wulff et al. [41] (see mostly rigid motion). While semantic information often improves the results, it has to be particularly adapted to the given domain. As a consequence, the corresponding approaches do typically not generalize well across different applications or benchmarks. Hence, we do not rely on such information.

PatchMatch. In the context of unconstrained matching (optical flow), PatchMatch has been originally proposed by Barnes et al. [4]. Recent developments include the work of Bao et al. [3] that introduces an edge-preserving weighting scheme as well as the approach of Hu et al. [15] that improves accuracy and speed with a hierarchical matching strategy. Moreover, Gadot and Wolf [9] and Bailer et al. [2], have recently shown that feature learning can be beneficial. Despite of all the progress, however, none of the aforementioned optical flow methods includes structure information. In contrast, our SfM-aware approach exploits such information by explicitly using feature matches from a specifically tailored three-view stereo/SfM PatchMatch method. Also in the stereo/SfM context, there exists a vast literature on PatchMatch algorithms. There, PatchMatch has been first introduced by Bleyer et al. [6] who proposed a plane-fitting variant for the rectified case. Recent developments include the approaches of Shen [34] and Galliani et al. [10] who extended PatchMatch to the non-rectified two-view and multi-view case, respectively; see also [32, 46]. In contrast to all those methods, our SfM-aware PatchMatch approach not only extracts pure stereo information. Instead, it combines information from optical flow and stereo and is hence also applicable to non-rigid scenes with independent object motion. Moreover, it relies on a hierarchical optimization [15] which has not been used in the context of PatchMatch stereo so far. Finally, the SfM part of our algorithm uses a direct depth-parametrization. This, in turn, makes both the estimation very robust.

2 Method Overview

Let us start by giving a brief overview over the proposed method. As many recent optical flow techniques it relies on a multi-stage approach which includes steps for computing and refining an initial flow field; see e.g. [14, 17, 22, 33, 41]. However, in contrast to most of these approaches that typically aim at improving an already given flow field, our method focuses on the generation of an accurate and robust initial flow field itself. To achieve this goal, our method integrates structure information into the feature matching process, which plays an essential role for the initialization [15, 25, 30]. This integration is motivated by the observation that many sequences contain a significant amount of rigid motion induced by the ego-motion of the camera [41]. Since this motion is constrained by the underlying stereo geometry, structure information can significantly improve the estimation.

Fig. 1.
figure 1

Schematic overview over our SfM-aware PatchMatch approach.

In our multi-stage method, we realize this integration by combining two hierarchical feature matching approaches that complement each other: On the one hand, we use a recent two-frame PatchMatch approach for optical flow estimation [15]. This allows our method to estimate the unconstrained motion in the scene (forward and backward matches). On the other hand, we rely on a specifically tailored three-frame stereo/SfM PatchMatch approach (see Sect. 3) with preceding pose estimation [26]. This in turn, allows us our method to compute the rigid motion of the scene induced by the moving camera (structure matches). In order to discard outliers and combine the remaining matches, we perform a filtering approach for all matches followed by a consistency-based selection (see Sect. 4). Finally, we inpaint and refine the combined matches using recent methods from the literature [14, 22]. An overview of the entire approach is given in Fig. 1.

3 Structure Matching

In this section, we present our structure matching framework which builds upon the PatchMatch algorithm [4] – a randomized, iterative algorithm for approximate patch matching. In this context, we adopt ideas of the recently proposed Coarse-to-fine PatchMatch (CPM) for optical flow [15] and apply them in the context stereo/SfM estimation that relies on a depth-based parametrization [10, 31]. This not only enables the straightforward integration of multiple frames, but also allows to consider the concepts of temporal averaging and temporal selection [19], the latter one being a strategy for implicit occlusion handling.

Fig. 2.
figure 2

Left: Illustration of the employed depth parametrization. Right: Illustration of corresponding points defined by the image location \(\mathbf {x}_t\) and the associated depth value \(z(\mathbf {x}_t)\). In this case, the 3D point is occluded in one view and could be handled with the idea of temporal selection. i.e. by the view from the other time step.

3.1 Depth-Based Parametrization

Let us start by deriving the employed depth-based parametrization. To this end, we assume that all images are captured by a calibrated perspective camera that possibly moves in space, i.e. the corresponding projection matrices \(P_t = K \, [R_t | \mathbf {t}_t]\) are known. Here \(R_t\) is a \(3~\times ~3\) rotation matrix and \(\mathbf {t}_t\) a translation 3-vector that together describe the pose of the camera at a certain time step t. In addition, the \(3~\times ~3\) matrix K denotes the intrinsic camera calibration matrix given by

$$\begin{aligned} K = \begin{pmatrix} s_x &{} 0 &{} c_x \\ 0 &{} s_y &{} c_y \\ 0 &{} 0 &{} 1 \end{pmatrix}, \end{aligned}$$
(1)

where \((s_x, s_y)\) denotes the scaled focal length and \(\mathbf {c} = (c_x, c_y)^\top \) denotes the principal point offset. Given the projection matrix \(P_t\), a 3D point \(\mathbf {X} \in \mathbb {R}^3\) is projected onto a 2D point \(\mathbf {x} \in \mathbb {R}^2\) on the image plane by \(\mathbf {x} = \pi (P_t \tilde{\mathbf {X}})\), where the tilde denotes homogeneous coordinates, such that

$$\begin{aligned} \mathbf {\tilde{X}} = \begin{pmatrix} \mathbf {X}^\top ,&1 \end{pmatrix}^\top , \end{aligned}$$
(2)

and \(\pi \) maps a homogeneous coordinate \(\tilde{\mathbf {x}}\) to its Euclidean counterpart \(\mathbf {x}\)

$$\begin{aligned} \pi (\tilde{\mathbf {x}}) = \begin{pmatrix} \tilde{x}_1/\tilde{x}_3\\ \tilde{x}_2/\tilde{x}_3 \end{pmatrix} \, , \quad \text {with} \quad \tilde{\mathbf {x}} = \begin{pmatrix} \tilde{x}_1,&\tilde{x}_2,&\tilde{x}_3 \end{pmatrix}^\top . \end{aligned}$$
(3)

Now, to define our parametrization, we assume w.l.o.g. that the camera pose of the reference camera, i.e. the camera associated with the image taken at time t, is aligned with the world coordinate system and invert the previous described projection to specify a 3D point on the surface \(\mathbf {s}\) by an image location \(\mathbf {x}\) and the corresponding depth \(z(\mathbf {x})\) along the optical axis; see Fig. 2. This leads to

$$\begin{aligned} \mathbf {X} = \mathbf {s}(\mathbf {x}, z(\mathbf {x})) = z(\mathbf {x}) K^{-1} \tilde{\mathbf {x}}, \end{aligned}$$
(4)

which allows us to describe correspondences throughout multiple images with a single unknown, the depth \(z(\mathbf {x})\), by projecting onto the respective image planes using the corresponding projection matrices; see Fig. 2. Finally, given three frames as in our case, with projection matrices \(P_{t+1}\), \(P_{t}\), and \(P_{t-1}\), one can directly convert the estimated depth values to the corresponding displacement vectors w.r.t. to the forward frame \(t+1\) and the backward frame \(t-1\) (Fig. 3):

$$\begin{aligned} \mathbf {u}_{\mathrm {st, fw}}(\mathbf {x},z(\mathbf {x}))= & {} \pi (P_{t+1} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x}))-\pi (P_{t} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x})),\end{aligned}$$
(5)
$$\begin{aligned} \mathbf {u}_{\mathrm {st, bw}}(\mathbf {x},z(\mathbf {x}))= & {} \pi (P_{t-1} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x}))-\pi (P_{t} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x})). \end{aligned}$$
(6)
Fig. 3.
figure 3

Illustration showing the conversion procedure from a 3D point to the displacement vectors w.r.t. to the forward frame \(t+1\) and backward frame \(t-1\).

3.2 Hierarchical Matching

With the depth parametrization at hand we now turn to the actual matching. While applying the classical PatchMatch approach [4] directly to the problem typically yields noisy results due to non-existent explicit regularization, we resort to the idea of integrating a hierarchical coarse-to-fine scheme, which has shown to be less prone to noise in the context of optical flow estimation [15].

As in [15] we do not estimate the unknowns for all pixel locations, but for multiple collections of seeds \(\mathcal {S}^l = \{ s_m^l \}\) that are defined on each resolution level \(l \in \{0,1,\ldots ,k-1\}\) of the coarse-to-fine pyramid. While the number of seeds remains the same for each resolution level, their spatial locations are given by

$$\begin{aligned} \mathbf {x}(s_m^l) =\lfloor \eta \cdot \mathbf {x}(s_m^{l-1}) \rceil \, \quad \text {for} \quad l \ge 1, \end{aligned}$$
(7)

where \(\lfloor \cdot \rceil \) is a function that returns the nearest integer value and \(\eta = 0.5\) is the employed downsampling factor between two consecutive pyramid levels. Furthermore, the locations for \(l=0\) (full image resolution) are located at the cross points of a regular image grid with a spacing of 3 pixels and come with the default neighborhood system, defined via the spatial adjacency. In addition, these neighborhood relations remain fixed throughout the coarse-to-fine pyramid.

The matching is now performed in the classical coarse-to-fine manner: Starting at the coarsest resolution, each level is processed by iteratively performing a random search and a neighborhood propagation as in [4]. While the coarsest level uses a random initialization of the unknown depth, the subsequent levels are initialized with the depth values of the corresponding seeds of the next coarser level. Furthermore, the search radius for the random sampling is reduced exponentially throughout the coarse-to-fine pyramid, such that the random search is restricted to values near the current best depth estimate.

3.3 Cost Computation and Temporal Averaging/Selection

Since we consider three images, there are several possibilities how to compute the matching cost between corresponding patches. One possible choice is to compute all pairwise similarity measures w.r.t. the reference patch and average the costs. While this renders the estimation more robust if the actual 3D point is visible in all views, it may lead to deteriorated results in case of occlusions. In order to deal with this, one can apply the idea of temporal selection [19] and compute all pairwise similarity measures w.r.t. the reference patch, but only consider the lowest pairwise cost as overall cost. Thereby it can be ensured that, as long as the reference patch can be found in at least one view and is occluded in the remaining ones, the correct correspondence retains a small cost. In our experiments we will use both approaches, temporal averaging and temporal selection.

Finally, we utilize SIFT descriptors [15, 20, 21] in order to compute the similarity between two corresponding locations. This also renders the matching more robust than operating directly on the intensity values. Regarding the cost function we follow [15] and apply a robust \(L^1\)-loss. The resulting forward and backward structure matching costs \(C_{t+1}\) and \(C_{t-1}\) are then given by

$$\begin{aligned} C_{t+1}(\mathbf {x}, z(\mathbf {x}))= & {} ||\mathbf {f}_{\mathrm {SIFT}} (\pi (P_{t+1} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x}))) - \mathbf {f}_{\mathrm {SIFT}} (\pi (P_{t} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x}))||_1,\end{aligned}$$
(8)
$$\begin{aligned} C_{t-1}(\mathbf {x}, z(\mathbf {x}))= & {} ||\mathbf {f}_{\mathrm {SIFT}} (\pi (P_{t-1} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x}))) - \mathbf {f}_{\mathrm {SIFT}} (\pi (P_{t} \tilde{\mathbf {s}}(\mathbf {x}, z(\mathbf {x}))||_1, \end{aligned}$$
(9)

where \(\mathbf {f}_{\mathrm {SIFT}}\) denotes the SIFT-feature and \(||\cdot ||_1\) is the \(L^1\)-norm. The corresponding temporal averaging and temporal selection costs read

$$\begin{aligned} C_{\mathrm {avg}}(\mathbf {x}, z(\mathbf {x}))= & {} \tfrac{1}{2} (C_{t+1}(\mathbf {x}, z(\mathbf {x})) + C_{t-1}(\mathbf {x}, z(\mathbf {x}))),\end{aligned}$$
(10)
$$\begin{aligned} C_{\mathrm {ts}}(\mathbf {x}, z(\mathbf {x}))= & {} \min (C_{t+1}(\mathbf {x}, z(\mathbf {x})), C_{t-1}(\mathbf {x}, z(\mathbf {x}))). \end{aligned}$$
(11)

3.4 Outlier Handling

Finally, we extend the classical bi-directional consistency check to our three-view setting. Therefore, we not only estimate the depth values with frame t as reference view but also with the other two frames as reference. Then we take the estimated depth value \(z_{t}(\mathbf {x})\) at frame t, project it into the frames \(t + 1\) and \(t - 1\), take the estimated depth values \(z_{t+1}(\mathbf {x})\) and \(z_{t-1}(\mathbf {x})\) there, and project them back to frame t. Only if at least one of the two backprojections maps to the starting point \(\mathbf {x}\), the depth value \(z_{t}(\mathbf {x})\) is considered valid. In this case, the forward/backward structure matches can be computed from \(z_{t}(\mathbf {x})\) via Eqs. (5) and (6).

4 Combining Matches

At this point, we have computed filtered forward and backward structure matches from frame t to frames \(t+1\) and \(t-1\). For the sake of clarity let us denote these matches by \(\mathbf {\hat{u}}_{\mathrm {st, fw}}\) and \(\mathbf {\hat{u}}_{\mathrm {st, bw}}\). Moreover, as indicated in Fig. 1. we have also computed the corresponding forward and backward optical flow matches between the same frames with a hierarchical PatchMatch approach for unconstrained motion [15]. Since these optical flow matches underwent a classical bi-directional consistency check to remove outliers (which requires to additionally compute matches from frames \(t+1\) and \(t-1\) to frame t), let us denote them by \(\mathbf {\hat{u}}_{\mathrm {of, fw}}\) and \(\mathbf {\hat{u}}_{\mathrm {of, bw}}\).

The goal of the combination step is now to fuse these four matches in such a way such that rigid parts of the scene can benefit from the structure matches. Thereby one has to keep in mind that optical flow matches may explain rigid motion, while structure matches are typically wrong in the context of independent object motion. To avoid using structural matches at inappropriate locations, we propose a conservative approach: We augment the optical flow matches with the matches obtained from the structure matching. This means that we always keep the match of the forward flow, if it has passed the outlier filtering. Otherwise, however, we consider to augment the final matches at this location by the match of the structure matching approach. In order to decide if such a structure match should really be considered, we propose three different approaches (see Fig. 4):

Permissive Approach. The first approach is the most permissive approach. It includes all structure matches \(\mathbf {\hat{u}}_{\mathrm {st, fw}}\) that have passed the outlier filtering at locations where no forward optical flow match \(\mathbf {\hat{u}}_{\mathrm {of, fw}}\) is available.

Restrictive Approach. The second approach is more restrictive. Instead of including all structure matches, we enforce an additional consistency check. This allows to reduce the probability of blindly including possibly false matches. For this consistency check we make use of the backward optical flow match \(\mathbf {\hat{u}}_{\mathrm {of, bw}}\). We only consider the forward structure match \(\mathbf {\hat{u}}_{\mathrm {st, fw}}\), if its backward variant \(\mathbf {\hat{u}}_{\mathrm {st, bw}}\) is consistent with the backward optical flow match \(\mathbf {\hat{u}}_{\mathrm {of, bw}}\). In case the additional consistency check cannot be performed, because the backward optical flow match did not pass the outlier filtering, we do not consider the structure match.

Voting Approach. Finally, we propose a voting approach that enforces the additional consistency check as in the restrictive approach but still allows to include structure matches in case the additional consistency check cannot be performed. The decision if such non-checkable structure matches should be included is conducted for each sequence separately. It is based on a voting scheme: All locations, that contain a valid match for the forward, backward and structure match are eligible to vote. If the structure match is consistent with both the forward and the backward match, we count this as a vote in favor of including non-checkable matches. If the votes surpass a certain threshold (\(80 \%\) in our experiments) all non-checkable structure matches are added. This can be seen as a detection scheme that allows to identify scenes with a large amount of ego-motion.

5 Evaluation

Evaluation Setup. In order to evaluate our new approach, we used the following components within our pipeline (cf. Fig. 1): The pose estimation uses the OpenMVG [27] implementation of the incremental SfM approach [26], the forward and backward matching employ the Coarse-to-fine PatchMatch (CPM) [15] approach, the structure matching and consistent combination are performed as described in Sects. 3 and 4, respectively, followed by a robust interpolation of the combined correspondences (RIC) using [14]. Finally, the inpainted matches are refined using the order-adaptive illumination-aware refinement method (OIR) [22]. Except for the refinement, where we optimized [35] the three weighting parameters per benchmark using the training data, we used the default parameters.

Fig. 4.
figure 4

Illustration showing the different strategies to combine the computed matches. Top: Color coded input matches. White denotes no match. Bottom: Fusion results. (Color figure online)

Benchmarks. To evaluate the performance of our approach, we consider three different benchmarks: the KITTI 2012 [11], the KITTI 2015 [24], and the MPI Sintel [7] benchmark. These benchmarks exhibit an increasing amount of ego-motion induced optical flow. While KITTI 2012 consists of pure ego-motion, KITTI 2015 additionally includes motion of other traffic participants. Finally, MPI Sintel also contains non-rigid motion from animated characters.

Baseline. To measure improvements, we establish a baseline that does not use structure information and only relies on forward optical flow matches (CPM). As Table 1 shows, our baseline outperforms most of the related approaches. Only DF+OIR [22] performs slightly better, due to the advanced DF matches [25].

Structure Matching. Next, we investigate the performance of our novel structure matching approach on its own. Therefore, we replace the matching approach (CPM) in our baseline with three variants of our structure matching approach (CPMz): a two-frame variant, a three-frame variant with temporal averaging and a three-frame variant with temporal selection. As the results in Table 1 show, structure matching significantly outperforms the baseline in pure ego-motion scenes, while it naturally has problems in scenes with independent motion. Moreover, they show that the use of multiple frames pays off. However, while for the KITTI benchmarks the robustness of temporal averaging is more beneficial than the occlusion handling of temporal selection, the opposite holds for the MPI Sintel benchmark. This, in turn, might be attributed to the fact that MPI Sintel contains a larger amount of occlusions. Since both strategies have their advantages, we consider both variants for our further evaluation.

Fig. 5.
figure 5

Example for the KITTI 2015 benchmark [24] (#186). First row: Reference frame, subsequent frame, ground truth. Second row: Forward matches, structure matches (depth visualization). Following rows. From left to right: Used matches (color-coding see Fig. 4), final result, bad pixel visualization. From top to bottom: Baseline, permissive approach, restrictive approach, voting approach. (Color figure online)

Table 1. Results for the training datasets of the KITTI 2012 [11] (all pixels), KITTI 2015 [24] (all pixels) and the MPI Sintel [7] benchmarks (clean render path) in terms of the average endpoint error (AEE) and the percentage of bad pixels (BP, 3px threshold).

Unconstrained Matching. Apart from the baseline we also evaluated two additional variants solely based on unconstrained matching: a variant only using backward matches and a variant that augments the forward matches with backward matches. To this end, we assume a constant motion model, i.e. \(\mathbf {\hat{u}}_{\mathrm {of, fw}} = -\mathbf {\hat{u}}_{\mathrm {of, bw}}\). The results for the backward flow in Table 1 show that such a simple model does not allow to leverage useful information to predict the forward flow. Even the augmented variant does not improve compared to the baseline.

Fig. 6.
figure 6

Example for the MPI Sintel benchmark [7] (ambush5 #44). First row: Reference frame, subsequent frame, ground truth. Second row: Forward matches, structure matches (forward match visualization). Following rows. From left to right: Used matches (color-coding see Fig. 4), final result, bad pixel visualization. From top to bottom: Baseline, permissive approach, restrictive approach, voting approach. (Color figure online)

Combined Approach. Let us now turn towards the evaluation of our combined approach. In this context, we compare the impact of the different combination strategies. As one can see in Table 1, the permissive approach is not an option. While it works well for dominating ego-motion, it includes too many false structure matches in case of independent object motion. In contrast, the restrictive approach prevents the inclusion of false structure matches, but cannot make use of the full potential of such matches in scenes with dominating ego-motion. Nevertheless, it already outperforms the baseline significantly and gives the best results for MPI Sintel. Finally, the voting approach combines the advantages of both schemes. It yields the best results for KITTI 2012/2015 with improvements up to 50% compared to the baseline, while still offering an improvement w.r.t. MPI Sintel. This observation is also confirmed by the examples in Figs. 5 and 6. They show the usefulness of including structure matches in occluded areas and the importance of filtering false structure matches in general.

Comparison to the Literature. Finally, we compare our method to other approaches from the literature. To this end, we consider both the training and the test data sets; see Tables 1 and 2, respectively. Regarding the training data sets, our method generally yields better results than recent learning approaches without fine-tuning (PWC-Net [36], FlowNet2 [18], UnFlow [23]). Moreover, it also outperforms DCFlow [42] and MR-Flow [41] on the KITTI 2015 benchmark. Only MirrorFlow [17] (KITTI 2015) and MR-Flow (MPI Sintel) provide better results. This good performance holds for the test data sets as well, for which we evaluated the approaches that had performed best on the training data. Here, on KITTI 2012, our method performs favorably (all pixels) even compared to methods based on pure ego-motion and semantic information. Moreover, it also outperforms recent approaches with an explicit SfM background estimation (MR-Flow) on KITTI 2015. Finally, ranking second and sixth our method also yields an excellent performance on the clean and final set of MPI Sintel, respectively. This shows that our method not only works well in the context of pure ego-motion but can also handle a significant amount of independent object motion.

Table 2. Top 10 non-anonymous optical flow methods on the test data of the KITTI 2012/2015 [11, 24] and of the MPI Sintel benchmark [7], excluding scene flow methods.

Fixed Parameter Set. Finally, we investigate how the results change when not optimizing the refinement parameters individually for each benchmark. To this end, we considered the voting approach with temporal averaging and conducted an experiment on the training data with all parameters fixed. As Table 3 shows the results hardly deteriorate when using a single parameter set for all benchmarks.

Table 3. Impact of refinement parameter optimization.

Runtime. The runtime of the pipeline excluding the pose estimation is 32s for one frame of size \(1024~\times ~436\) (MPI Sintel) using three cores on an Intel® Core™ i7-7820X CPU @ 3.6 GHz, which splits into: 5.5s matching (incl. outlier filtering), <0.1s combination, 1.5s inpainting and 25s refinement. The pose estimation is run on the entire image sequence, which takes 83s for a sequence with 50 frames.

6 Conclusion

In this paper, we addressed the problem of integrating structure information into feature matching approaches for computing the optical flow. To this end, we developed a hierarchical depth-parametrized three-frame SfM/stereo PatchMatch approach with temporal selection and preceding pose estimation. By adaptively combining the resulting matches with those of a recent PatchMatch approach for general motion estimation, we obtained a novel SfM-aware method that benefits from a global rigidity prior, while still being able to estimate independently moving objects. Experiments not only showed excellent results on all major benchmarks (KITTI 2012/2015, MPI Sintel), they also demonstrated consistent improvements over a baseline without structure information. Since our approach is based on inpainting and refining advanced feature matches, it offers another advantage: Other optical flow methods can easily benefit from it by incorporating its matches or the resulting dense flow fields as initialisation.