Keywords

1 Introduction

Stereo is one of the oldest areas of computer vision research [1]. Interestingly, the arrival of mass-produced active depth sensors [2] seems to have renewed interest also in passive stereo systems. In contrast to active depth sensors, stereo cameras are also applicable in outdoor environments. Due to their more general applicability, stereo cameras are gaining increased adoption, for example in autonomous driving [3]. Remarkably, the availability of stereo image pairs also helps in the estimation of temporal correspondences: On the KITTI optical flow benchmark [4], the best performing algorithms [5, 6] are indeed scene flow algorithms that jointly estimate depth and 3D motion from stereo videos. Part of their advantage stems from an increased robustness to adverse imaging conditions [6]. One such adverse imaging condition is a shortage of light. In low-light conditions, the exposure time often needs to be increased to obtain a reasonable signal-to-noise-ratio. But when either the camera, or the objects in the scenes are moving during exposure time, this results in motion blurred images.

Motion blur is not only unsatisfactory to look at, it can also disturb further image-based processing, e.g. in tasks such as panorama stitching [8] or barcode recognition [9]. In stereo video setups, viewpoint-dependent motion blur hinders a post-capture adjustment of the baseline, the acquisition and visualization of 3D point clouds (see Fig. 1 for an example) or the control of tele-operated robots in the presence of rapid robot and/or object motion.

Fig. 1.
figure 1

Application of stereo video deblurring: Given 2 consecutive stereo frames (a), our deblurring approach allows to estimate sharp textures from stereo video input with motion blur. Rendering scene flow geometry with the blurred input image as a colored point-cloud from a new point of view produces an unnatural motion blur (b). Our stereo video deblurring algorithm can remove the blur (c).

In this paper we address the challenge of deblurring stereo videos. In contrast to the substantial literature on removing camera shake [1015], we aim to deal with the more general case of camera and object motion. In case of independent motions, mixed pixels at motion boundaries yield significant complications. Removing such spatially-variant blur is extremely challenging when attempted from single images [16, 17], but video input helps to significantly increase robustness [7, 18]. Unlike previous work, we leverage stereo video to obtain substantially improved and more robust deblurring results. In our approach, we exploit 3D scene flow in various ways and make the following contributions: (i) We show that 3D scene flow can improve video deblurring by providing more accurate motion estimates. In particular, we exploit piecewise rigid scene flow [6], which yields an over-segmentation of the image into planar patches that move with a rigid 3D motion (Figs. 2b and c). (ii) We demonstrate that the resulting piecewise homographies allow to directly induce blur matrices. Thereby, we take into account that the projection of a rigid 3D motion yields non-linear motion trajectories in 2D (Fig. 3, Table 1). We find that this leads to superior deblurring results compared to inducing the blur matrices from an optical flow field [7] (Figs. 2d to f). (iii) We apply the homography-induced blur matrices in a robust deblurring procedure that attenuates the effects of motion discontinuities using an iterative weighting scheme; Initial motion discontinuities are obtained from 3D scene flow. We demonstrate the superiority of the proposed stereo video deblurring over state-of-the-art monocular video deblurring using experiments on synthetic data as well as on real videos.

Fig. 2.
figure 2

Stereo video deblurring: For two consecutive frames of a synthetic stereo video (a) we use the scene flow approach of Vogel et al. [6] to compute an over-segmentation into planar patches with constant 3D rigid body motion (b). Projecting the 3D motion onto the image plane yields optical flow (c), which our baseline algorithm uses to deblur a reference frame (d). Exploiting the homographies from the 3D motion and object boundary information from the over-segmentation, our full approach obtains sharp images avoiding ringing and boundary artifacts (e). Our result is also clearly sharper than state-of-the-art video deblurring [7] (f)

Fig. 3.
figure 3

Descriptiveness of homography-based blur kernels: Using 3D rigid body motion to generate blur kernels, we can faithfully express, e.g., yaw motion (a), while kernels constructed with spatially varying 2D displacement vectors fields [7] only yield an approximation (b). Approximation errors (c) are also present close to the rotation axis where motions are small (extremly large yaw angle and all intensities scaled for better visibility)

Table 1. Overview of the different sources of motion information used for video deblurring: When pure 2D correspondence is considered (top two rows), the induced blur kernels are only approximate, as motion trajectories are assumed to be linear. Exploiting homographies from scene flow allows us to capture the fact that rigid 3D object motion leads to non-linear trajectories

2 Related Work

The goal of this work is to obtain sharp images from stereo videos containing 3D camera and object motion. Of course, in principle blind deblurring could be applied to each frame individually. However, blind motion deblurring from a single image is a highly underconstrained problem, as blur parameters and sharp image have to be estimated from a single measurement. To cope with spatially-variant blur due to the 3D motion of the camera, single image deblurring approaches frequently use homographies [1921]. In contrast, we apply homographies to describe spatially-variant object motion blur. Single image object motion deblurring approaches keep the number of parameters manageable by either choosing the motion of a region from a very restricted set of spatially-invariant box filters [22, 23], assuming it to have a spatially-invariant, non-parametric kernel of limited size [16], or to be representable by a discrete set of basis kernels [24]. Approaches that rely on learning spatially-variant blur are also limited to a discretized set of detectable motions [17, 25]. Kim et al. [26] consider continuously varying box filters for every pixel, but rely heavily on regularization.

Connecting deblurring and depth estimation, Xu and Jia [27] successfully apply stereo correspondence estimation to motion-blurred stereo frames to support blind image deblurring. Lee and Lee [28], Arun et al. [29], and Hu et al. [30] estimate sharp images and depth jointly. However, all these approaches assume the scene to be static and camera motion to be the only source of motion blur.

Cho et al. [31] deblur images of independently moving objects. The multiple input images of their algorithm are unordered, and a piecewise affine registration between the images, as well as the motion underlying the blur, has to be estimated. To restrict the parameter space, the blur kernels are assumed to be piecewise constant and linear.

Video deblurring approaches reduce the number of parameters through the assumption that the inter-frame and intra-frame motion are related by the duty cycle of the camera. He et al. [32] and Deng et al. [33] apply feature tracking of a single moving object to obtain 2D displacement-based blur kernels for deblurring. Wulff and Black [18] refine the latter approach and perform segmentation into two layers, estimation of the affine motion parameters, as well as deblurring of each layer jointly. Relaxing the assumption of two layers and affine motion, Yamaguchi et al. [34] and Kim and Lee [7] employ optical flow to approximate spatially variant blur kernels for deblurring. Yamaguchi et al. [34] propose deblurring based on the flow estimates from the blurry images. Kim and Lee [7] iteratively refine flow estimation and deblurred video frames by minimizing a joint energy. The latter method represents the state-of-the-art in video deblurring and is used for comparison in the experimental section. To the best of our knowledge, exploiting stereo video for deblurring has not been considered in the literature before.

Correspondence estimation on stereo video sequences can be improved by estimating stereo correspondences and optical flow jointly as 3D scene flow [3537]. In our approach we build on the piecewise rigid scene flow by Vogel et al. [6] for the following reasons. First, it provides us with explicit 3D rotations and translations that we employ for accurate blur kernel construction. Second, through over-segmentation into planar patches, it also delivers occlusion information, which we use as initialization for our boundary-aware object motion deblurring. A general problem in object motion deblurring is that object boundaries with mixed foreground and background pixels can lead to severe ringing artifacts (see Fig. 2). Explicit segmentation and \(\alpha \)-matting [18, 38] can prevent this effect, but requires restrictive assumptions on the number of moving objects. To handle general scenes with an arbitrary number of objects, we extend the robust outlier handling of Chen et al. [39] to spatially-variant deblurring based on scene flow, and apply it to the mixed pixels at object boundaries.

In contrast to the aforementioned deblurring approaches, Cho et al. [40] deblur hand-held video under the assumption that patches are sharp in some frames of the video. However, in the case of autonomous robots or objects passing the field of view with high speed, this assumption does not hold. Joshi et al. [41] attach additional inertial measurement units to the camera, but this does not account for object motion. An additional low-resolution, high frame-rate camera can provide complex motion kernels [38], but does not provide depth estimates in the way a stereo camera can.

3 Blurred Image Formation in Stereo Video

Inducing Blur Matrices from 3D Rigid Object Motions. Due to the finite exposure time \(\tau \) of our stereo video camera, each frame of each camera is blurred. Our goal is to find a sharp image \(I_{t_0}\) for a reference camera at time \(t_0\). We base our approach on the scene flow of Vogel et al. [6], and likewise assume that the scene can be approximated with planar patches that undergo a 3D rigid body motion. If an object in the scene is non-planar, this assumption leads to an over-segmentation of the object into spatially adjacent patches (see Fig. 2b). Considering video frames where the exposure time is naturally limited by the frame rate, we additionally assume that the motion of each patch is constant during the exposure time of two consecutive frames. Note that a constant rigid motion in 3D does not necessarily imply that its 2D projection is constant; the projection may, e.g. in the case of a rotation, be constantly accelerated. However, our assumption excludes rapidly changing motions such as vibrations.

Constant 3D rigid body motion can be expressed as a homogeneous \(4\times 4\) matrix

$$\begin{aligned} M = \begin{pmatrix} R &{} T \\ \varvec{0} &{} 1 \end{pmatrix} \end{aligned}$$
(1)

with a rotation matrix \(R\in \mathbb {R}^{3\times 3}\) and a translation vector \(T \in \mathbb {R}^3\). To enable our highly accurate blur kernel description, we rewrite \(M = \exp \big ( \theta \xi \big )\) as matrix exponential, where \(\theta \in \mathbb {R}\) describes the rotation angle and \(\xi \) is a \(4\times 4\) matrix that is determined by the rotation axis and the translation, see [42, 43]. With M describing the motion between time instants \(t_0\) and \(t_1\), the constant 3D motion between two arbitrary time instants \(t_a\) and \(t_b\) is given as

$$\begin{aligned} M_{t_b, t_a} = \exp \left( \frac{t_b - t_a}{t_1 - t_0} \theta \xi \right) \!. \end{aligned}$$
(2)

In a piecewise planar scene approximation, the 3D planes of the patches at time t are defined via their scaled normals \(n_t\). All points P on the plane satisfy the equation \(P^{\text {T}} n_t = 1\), where \(P^{\text {T}}\) is the transposed of P. We can relate a moving 3D point to its corresponding pixel location on the image plane via the camera geometry. Given the calibration matrix K of the reference camera and its location \(T_K\), the projection from a 3D plane to the image plane at time t can be written in homogeneous coordinates as \(Pr_{t} = K - K T_{K} n_t^{\text {T}}\), see e.g. [6].

Under the assumption of color constancy, two sharp images of the reference camera (with hypothetical infinitesimal exposure) at different times are connected via

$$\begin{aligned} I_{t_a}( x ) = I_{t_b}( {}^{t_b}{} H^{t_a} x ) \quad \text {where}\quad {}^{t_b}{}H^{t_a}_{} = Pr_{t_b} M_{t_b, t_a} Pr_{t_a}^{-1}. \end{aligned}$$
(3)

With this notation, a blurry image pixel x in the interior of a patch is formed from the reference image as

$$\begin{aligned} \hat{B}(x) = \int ^{t_0 + \frac{\tau }{2}}_{t_0 -\frac{\tau }{2}} I_{t}(x ) \,\text {d}t = \int ^{t_0 + \frac{\tau }{2}}_{t_0 -\frac{\tau }{2}} I_{t_0}( {}^{t_0}{}H^{t}_{} x ) \,\text {d}t, \end{aligned}$$
(4)

where

$$\begin{aligned} {}^{t_0}{}H^{t}_{} = Pr_{t_0} \exp \big ( -t \theta \xi \big ) Pr_{t}^{-1} \end{aligned}$$
(5)

is a homography that can be computed exactly from camera geometry, normal, and motion. To put it differently, a 3D point that is projected to x on the image plane describes a certain trajectory on the image plane during the exposure time. If the 3D point follows a rigid body motion, the homography \(\big ( {}^{t_0}{}H^{t}_{} \big )^{-1}\) allows us to exactly describe this 2D trajectory. In contrast, optical flow based methods [7, 24, 44], employ 2D optical flow vectors to generate \(I_{ t}\) via forward warping. Thus the trajectory of a point on the image plane is approximated by a 2D line that is traversed with constant velocity. As optical flow is spatially variant, the trajectories may change for each pixel, hence induce blur kernels with a curved shape. However, more complex motions such as rotations can only be approximated, Fig. 3. In our approach, the description of trajectories due to 3D rigid body motions is exact. As our experiments show, this also results in more faithfully deblurred images.

By discretizing the integration over time with \(\delta t = \frac{\tau }{N}\) (we fix \(N = 70\)) and using bilinear image interpolation, we can obtain a discretized version of Eq. (4) for vectorized reference images as \(\hat{B}(x) = A_x \varvec{I}_{t_0}\). Here, \(A_x\) denotes a sparse row vector that depends on the homography estimated at pixel x. Stacking the blur vectors \(A_x\) for each pixel, we obtain our homography-based blur matrix A leading to \(\hat{\varvec{B}} = A \varvec{I}_{t_0}\).

Motion Boundaries. If only scene points from the same plane contribute to the color B(x) of the measured blurred image at point x, the image formation model of Eq. (4) is exact. If at time t a scene point with a different motion contributes to B(x) we should also use the corresponding homography. However, within an object, the planar patches are adjacent in space and move consistently. Therefore, we approximate the blur with the row vector \(A_x\) induced by the homography of x at \(t_0\). At motion boundaries, the homographies are very different and as pixels of foreground and background mix, transparency effects occur. While such effects can be modeled, taking them into account requires precise localization of the motion boundaries, which is very challenging. Instead, we exclude motion boundaries from the deblurring process by means of an iterative approach. In each iteration, we downweight pixels with a high difference between image formation model and measured image and try to find a sharp image that explains the remaining pixels. Under the assumption of additive Gaussian noise, we use the residual to compute a weight for each pixel as

$$\begin{aligned} w_n (x) = \exp \Big (\!- k_\sigma \Vert B (x) - A_x \varvec{I}^{n-1}_{t_0} \Vert ^2 \Big ), \end{aligned}$$
(6)

where \(\varvec{I}^{n-1}_{t_0} \) denotes the current estimate of the sharp (color) image. For normalized images we set as default value. In the first iteration we initialize \(w_0 \) with the binary occlusion information from the scene flow. As Fig. 4 shows, the weights converge quickly. Some pixels in the image that were initially suppressed as motion boundaries are included in deblurring at a later iteration. More importantly, other pixels where the image formation model is invalid are suppressed later on, which helps controlling ringing artifacts. Suppression may also happen due to some inaccuracies in the computed scene flow. In the experimental section, we will see how this property actually helps to improve deblurring results.

Fig. 4.
figure 4

Downweighting of mixed pixels due to motion boundaries: Foreground and background mix at motion boundaries and violate our image formation model (a). At motion boundaries and locations of inaccurate flow estimates, the image formation model is downweighed to avoid ringing artifacts. We initialize these weights with the occlusion information provided by the scene flow (b) and refine them iteratively (c), (d)

Deblurring. Theoretically, we could fill in the regions at motion boundaries during deblurring by using adjacent frames or information from the other camera. However, we found experimentally that correspondence estimation in these regions is too unreliable to produce visually pleasing results. Instead, we exploit that natural, sharp images follow a Laplacian distribution of their gradients [22]. In locations where the image formation model is unreliable, e.g., at motion boundaries, we rely on this prior to provide the necessary regularization. Specifically, we obtain an estimate of the sharp reference frame by minimizing the energy

$$\begin{aligned} E( \varvec{I}_{t_0} ) = \sum _{x\in \varOmega } \Big \Vert w_n (x) \big ( B (x) - A_x \varvec{I}_{t_0} \big ) \Big \Vert ^2 + \alpha \rho \big ( \nabla I_{t_0} (x) \big ), \end{aligned}$$
(7)

where \(\varOmega \subset \mathbb {N}^2\) is the image domain and the constant \(\alpha \) is fixed to 0.001. Following prior work [22], we use the robust norm \(\rho \big ( c \big ) = | c | ^{0.8}\) for each color channel and gradient direction.

To solve the optimization problem in Eq. (7), we use iteratively reweighted least squares (IRLS) [45]. In each reweighting iteration, we compute the following weights

$$\begin{aligned} \rho _n(c) = \frac{1}{c} \frac{d \rho \big (c\big ) }{d c } \approx \max \big ( |c |, \epsilon \big )^{0.8-2} \quad \text { with } \quad \epsilon = 0.01 \end{aligned}$$
(8)

for the smoothness term using the preceding image estimate \(\nabla I^{n-1}_{t_0} \). Then we minimize the least squares energy

$$\begin{aligned} E( \varvec{I}_{t_0} , n ) = \sum _{x\in \varOmega } \big \Vert w_n (x) \big ( B (x) - A_x \varvec{I}^n_{t_0} \big ) \big \Vert ^2 + \alpha \Vert \rho _n \nabla I^n_{t_0} (x) \Vert ^2 \end{aligned}$$
(9)

via conjugate gradients. We alternate between updating the occlusion weight \(w_n\) and the smoothness weight \(\rho _n\). In all our experiments the weights converge quickly and only a few (\(\approx \)10) iterations were needed in total.

To compute the 3D scene flow needed for our stereo video deblurring approach, we rely on the method of Vogel et al. [6]. The algorithm is originally designed for sharp images. However, its data term uses the census transform for comparing the warped images, which makes it quite robust to image blur. Of course, scene flow estimation will reach its limits for very strong motion blur. Experimentally, we find that by aggregating evidence in piecewise planar patches, the method yields a scene flow accuracy that turns out to work well in deblurring stereo videos of casual motion. As the following experiments will show, it is crucial, however, to not only rely on the robust correspondence information, but to exploit the homographies to directly induce the blur kernels.

4 Experiments

To demonstrate the efficacy of the proposed stereo video deblurring, we perform experiments on synthetic images with known ground truth, as well as on real images. We capture the real video footage with a Point Grey Bumblebee2 stereo color camera, which can acquire \(640 \times 480\) images at a frame rate of 20 Hz. We use the internal calibration and supplied software to obtain rectified and demosaiced images. The exposure time of each image can be obtained from the camera software.

In all experiments, we compute scene flow using the publicly available implementation of [6]. We take the default parameters and scale them uniformly to account for the baseline difference between our stereo camera and the KITTI dataset [4] for which they were tuned. For the \(640\times 480\) image in Fig. 2 our approach requires 73 s to form the discretized blur matrix A. Using MATLAB to optimize Eq. (7) in 25 conjugate gradient steps and 10 IRLS iterations requires 69 s on an 8-core 4 GHz CPU.

Fig. 5.
figure 5

Deblurring planar textures: For a planar texture blurred with 3D rigid body motion (a), deblurring with 2D spatially-variant ground truth displacements (b) yields ringing errors (c) that can be reduced by deblurring with our homography-based image formation model (d), (e)

4.1 Comparing Flow-Based Deblurring to Homography-Based Deblurring

We begin by applying the proposed stereo video deblurring to scenes without object discontinuities. In this way we can analyze the benefit of the homography-induced motion blur model in isolation. We create synthetic sequences by simulating various 3D motions (upward and forward translation, and a combination forward translation and yaw) of a planar, roughly fronto-parallel texture, see Fig. 5a.Footnote 1 A second test set consists of rigidly moving 3D objects rendered with a raytracer at very small time steps and averaged to give motion-blurred images (see Figs. 4a, 6a and 7a for the first image of the left view). We take the central frame of each motion-blurred image as a sharp reference frame. For the rendered scenes motion discontinuities are known. In the first experiment, we disable the data term around any motion discontinuities by fixing the weights \(w_n\) in these areas to zero, see Fig. 6b for an example. As the image prior stays active, the boundaries are filled in smoothly as illustrated in Fig. 6d.

Fig. 6.
figure 6

Deblurring with masked discontinuities: Our raytraced stereo video frames contain independent object motion of non-planar objects (a). Through the estimated disparity we can assess the shape of the objects (c). Excluding given discontinuities (b) from the computation of the data term, invalid areas are filled in smoothly (d). The masked difference image (e) to the real sharp image (f) shows that homography-based deblurring has about the same error in planar as in curved surfaces, showing the effectiveness of the over-segmentation from the scene flow

We compare our homography-induced deblurring approach against deblurring with blur matrices generated from different 2D displacement fields. We use forward and backward 2D motion as described by Kim and Lee [7] and apply them in our IRLS deblurring framework. In particular, we use the known ground truth 2D displacement, the 2D initial optical flow with which the scene flow is initialized [46] (baseline deblurring), and the 2D projection of the scene flow to induce blur kernels. Table 1 summarizes these settings. Table 2 shows the peak-signal-to-noise-ratio (PSNR) of the deblurred images from the different methods. We observe that the PSNR of our homography-based stereo video deblurring outperforms the results of deblurring with ground truth 2D displacement in all cases of non-fronto-parallel motion. In these cases linear motion trajectories of constant velocity are an approximation. Blur matrices induced by homographies are more expressive and improve the results. Already, deblurring with the 2D projection of scene flow achieves a consistently higher PSNR than deblurring with the initial flow. Indeed, in the case of forward motion, also deblurring with the 2D projection of the scene flow outperforms deblurring with ground truth displacement. The estimated 2D displacement appears to be a better approximation to the linear, but accelerated trajectory of the 3D forward motion than the 2D ground truth displacement. Figures 5b and d show examples of deblurred images using the ground-truth 2D displacement and our homography-based approach. From the difference image between the results and the original sharp texture, Figs. 5c and e, we observe that the increase in PSNR is due to the mitigation of ringing effects throughout the image.

Fig. 7.
figure 7

Raytraced scenes for evaluating object motion deblurring: The input images exhibit blur due to 3D object motion (a). Also when 3D homographies are used to induce blur kernels, mixed pixels at object boundaries cause some ringing artifacts (b). Iteratively downweighting the boundaries from the data term, our full stereo video deblurring (c) suppresses ringing and obtains considerably sharper images than state-of-the-art video deblurring (d). Please zoom in for detail

Table 2. Deblurring without considering motion-discontinuity regions: For different motions of a planar texture (top) and moving 3D objects with masked object boundaries (bottom), we report the peak signal-to-noise ratio (PSNR) of the deblurred reference frame, the average endpoint error of the estimated motion (AEP), and the average disparity error (ADE) of the estimation. For all scenes the use of scene flow increases deblurring accuracy compared to using optical flow. For scenes with non-fronto-parallel motion (all except ‘upward’ and ‘apples’) homography-based object motion deblurring provides the best results (bold)

For the raytraced scenes the geometry of the moving objects is non-planar and the planarity assumption in our image formation model becomes an approximation. Figure 6 shows the estimated disparity of an object and the deblurred image obtained by masking out discontinuities. Looking at the difference image, Fig. 6e, we observe that the deblurring error for slightly curved surfaces is comparable to the performance on planar regions of the background, showing that the over-segmentation aids coping with curved surfaces.

For all rendered scenes where the disparity does not exhibit gross errors, we observe in Table 2 that 3D homography-based deblurring improves the PSNR clearly over any form of 2D deblurring. In the scene ‘apples’, Fig. 7a, 1st row, depth estimation fails with a mean disparity error of 4.95 pixels. In this situation the deblurring quality of homography-based deblurring drops below that of its 2D projection. Still, both outperform the results obtained with the initial optical flow. More importantly, as we will see below, the iterative weighting scheme for treating motion discontinuities can address such disparity estimation errors as well and lead to much improved results.

4.2 Full Algorithm with Motion Discontinuities

We now evaluate the performance of stereo video deblurring in the presence of object motion boundaries. We use the raytraced scenes from the previous experiment, but this time without providing ground-truth information on the motion discontinuities, Fig. 7a. Additionally, we use real images captured with a stereo camera attached to a motorized rail, Fig. 8a. The camera moves forward very slowly on the rail while we capture frames with maximal exposure time and frame rate. By averaging the frames, we obtain motion-blurred images. Comparison to the central frame of the averaged frame series allows for numerical evaluation. Finally, we capture scenes with arbitrarily moving objects for which only a visual evaluation is possible, Fig. 9a. As before we compare against 2D versions of our algorithm. Additionally, we compare against the state-of-the-art video deblurring algorithm of Kim and Lee [7] that uses 3 consecutive monocular frames. We tuned their regularization parameter to obtain the most accurate results.

Table 3. Deblurring with motion discontinuities: PSNR of deblurred synthetic scenes with motion discontinuities (top) and real scenes with the camera moving on a motorized rail (bottom). Our homography-based stereo video deblurring with motion boundary weighting (full) clearly outperforms monocular video deblurring with optical-flow induced blur kernels in all cases

In Figs. 7b and c we first contrast homography-induced deblurring without and with handling of motion boundaries. When not taking into account motion boundaries explicitly, i.e. \(w_n \equiv 1\), Fig. 7b, considerable ringing artifacts are the result, but they are successfully suppressed with our proposed iterative weighting scheme, Fig. 7c. This also becomes evident in the numerical evaluation when comparing the \(3^{\mathrm{rd}}\) and \(4^{\mathrm{th}}\) column of Table 3 (top)Footnote 2. For the real sequences in Fig. 8, boundary artifacts are generally less pronounced, as all objects in the scene are static and the camera moves toward the scene. However, as shown in Fig. 4, the discontinuity weight can still compensate errors in scene flow computation. One such example is the erroneous depth estimation in the ‘apples’ scene, which is disabled by the discontinuity weight. Similarly, also in the scenes with the motorized rail, our full object motion deblurring approach improves the PSNR compared to the basic homography approach, Table 3 (bottom).

When comparing to the state-of-the-art video deblurring method of Kim and Lee [7], we find that our stereo video deblurring approach yields significantly fewer ringing artifacts and considerably sharper results. This can be seen visually, comparing (d) to (c) of Figs. 7, 8 and 9, as well as quantitatively in Table 3. Interestingly, we find in Table 3 that IRLS deblurring with the 2D projection of the scene flow is already on par with video deblurring of Kim and Lee. 3D homography-based deblurring without boundary handling improves on these result numerically already, highlighting the importance of our homography-induced blur kernels. Yet, our full homography-based object deblurring with motion boundary handling gives further numerical gains and a large visual improvement. Recall that the motion boundaries are initially obtained from the 3D scene flow, thus unique to our setting.

Fig. 8.
figure 8

Controlled camera motion for evaluating object motion deblurring: Our 3D deblurring (c) has less ringing artifacts than baseline deblurring with optical flow (b), and sharper results than video deblurring (d), in particular at the periphery of the images where motion is large

For the real scenes with independent object motion, Fig. 9, we observe that the optical flow-based approaches introduce ringing artifacts, particularly where strong gradients of the background coincide with the object boundary. Our stereo video deblurring algorithm can cope with this situation even in the presence of non-planar, non-rigidly moving objects such as the trousers (\(2^\mathrm{nd}\) row) are present.

Fig. 9.
figure 9

For real scenes with independent object motion (a), our novel stereo video deblurring approach (c) generates fewer ringing artifacts due to object boundaries than baseline deblurring with optical flow (b) and sharper images than video deblurring (d)

5 Conclusions and Future Work

We have proposed the first stereo video deblurring approach, which is based on an image formation model that exploits 3D scene flow computed from stereo video. For scenes with an arbitrary number of moving objects, we use an over-segmentation of the scene into planar patches to establish spatially-variant blur matrices based on local homographies. Our experiments on synthetic scenes and real videos show that deblurring with these homographies is more accurate than baseline methods based on 2D linear motion approximations, as well as the current state-of-the-art in video deblurring. Combined with our robust treatment of motion boundaries through an iterative weighting scheme, our approach obtains superior results also on real stereo videos with independently moving objects. In future work we would like to improve the performance of scene flow computation at motion boundaries such that we can benefit from another view to supply information near motion boundaries.