Keywords

1 Introduction

Global motion compensation (GMC) removes the impact of intentional and unwanted camera motion in the video, transforming the video to have static background with the only motion coming from foreground objects. As a related problem, video stabilization removes unwanted camera motion, such as vibration, and generates a video with a smooth camera motion. The term “global motion compensation” is also used in video coding literature, where background motion is estimated roughly to enhance the video compression performance [1, 2].

GMC is an essential module for processing videos from non-stationary cameras, which are abundant due to emerging mobile sensors, e.g., wearable cameras, smartphones, and camera drones. First, the resultant motion panorama [3], as if virtually generated by a static camera, is by itself appealing for visual perception. More importantly, many vision tasks may benefit from GMC. For instance, dense trajectories [4] are shown to be superior when camera motion is compensated  [5]. Otherwise, camera motion interferes with human motion, rendering the analysis problem very challenging. GMC allows reconstruction of a “stitched” background [6], and subsequently segmentation of foreground [7, 8]. This helps multi-object tracking by mitigating the unconstrained problem of tracking multiple in-the-wild objects, to tracking objects with a static background [9].

Fig. 1.
figure 1

Schematic diagrams of proposed TRGMC and existing sequential GMC algorithms, and resultant motion panorama for a video shot by panning the camera up and down. Background continuity breaks easily in the case of the sequential GMC [10].

In existing GMC works [1012], frames are transformed to a global motion-compensated coordinate (GMCC), by sequentially processing input frames. For a pair of consecutive frames, the mapping transformation is estimated, and by accumulating the transformations, a composite global transformation of each frame to the GMCC is obtained. However, the sequential processing scheme causes frequent GMC failures for multiple reasons: (1) Sequential GMC is only as strong as the weakest pair of consecutive frames. A single frame with high blur or dominant foreground motion can cause the rest of the video to fail. (2) Generally, multiple planes exist in the scene. The common assumption of a single homography will accumulate residual errors into remarkable errors. (3) Even if the error of consecutive frames is in a sub-pixel scale, due to the multiplication of several homography matrices, the error can be significant over time [6]. These problems are especially severe when processing long videos and/or the camera motion becomes more complicated. E.g., when the camera pans to left and right repeatedly, or severe camera vibration exists, the GMC error is obvious by exhibiting discontinuity on the background (see Fig. 1 for an example).

To address the issues of sequential GMC, we propose a temporally robust global motion compensation (TRGMC) algorithm which by joint alignment of input frames, estimates accurate and temporally consistent transformations to GMCC. The result can be rendered as a motion panorama that maintains perceptual realism despite complicated camera motion (Fig. 1). TRGMC densely connects pairs of frames, by matching local keypoints via keypoint descriptors. Joint alignment (a.k.a. congealing) of these frames is formulated as an optimization problem where the transformation of each frame is updated iteratively, such that for each link interconnecting a keypoint pair, the spatial coordinates of two end points are identical. This novel keypoint-based congealing, built upon succinct keypoint coordinates instead of high-dimensional appearance features, is the core of TRGMC. Joint alignment not only leads to the temporal consistency of GMC, but also improves GMC stability by using redundancy of the information. This improved stability is crucial for GMC, especially in the presence of considerable foreground motion, motion blur, non-rigid motion like water, or low-texture background. The joint alignment scheme also provides capabilities such as coarse-to-fine alignment, i.e., alignment of the keyframes followed by non-keyframes, and appropriate weighting of keypoints matches, which cannot be naturally integrated into sequential GMC. Our quantitative experiments reveal that TRGMC pushes the alignment error close to human performance.

2 Prior Work

TRGMC is related to many techniques in different aspects. We first review them and then compare our work with existing GMC algorithms.

Firstly, homography estimation from keypoint matches is crucial to many vision tasks, e.g., image stitching, registration, and GMC. Its main challenge is the false matches due to appearance ambiguities. Methods are proposed to either be robust to outliers, such as RANSAC [1316] and reject false matches [17, 18], or probabilistically combine appearance similarities and keypoint matches [10, 19]. All methods estimate a homography for a frame pair. In contrast, we jointly estimate homographies of all frames to a global coordinate, which leverages the redundant background matches over time to better handle outliers.

Image stitching (IS) and panoramic image mosaicing share similarity with GMC. IS aims to minimize the distortions and ghosting artifact in the overlap region. Recent works focus on different challenges, e.g., multi-plane scenes [2025], the parallax issue [2628], and motion blur [29]. In these works, input images have much less overlap than GMC. On the other hand, video mosaicing takes in a video which raster scans a wide angle static scene, and produces a single static panoramic image [3032]. When the camera path forms a 2D scan [30] or a \(360^{\circ }\) rotation [32], global refinement is performed via bundle adjustment (BA) [33], which ensures an artifact-free panoramic image. Although a byproduct of TRGMC is a similar static reconstruction of the scene, TRGMC focuses on efficient generation of an appealing video, for a highly dynamic scene. While one may use BA to estimate camera pose and then transformation between frames, our experiments reveal that BA is not reliable for videos with foreground motion and is less efficient than TRGMC. Hence, image/video mosaicing and GMC have different application scenarios and challenges.

Another related topic is the panoramic video [3438]. For instance, Perazzi et al. [35] create a panoramic video from an array of stationary cameras by generalizing parallax-tolerant image stitching to video stitching. While these works focus on stitching multiple synchronized videos, GMC creates a motion panorama from a single non-stationary camera. Unlike GMC, video panoramas do not require the resultant video to have a stationary background.

Video stabilization (VS) is a closely related but different problem. TRGMC can be re-purposed for VS, but not vice versa, due to the accuracy requirement. Given the accurate mapping to a global coordinate using TRGMC, VS would mainly amount to cropping out a smooth sequence of frames and handling rendering issues such as parallax. Among different categories of VS, 2D VS methods calculate consecutive warping between the frames and have similarities with sequential GMC, but any estimation error will not cause severe degradation in VS as long as it is smoothed. While TRGMC targets long-term staticness of the background, VS mainly cares about smoothing of camera motion, not removing it. In other words, TRGMC imposes a stronger constraint on the result. This strict requirement differentiates TRGMC also from Re-Cinematography [39].

Congealing aims to jointly align a stack of images from one object class, e.g., faces and letters [4043]. Congealing iteratively updates the transformations of all images such that the entropy [40] or Sum of Squared Differences (SSD) [44] of the images, is minimized. However, despite many extensions of congealing [4549], almost all prior work define the energy based on the appearance features of two images. Our experiments on GMC show that appearance-based congealing is inefficient and sensitive to initialization and foreground motion. Therefore, we propose a novel keypoint-based congealing algorithm minimizing the SSD of corresponding keypoint coordinates. Further, most prior works apply to a spatially cropped object such as faces, while we deal with complex video frames with dynamic foreground and moving background, at a higher spatial-temporal resolution. Note that [46] uses a heuristic local feature based algorithm to rigidly align object class images. In contrast we formulate the joint alignment of keypoints as an optimization problem and solve it in a principal way.

There are a few existing sequential GMC works, where the main problem is to accurately estimate a homography transformation between consecutive frames, given challenges such as appearance ambiguities, multi-plane scene, and dominant foreground [3, 10, 12]. Bartoli et al. [11] first estimate an approximate 4-degree-of-freedom homography, and then refine it. Sakamoto et al. [32] generate a \(360^\circ \) panorama from an image sequence. Assuming a 5-degree-of-freedom homography, all the homographies are optimized jointly to prevent error accumulation. In contrast, TRGMC employs an 8-degree-of-freedom homography. Although using homography in the case of considerable camera translation and large depth variation results in parallax artifacts, using a higher degrees-of-freedom homography than prior works allows TRGMC to better handle camera panning, zooming, and translation. Safdarnejad et al. [10] incorporate edge matching into a probabilistic framework that scores candidate homographies. Although [10, 12] improve the robustness to foreground, error accumulation and failure in a single frame pair still deteriorate the overall performance. Thus, TRGMC targets robustness of the GMC in terms of both the presence of foreground and long-term consistency by joint alignment of frames.

Fig. 2.
figure 2

Flowchart of the TRGMC algorithm.

3 Proposed TRGMC Algorithm

The core of TRGMC is the novel keypoint-based congealing algorithm. Our method relies on densely interconnecting the input frames, regardless of their temporal offset, by matching the detected SURF keypoints at each frame using SURF descriptors [50]. We refer to these connections, shown in Fig. 2, as links. Frames are initialized to their approximate spatial location by only 2D translation (Sect. 3.4). We rectify the keypoint matches such that majority of the links have end points on the background region. Then the congealing applies appropriate transformation to each frame and the links connected to it, such that the spatial coordinates of the end-points of each link are as similar as possible. In Fig. 2, this translates to having the links as parallel to the \(t-\)axis as possible.

For efficiency and robustness, TRGMC processes an input video in two stages. Stage one selects and jointly aligns a set of keyframes. The keyframes are frozen, and then stage two aligns each remaining frame to its two encompassing keyframes. The remainder of this section presents the details of the algorithm.

3.1 Formulation of Keypoint-Based Congealing

Given a stack of N frames \(\{\mathbf {I}^{(i)}\}\), with indices \(i\in \mathbb {K}=\{k_1, ..., k_N\}\), the keypoint-based congealing is formulated as an optimization problem,

$$\begin{aligned} \mathop {\text {min}}\limits _{\{ \mathbf {p}_i \}}{\epsilon = \sum _{i \in \mathbb {K}}{[\mathbf {e}_i(\mathbf {p}_i)]^{\intercal }\varOmega ^{(i)} [\mathbf {e}_i(\mathbf {p}_i)]}}, \end{aligned}$$
(1)

where \(\mathbf {p}_i\) is the transformation parameter from frame i to GMCC, \(\mathbf {e}_i(\mathbf {p}_i)\) collects the pair-wise alignment errors of frame i relative to all the other frames in the stack, and \(\varOmega ^{(i)}\) is a weight matrix.

We define the alignment error of frame i as the SSD between the spatial coordinates of the endpoints of all links connecting frame i to the other frames, instead of the SSD of appearance [44]. Specifically, as shown in Fig. 3, we denote coordinates of the start and the end point of each link k connecting frame i to the frame \(d_k^{(i)} \in \mathbb {K}\backslash \{i\}\) as \((x_k^{(i)}, y_k^{(i)})\) and \((u_k^{(i)}, v_k^{(i)})\), respectively. For simplicity, we omit the frame index i in \(\mathbf {p}_i\). Thus, the error \(\mathbf {e}_i(\mathbf {p})\) is defined as,

$$\begin{aligned} \mathbf {e}_i(\mathbf {p}) = [\mathbf {\varDelta x}_i(\mathbf {p})^{\intercal }, \mathbf {\varDelta y}_i(\mathbf {p})^{\intercal }]^{\intercal }, \end{aligned}$$
(2)

where \(\mathbf {\varDelta x}_i(\mathbf {p}) = \mathcal {\mathbf {w}}^{(i)}_x- \mathbf {u}^{(i)}\) and \(\quad \mathbf {\varDelta y}_i(\mathbf {p}) = \mathcal {\mathbf {w}}^{(i)}_y- \mathbf {v}^{(i)}\), are the errors in \(x-\) and \(y-\) axes. The vectors \(\mathcal {\mathbf {w}}^{(i)}_x=[\mathcal {W}_x(x_k^{(i)}, y_k^{(i)}; \mathbf {p})]\) and \(\mathcal {\mathbf {w}}^{(i)}_y=[\mathcal {W}_y(x_k^{(i)}, y_k^{(i)}; \mathbf {p})]\) denote the x and \(y-\) coordinates of \((x_k^{(i)}, y_k^{(i)})\) warped by the parameter \(\mathbf {p}\), respectively. The vectors \(\mathbf {u}^{(i)}=[u_k^{(i)}]\) and \(\mathbf {v}^{(i)}=[v_k^{(i)}]\) denote the coordinates of the end points and \(\mathbf {x}^{(i)}=[x_k^{(i)}]\) and \(\mathbf {y}^{(i)}=[y_k^{(i)}]\) denote the coordinates of the start points. If \(N_i\) links emanate from frame i, \(\mathbf {e}_i\) is a \(2N_i-\)dim vector. \(\varOmega ^{(i)}\) is a diagonal matrix of size \(2N_i \times 2N_i\) which assigns a weight to each element in \(\mathbf {e}_i\). The parameter \(\mathbf {p}\) has 2, 6, or 8 elements for the cases of 2D translation, affine transformation, or homography, respectively. In this paper, we focus on homography transformation which is a projective warp model, parameterized as,

$$\begin{aligned} \begin{bmatrix} \mathcal {W}_x(x_k^{(i)}, y_k^{(i)}; \mathbf {p}) \\ \mathcal {W}_y(x_k^{(i)}, y_k^{(i)}; \mathbf {p}) \\ 1 \end{bmatrix} = \overbrace{ \begin{bmatrix} p_1&p_2&p_3 \\ p_4&p_5&p_6 \\ p_7&p_8&1 \end{bmatrix} }^{\mathbf {p}} \begin{bmatrix} x_k^{(i)} \\ y_k^{(i)} \\ 1 \end{bmatrix}. \end{aligned}$$
(3)
Fig. 3.
figure 3

The notation used in TRGMC.

Although the homography model assumes the planar scene and this assumption may be violated in real world [27], we identify the problem of temporal robustness to be more fundamental for GMC than the inaccuracies due to a single homography. Also, videos for GMC are generally swiped through the scene with high overlap, thus the discontinuity resulted from this assumption is minor.

3.2 Optimization solution

Equation 1 is a non-linear optimization problem and difficult to minimize. Following [44], we linearize this equation by taking the first-order Taylor expansion around \(\mathbf {p}\). Starting from an initial \(\mathbf {p}\), the goal is to estimate \(\varDelta \mathbf {p}\) by,

$$\begin{aligned} \mathop {\text {argmin}}\limits _{\varDelta \mathbf {p}}{[\mathbf {e}_i(\mathbf {p})+\frac{{\partial \mathbf {e}_i(\mathbf {p})}}{\partial \mathbf {p}} \varDelta \mathbf {p}]^{\intercal }\varOmega ^{(i)} [\mathbf {e}_i(\mathbf {p})+\frac{{\partial \mathbf {e}_i(\mathbf {p})}}{\partial \mathbf {p}} \varDelta \mathbf {p}]} + \gamma {\varDelta \mathbf {p}}^{\intercal }\mathbf {\mathcal {I}} \varDelta \mathbf {p}, \end{aligned}$$
(4)

where \({\varDelta \mathbf {p}}^{\intercal }\mathbf {\mathcal {I}} \varDelta \mathbf {p}\) is a regularization term, with a positive constant \(\gamma \) setting the trade-off. We observe that without this regularization, parameter estimation may lead to distortion of the frames. The indicator matrix \(\mathbf {\mathcal {I}}\) is a diagonal matrix specifying which elements of \(\varDelta \mathbf {p}\) need a constraint. We use \(\mathbf {\mathcal {I}} = diag([1, 1, 0, 1, 1, 0, 1, 1])\) to specify that there is no constraint on the translation parameters of the homography, but the rest of parameters should remain small.

By setting the first-order derivative of Eq. 4 to zero, the solution for \(\varDelta \mathbf {p}\) is,

$$\begin{aligned} \varDelta \mathbf {p}= \mathbf {H}^{-1}_R { \frac{{\partial \mathbf {e}_i(\mathbf {p})}^{\intercal }}{\partial \mathbf {p}} \varOmega ^{(i)} \mathbf {e}_i(\mathbf {p})}, \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {H}_R = { {\frac{\partial \mathbf {e}_i(\mathbf {p})}{\partial \mathbf {p}}}^{\intercal }\varOmega ^{(i)} \frac{\partial \mathbf {e}_i(\mathbf {p})}{\partial \mathbf {p}} }+ \gamma \mathbf {\mathcal {I}}. \end{aligned}$$
(6)

Using the chain rule, we have \(\frac{{\partial \mathbf {e}_i(\mathbf {p})}}{\partial \mathbf {p}}=\frac{{\partial \mathbf {e}_i(\mathbf {p})}}{\partial \mathcal {W}} \frac{{\partial \mathcal {W}}}{\partial \mathbf {p}}\). Knowing that the mapping has two components as \(\mathcal {W}=(\mathcal {W}_x, \mathcal {W}_y)\), and the first half of \(\mathbf {e}_i\) only contains x components and the rest only y components, we have,

$$\begin{aligned} \frac{{\partial \mathbf {e}_i(\mathbf {p})}}{\partial \mathcal {W}} = \left[ \begin{array}{ll} \mathbf {1}_{N_i }&{} \mathbf {0}_{N_i }\\ \mathbf {0}_{N_i }&{} \mathbf {1}_{N_i }\end{array} \right] , \end{aligned}$$
(7)

where \(\mathbf {1}_{N_i }\) (or \(\mathbf {0}_{N_i }\)) is a \(N_i-\)dim vector with all elements being 1 (or 0). For homography transformation, \(\frac{{\partial \mathcal {W}}}{\partial \mathbf {p}}=\frac{\partial (\mathcal {W}_x, \mathcal {W}_y)}{\partial (p_1, p_2, p_3, p_4, p_5, p_6, p_7, p_8)}\) is given by,

$$\begin{aligned} \frac{{\partial \mathcal {W}}}{\partial \mathbf {p}} = \left[ \begin{array}{llllllll} \mathcal {\mathbf {w}}^{(i)}_x&{}\mathcal {\mathbf {w}}^{(i)}_y&{}\mathbf {1}_{N_i }&{}\mathbf {0}_{N_i }&{}\mathbf {0}_{N_i }&{}\mathbf {0}_{N_i }&{}-\mathbf {u}^{(i)}\mathcal {\mathbf {w}}^{(i)}_x&{}-\mathbf {u}^{(i)}\mathcal {\mathbf {w}}^{(i)}_y\\ \mathbf {0}_{N_i }&{}\mathbf {0}_{N_i }&{}\mathbf {0}_{N_i }&{}\mathcal {\mathbf {w}}^{(i)}_x&{}\mathcal {\mathbf {w}}^{(i)}_y&{}\mathbf {1}_{N_i }&{}-\mathbf {v}^{(i)}\mathcal {\mathbf {w}}^{(i)}_x&{}-\mathbf {v}^{(i)}\mathcal {\mathbf {w}}^{(i)}_y\end{array} \right] . \end{aligned}$$
(8)

At each iteration, and for each frame i, \(\varDelta \mathbf {p}\) is calculated and the start points of all the links emanating from frame i are updated accordingly. Similarly, for all links with end points on frame i, the end point coordinates are updated.Footnote 1

We use the SURF [50] algorithm for keypoint detection with a low detection threshold, \(\tau _s = 200\), to ensure sufficient keypoints are detected even for low-texture backgrounds. We use the nearest-neighbor ratio method [51] to match the keypoints descriptors and form links between each pair of keyframes.

Keyframe selection. We select keyframes at a constant step of \(\varDelta f\), i.e., from every \(\varDelta f\) frames, only one is selected. Based on the experimental results, as a trade-off between accuracy and efficiency, we use \(\varDelta f=10\) in TRGMC.

3.3 Weight assignment

We have defined all parameters in the problem formulation, except the weights of links, \(\varOmega ^{(i)}\). We consider two factors in setting \(\varOmega ^{(i)}\). Firstly, the keypoints detected at larger scales are more likely to be from background matches, since they cover coarser information and larger image patches. Thus, to be robust to foreground, the early iterations should emphasize links from larger-scale keypoints, which forms a coarse-to-fine alignment. We normalize the scales of all keypoints such that the maximum is 1, and denote the minimum of the normalized scales of the two keypoints comprising the link k as \(s_k\). Then, \(\varOmega ^{(i)}_{k,k}\) is set proportional to \(s_k\).

Secondly, for each frame i, the links may be made either to all the previous frames, denoted as backward scheme, or both the previous and upcoming frames, denoted as backward-forward scheme. The former is for real-time applications, whereas the latter for offline video processing. These schemes are implemented by assigning different weights to backward and forward links,

$$\begin{aligned} \varOmega ^{(i)}_{k,k}={\left\{ \begin{array}{ll} \ (\beta . s_k)^{r^q}; \ \text{ if } \ \ d_k^{(i)}<i \ \ \ \text{(Backward } \text{ links) } \\ \ (\alpha . s_k)^{r^q}; \ \text{ if } \ \ d_k^{(i)}>i \ \ \ \text{(Forward } \text{ links) }\\ \end{array}\right. } \end{aligned}$$
(9)

where \(0<\alpha ,\beta <1\), q is the iteration index, and \(0<r<1\) is the rate of change of the weights. Note that the alignment errors in x and \(y-\)axes have the same weights, i.e., \(\varOmega ^{(i)}_{k+N_i,k+N_i}= \varOmega ^{(i)}_{k,k}\). After a few iterations, the weights of all the links will be restored to 1. In the backward scheme, we set \(\alpha = 0\).

3.4 Initialization

Initialization speeds up the alignment and decreases the false keypoint matches. The objective is to roughly place each frame at the appropriate coordinates in the GMCC. For initialization, we align the frames based only on rough estimation of translation without considering rotation, skew, or scale. We use the average of the motion vectors in matching two consecutive frames as the translation. Using this simple initialization, even if the camera has in-plane rotation, estimated 2D translations are zero, which is indeed correct and does not cause any problem for TRGMC. Given the estimated translation, approximate overlap area of each pair of frames is calculated, and only the keypoints inside the overlap area are matched, reducing number of false matches due to appearance ambiguities.

3.5 Outlier handling

Links may become outliers for two reasons: (i) the keypoints reside on foreground objects not consistent with camera motion; (ii) false links between different physical locations are caused by the low detection threshold and similar appearances.

In order to prune the outliers, we assume that the motion vectors of background matches, i.e., background links, have consistent and smooth patterns, caused by camera motion such as pan, zoom, tilt, whereas, the outlier links will exhibit arbitrary pattern, inconsistent with the background pattern. Specifically, we use Ma et al. [17] method to prune outlier links by imposing a smoothness constraint on the motion vector fieldFootnote 2. This method outperforms RANSAC if the set of keypoint matches contains a large proportion of outliers. Since keyframes have larger relative time difference than consecutive frames, the foreground motion is accentuated and more distinguishable from camera motion. This helps with better pruning of the foreground links. At each stage that the keypoints from a pair of frames are matched to form the links, we perform the pruning.

Congealing of an image stack also increases the proportion of background matches over the outliers - another way to suppress outliers. The keypoints on background are more likely to form longer range matches than the foreground ones, due to non-rigid foreground motion. Hence, when \(\left( {\begin{array}{c}N\\ 2\end{array}}\right) \) combinatorial pairs of frames are interconnected, there are a lot more background matches (Fig. 4).

Fig. 4.
figure 4

Comparison of the ratios of matches for (a) sequential GMC and (b) TRGMC.

Fig. 5.
figure 5

(a) The input frame, (b) the reliability map, with the red color showing higher reliability. (Color figure online)

3.6 Alignment of Non-keyframes

The keyframes alignment provides a set of temporally consistent motion compensated frames, which are the basis for aligning non-keyframes. We refer to keyframes and non-keyframes with superscripts i and j, respectively. For a non-keyframe j between the keyframes \(k_{i}\) and \(k_{i+1}\), its alignment is a special case of Eq. 1, with indices \(\mathbb {K}=\{j\}\), and the destination of the links \(d_k^{(j)} \in \{ k_i, k_{i+1} \}\), i.e., only \(\mathbf {p}_j\) of frame j is updated while the keyframes remain fixed. Each non-keyframe between keyframes \(k_{i}\) and \(k_{i+1}\) is aligned independently.

However, given the small time offset between j and \(d_k^{(j)}\), the observed foreground motion may be hard to discern. Also, frame j is linked only to two keyframes, thus there is no redundancy of background information to improve robustness to foreground motion. So, we handle outlier handling by assigning higher weights to links that are more likely to be connected to the background.

For each keyframe i, we quantify how well the links emanating from frame i are aligned with other keyframes. If the alignment error is small, i.e., \(\epsilon _k^{(i)} = \big | \mathcal {W}_x(x_k^{(i)}, y_k^{(i)}; \mathbf {p}) - u_k^{(i)} \big | + \big | \mathcal {W}_y(x_k^{(i)}, y_k^{(i)}; \mathbf {p}) - v_k^{(i)} \big | < \tau \), the link k is more likely on the background of frame i and thus, more reliable for aligning non-keyframes. We create a reliability map for each keyframe i, denoted as \(\mathbf {R}^{(i)}\) (Fig. 5). For each link k with \(\epsilon _k^{(i)} < \tau \), a Gaussian function with \(\mu _k = (x_k^{(i)}, y_k^{(i)})\) and \(\sigma _k=c s_k\) is superposed on \(\mathbf {R}^{(i)}\), where the constant c is 20. We define,

$$\begin{aligned} \mathbf {R}^{(i)}_{m,n}= \Bigg \lceil \bigg \lfloor \sum _{k \in \mathbb {B}_i}e^ {- \frac{\big ( m-x_k^{(i)} \big )^2 + \big ( n-y_k^{(i)} \big )^2}{2\sigma _k ^ 2}} \bigg \rfloor _1 \Bigg \rceil _\eta , \end{aligned}$$
(10)

where \(\mathbb {B}_i = \{k | \epsilon _k^{(i)} < \tau \}\), \(\eta > 0\) is a small constant (set to 0.1), \(\lceil x \rceil _{\eta }=\text{ max }(x,\eta )\) and \(\lfloor x \rfloor _1=\text{ min }(1,\eta )\). Now, we assign the weight of the links connecting frame j to the keyframe \(d_k^{(j)}\) at the coordinate \((u_k^{(j)} , v_k^{(j)})\), as the reliability map of the keyframe at the endpoint, \(\varOmega ^{(j)}_{k,k} = \big (\mathbf {R}^{(a)}_{u_k^{(j)}, v_k^{(j)}} \big )^{r^q}\), where \(a=d_k^{(j)}\).

We summarize the TRGMC algorithm in Algorithm 1.

figure a

4 Experimental Results and Applications

We now present qualitative and quantitative results of the TRGMC algorithm and discuss how different computer vision applications will benefit from TRGMC.

4.1 Experiments and results

Baselines and details. We choose three sequential GMC algorithms as the baselines for comparison: MLESAC [15] and HEASK [19] both based on our own implementation, and RGMC [10] based on the authors’ Matlab code available online. TRGMC is implemented in Matlab and is available for download.Footnote 3 Denoting the video frames of \(w\times h\) pixels, we set the parameters as \(\gamma = 0.1 w h\), \(T_1=300\), \(\tau _1=5 \times 10^{-4}\), \(T_2=50\), \(\tau _2=10^{-4}\), \(r=0.7\), \(\tau = 1\), \(\varDelta f=10\), and \(\beta =1\). For the backward-forward scheme we set \(\alpha =1\) and for the backward scheme \(\alpha =0\).

Datasets and metric. We form a dataset composed of 40 challenging videos from SVW [52] and 15 videos from UCF101 [53], termed “quantitative dataset”. SVW is an extremely unconstrained dataset including videos of amateurs practicing sports, and is also captured by amateurs via smartphone. In addition, we form another “qualitative dataset” with 200 unlabeled videos from SVW, in challenging categories of boxing, diving, and hockey.

To compare GMC over different temporal distances of frames, for each video of length M frames in the quantitative dataset, we manually align all 10 possible pairs from the 5-frame set, \(\mathbb {F}=\{ 1, 0.25M, 0.5M, 0.75M, M\}\), as long as they are overlapping, and specify the background regions. For this, a GUI is developed for a labeler to match 4 points on each frame pair, and fine tune them up to a half-pixel accuracy, until the background difference is minimized. Then, the labeler selects the foreground regions which subsequently identify the background region. Similar to [10], we quantify the consistency of two warped frames \(\mathbf {I}^{(i)}(\mathbf {p}_i)\) and \(\mathbf {I}^{(j)}(\mathbf {p}_j)\) (0 to 1 grayscale pixels) via the background region error (BRE),

$$\begin{aligned} \text{ BRE }(i,j) = \frac{1}{||\mathbf {M_B}||_1}{||| (\mathbf {I}^{(i)}(\mathbf {p}_i)-\mathbf {I}^{(j)}(\mathbf {p}_j)) |\odot \mathbf {M_B} ||_1}, \end{aligned}$$
(11)

where \(\odot \) is element-wise multiplication and \(\mathbf {M_B}\) is the background mask for the intersection of two warped frames.

Table 1. Comparison of GMC algorithms on quantitative dataset (*GT: Ground truth, BF: Backward-Forward, B: Backward).
Fig. 6.
figure 6

Average BRE of frame pairs versus the time difference between the two frames.

Quantitative evaluation. Average of BRE over all the temporal frames pairs is shown in Table 1. TRGMC outperform all the baseline methods with considerable margin. The backward-forward (BF) scheme has a slightly better accuracy than the backward (B) scheme, and is also more stable based on our visual observation. Thus, we use BF as the default scheme for TRGMC.

To illustrate how the accumulation of errors over time affects the final error, Fig. 6 summarizes the average error versus the time difference between the frames in \(\mathbb {F}\). This shows that TRGMC error is almost constant over a wide temporal distance between the frames. Thus, even if a frame is not aligned accurately, the error is not propagating to all the frames after that. However, in sequential GMC, the error increases as the time difference increases.

Qualitative evaluation. While quantitative results are comprehensive, the number of videos is limited by the labeling cost. Thus, we further compare TRGMC and the best performing baseline, RGMC, on the larger qualitative dataset. The resultant motion panoramas were visually investigated and categorized into three cases: good, shaking, and failed (i.e., considerable background discontinuity). The comparison in Table 2 again shows the superiority of TRGMC.

Fig. 7.
figure 7

Top view of the frames and links (a) before and (b) after TRGMC. The parallel links in (b) show successful spatial alignment of keypoints. For better visibility, we show up to 15 links emanated per frame. Average of frames (c) before and (d) after TRGMC.

Fig. 8.
figure 8

Composite image formed by overlaying the frame n on frame 1 for several videos after TRGMC. Left to right, n is equal to 144, 489, 912, 93, respectively. In the overlap region the difference between the frames is shown.

Figure 7 shows the links of a sample video processed by TRGMC, and the average frames, before and after processing. Initialization module is disable for generating this figure to better illustrate how well the spatial coordinate of the keypoints are aligned, resulting in links parallel to the \(t-\) axis. Figure 8 shows a composite image formed by overlaying the last frame (or a far apart frame with enough overlap) on frame 1 for several videos, after TRGMC. In the overlap region, difference between the two frames is shown, to demonstrate how well the background region matches for the frames with large temporal distances.

Computational efficiency. Table 1 also presents the average time for processing one frame for each method, on a PC with an Intel i5-3470@3.2 GHz CPU, and 8 GB RAM. While obtaining considerably better accuracy than HEASK or RGMC, TRGMC is on average 15 times faster than HEASK and 7 times faster than RGMC. MLESAC is \(\sim \) \(3\) times faster than TRGMC, but with twice the error. For TRGMC, the backward scheme is \(50\,\%\) faster than forward-backward, since it has approximately half the links of BF.

Accuracy vs. efficiency trade-off. Figure 9 presents the error and efficiency results for a set of 5 videos versus the keyframe selection step, \(\varDelta f\). For this set, the ground truth error is 0.049. As a sweet spot in the error and efficiency trade-off, we use \(\varDelta f=10\) for TRGMC. This figure also justifies the two stage processing scheme in TRGMC, as processing frames at a low selection step \(\varDelta f\), is costly in terms of efficiency, but only improves the accuracy slightly.

Fig. 9.
figure 9

Error and efficiency vs. the keyframe selection step, \(\varDelta f\).

Table 2. Comparison of GMC algorithms on qualitative dataset.

4.2 TRGMC applications

Motion panorama. By sequentially reading input frames, applying the transformation found by TRGMC, and overlaying the warped frames on a sufficiently large canvas, a motion panorama is generated. Furthermore, it is possible to reconstruct the background using the warped frames first (as will be discussed later), and overlay the frames on that, to create a more impressive panorama. Figure 10 shows a few exemplar panoramas and the camera motion trajectory.

Fig. 10.
figure 10

Temporal overlay of frames from different videos processed by TRGMC. Trajectory of the center of image plane over time is overlaid on each plot to show the camera motion pattern, where color changes from blue to red with progression of time. (Color figure online)

Background reconstruction. Background reconstruction is important for removing occlusions, or detecting foreground [6]. To reconstruct the background, a weighted average scheme is used to weight each frame by the reliability map, \(\mathbf {R}^{(i)}\), which assigns higher weights to background. Since the minimum value of \(\mathbf {R}^{(i)}\) is a positive constant \(\eta \), if no reliable keyframe exists at a coordinate, all the frames will have equal weights. Specifically, the background is reconstructed by \(\mathbf {B} = \frac{\sum _{i \in \mathbb {K}} \mathbf {R}^{(i)}(\mathbf {p}_i) \mathbf {I}^{(i)}(\mathbf {p}_i)}{\sum _{i \in \mathbb {K}} \mathbf {R}^{(i)}(\mathbf {p}_i)}\), where \(\mathbf {R}^{(i)}(\mathbf {p}_i)\) and \(\mathbf {I}^i(\mathbf {p}_i)\) are the reliability map and the input frame warped using the transformation \(\mathbf {p}_i\). Using our scheme, reconstructed background in Fig. 11 is sharper and less impacted by the foreground.

Foreground segmentation. The reliable background reconstruction result \(\mathbf {B}\) along with the GMC result of frame \(\mathbf {I}^{(i)}\), e.g., \(\mathbf {p}_i\), can be easily used to segment the foreground by thresholding the difference, \(| \mathbf {B} - \mathbf {I}^{(i)}(\mathbf {p}_i) |\) (Fig. 12).

Fig. 11.
figure 11

Background reconstruction results. Compare the left image with Fig. 10, middle image with Fig. 7, and right image with Fig. 8.

Human action recognition. State of the art human action recognition heavily relies on analysis of human motion. GMC helps to suppress camera motion and magnify human motion, making the motion analysis more feasible, which is clearly shown by the dense trajectories [4] in Fig. 13.

Fig. 12.
figure 12

Segmented foreground overlaid on the input.

Fig. 13.
figure 13

Dense trajectories of the (a) original video, and (b) TRGMC-processed video.

Multi-object Tracking (MOT). When appearance cues for tracking are ambiguous, e.g., tracking players in team sports like football, motion cues gain extra significance [54, 55]. MOT is comprised of two tasks, data association by assigning each detection a label, and trajectory estimation – both highly affected by camera motion. TRGMC can be applied to remove camera motion and thus, revive the power of tracking algorithms relying on motion cues. To verify the impact of TRGMC, we manually label the locations of all players in 566 frames of a football video and use this ground truth detection results to study how MOT using [56] benefits from TRGMC. Figure 14 compares the trajectories of players over time with and without applying TRGMC. Comparing number of label switches qualitatively demonstrates improvement of a challenging MOT scenario using TRGMC. Also, the Multi-Object Tracking Accuracy [57] for the original video and the video processed by TRGMC are \(63.79\,\%\) and \(84.23\,\%\), respectively.

Fig. 14.
figure 14

Multi-player tracking using [56] for a football video with camera panning to the right, before (left) and after processing by TRGMC (right).

5 Conclusions and Discussions

We proposed a temporally robust global motion compensation (TRGMC) algorithm by joint alignment (congealing) of frames, in contrast to the common sequential scheme. Despite complicated camera motions, TRGMC can remove the intentional camera motion, such as pan, as well as unwanted motion due to vibration on handheld cameras. Experiments demonstrate that TRGMC outperforms existing GMC methods, and applications of TRGMC.

The enabling assumption of TRGMC is that the camera motion in the direction of the optical axis is negligible. For instance, TRGMC will not work properly on a video from a wearable camera of a pedestrian, since in the global coordinate the upcoming frames grow in size and cause computational and rendering problems. The best results are achieved if the optical center of the camera has negligible movement, making a homography-based approximation of camera motion appropriate. However, if the optical center moves in the perpendicular direction to the optical axis (e.g., a camera following a swimmer), TRGMC still works well, but results will be visually degraded by the parallax effect.