Keywords

1 Introduction

Ranging from high-altitude Unmanned Aerial Vehicles (UAV) capable of flying at 65,000 ftFootnote 1 to low-altitude miniature drones, long-endurance variants to micro air vehicles weighing just a few gramsFootnote 2, UAV industry has gone through a meteoric rise. Owing to their ever increasing availability in civilian and military sectors alike, UAV variants have been disruptive in the last decade and consequently found use in several applications, such as disaster relief, precision agriculture, cinematography, cargo delivery, industrial inspection, mapping, military surveillance and air support [1].

Following the industrial attention, academic community also contributed to the transformation of UAVs in various aspects, such as aerodynamics, avionics and various sensory data acquired by said platforms. Slightly different than remote sensing domain, drone-mounted imagery has paved the way for new research in computer vision (CV). There has been a large quantity of studies reported in object detection [2,3,4,5,6], action detection [7], visual object tracking [8,9,10], object counting [11] and road extraction [12]. In recent years, new datasets [7, 13,14,15,16,17], challenges and dedicated workshops [18, 19] have surfaced to bridge the gap between drone-specific vision problems and their generic versions.

From a practical perspective, low-altitude drones introduce several new problems for CV algorithms. Proneness to sudden platform movements and exposure to environmental conditions arguably affect low-altitude drones in a more pronounced manner compared to their high-altitude counterparts. Moreover, fast-changing operating altitudes and camera viewpoints result into the generation of data with a large diversity, which inherently furthers the complexity of virtually any vision problem. Their small-sized nature also impose severe limits on the availability of computational resources installed on-board, which calls for non-trivial engineering solutions [20, 21].

Moving object detection (MOD), primarily used for surveillance purposes, is a long-standing problem in CV and has been the subject of many studies [22,23,24]. Due to the presence of platform motion in drone vision, it becomes a notorious problem, where platform motion can easily be confused with moving regions/objects. Several solutions addressing platform motion issue have been reported [25, 26]. Moreover, low-altitude drone cases also suffer from severe motion parallax which causes objects closer to camera move faster than objects further away. Solutions provided for motion parallax issue is considered computationally expensive [17, 27,28,29], which makes the solutions even harder especially when on-board processing with (near) real-time performance is a hard constraint.

In this paper, we propose a new approach for moving object detection, primarily optimized for embedded resources for on-board functionality. We make two main contributions; first, we show that performing a large portion of our pipeline in lower resolutions significantly improve the runtime performance while keeping our accuracy high. Second, we design the matching part of the parallax handling scheme using a simple sparse-flow based technique which avoids the bottlenecks such as failing to extract features from candidate objects or inferior feature matching. Its sparse nature also contributes to further speed-ups, pushing further to real-time performance on embedded platforms.

The paper is organized as follows. In Sect. 2, related work in the literature is reviewed. The proposed approach is explained thoroughly in Sect. 3. Experimental results and their analysis are reported in Sect. 4. We conclude our work by drawing insights and making future recommendations in Sect. 5.

2 Related Work

The research community has contributed to moving object detection literature considerably over the last few decades. Earlier studies aimed to solve this problem for static cameras, where background subtraction [22] and temporal differencing [30] based solutions slowly transformed into more sophisticated approaches such as background learning via Mixture of Gaussians, Eigen backgrounds and motion layers [31, 32]. As mobile platforms started to emerge, a new layer of complexity was introduced; ego-motion. The presence of ego-motion renders obsolete the approaches devised for static cameras, as the platform motion is likely to produce quite a few false positives. Moreover, this problem becomes more pronounced when platform motion is sudden.

A simple method to tackle platform-motion induced false positives is to perform image alignment as a preprocessing step. By finding the affine/perspective transformation between two consecutive images, one can warp an image onto another and then perform temporal differencing. Primarily named as “feature-based” methods, such methods depend on accurate image alignment where accurate feature keypoint/descriptor computation is imperative [33]. Another approach to solve ego-motion in such cases can be referred as “motion-based”, where motion layers [32] and optical flow [26] techniques are utilized. In cases where planar surface assumption (if any) does not hold, the perspective transformation based warping fails to handle motion parallax induced false positives. Unlike high-altitude scenarios, motion parallax becomes a severe problem in imagery taken from the ground as well as low-altitude UAV imagery. There are studies in the literature using various geometric constraints and flow-based solutions which claim to mitigate the effects of motion parallax [27, 34].

Building on the simple solutions reported above, several high impact studies have been reported in recent years. Based on their previous study [34], in [35] authors propose a new method that is related with the projective structure between consecutive image planes, which is used in conjunction with epipolar constraint. This new constraint is useful to detect the moving objects which move along the same direction with the camera, which is a configuration epipolar constraint misses to detect. Assessed using airborne videos, authors state abrupt motion or medium-level parallax might be detrimental to the efficacy of their algorithm. Authors of [36] tackle moving object detection for ground robots, where they use epipolar constraint along with a motion estimation mechanism to handle degenerate cases (camera and platform move to the same direction) in a Bayesian framework. Work reported in [27] handles moving object detection by using epipolar and flow-vector bound constraints, which facilitates parallax handling as well as degenerate cases. Authors estimate the camera pose by using Parallel Tracking and Mapping technique. Similar methods have been reported in [37] and [17], where both algorithms target low altitude imagery but the latter handles parallax in an optimized manner.

In addition to feature based methods mentioned above, motion-based methods have also emerged. In [28], authors fuse the sensory data with imagery to facilitate moving object detection in the presence of ego-motion and motion parallax. By using optical flow in conjunction with the epipolar constraint, authors show they can eliminate parallax effects in videos taken from ground vehicles. In work reported in [38], authors use a dense flow based method where optical flow and artificial flow are assessed for their orientation and magnitude to find moving objects in aerial imagery. Another study using flow-based approaches is [39], where authors use optical flow information along with a reduced Singular Value Decomposition and image inpainting stages to handle parallax and ego-motion. They present their results using sequences taken from aerial and ground vehicles. In [40], authors use artificial flow and background subtraction together. They formulate two scores; anomaly and motion scores where the former facilitates good precision and the latter helps achieve improved recall values.

3 Our Approach

In this work, we propose a hybrid moving object detection pipeline which fuses feature based and optical flow based approaches in an efficient manner for near real time performance. In addition, we propose many minor improvements in the pipeline for increasing processing speed as well as detection accuracy. Our proposed pipeline is given in Fig. 1. It is based on well studied ego-motion compensation and plane-parallax decomposition approaches [17, 28, 34, 35, 41] and divided into different process lines for ease of understanding.

Fig. 1.
figure 1

Our proposed moving object detection pipeline. Red boxes represent the steps we build on other baselines. Green boxes represent the steps that can be applied where IMU, barometric sensor and camera calibration parameters are available. \(F_o\) represents the frame in original resolution and \(H_u\) represents upscaled homography. (Color figure online)

Fig. 2.
figure 2

Dynamic frame buffer. \(\varDelta \) changes depending on required sensitivity.

3.1 Preprocessing and Ego-Motion Compensation

One of the most challenging parts of moving object detection from a drone is to be able to detect varying size of objects from varying altitudes. In a background subtraction and ego-motion compensation based system, such as ours, the easiest way to cope with this variation is to be able to use varying length of time difference between frames that are compared. Thus, as the very first stage of our pipeline, we have implemented a dynamic frame buffer that changes its size according to the height measurements read (when available) from the barometric sensor and speed measurements read from IMU (Inertial Measurement Unit) as well as the users’ desire of detection sensitivity. The size of the buffer, thus the time \(\varDelta \) between frames that will be processed, increases as the required sensitivity to detect smaller objects (and/or smaller movements) increase. In our system, before pushing the frames into our buffer, if the used camera is known and calibration is possible, we correct the lens distortion (radial and tangential) as well.

Typical to the majority of computer vision systems, feature extraction and matching take a significant time of our pipeline and form the bottleneck. Additionally, we claim that calculating the homography between frames in high resolution is not worth the loss in runtime. Therefore, we downscale the input images for feature extraction and matching (using SURF [42]), and then calculate the homographies between frames t, \(t-\varDelta \) and \(t-\varDelta \), \(t-2\varDelta \). However, to detect smaller objects, the rest of the pipeline runs on original resolution. To achieve this, the homographies calculated in lower resolution \(H_d\) are used to calculate/estimate the original resolution homographies \(H_u\) using Eq. 1.

$$\begin{aligned} H_u = H_d * P_{do} \end{aligned}$$
(1)

where \(P_{do}\) is the perspective transformation between the downscaled image and original image.

3.2 Moving Object Detection

The calculated upscale homographies (\(H_u\)) are used for perspective warping (of original image \(F_o\)) and three-frame differencing. As can be seen in Fig. 2, current and previous frames are warped on the center frame separately, and two separate two-frame differences are calculated. These two-frame difference results are then processed with an empirical threshold value, which produces a binary image for each. Morphological operations are used to cancel noise and associate pixels belonging to the same object. These two-frame differences (after thresholding and morphological operations) are joined with a logical AND operation to facilitate three-frame differencing. Resulting three-frame difference is then subjected to a connected component analysis to create the object bounding boxes.

3.3 Parallax Filtering

Especially for mini UAVs that operate typically under 150 m, parallax can be a significant problem. Without a dedicated algorithm, there might be many false positives due to trees, buildings, etc. In the literature, using geometric constraints has proven to be an effective solution for eliminating parallax regions [17, 27, 28, 35]. In these studies, either features that are extracted on candidate moving objects are tracked/matched [17, 27] or each candidate pixel is densely tracked/matched [28, 35] to be able to apply geometric constraints. Instead of these, we propose a fast and efficient hybrid method that only tracks the center locations of the candidate objects using sparse optical flow (via [43]). As can be seen from Table 1, this method facilitates significant performance improvement compared to feature tracking based methods. After tracking only the center locations of the candidate objects, we apply epipolar constraint on tracked locations. As can be seen in Figs. 3 and 4, the benefits of tracking only object centers are two fold; epipolar constraint calculations are significantly reduced and the requirement of having keypoints/features on a candidate object is removed.

In order to understand the epipolar constraint [44], assume that \(I_{t-\varDelta }\) and \(I_{t}\) denote two images of a scene (taken by the same camera at different positions in space) at times \(t-\varDelta \) and t, and P denote a 3D point in the scene. In addition, let \(p_{t-\varDelta }\) be the projection of P on \(I_{t-\varDelta }\), and \(p_{t}\) be the projection of P on \(I_{t}\).

In light of these, a unique fundamental matrix, represented by \(F_{t}^{t-\varDelta }\), that relates images \(I_t\) to \(I_{t-\varDelta }\) can be found, which satisfies

$$\begin{aligned} {p_{t}^{i}}^T F_{t}^{t-\varDelta } p_{t-\varDelta }^{i} = 0, \end{aligned}$$
(2)

for all corresponding points \(p_{t-\varDelta }^{i}\) and \(p_{t}^{i}\) where i represents each unique image point. In the case where P is a static point, it satisfies

$$\begin{aligned} el_{t}&= F_{t-\varDelta }^{t} p_{t-\varDelta }^{i}, \end{aligned}$$
(3)
$$\begin{aligned} el_{t-\varDelta }&= F_{t}^{t-\varDelta } p_{t}^{i} \end{aligned}$$
(4)

where \(el_{t-\varDelta }\) and \(el_{t}\) are epipolar lines corresponding to \(p_{t}\) and \(p_{t-\varDelta }\), respectively. If P is a 3D static point, \(p_t\) should be located on the epiline \(el_t\) (see Fig. 5a). Otherwise, P will not satisfy the epipolar constraint (see Fig. 5b). One exceptional case can occasionally rise, where the point of interest moves along the epilines themselves. This occurs when the camera and the point of interest move along the same direction (i.e. degenerate case).

If camera information required for camera calibration is available, essential matrix instead of fundamental matrix can be used for more accurate results as follows,

$$\begin{aligned} F \equiv K^{-T}\widehat{T}RK^{-1} = K^{-T}EK^{-1} \end{aligned}$$
(5)

where K denotes the camera calibration matrix, \(\widehat{T}\) denotes the skew symmetric translation matrix and R denotes the rotation matrix between corresponding frames.

Fig. 3.
figure 3

Visual comparison of feature tracking and object center tracking with sparse optical flow in EgTest05. Note that there are multiple matches on some of the objects which results on multiple epipolar constraint calculations.

Fig. 4.
figure 4

Visual comparison of feature tracking and object center tracking with sparse optical flow on our in-house captured video. Note that some objects may not have features associated with them, therefore feature tracking (hence parallax handling) may fail. This problem is mitigated by using optical flow.

Fig. 5.
figure 5

Image courtesy of [17].

Epipolar constraint.

4 Experiments

4.1 Datasets

We evaluate our technique in a rigorous manner using two different configurations. In the first one, we use the well-known VIVID [45] dataset. VIVID consists of nine sequences, where three are thermal IR data and the rest are RGB. VIVID annotations are available for every tenth frame and it contains annotations for only one object in the scene. We use a select number of VIVID sequences (egtest01-02-04-05) solely to compare our results with other algorithms. VIVID is the most commonly used dataset for evaluating moving object detection algorithms although it is intended for object tracking. Since VIVID is developed for benchmarking tracking algorithms, only single object (even though multiple moving objects exists) is annotated for each 10th frame.

Our second set of evaluation is performed using the publicly available LAMOD dataset [17]. LAMOD consists of various sequences taken from two publicly available datasets, VIVID and UAV123 [16]. These sequences are hand-annotated from scratch for each moving object present in the scene. Annotations are available for each frame and the dataset provides a large set of adverse effects, such as motion parallax, occlusion, out-of-focus and altitude/viewpoint variation [17].

4.2 Results

Execution Time. Improvements introduced in run-time performance by our approach is primarily two folds; calculation of the features and homography at downscale and sparse optical flow based parallax filtering. We perform our execution time analysis on NVIDIA Jetson TX1 and TX2 modulesFootnote 3.

As expected, feature extraction in downscaled versions introduce significant speed-ups. We observe that from \(1280 \times 720\) resolution to \(640 \times 360\), downscale processing improves runtimes from 148 to 42 ms and 113 to 30 ms for TX1 and TX2, respectively. As downscale processing effectively reduces the number of extracted features, this also reflects on speed of feature matching. Comparing \(1280 \times 720\) to \(640 \times 360\) versions, speed of matching improves by the square of input size ratios due to brute-force matching. We see matching speeds change from 146 to 8 ms and 106 to 6 ms (approximately 1700% improvement) for TX1 and TX2, respectively. Sparse optical flow based parallax handling, compared to feature based parallax handling, also introduces considerable execution time gains, as shown in Table 1. TX1 results show an improvement of 20% to 25% whereas TX2 results show improvements in between 18% to 20%.

Table 1. Execution time of our proposed approach for different input resolutions. Feat. indicates the version where features are extracted from candidate objects for parallax filtering. O.F. indicates the version where objects centres are tracked with sparse optical flow for parallax filtering.

Table 2 shows a detailed comparison of a recent technique [17] and our approach. A significant improvement up to 40% is observed for low resolution inputs, both with and without parallax filtering. For larger input resolutions, improvements are in between 200% to 400%.

Table 2. Execution time of our proposed approach for different input resolutions. NF represents no parallax filtering, PF represents parallax filtering and ours refer to our proposed approach.
Table 3. Precision and recall values for 4 sequences in VIVID dataset with the original single object tracking ground truth. We extrapolate the results of baselines as they do not provide numerical results directly. NF and PF represent results without and with parallax filtering. Results in each row are precision and recall (in percentage), respectively.
Table 4. Precision and recall values for 4 sequences in VIVID dataset with multi object moving object detection ground truth provided in LAMOD dataset. NF and PF represent results without and with parallax filtering. Results in each row are precision and recall (in percentage), respectively. Results indicated with \(*\) calculate precision/recall for each frame and then average for entire sequence. Results indicated with \(\dagger \) represent the results of our technique when it operates on original resolution images (no downscaling).
Fig. 6.
figure 6

Detection results on 4 sequences of VIVID dataset. Green boxes are detection results, blue boxes are ground truth data that are taken from LAMOD dataset, grey boxes are candidate objects that are filtered by our parallax filtering algorithm. (Color figure online)

To support our claim that downscale processing does not lead to significant degradation in accuracy, we also assess our pipeline with full high resolution operation. We present results for original and downscaled operations for LAMOD ground truths in Table 4. Results show a slight decrease in accuracy when compared to high resolution. Except a maximum of 6% decrease in recall for egtest02, we do not see any other significant decrease in accuracies. In fact, precision and recall values do not even change in many cases, such as egtest04 recall and egtest04 precision values.

Accuracy. We first evaluate our proposed approach using single object ground truths of VIVID dataset to compare our performance with other baseline algorithms. We use precision/recall as our metric and take a minimum of 50% overlap to be a correct detection. As all the baseline algorithms have reported their results in terms of correct detection ratio and miss detection ratio, we convert these results to precision and recall for a better comparison (miss detection ratio is effectively \(1-precision\), whereas correct detection is ratio is equal to precision). We do not report results for parallax handling for sequences EgTest01 and EgTest02 as they do not have parallax effects. Results are shown in Table 3.

Our proposed algorithm performs comparably to other baselines, even surpassing them in several sequences; EgTest01 and egtest02 results outperform all others in precision, whereas our precision or recall values are the second best in other sequences. Our method shines as it has close precision and recall values. When we perform parallax handling, an expected reduction in recall is compensated with an increase in precision, practically evening out or improving the final F-score. It must be noted that nearly all baselines are effectively object trackers, which means our algorithm performs quite accurately as we do not support our detection with a sophisticated tracker.

We then assess our pipeline for multiple moving objects using LAMOD dataset. We use precision/recall and per-frame precision/recallFootnote 4 (i.e. where precision and recall is calculated for every frame and then averaged) as our evaluation metric where 50% overlap is considered a detection. Similar to previous section of our evaluation, we do not report parallax filtering results for EgTest01 and EgTest02. Exemplary results are visualized in Fig. 6. Results are shown in Table 4.

Results indicate our proposed algorithm significantly outperforms an existing baseline [17] in all sequences except EgTest05. Parallax filtering introduces considerable gains in precision and modest reductions in recall, as reported before. This is expected as EgTest04 and EgTest05 have degenerate cases (i.e. objects and the platform move along the same direction) and our approach currently does not handle such cases. This leads to the elimination of true positives by parallax filtering, thus the reduction in recall.

Fig. 7.
figure 7

The effect of lens distortion correction. Note that although the effects of lens correction on input images may be almost imperceptible, it gives rise to many pixel level errors.

Fig. 8.
figure 8

The effect of dynamic frame buffering. Note that dynamically adjusted buffer size for 50 m altitude works accurately for 50 m, but fails at 100 m altitude. Adaptively changing the buffer size for 100 m significantly improves our detection performance.

4.3 Multi-modal Extension

In the previous section, as we use public datasets where no IMU, height measurement or camera information is available, we can not fully utilise the adaptive algorithm we show in Fig. 1. This means we can not use lens distortion correction at all and we can only use a fixed set of parameters (i.e. dynamic buffer size) for all sequences. In order to show how our pipeline works while utilising external sensory data, we present some qualitative results with our in-house captured videos, where we were able to acquire the relevant IMU and camera parameter information.

Lens Distortion Correction. Lens distortion distorts certain pixels to other locations, radially or tangentially in our case, which directly effects our results (see Fig. 7(b)). This occurs as pixels are distorted to some other location and during image registration, they are erroneously detected as moving objects. By using radial and tangential coefficients specific to the camera lens, this effect can be corrected. Such correction leads to visible improvements in our performance (see Fig. 7(d)).

Dynamic Frame Buffer. It can be hard to detect slowly moving objects in high altitudes as their relative displacement in the image is not large. This can be alleviated by using the height measurements provided by a barometric sensor and vehicle speed measurements by IMU; we dynamically change the size of the buffer (namely the distance between the frames to be differenced) linearly by using the altitude and speed information. By doing so, we effectively amplify the perceived movement of slow moving objects, thus making them highly detectable. Exemplary results shown in Fig. 8(c) and (d) verify the said phenomena and shows a visible improvement in recall.

5 Conclusions

In this paper, we propose a new approach aimed at tackling moving object detection problem for imagery taken from low-altitude aerial platforms. Capable of handling the motion of the platform as well as the detrimental effects of motion parallax, our approach performs parallax handling by sparse optical flow based tracking along with epipolar constraint and performs a large portion of the pipeline in lower resolutions. These two changes introduce significant runtime improvements, reaching up to 16 FPS on embedded resources. Moreover, we analyze our approach in two different datasets for single and multiple moving object detection tasks. We observe that our results perform either comparably or better than existing state-of-the-art algorithms. We also outline an advanced pipeline capable of exploiting multi-modal data that might alleviate the need of laborious parameter tuning. As future work, we aim to integrate a light-weight scheme to alleviate the effect of degenerate motion cases. Should a dataset with IMU, height measurements and camera information become publicly available, we aim to assess our approach in a multi-modal setting.