Keywords

1 Introduction

Object detection is a fundamental problem in image understanding. Deep convolutional neural networks have been successfully applied to this task, including [2, 18,19,20,21,22, 29]. Although they have achieved great success in object detection from static image, video object detection remains a challenging problem. Frames in videos are usually deteriorated by motion blur or video defocus, which are extremely difficult for single-frame detectors.

To tackle the challenges in deteriorated frames, one of straightforward solutions is to consider the spatial and temporal coherence in videos and leverage information from nearby frames. Following this idea, [5, 8, 14, 15] explore hand-crafted bounding box association rules to refine the final detection results. As post-processing methods, those rules are not jointly optimized. As contrast, FGFA [30] attempts to leverage temporal coherence on feature level by aggregating features of nearby frames along the motion paths. They use flow estimation to predict per-pixel motion which is hereinafter referred to as pixel-level feature calibration. However, such pixel-level feature calibration approach would be inaccurate when appearance of objects dramatically changes, especially as objects are occluded. With inaccurate flow estimation, the flow-guided warping may undesirably mislead the feature calibration, failing to produce ideal results. Thus, the robustness of feature calibration is of great importance.

Fig. 1.
figure 1

Examples of occlusion in video object detection. When the bus is occluded by a passing car, the single frame detector fails to produce an accurate box. Pixel-level calibration can help improve the results but it is still influenced due to occlusions. Instance-level calibration performs the best among these results.

In this paper, our philosophy is that accurate and robust feature calibration across frames plays an important role in video object detection. Besides existing pixel-level methods, we propose an instance-level feature calibration method. It estimates the motion of each object along time in order to accurately aggregate features. Specifically, for each proposal in the reference frame, the corresponding motion features are extracted to predict the relative movements between nearby frames and the current frame. According to the predicted relative movements, the features of the same object in nearby frames are RoI-pooled and aggregated for better representation. Compared to the pixel-level calibration, the instance-level calibration is more robust to large temporal appearance variations such as occlusions. As shown in Fig. 1, when the bus in the reference frame is occluded, the flow estimation fails to predict such detailed motion. The warped features of nearby frames can be used to improve the current result, but they are still affected by occluded pixels. In contrast to the pixel-level calibration, the instance-level calibration considers an object as a whole and estimate the motion of the entire object. We argue that such high-level motion is more reliable to use especially when the object is occluded.

Moreover, taking a closer look at above two calibration, we find the pixel-level and instance-level calibration can work collaboratively depending on different motion patterns. The former one is more flexible for modeling non-rigid motion, particularly for some tiny animals. And high-level motion estimation can well describe regular motion trajectory (e.g. car). On the basis of observation, we develop a motion pattern reasoning module. If the motion pattern is more likely to be non-rigid and any occlusion does not occur, the final result relies more on the pixel-level calibration. Otherwise, it depends more on the instance-level calibration. All above modules are integrated in a unified framework that can be trained end-to-end.

In terms of the baseline model R-FCN, the proposed instance-level calibration and the MANet improve the mAP 3.5% and 4.5%, respectively, on ImageNet VID dataset.

In summary, the contributions of this paper include:

  • We propose an instance-level feature calibration method by learning instance movements through time. The instance-level calibration is more robust to occlusions and outperforms pixel-level feature calibration.

  • By visualizing typical samples and conducting statistical experiments, we develop a motion pattern reasoning module to dynamically combine pixel-level and instance-level calibration according to the motion. We show how to jointly train them in an end-to-end manner.

  • We demonstrate the MANet on the large-scale ImageNet VID dataset [23] with state-of-the-art performance. Our code is available at: https://github.com/wangshy31/MANet_for_Video_Object_Detection.git.

2 Related Work

2.1 Object Detection from Still Images

Existing state-of-the-art methods for general object detection are mainly based on deep CNNs [1, 10, 11, 16, 25,26,27]. Based on such powerful networks, a lot of works [2, 3, 6, 7, 18, 22, 24] have been done for further improvement in performance of detection. [7] is a typical proposal based CNN detector by using Selective Search [28] to extract proposals. Different from the above multi-stage pipeline, [6] develops an end-to-end training method through applying spatial pyramid pooling [9]. Faster R-CNN [22] further incorporates proposal generation procedure into CNNs with most parameters shared, leading to much higher proposal quality as well as computation speed. R-FCN [2] is another fully convolutional detector. To address the lack of position sensitivity, R-FCN introduces position-sensitive score maps and a position-sensitive RoI pooling layer. We use R-FCN as our baseline and further extend it for video object detection.

2.2 Object Detection in Videos

Unlike those methods of object detection in still images, detectors for videos should take the temporal information into account. One of the main-stream approaches aims to explore bounding box association rules and apply heuristic post-processing. And the other stream of previous work is to leverage temporal coherence on feature level and seek to improve the detection quality in a principled way.

For post-processing, the main idea is to use high-scoring objects from nearby frames to boost scores of weaker detections within the same video. The major difference among these methods is the mapping strategy of linking still image detections to cross-frame box sequences. [8] links cross-frame bounding boxes iff their IoU is beyond a certain threshold and generate potential linkages across the entire clip. Then they propose a heuristic method for re-ranking bounding boxes called “Seq-NMS”. [14, 15] focus on tubelet rescoring. Tubelets are bounding boxes of an object over time. They apply an offline tracker to revisit the detection results and then associate still-image object detections around the tubelets. [15] presents a re-scoring method to improve the tubelets in terms of temporal consistency. Moreover, [14] proposes multi-context suppression (MCS) to suppress false positive detections and motion-guided propagation (MGP) to recover false negatives. D&T [5] is the first work to jointly learn ROI tracker along with detector. The cross-frame tracker is used to boost the scores for positive boxes. All above approaches focus on post-processing that can be further collaborated with feature-level methods. We will prove it by combining Seq-NMS [8] with our model to reinforce each other and further improve performance.

For feature-level learning, [13, 30, 31] propose end-to-end learning frameworks to enhance the feature of individual frames in videos. [30] presents flow-guided feature aggregation to leverage temporal coherence on feature level. In order to spatially calibrate the features across frames, they apply an optical flow network [4] to estimate the per-pixel motion between the nearby frames and the reference frame. All the feature maps from nearby frames are then warped to the reference frame so as to enhance the current representations. Similar to this work, [31] also utilizes an optical flow network to model the correspondences in raw pixels. The difference is that they use it to achieve significant speedup. However, the low-level motion prediction is lack of robustness especially in the presence of occlusion [12]. Such individual pixel-wise prediction without considering context may suffer from local consistency [17]. Different from still image proposals, [13] provides a novel tubelet proposal network to efficiently generate spatiotemporal proposals. The tubelet starts from static proposals, and extracts multi-frame features, in order to predict the object motion patterns relative to the spatial anchor. The detector extends 2-D proposals to spatiotemporal tubelet proposals. All those methods will be our strong baselines.

Table 1. Notations.

3 Fully Motion-Aware Network

3.1 Overview

We first briefly overview the entire pipeline. Table 1 summarizes the main notations used in this paper. The proposed model is built on standard still image detector which consists of the feature extractor \(\mathcal {N}_{feat}\), the region proposal network \(\mathcal {N}_{rpn}\) [22] and the region-based detector \(\mathcal {N}_{rfcn}\) [2]. The key idea of the proposed model is to aggregate neighboring frames through feature calibration.

First, \(\mathcal {N}_{feat}\) will simultaneously receive three frames \(\varvec{I}_{t-\tau }\), \(\varvec{I}_t\) and \(\varvec{I}_{t+\tau }\) as input, and produce the intermediate features \(\varvec{f}_{t-\tau }\), \(\varvec{f}_t\) and \(\varvec{f}_{t+\tau }\). As shown in Fig. 2, the horizontal line running through the middle of the diagram produces the reference features \(\varvec{f}_t\). The top and bottom lines are nearby features \(\varvec{f}_{t-\tau }\) and \(\varvec{f}_{t+\tau }\). These single frame features will be spatially calibrated through the following two steps.

Second, the pixel-level calibration will be first applied to calibrate \(\varvec{f}_{t-\tau }\) and \(\varvec{f}_{t+\tau }\), generating \(\varvec{f}_{t-\tau \rightarrow t}\) and \(\varvec{f}_{t+\tau \rightarrow t}\). These features are then aggregated as \(\varvec{f}_{pixel}\). The elaborated formulations are in Sect. 3.2. \(\varvec{f}_{pixel}\) is subsequently delivered to \(\mathcal {N}_{rpn}\) to produce proposals, as well as \(\mathcal {N}_{rfcn}\), waiting to be further combined with instance-level calibrated features.

Third, the instance-level calibration is conducted on the position-sensitive score maps in \(\mathcal {N}_{rfcn}\). Specialized convolutional layers are applied on \(\varvec{f}_{t-\tau }\), \(\varvec{f}_t\) and \(\varvec{f}_{t+\tau }\) to produce a bank of \(k^2\) position-sensitive score maps \(\varvec{s}_{t-\tau }, \varvec{s}_t\) and \(\varvec{s}_{t+\tau }\). For the i-th proposal \((x_t^i, y_t^i, w_t^i, h_t^i)\) of \(\varvec{s}_t\), we introduce a procedure to regress the corresponding proposal location \((x_{t-\tau }^i, y_{t-\tau }^i, w_{t-\tau }^i, h_{t-\tau }^i)\) for \(\varvec{s}_{t-\tau }\) and \((x_{t+\tau }^i, y_{t+\tau }^i, w_{t+\tau }^i, h_{t+\tau }^i)\) for \(\varvec{s}_{t+\tau }\). As formulated in Sect. 3.3, with these predicted proposal, features in nearby frames are RoI-pooled and aggregated as \(\varvec{s}^i_{insta}\).

At last, motion pattern reasoning is carried out to decide how to combine the different calibrated features. Since \(\varvec{f}_{pixel}\) is also fed into \(\mathcal {N}_{rfcn}\), it produces \(\varvec{s}_{pixel}^i\) for the i-th proposal. Such module is designed to combine \(\varvec{s}^i_{insta}\) and \(\varvec{s}_{pixel}^i\) according to dynamic motion pattern. It is described in Sect. 3.4.

In our method, all the modules, including feature extractor \(\mathcal {N}_{feat}\), \(\mathcal {N}_{rpn}\), \(\mathcal {N}_{rfcn}\), pixel-level calibration, instance-level calibration and motion pattern reasoning are trained end-to-end.

Fig. 2.
figure 2

(Better viewed in color) The overall framework of the proposed fully motion-aware network (MANet). It composes the four steps below: (a) single frame feature extraction and flow estimation whose results are fed to the next two steps; (b) the pixel-level calibration by per-pixel warping; (c) the instance-level calibration through predicting instance movements; (d) the motion pattern based feature combination.

3.2 Pixel-Level Calibration

As motivated by [30, 31], given a reference frame \(\varvec{I}_t\) and a neighbor frame \(\varvec{I}_{t-\tau }\) (or \(\varvec{I}_{t+\tau }\)), we can model the pixel-level calibration through optical flow estimation. Let \(\mathcal {F}\) be a flow estimation algorithm, such as FlowNet [4], and \( \mathcal {F}(\varvec{I}_{t-\tau }, \varvec{I}_{t})\) indicates the flow field estimated through such network from frame \(\varvec{I}_t\) to \(\varvec{I}_{t-\tau }\). Then we can warp the feature maps from the neighbor frames to the current frame as follows:

$$\begin{aligned} \begin{aligned} \varvec{f}_{t-\tau }&= \mathcal {N}_{feat}(\varvec{I}_{t-\tau })\\ \varvec{f}_{t-\tau \rightarrow t}&= \mathcal {W}(\varvec{f}_{t-\tau }, \mathcal {F}(\varvec{I}_{t-\tau }, \varvec{I}_{t}))\\ \end{aligned} \end{aligned}$$
(1)

where \(\varvec{f}_{t-\tau }\) denotes feature maps extracted by \(\mathcal {N}_{feat}\) and \(\varvec{f}_{t-\tau \rightarrow t}\) is the warped features from time \(t-\tau \) to time t. The warping operation \(\mathcal {W}\) is implemented by bi-linear function which is applied on each location for all the feature maps. It projects a location \(\varvec{p}+\varDelta \varvec{p}\) in the nearby frame \(t-\tau \) to the location \(\varvec{p}\) in the current frame. We formulate it as:

$$\begin{aligned} \begin{aligned} \varDelta \varvec{p}&= \mathcal {F}(\varvec{I}_{t-\tau }, \varvec{I}_{t})(\varvec{p})\\ \varvec{f}_{t-\tau \rightarrow t}(\varvec{p})&= \sum _{\varvec{q}}{G(\varvec{q}, \varvec{p} + \varDelta \varvec{p})\varvec{f}_{t-\tau }(\varvec{q})} \end{aligned} \end{aligned}$$
(2)

where \(\varDelta \varvec{p}\) is the output of flow estimation at location \(\varvec{p}\). \(\varvec{q}\) enumerates all spatial locations in the feature maps \(\varvec{f}_{t-\tau }\), and \(G(\cdot )\) denotes bi-linear interpolation kernel as follow:

$$\begin{aligned} \begin{aligned} G(\varvec{q}, \varvec{p} + \varDelta \varvec{p}) = max(0, 1 - ||\varvec{q} - (\varvec{p} + \varDelta \varvec{p})||)\\ \end{aligned} \end{aligned}$$
(3)

After obtaining calibrated features of nearby frames, we average these features as the low-level aggregation for the updated reference features:

$$\begin{aligned} \begin{aligned} \varvec{f}_{pixel} = \frac{\sum ^{t+\tau }_{j = t-\tau }{\varvec{f}_{j \rightarrow t}}}{2\tau +1}\\ \end{aligned} \end{aligned}$$
(4)

where \(\varvec{f}_{pixel}\) is generated by the nearby frames from time \(t-\tau \) to time \(t+\tau \). [30] proposes an adaptive weight to combine those nearby features. But we find that averaging motion guided features has the similar performance with less computation cost. As a result, we adopt average operation in our model.

Through the pixel-wise calibration, the features of nearby frames are spatially-temporally calibrated so as to provide diverse information for the reference frame. It alleviates several challenges in videos such as motion blur and video defocus.

3.3 Instance-Level Calibration

The pixel-level feature calibration is flexible for modeling non-rigid motion, which needs precise per-pixel correspondence. But the low-level calibration may be inaccurate when object is occluded. In this subsection, we extend it to instance-level motion modeling which has much more tolerance of occlusions.

The instance-level calibration is conducted on score maps of R-FCN. R-FCN uses specialized convolutional layers to produce position-sensitive score maps \(\varvec{s}_t\). In order to aggregate scores for the i-th proposal \(\varvec{s}_t^i\), we should obtain the \(\varvec{s}_{t-\tau }\), \(\varvec{s}_{t+\tau }\) and proposal movements. \(\varvec{s}_{t-\tau }\) and \(\varvec{s}_{t+\tau }\) can be easily yielded by feeding \(\varvec{f}_{t-\tau }\) and \(\varvec{f}_{t+\tau }\) to the R-FCN. The problem is how to learn the relative movements of the i-th proposal, which is the prerequisites for calibrating instance-level features.

We employ the flow estimation and proposals of reference frame as input, and produce movements of each proposal between the neighboring frame and the current frame. The relative movements require motion information. Although per-pixel motion prediction by FlowNet is not accurate due to occlusion, it is capable of describing the motion tendency. We use this motion tendency as input, and output the movements of the entire object. Similar to the Sect. 3.2, we only formulate the relationship between \(\varvec{I}_{t-\tau }\) and \(\varvec{I}\), and \(\varvec{I}_{t+\tau }\) is in a similar way.

First, we utilize the RoI pooling operation to generate the pooled features \(\varvec{m}_{t-\tau }^{i}\) of the i-th proposal at location \((x_t^i, y_t^i, h_t^i, w_t^i)\):

$$\begin{aligned} \begin{aligned} \varvec{m}_{t-\tau }^{i} = \phi (\mathcal {F}(\varvec{I}_{t-\tau }, \varvec{I}_{t}), (x_t^i, y_t^i, h_t^i, w_t^i)) \end{aligned} \end{aligned}$$
(5)

where \(\phi (\cdot )\) indicates the RoI pooling [6] and \(\mathcal {F}(\varvec{I}_{t-\tau }, \varvec{I}_{t})\) is the flow estimation produced by shared FlowNet in Sect. 3.2. RoI pooling uses max pooling to convert the features inside any valid region of interest into a small feature map with fixed spatial extent.

Then regression network \(R(\cdot )\) is exploited to estimate the movement of the i-th proposal between the frame \({t-\tau }\) and t according to the \(\varvec{m}_{t-\tau }^{i}\):

$$\begin{aligned} \begin{aligned} (\varDelta _{x_{t-\tau }}^i, \varDelta _{y_{t-\tau }}^i,\varDelta _{w_{t-\tau }}^i,\varDelta _{h_{t-\tau }}^i) = R(\varvec{m}_{t-\tau }^{i})\\ \end{aligned} \end{aligned}$$
(6)

where \((\varDelta _{x_{t-\tau }}^i, \varDelta _{y_{t-\tau }}^i,\varDelta _{w_{t-\tau }}^i,\varDelta _{h_{t-\tau }}^i)\) is relative movements and \(R(\cdot )\) is implemented by a fully connected layer. The remaining problem is how to design proper supervisions for learning the relative movements. Since we have the track-id of each object within a video, we are able to generate the relative movements in terms of the ground-truth bounding boxes. We believe that the proposals should have consistent movement with the ground-truth objects. Thus, the above regression target is assigned the ground-truth box movement if the proposal overlaps with a ground-truth at least by 0.5 in intersection-over-union (IoU). In other word, only the positive proposals will learn to regress the movements among consecutive frames. We use the normed relative movements as regression targets.

Once we obtain the relative movements, we are able to calibrate the features across time and aggregate them to enhance the feature of the current frame. The proposal of frame \(\varvec{I}_{t-\tau }\) can be inferred as:

$$\begin{aligned} \begin{aligned} x_{t-\tau }^i&= \varDelta _{x_{t-\tau }}^i \times w_t^i + x_t^i\qquad \ y_{t-\tau }^i = \varDelta _{y_{t-\tau }}^i \times h_t^i + y_t^i\\ w_{t-\tau }^i&= exp(\varDelta _{w_{t-\tau }}^i)\times w_t^i \qquad h_{t-\tau }^i = exp(\varDelta _{h_{t-\tau }}^i)\times h_t^i\\ \end{aligned} \end{aligned}$$
(7)

Based on the estimated proposal locations for nearby frames, the aggregated feature of the i-th proposal can be calculated as:

$$\begin{aligned} \begin{aligned} \varvec{s}_{insta}^{i} = \frac{\sum ^{t+\tau }_{j = t-\tau }{\psi (\varvec{s}_{j}, (x_{j}^i, y_{j}^i, h_{j}^i, w_{j}^i))}}{2\tau +1}\\ \end{aligned} \end{aligned}$$
(8)

where \(\varvec{s}_{j}\) denotes the neighboring score maps, \(\psi \) indicates position-sensitive pooling layer introduced by [2], and \(\varvec{s}_{insta}^{i}\) is the instance-level calibrated feature of the i-th proposal.

Discussion about the regression of relative movements. In [13], they have the similar movement regression problem when generating tubelets. They utilize pooled multi-frame visual features from the same spatial location of proposals to regress the movements of the objects. However, these features within the same location across time without explicit motion information make the regression difficult for training. In our instance-level movements learning, we use flow estimation as input to predict movements. It can regress the movements of all the proposals simultaneously without any extra initialization tricks. [5] proposes a correlation based regression. Compared to this additional correlation operation, we adopt a shared FlowNet to model two kinds of motions (both pixel-level and instance-level) simultaneously. This brings two advantages: (1) the feature sharing saves computation cost (shown in Sect. 4.6). (2) the supervision for instance-level movement regression provides additional motion information and improves flow estimation as well.

3.4 Motion Patten Reasoning and Overall Learning Objective

Sections 3.2 and 3.3 give two motion estimation methods. Since they have respective advantages on different motion, the key issue of combination is to measure the non-rigidity of the motion pattern. Intuitively, when the boundingbox’s aspect ratio \(\frac{x^i_{t}}{y^i_{t}}\) changes rapidly across time, the motion pattern is more likely to be non-rigid. Thus, we use the central-difference \(\delta (\frac{x^i_{t}}{y^i_{t}})\) to express the change rate of aspect ratio at current time. In order to provide more stable estimates, we use average operation over a short snippet to produce the final descriptor of motion pattern:

$$\begin{aligned} \begin{aligned} \delta (\frac{x^i_{t}}{y^i_{t}}) =(\frac{x^i_{t+1}}{y^i_{t+1}} - \frac{x^i_{t-1}}{y^i_{t-1}})/2\\ p^i_{nonri} = \frac{\sum ^{t+\tau -1}_{j = t-\tau +1}{\delta (\frac{x^i_{j}}{y^i_{j}})}}{2\tau -1} \end{aligned} \end{aligned}$$
(9)

where \(p^i_{nonri}\) is the motion pattern descriptor for the i-th proposal. The corresponding proposals in the nearby frames can be obtained from Sect. 3.3.

Additionally, occlusion is another important factor when combining these two calibrations. We exploit the visual feature within the proposal to predict the probability of the object being occluded:

$$\begin{aligned} \begin{aligned} p^i_{occlu} = R(\phi (\varvec{f}_t, (x_t^i, y_t^i, h_t^i, w_t^i))) \end{aligned} \end{aligned}$$
(10)

where \(R(\cdot )\) is also implemented by a fully connected layer and \(p^i_{occlu}\) is the probability of occlusion for the i-th proposal. Notice that Eq. 10 is similar to Eq. 6, but Eq. 6 uses motion features from FlowNet to regress movements while Eq. 10 adopts visual features to predict occlusion. It is mainly due to the fact that occlusion is more related to appearance.

Considering these two factors, we use learnable soft weights to combine the two calibrated features:

$$\begin{aligned} \begin{aligned} \varvec{s}^i_{com} = \varvec{s}^i_{insta} \times \alpha ( \frac{p^i_{occlu}}{p^i_{nonri}}) + \varvec{s}^i_{pixel} \times (1 - \alpha (\frac{p^i_{occlu}}{p^i_{nonri}})) \end{aligned} \end{aligned}$$
(11)

where \(\alpha (\cdot ): \mathbb {R}\rightarrow [0,1]\) is the mapping function that controls the adjustment range for the weight.

The overall learning objective function is given as:

$$\begin{aligned} \begin{aligned} \mathcal {L}(I) =&\frac{1}{N}\sum _{i=1}^{N}{\mathcal {L}_{cls}(p^i, c^i_{gt})} +\\&\frac{1}{N_{fg}}\sum _{i=1}^{N}{\varvec{1}\{c_i^{gt}>0\}(\mathcal {L}_{reg}(b^i, b^i_{gt})}+\mathcal {L}_{cls}(p^i_{occlu}, c^i_{o\_gt})) +\\&\lambda \frac{1}{N_{tr}}\sum _{i=1}^{N_{tr}}{\mathcal {L}_{tr}(\varDelta ^i, \varDelta ^i_{gt})} \end{aligned} \end{aligned}$$
(12)

where \(c^i_{gt}\) is the ground-truth class label. \(p^i\) and \(b^i\) stand for the predicted category-wise softmax score and bounding box regression based on \(\varvec{s}^i_{com}\). \(p^i_{occlu}\) and \(\varDelta ^i\) are occlusion probability and relative movement. \(\varvec{1}\{c^i_{gt}>0\}\) denotes that we only regress the foreground proposals and \(N_{tr}\) indicates that only positive proposals will learn to regress the movement targets. \(\mathcal {L}_{cls}\) is the cross-entropy loss while \(\mathcal {L}_{reg}\) and \(\mathcal {L}_{tr}\) are defined as the smooth L1 function. The FlowNet is supervised by both the movement targets and the final detection targets.

Given the overall objective function, the whole architecture, including pixel-level calibration, instance-level calibration, motion pattern reasoning, bounding box classification and regression, is learned in an end-to-end way.

4 Experiments

4.1 Dataset Sampling and Evaluation Metrics

We evaluate the proposed framework on the ImageNet [23] object detection from video (VID) dataset that contains 30 classes. It is split into 3862 training videos and 555 validation videos. The 30 categories are labeled with ground-truth bounding boxes and track IDs on all the video frames. We report all results on the validation set and use the mean average precision (mAP) as the evaluation metric by following the protocols in [13, 30, 31].

The 30 object categories in ImageNet VID are a subset of the 200 categories in the ImageNet DET dataset. Although there are more than 112,000 frames in VID training set, the redundancy among video frames make the training procedure less efficient. Moreover, the quality of frames in video is much poorer than the still images in DET dataset. Thus we follow previous approaches and train our model on an intersection of ImageNet VID and DET set - 30 categories. To sum up, we sample 10 frames from each video in VID dataset and at most 2 K images per class from DET dataset as our training samples.

4.2 Training and Evaluation

Our model is trained by SGD optimization with momentum of 0.9. During the training, we use a batch size of 4 on 4GPUs, where each GPU holds one mini-batch. The two-phase training is performed. In the first phase, the model is trained on the mixture of DET and VID for 12 K iterations, with learning rates of \(2.5\times 10^{-4}\) and \(2.5 \times 10^{-5}\) in the first 80 K and 40 K iterations, respectively. In the second phase, the movement regression along with the R-FCN are learned for another 30K iteration on VID dataset in order to be more adapted to VID domain. The feature extractor ResNet101 model is pre-trained for ImageNet classification as default. FlowNet (the “Simple” version) is also pre-trained on synthetic Flying Chairs dataset in [4] in order to provide motion information. They are jointly learned during the above procedure. In both training and testing, we use single scale images with shorter dimension of 600 pixels. For testing we aggregate in total of 12 frames nearby to enhance the feature of the current frame by using the Eqs. 4 and 9. Non-maximum suppression (NMS) is applied with intersection-over-union (IoU) threshold 0.7 in RPN and 0.4 on the scored and regressed proposals.

Table 2. Accuracy of different methods on ImageNet VID validation, using ResNet-101 feature extraction networks.

4.3 Ablation Study

In this section, we conduct an ablation study so as to validate the effectiveness of the proposed network. To make better analysis, we follow the evaluation protocols in [30] where the ground-truth objects are divided into three groups in accordance with? their motion speed. They use object’ averaged intersection-over-union(IoU) scores with its corresponding instances in the nearby frames as measurement. It means that the lower the motion IoU(\(<0.7\)) is , the faster the object moves. Otherwise, the larger Motion IoU (\(score > 0.9\)) expresses the object moves slowly. The rest is medium speed.

Method (a) is the single-frame baseline. It achieves 73.6% mAP by using ResNet-101. All the other experiments keep the same setting as this baseline. Note that we only use the single model and do not add bells and whistles.

Method (b) is carried out conducted by averaging multi-frame features. Even we use the same feature extractor in an end-to-end training manner, the model is even worse than our baseline result. It indicates the importance of motion guidance.

Method (c) incorporates the pixel-level feature calibration. The pixel-wise motion information effectively enhances the information from nearby frames in feature aggregation.

Method (d) is the proposed the instance-level calibration. It aligns the proposal features by predicting the movements among consecutive frames, and finally aggregate them across time. It improve the overall performance by 3.5%, even better than the pixel-wise motion guided features in Method (c).

Method (e) is conducted to prove the pixel-wise motion guided(Method (c)) and the instance-wise motion guided features (Method (d)) are complementary and they are able to collaboratively improve the model. We utilize the motion pattern reasoning (introduced by Sect. 3.4) to adaptively combine these two kinds of calibrated features, and it helps to further enhance the performance from 77.1% to 78.1%.

To sum up, aggregating the multi-frame features by explicitly modeling the motion is quite necessary, and the combination of these two calibration modes is capable of promoting the final feature representations collaboratively. Through the above modules, the overall mAP is improved from 73.6% to 78.1%.

Fig. 3.
figure 3

(Better viewed in color) Visualization of two typical examples: occluded and non-rigid objects. They show respective strengths of the two calibration methods.

Table 3. Statistical analysis on different validation sets. The instance-level calibration is better when objects are occluded or move more regularly while the pixel-level calibration performs well on non-rigid motion. Combination of these two module can achieve best performance.

4.4 Case Study and Motion Pattern Analysis

We attempt to take a deeper look at detection results. In order to prove that two calibrated features have respective strengths, we split the validation dataset into different subsets that include different typical samples. The first row in Table 3 shows the performance of occluded samples. We select 87,195 images from validation, where more than half bounding boxes are occluded. The instance-level calibration achieves better performance (74.1%) than pixel-level calibration (73.0%). In terms of motion pattern, we use \(p_{nonri}\) to divide the dataset. The objects in a snippet whose \(p_{nonri}\) are greater than pre-define thresh will be considered as non-rigid motion, otherwise the rigid motion. Thresh is set to 0.02 in our experiments. From the second and third rows of Table 3, the instance-level calibration is better for modeling rigid motion while pixel-level calibration has advantages of modeling non-rigid patterns. In particular, the adaptive combination distills their advantages and obtain the best performance.

We visualize the learned feature maps in order to better understand the two calibration methods. Figure 3(a) show an occluded airplane which is at the bottom of the current frame. When using a single frame detector, the confidence of category “airplane” is 0.17. When applying pixel-level calibrated features, it can be improved to 0.48 (the third column). However, due to the occluded part, the quality of warped feature is undesirably reduced. The last column is instance-level calibration. Since it uses original feature maps of nearby frames, the confidence of category “airplane” achieves 0.66. For non-rigid objects in Fig. 3(b), both of the direction and trajectory are changed through the time, and the parts of dogs may have different motion tendencies. So it is difficult for instance-level module to produce correct movements of the whole dog. The corresponding locations in the nearby frames are not accurate, leading to the unsatisfactory score 0.59. By contrast, the pixel-level calibration is flexible of modeling dog’s motion and appearance, so it can achieve higher confidence 0.71.

4.5 Comparison with State-of-the-art Systems

We compare our model to the existing state-of-the-art methods which can be divided into two groups: end-to-end learned feature methods [2, 13, 30, 31] and post-processing based methods [5, 14, 15]. In terms of feature-level comparison, the proposed MANet achieves the best performance among these methods. [13] has the similar regression target with our instance movements learning. But it is much inferior to our calibrated features. [30, 31] are pixel-level feature aggregation and our model is better than these methods mainly due to the robustness of motion prediction. It has been analysed in Sect. 4.4.

Table 4. Performance comparison with state-of-the-art systems on the ImageNet VID validation set. The average precision (in %) for each class and the mean average precision over all classes are provided.

Since the MANet aims to improve the feature quality in video frames, it can further incorporate bounding-box post-processing techniques to improve the recognition accuracy. Thus using post-processing based methods and combined with [8], the MANet achieves better performance (from 78.1% to 80.3%) that still outperforms other strong baselines [5, 14, 15].

To sum up, the comparison among feature based methods is more related to our motivation. Our model focuses on the end-to-end feature learning and has obvious advantages among these methods. In addition, we also demonstrate that the MANet can be further improved by post processing and achieves the state-of-the art performance.

4.6 Performance and Time-Consuming Evaluation

Assume that \({O(\cdot )}\) is denoted as the time spent for the main model \(\mathcal {N}\) \((\mathcal {N}_{feat}+\mathcal {N}_{rpn}+\mathcal {N}_{rfcn})\), \(\mathcal {F}\) as the flow estimation, \(\mathcal {W}\) as the pixel-level feature warping, Ins as the instance-level regression and Ocu as the occlusion predicting. When aggregating 1 adjacent frame, we have:

$$\begin{aligned} \begin{aligned}&O(\mathcal {N})=(82.8 ms)\gg O(\mathcal {F})=(6.8\mathrm{{ms}})> \\&O({Ocu})=(2\mathrm{{ms}})>O({Ins})=(1.5\mathrm{{ms}}) > O(\mathcal {W})=(0.8\mathrm{{ms}}) \end{aligned} \end{aligned}$$
(13)

where the aggregation modules take negligible time-consuming compared to \(\mathcal {N}\).

For testing, we aggregate k nearby frames to enhance the reference frame. The performance and time for varying k are listed in Table 5. Notice that aggregating nearby 4 frames, our model can achieve 77.58% mAP, which exceeds the performance of [30] where nearby 20 frames are aggregated.

Table 5. Results obtained by using different k in inference. The runtime contains data processing which is measured on an NVIDIA Titan X Pascal GPU.

5 Conclusions

We propose an end-to-end learning framework for video object detection by aggregating multi-frame features in a principled way. We model the motion among consecutive frames in two different ways and combine them to further improve the performance of the model. We conduct extensive ablation study to prove the effectiveness of each module in our model. In addition, we also give in-depth analysis of their respective strengths on modeling different motion. The proposed model achieves 80.3% mAP on the large-scale ImageNet VID dataset with backbone network ResNet101, which outperforms existing state-of-the-art results.