1 Introduction

Video object segmentation is a fundamental task in computer vision, which integrates the segmentation task with object tracking spirit. Compared to object detection or tracking, this task aims to find the exact object region at pixel-level. So video object segmentation is a complicated but realistic task for real applications, which shows a growing prospect in public security and surveillance technology and attracts increasingly more attentions. Recent methods [5, 18] are nearly based on DCNN, and obtain encouraging score in some data sets, such as DAVIS [17] and SegTrack v2 [15]. However, there are still many difficulties to overcome.

The probability of objects and background that share similar appearances increases dramatically when the environment is complex. This makes the segmentation task be of greater difficulty and error-prone. The confusion between background and foreground is bound to lower the segmentation accuracy. But for videos there exists the relevance and continuity between frames, while many existing works [5, 18, 23] didn’t take this characteristic into consideration. If the difference in two continuous frames is large or the foreground changes abruptly, some background regions may be easily blended into the foreground object area. To address this problem, a tracking based method is considered as an additional supplement in our work. Tracking results can provide a rough proposal bounding box in which the target locates, which helps to filter out many of the background pixels. Even if some regions are similar in appearance but when considered the motion similarity and the relationship among frames, many misclassifications could be eliminated. In turn, the segmentation mask will also provide strong evidence for the tracker when fast motion or occlusion happens.

In this paper, we propose a task-complementary algorithm which integrates a single object tracker to assist the segmentation network in bounding box level. As shown in Fig. 1, the tracker generates a bounding box (the left) which can be exploited as a candidate region for segmentation task. The segmentation network is first applied to generate output mask (the right) of the input frame, and at the same time, we use a correlation filter that shares the same CNN feature with the segmentation net to track the object. The segmentation results will be modified by the tracking output and used to adjust the parameters of the tracker as well.

Fig. 1.
figure 1

The overall scheme of the proposed algorithm

Fig. 2.
figure 2

The example of tracking result and segmentation result of the same frame

2 Related Work

2.1 Video Object Segmentation

Semi-supervised video object segmentation aims to segment the foreground object with the knowledge of first frame. With the success of the DAVIS challenge [17], many recent segmentation methods based on CNNs have been proposed. All these methods could be roughly divided into two parts. One tends to regard video as a collection of images and process each frame independently. The most representative method for this is OSVOS [5], which use the famous image semantic segmentation network [6]. In order to focus on special object on each video sequence, OSVOS fine-tune the network with the first frame and its ground-truth. The other part realizes the temporal continuity between adjacent frames and tried to formulate it. MaskTrack [18] adds the mask from previous frame as a new input channel except RGB, and learns to extract features from static images. SegFlow [7] designs a network that jointly trains on object and optical flow. Bilateral filtering is used by VPN [13] to build a long range time correlation. And another technology for exploring the relationship between adjacent frames is RNN. Tokmakov et al. [20] propose to build a ?visual memory? in video with a convolutional recurrent unit. Although above works try to apply Interframe information, it is difficult to segment special objects when scene is complicated. With the advantage of object tracking, our method can handle such problem effectively.

2.2 Object Tracking

Object tracking is the task that predicts the location and scale of target which is given by a bounding box in the first frame of a video sequence, and it has been greatly developed in the last decade. Correlation filter, thanks to its high efficiency, has been widely used in object tracking. Bolme et al. developed MOSSE filter for tracking in [4] which can operate at 669 FPS. Though MOSSE only estimates the location of target, it can be extended to estimate changes in scale and rotation by filtering the log-polar transform of the input patch. But the training of MOSSE is conducted directly on the input image rather than features extracted from the image, and it is single channel which is not suitable for multi-channel features (deep features for example). A few years later, correlation filter was extended to multi-channel feature representations [3, 11, 12]. Dannelljan et al. proposed DSST tracker in [9], and the main contribution of DSST is the method to estimate the scale of target by computing the correlation scores in a scale pyramid representation. Danneljan also did some researches on correlation filter in [10] and [8]. C-COT in [10] focuses on the learning of continuous correlation filter by introducing the interpolation function, while ECO in [8] is the advanced version of C-COT, which learns and updates the correlation filter more efficiently and effectively. And the tracking performance is greatly improved by ECO.

For the remainder of the paper, we will introduce the overall architecture in detail in Sect. 3. Experimental results and analysis are given in Sect. 4. And our conclusion at the end.

3 Proposed Algorithm

Our proposed network can be divided into three parts, as shown in Fig. 2. The first part is a standard segmentation network which contains a CNN as encoder and a decoder network (noted as SegNet) to generate output masks. Secondly, we apply a correlation filter that shares the CNN feature extracted by the FeatureNet with the SegNet and works concurrently with it. Lastly, we fuse the outputs from the SegNet and the tracker to get the final result which is fed back into the correlation filter to get better results for both tasks.

3.1 Feature Extractor and Segmentation Network

We use OSVOS [5] which implements a VGG [19] based fully convolutional architecture as our base model for video object segmentation. The former 5 layers forms the feature network (FeatureNet) and the rest parts are the segmentation network (SegNet). The training process follows a general-to-specific manner. The entire network is pre-trained on ImageNet data set as a base model. Then it is further trained on DAVIS and results the parent model. Now the network is already capable to separate the foreground from an image. However it is still weak in segmenting certain object so far. Finally, when fine-tuned with the ground-truth mask of the first frame, the network learns to outline the given object from background scenes.

Another notable problem is that background pixels take up most space of images. This unbalance between foreground object and background scenes may finally incur more and more pixels in the frame be recognised as background. To avoid this, we applied a intra classes balancing cross entropy loss function.

$$\begin{aligned} \begin{aligned} L=-\beta \sum _{j\in Y_{+}}&logP(y_{j}=1|X) \\&-(1-\beta )\sum _{j\in Y_{-}}logP(y_{i}=0|X) \end{aligned} \end{aligned}$$
(1)

Where X is the input image, \(y_{i}\in \{0,1\}\) , \(j=1, ..., |X|\) are 2 meaning labels by element. \(Y_{+}\) and \(Y_{-}\) are positive and negative labeled pixels respectively, and \(\beta = \frac{|Y_{-}|}{|Y|}\). The probability \(P(\cdot )\) is the result from the sigmoid activation of the output layer.

For the following post-processing procedure, many algorithms use Dense CRF [6, 24] to refine the segmentation results. But these approaches are too costly in time and computation. In the proposed method, we apply edge detection method, HED [22], to detect overall contours all over the image and then we use Ultra metric Contour Map [2] to produce a superpixel representation [1] of the image. By voting of the most (over \(50\%\) experimentally), foreground regions (superpixels) are selected. With this procedure, we achieve similar results with CRF models but greatly accelerates the processing speed.

3.2 Tracker

We choose correlation filter method as our tracker for its high efficiency and robustness. Instead of using handcrafted features [9], we apply hierarchical convolutional features [16] as the input of the correlation filter. Since convolutional features can provide better performance and can be reused in the following segmentation network and the tracker at the meanwhile.

We briefly review the key principles of correlation filter. By exploiting the property of circulate matrix, the correlation filter can learn from a relatively large number of training samples effectively and perform fast tracking in the Fourier domain. Let \(x^{l}\) to be the feature map of l-th layer with size \(M\times N\times D\), where M, N and D represents the width, height and the number of channels, respectively. Taking advantages of the property of cyclic matrix and befitting padding, all the circular shifts of \(x_{m,n}\in \{0, 1, ..., M-1 \}\times \{0, 1, ..., N-1 \}\), are considered as training sample. Each training sample \(x_{m,n}\) is assigned with a soft label \(y_{m,n}\), which is generated by a Gaussian function and takes a value of 1 for the centred target, and smoothly decays to 0 for any other shifts. The goal of training is to find a function \(f(z)=w^{T}z\) that minimizing the following cost:

$$\begin{aligned} \begin{aligned} w = \arg \min _{w} \sum _{m,n}|w * x_{m,n}-y_{m,n}|^{2} + \lambda \, {\parallel } w {\parallel }^{2} \end{aligned} \end{aligned}$$
(2)

where \(w \cdot x_{m,n} = \sum _{d=1}^{D}w_{m,n,d}^{T}x_{m,n,d}\), and \(\lambda \) is a regularization parameter. In the Fourier domain, the learned filter for the d-th channel \((d \in \{1, ..., D\})\) can be transformed into:

$$\begin{aligned} \begin{aligned} w^{d}= \mathcal {F}^{-1}(\frac{\mathcal {F}(y)\odot \mathcal {F}(\bar{x}^{d})}{\sum ^{D}_{i=1}\mathcal {F}(x^{i}) \odot \mathcal {F}(\bar{x}^{i})+\lambda }) \end{aligned} \end{aligned}$$
(3)

Where \(\mathcal {F}\) and \(\mathcal {F}^{-1}\) represents the Fourier transformation and its inverse. The operation \(\odot \) is the element-wise product, and the bar represents complex conjugation. Given an image patch in the new frame, the feature vector on the l-th layer is denoted as z, and its size is \(M\times N\times D\). The score map \(\hat{y}_{l}\) for the l-th correlation filter can be calculated as

$$\begin{aligned} \begin{aligned} \hat{y}_{l}=\mathcal {F}^{-1}(\sum ^{D}_{d=1}\mathcal {F}(w^{d}) \odot \mathcal {F}(\bar{z}^{d})) \end{aligned} \end{aligned}$$
(4)

The optimal position for the l-th correlation filter is obtained by searching the maximal value of the score map \(\hat{y}_{l}\).

The tracking system works with the SegNet in a parallel manner. We apply hierarchical convolutional features to represent the object, which integrates low level features from lower layers of the CNN and high level features from upper layers’ outputs. High level semantic features are helpful in handling appearance distortion problem, while low level features can be used to get accurate location. We choose to use the correlation filter to track the object for its high efficiency and superior performance. The specified tracking process is given as follows:

  • Extracting features from layer \(conv3 \_ 3\), \(conv4 \_ 3\), and \(conv5 \_ 3\) around the object according to the ground-truth in the first frame to train three correlation filters respectively.

  • At time \(t (t>1)\), interpolating the features from these convolutional layers around the predicted location of time \(t-1\). Feeding the features into the three correlation filters to get the output as score map.

  • Updating the predicted location from upper filters, say \(conv5 \_ 3\), to lower filters, like \(conv3 \_ 3\). Taking outputs from the the upper filter as basis or constraint for lower filters. Note that \(\arg \max _{m,n}f_{l}(m,n)\) represents the output location of the l-th correlation filter, so the output from the \(l-1\)-th correlation filter can be written as:

    $$\begin{aligned} \begin{aligned}&\arg \max _{m,n}f_{l-1}(m,n)+\gamma f_{l}(m,n) \\&s.t.\, |m-\hat{m}|+|n-\hat{n}|\le r. \end{aligned} \end{aligned}$$
    (5)

    Where the constraint ensures the algorithm look for better results only in the upper layer. \(\gamma \) is a regularization term for former outputs. The predicted location can be calculated by optimizing Eq. (5).

  • Updating three correlation filters with the tracking results.

3.3 Outputs Fusion

According to tracking bounding box, most of the irrelevant background pixels that lays outside the bounding box can be ignored naturally although some of them are close to the target region on appearance. The concrete implementing steps are given as follows:

  • Firstly, as the rough segmentation results from SegNet are many foreground connected domains, we calculate bounding rectangles of each connected domain. If there exists any overlap region between two bounding rectangles, we cluster them into the same group. Going through the entire image we can get one bounding rectangle for every group.

  • Secondly, we enlarge the box detected by the tracker by k times and calculate its overlap with the group bounding rectangles generated in the first step. If a group bounding rectangle have any overlap with the tracker bounding box we incorporate the segmented connected domains it contains into the foreground region. And the group bounding rectangles that lies apart are directly excluded.

With this mechanism, the segmentation network and the tracking correlation filter are complementary to each other while training. Ideally, the group bounding rectangle can completely covers the tracking bounding box, but actually the tracking results may usually deviate a lot from its ideal location. So, in order to measure the reliability of the tracker, we set a counting variable (cnt) that counts how many times the tracking bounding box drifts far away from the segmentation group bounding rectangle and lowers the IoU by a threshold (\(T_{min}\)). If cnt records over 4 times that the tracking box ‘escapes’ from the segmentation bounding rectangle, we think the tracking results is unreliable and reset the tracker. Which means if the tracking result deteriorates, we will retrack the object according to the bounding rectangle of the segmentation result at a certain middle frame of a video sequence.

4 Experiments and Results

The proposed algorithm shows distinct improvement when compared with common segmentation-only methods in complicated scenes. The experimental results in the following sections are obtained with Caffe and Matlab r2017b on 2 Nvidia Titan XP 12 GB GPUs.

4.1 Data Set and Parameter Settings

Our experiments are done mainly at DAVIS 2016 data set [17] which includes 50 videos from different scenes. The training set includes 30 of the videos and the rest are in the test set. One object in the video is selected to be the target and only the very first frame is labeled with binary mask as ground-truth. The videos are given in sequences of images with resolution at 480p and 1080p and we choose 480p in our experiments. DAVIS covers the main challenges in the video segmentation task including: background interference, similar objects, distortion, blur effect caused by motion, fast moving objects, low resolution, occlusion, scale variation and object that exceeds the screen etc.

For the SegNet, we set the parameters according to OSVOS [5]. For the tracker, the regularization coefficient \(\gamma \) are set to be 1, 0.5 and 0.02 for layer \(conv 4\_ 3\), \(conv 3\_ 3\) and \(conv 5\_ 3\) in our experiments. Statistics proves that threshold r shows little influence on results, so it is also acceptable to decide the final location by weight voting. For the output fusion part, scale ranges \(k=1.0\), threshold \(T_{min}=0.6\) experimentally.

4.2 Evaluation Metrics

i. Region Similarity, which is defined as the overlap or similarity between the predicted region and the ground-truth and denoted by J:

$$\begin{aligned} \begin{aligned} J=\frac{|M\,\bigcap \,G|}{|M\,\bigcup \,G|} \end{aligned} \end{aligned}$$
(6)

where G represents the ground-truth foreground and M means the predicted foreground.

ii. Contour Accuracy, which represents the output accuracy in edge level and denoted by F:

$$\begin{aligned} \begin{aligned} F=\frac{2P_{c}R_{c}}{P_{c}+R_{c}} \end{aligned} \end{aligned}$$
(7)

Where \(P_{c}\) and \(R_{c}\) are the precision and recall rate of the contour which is calculated from the predicted contour and the ground-truth mask.

iii. Temporal Stability, which measures the stability of the predicted foreground region and equals to the matching cost between frames at contiguous time. The concrete details about these evaluation metrics approaches are given in [17].

4.3 Results and Statistics

The comparison of the proposed algorithm with other related prevalent methods is shown in Table 1. Where M, O, and D represents mean, recall and decay respectively in first row. As illustrates in the table, the proposed algorithm outperforms other approaches and we attain \(1.38\%\) gain compared with our base model OSVOS [5] in the region similarity(J(M)) which is the most significant accuracy quota. In other evaluation quotas , our algorithm also reaches the best or the state-of-the-art performance. We will further discuss the feasibility of our method at each video sequence of DAVIS data set.

In Table 2, we list the region similarity of state-of-the-art and our proposed methods at every video sequences in DAVIS data set. Our algorithm gets similar results with our baseline [5] at simpler sequences like Blackswan, Camel and Dog where the objects are discriminative against the background throughout the scenes. However, when the scene is complicated and heavily influenced by other moving objects, like Car-shadow, Horsejump-high and Libby, the tracker assistance enhances the segmentation output distinctively. Consequently, the tracker is proved to be helpful in promoting the robustness when handling the influence of similar objects and complex background.

We propose to use correlation filter to complement the segmentation network, which helps to exclude non-target regions. We give qualitative results on some challenging frames which include complex backgrounds, occlusions and similar objects in Fig. 3. Obviously, many regions that far part from the target are misclassified by the baseline model, but when guided by tracking bounding boxes the outlied green regions can be easily excluded.

Table 1. Comparison result with other state-of the-art methods
Table 2. Comparison result on averaged region similarity with other state-of the-art methods on each video in DAVIS.

On the other hand, the segmentation network can do a favor when the tracker fails to track some fast moving objects. As shown in Fig. 4, the tracker often crashes when the object has an abrupt motion or high speed like a drifting car or a excited dancer (right column). Nevertheless, the segmentation network is less affected in such circumstance. So we can use segmentation results to modify the bounding boxes generated by the tracker(left column). In left, we have a much more accurate bounding box than which in right.

Fig. 3.
figure 3

Segmentation results with tracking (best viewed in color). Green masks are segmentation results from the baseline models, red boxes are the tracking results. If integrates the two methods, we can get a finer box (colored in green).

Fig. 4.
figure 4

Tracking modification by segmentation. Green regions are the segmentation results and red boxes are tracing results. (Color figure online)

5 Conclusion

In this paper, we propose to use object tracking mechanism as a complement for video segmentation, which works alternatively with the segmentation network to refine the final results. The experimental results shows that the proposed architecture is superior to the other state-of-the-art algorithms and much more robust in complicated scenes.