Keywords

1 Introduction

Visual tracking is one of the fundamental problems in the field of computer vision and has a variety of applications, such as human-computer interactions, smart surveillance systems and autonomous driving. Given the initial state of target object in the first frame of video sequence, visual tracking aims to describe the movement of target object by motion modeling and keeping estimating its trajectory under the hard computational constraints of real-time vision system (e.g., deformation, illumination variations, motion blur and occlusion).

Discriminative Correlation Filters (DCF) is an efficient algorithm which learns to discriminate the target object from surrounding background by solving a ridge regression problem extremely efficiently [4, 15]. Recently, DCF based trackers [10, 20] have shown great performance improvements on several popular tracking benchmarks [18, 31, 32] on the part of accuracy, robustness and speed. Compared to other tracking algorithms, DCF owns its performance to a Fourier domain formulation as an cheap element-wise operation and dense response scores over all searching locations. In contrast, sparsely sampling used by other algorithms only generates sparse response scores and may lose the target object in complex environment.

Instead of using multi-dimensional features [12], robust scale estimation [10] or non-linear kernels [15], recent DCF based trackers try to learn DCF directly over a deep convolutional neural network (CNN) [7, 11, 19] and benefit from the end-to-end training [26]. This has shown that with the discriminate CNN feature map pretrianed on ImageNet dataset offline, DCF gains much improvements on performance. However, in the aforementioned work, these algorithms simply integrate pre-trained deep model with DCF, without considering the limitations of CNN features.

In this paper, we propose a non-local neural network learning scheme for better visual tracking. We first modify a given CNN model into a fully-convolutional network in which we can provide search image with arbitrary size and get more accuracy location. Then we reformulate DCF as a convolutional layer without bias on the top of CNN feature map, which generate local response map of the input image. Meanwhile, CNN feature map is fed into a non-local layer for capturing global response map, which is response for modeling background objects. Similar to DCF, finial response map combined by local response map and global response map generates dense response scores over all searching locations in an input image. For scale estimation, rather than feeding several search patches extracted based on the center location of the target in different scales, we crop feature maps in different scales from original feature map after CNN. This operation helps to efficiently speed our model in one forward propagation. Moreover, fully-convolutional DCF layer and non-local layer are fully differentiable, this allows us to train and update the model through back propagation algorithm (Fig. 1).

Fig. 1.
figure 1

A comparison of our proposed model NNVT with C-COT [11] and CREST [26]. Our tracker outperforms these DCF based trackers and is more robust in scale estimation.

In summary, our contributions are as follows:

  • We propose a non-local network architecture for visual tracking by using a combination of fully-convolutional DCF layer and non-local layer, which capture local and global response map for better discriminating target object from background objects.

  • We replace image pyramids with feature map pyramids to save more than half the time in scale estimation progress.

  • Extensive experiments on three popular tracking benchmarks, OTB-2013 [31], OTB-2015 [32] and VOT-2016 [18], show that our tracker achieves state-of-the-art performance.

2 Related Work

Visual tracking has been widely studied in the literature [25, 33]. In this section we mainly introduce two aspects which are most related to our work.

Correlation Filter: Since the seminal work of Bolme (et al.) [4] employed Correlation Filter for visual tracking, Several algorithms [2, 9, 10, 15] were built upon it and devoted notable efforts to its improvement, such as using multi-dimensional features [12], robust scale estimation [10], reducing boundary effects [13] and non-linear kernels [15]. Recently, CNN based DCF trackers have been introduced [2, 6, 26], they reformulate DCF as one convolutional layer and apply it on the top of pretrained CNN models to obtain response map. Since DCF is interpreted as a differentiable convolutional layer, the loss function can be easily back-propagated through DCF layer to the CNN model.

Non-local Neural Network: Non-local operations [30] is the extension of non-local means which is a classical filtering algorithm that computes a weighted mean of all pixels in an image. It is an efficient and simple approach for capturing long-distance dependences in CNN by enlarging receptive fields. Unlike convolutional operations which processes a local neighborhood in space and needs to stack much operations to capture long-range dependences, non-local operations capture long-range dependencies directly by computing interactions between any two positions, regardless of their positional distance. Thus, we can apply non-local operations for capturing global feature map for modeling the background objects with only a few layers (Fig. 2).

Fig. 2.
figure 2

Overview of our proposed tracking architecture. We apply one search image into the pretrained CNN for feature extraction, then the feature map is fed into fully-convolutional DCF layer and non-local layer for capturing local and global response map respectively, an accumulative operation is employed for final response map.

3 Learning Non-local Representation

In this section, we present a convolutional framework for learning the combination of local and global response map. The local response map is captured through a fully-convolutional layer which is formulated as standard DCFs, and global response map obtained by non-local operations is the response at a position as a weighted sum of the features at all positions. Then, we sum the local and global response map for final response map.

3.1 Fully-Convolutional DCF Reformulation

We adopt VGG16 network [24] for base model, and modify it in order to allow the model to be interpreted as a feature extractor for DCF layer.

A fully-convolutional neural network is a transformation that it commutes with translation [3]. It can benefit from a much larger input image with more background information, rather than a search image of the target size. To be specific, given the position of target object in the first frame of video sequence, we can obtain an input image centered on the target object and the size is 5 times larger than it, without worrying about decrease of accuracy. For a translation operator \(L_{\tau }\) and an input image X, we have \((L_{\tau }X)[\mu ]=X[\mu - \tau ]\), a neural network \(f_{\rho }\) with learnable parameters \(\rho \) that maps input image X to feature map \(f_{\rho }(X)\) is fully-convolutional with integer stride n if

$$\begin{aligned} f_{\rho }(L_{n\tau }X)=L_{\tau }f_{\rho }(X) \end{aligned}$$
(1)

for any translation \(\tau \).

On the basis of fully-convolutional neural network, we follow the DCF reformulation in [26] and reformulate it into a fully-convolutional DCF layer. We crop a search patch (denoted as X as well) from the given input image X, which represents the object of interest and typically larger than the target object. DCF layer learns a discriminative classifier with correlation filter \(\mathcal {W}\) by minimizing the L2 loss between response map and the corresponding Gaussian function label Y:

$$\begin{aligned} \mathcal {W}^{*} = \mathop {\arg \min }_{\mathcal {W}} \Vert \mathcal {W} \star f_{\rho }(X) - Y \Vert ^{2} + \lambda \Vert \mathcal {W} \Vert ^{2} \end{aligned}$$
(2)

where \(\lambda \) is the regularization parameter. Equation 2 amounts to predicting the target translation through an exhaustive search of the maximum value in the response map.

We take \(\mathcal {W} \star f_{\rho }(X)\) as the convolution operation on feature map \(f_{\rho }(X)\), which can be achieved through one convolutional layer without bias term. Thus we can reformulate the Eq. 2 in convolution neural network, which learns a convolutional layer by minimizing the following loss function:

$$\begin{aligned} L(\mathcal {W}) = \Vert \mathcal {F}_{\mathcal {W}}(X) - Y \Vert ^{2} + \lambda r(\mathcal {W}) \end{aligned}$$
(3)

where \(\mathcal {F}_{\mathcal {W}}(X)=\mathcal {W} \star f_{\rho }(X)\), and \(\lambda r(\mathcal {W})\) is the weight decay. We simply use the first frame of video sequence as training image and set the batch N of each iteration 1. We take the L2 norm as \(r(\mathcal {W})\) based on DCF function. In order to cover the target object, we set the size k of filter \(\mathcal {W}\) equal to the target object after several down-samplings.

In general, target object with large size will provide more representations in feature map and result in more accuracy and robust tracking performance, so we tend to enlarge small size target object to an appropriate size through bilinear interpolation in the search patch. However, as mentioned above, the size k of filter \(\mathcal {W}\) must be set equal to the target object, this constraint will highly increase the computation of DCF layer and slow down tracking speed.

Considering above weakness, we introduce large separable convolution layer [27] to act as DCF layer, its structure is illustrated in Fig.  3. A large separable convolution layer is composed of two separate convolution sequences, each convolution sequence is made up of a \(k \times 1\) convolution layer followed by a \(1 \times k\) convolution layer or a \(1 \times k\) convolution layer followed by a \(k \times 1\) convolution layer.

The standard DCF layer is parameterized by convolution kernel \(\mathcal {W}\) of size \(k \times k \times C \times 1\) where C is the channels of feature map, and it has the computational cost of:

$$\begin{aligned} k \cdot k \cdot C \cdot 1 \cdot H \cdot W \end{aligned}$$
(4)

where the computational cost depends on the kernel size \(k \times k\), channels of feature map C and feature map size \(H \times W\).

Considering large separable convolution layer cost:

$$\begin{aligned} (k \cdot 1 \cdot C \cdot 1 \cdot H \cdot W + k \cdot 1 \cdot 1 \cdot 1 \cdot H \cdot W) \cdot 2 \end{aligned}$$
(5)

which is the sum of two separate convolution sequences.

By replacing standard DCF layer with two separate convolution sequences, we get a reduction in computation of:

$$\begin{aligned} \begin{aligned}&\frac{(k \cdot 1 \cdot C \cdot 1 \cdot H \cdot W + k \cdot 1 \cdot 1 \cdot 1 \cdot H \cdot W) \cdot 2}{k \cdot k \cdot C \cdot 1 \cdot H \cdot W} \\&= \frac{2(C + 1)}{k \cdot C} \end{aligned} \end{aligned}$$
(6)

In our architecture, we set k to be equal to the size of target object and \(C = 64\), the large separable convolution layer uses nearly 2/k times less computation than standard DCF layer at only little reduction in performace.

Fig. 3.
figure 3

A large separable convolution layer. It replaces standard \(k \times k\) DCF layer with two separate convolution sequences. The computational complexity is reduced by nearly 2/k times.

Then, we get the fully-convolutional DCF layer in the form of two separate convolution sequences with L2 loss as the objective function, We name it as fully-convolutional DCF layer.

3.2 Non-local Neural Network

DCF layer captures local response map using two separate convolution sequences with fixed kernel size. As a result of convolution operation, DCF layer is only sensitive to the neighborhood of target object, the background objects far away from the center location of target object are not considered sufficiently, thus it is unlikely to be able to agree with the ground truth soft label. Some tracking algorithms have suggested to use more discriminative deep features (e.g. DeepSRDCF [7] and CCOT [11]) or learn complex deep trackers (e.g. MDNet [21]). Instead of using more complex deep features which may dramaticly increase network computation and result in overfitting, we employ non-local operations as a residual layer for capturing global response map of background objects.

Non-local [30] is a lightweight stack of several convolution layers and softmax layer, we apply a non-local block in our neural network to compute the response at a position as a weighted sum of the feature map of all positions:

$$\begin{aligned} y_{i} = \frac{1}{J(x_{i})} \sum _{\forall j} g(x_{i}, x_{j})h(x_{j}) \end{aligned}$$
(7)

where y is the computed response map, i is the position of the input feature map x whose value is to be computed and j is the index that enumerates all possible positions, g is a pairwise function which computes responses based on relationships between different locations. The unary function h computes a representation of the input feature map as position j. We set \(J(x_{i}) = \sum _{\forall j} g(x_{i}, x_{j})\) to normalize the response (Fig. 4).

Fig. 4.
figure 4

A spacetime embedded Gaussian non-local layer. The input feature map X is of the shape \(N \times H \times W \times C\), Y denotes output response map of the shape \(N \times H \times W \times 1\).

Here we apply Embedded Gaussian to compute similarity in an embedding space: \(g(x_{i}, x_{j}) = e^{\theta (x_{i})^{T}\phi (x_{j})}\), \(h(x_{j}) = W_{g}x_{j}\) and \(\theta (x_{i}) = W_{\theta }x_{i}\), \(\phi (x_{j}) = W_{\phi }x_{j}\) are two embeddings. These operations can be achieved through a softmax layer and several convolutional layers without bias term. For arbitrary position i in an input feature map, non-local layer computes the similarity between position i and all positions j to model the global residual response map.

Compared to DCF layer which sums up the weighted feature map in a local neighborhood (e.g., \(i - s \le j \le i + s\) with kernel size equal to the object size \(2s+1\)), non-local layer can caputer weighted feature all over the feature map, which enhances the influence of background objects over the target object.

3.3 Scale Estimation

Instead of feeding several search images extracted based on the center location of the target object in different scales, we crop feature maps in different scales from original feature map after VGG16 network. These feature maps are then resized into a fixed size before fed into DCF layer and non-local layer to generate final response map. This operation helps to share process of feature extraction and save much computation without degrading performance.

We use a smooth function to update the size \(s_{t}\) of the predicted target object:

$$\begin{aligned} s_{t} = \beta s_{t}^{*} + (1 - \beta ) s_{t-1} \end{aligned}$$
(8)

where \(s = (w, h)\) and \(s_{t}^{*}\) is the predicted size of the target object in frame t with the maximum response value, \(\beta \) is the weight function to smooth the update of predicted size.

4 Experiments

In this section, we investigate the effect of incorporating of local response map by DCF layer and global response map by non-local layer. We first introduce implementation details. Then we compare our tracker with state-of-the-art trackers on the benchmark datasets for performance evaluation.

4.1 Implementation Details

Our experiment is performed in Python based on Tensorflow [1], and runs at around 5 FPS on a PC with an i7 3.7 GHz CPU and speeded up by a NVIDIA GeForce Titan X GPU. For each new video sequence, we obtain the training image from the first frame whose size is 5 times the width and height of the target object through padding or cropping. Then we feed it into pretrained VGG16 network [24] for feature extraction. After extracting the feature map, we respectively apply DCF layer and non-local layer on it to obtain local and global response map, and sum them together for final response map. The regression target map for computing loss is generated by a two-dimensional Gaussian function with a peak value of 1.0. We calculate the difference between response map and regression target map using L2 loss function and apply the adam optimizer with a learning rate of 5e−8 for back propagation. We stop training until the loss is below the given threshold and update the model when the final response map has a variance above another given threhold with a learning rate of 5e−9.

4.2 Comparison with State-of-the-Art

Here, we extensively consider to evaluate our tracker on three popular tracking benchmarks OTB-2013 [31], OTB-2015 [32] and VOT-2016 [18], comparing with several state-of-the-art trackers.

Fig. 5.
figure 5

Precision and success plots on OTB-2013 dataset using one-pass evaluation. The legend shows the average distance precision score and the area-under-curve score at 20 pixels for precision plots and success plots.

OTB-2013 Dataset: OTB-2013 benchmark [31] contains 50 fully annotated video sequences with various challenges in object tracking, such as deformation, fast motion and occlusion. We compare our method with most recent state-of-the-art trackers including KCF [15], ECO [6], Staple [2], SRDCF [8], LCT [20], SINT [28], SiamFC [3], SRDCFdecon [9], DeepSRDCF [7], Struck [14], MDNet [21], CF2 [19], C-COT [11], SCT [5], MEEM [34], TCNN [22], HDT [23], STCT [29], MUSTer [17] and CREST [26]. We adopt the one-pass evaluation (OPE) with precision and success plots metrics defined in [31] to evaluate the robustness of trackers. Figure 5 shows the comparison of our tracker with other trackers. The figure legend indicates the average distance precision score and the area-under-curve score at 20 pixels for precision plots and success plots. We can see that our tracker achieves state-of-the-art performance among all the trackers. The higher precision score at low location error threshold (5 10) means that our tracker hardly misses the target even though at a strict threshold. It is worth noticing that, compared with CREST [26] which also uses one convolutional layer to formulate the DCF, our tracker outperforms it in both measures.

Fig. 6.
figure 6

Precision and success plots on OTB-2015 dataset using one-pass evaluation. The legend indicates the average distance precision score and the area-under-curve score at 20 pixels for precision plots and success plots.

OTB-2015 Dataset: OTB-2015 benchmark [32] contains 100 fully annotated video sequences. We also compare our tracker with several trackers, include MDNet [21], C-COT [11], SRDCFdecon [9], HDT [23], Staple [2], SRDCF [8], DeepSRDCF [7], CNN-SVM [16], CF2 [19], LCT [20], DSST [10], MEEM [34], KCF [15] and CREST [26]. We adopt the one-pass evaluation (OPE) with precision and success plots metrics defined in [31] to evaluate the robustness of trackers. Figure 6 shows the comparison of our tracker with other trackers. We can see that our tracker also achieves state-of-the-art performance among all the trackers. Our tracker is slightly underperform the C-COT [11] when compared with OTB-2013 benchmark [31], it means that our tracker is not as robust as C-COT [11] in much challenging conditions.

Table 1. Evaluation of our tracker with the state-of-art trackers in terms of expected average overlap (EAO), accuracy values (Av) and robustness values (Rv) on the VOT-2016 benchmark.

VOT-2016 Dataset: VOT-2016 benchmark [18] contains 60 challenging videos from a set of more than 300 videos. We compare our model with the top-ranked trackers, including TCNN [22], C-COT [11], ECO [6], Staple [2], EBT [35], MDNet [21], SiamFC [3] SSAT [18] and CREST [26], in terms of expected average overlap (EAO), accuracy values (Av) and robustness values (Rv). Table 1 shows the comparison of our tracker with the state-of-art trackers. Among all the trackers, ECO [6] achieves the best performance. These top five trackers are all based on deep CNN and our tracker achieves the similar performace with other trackers.

5 Conclusion

In this paper, we present a novel network architecture for accuracy and robust visual tracking by fusing the local and global response map. Instead of applying a DCF layer on the top of the pretrianed CNN and merely regressing the local response map, we introduce a convolutional DCF layer for capturing local response map and a non-local layer for global response map, which helps the model to distinguish the target object from the background. Also, we replace image pyramids with feature map pyramids in scale estimation, this operation help to save half the time in one forwarding. Experiments on three popular tracking benchmarks show that our model achieves state-of-the-art performance.