1 Introduction

Visual Tracking is a major issue in computer vision with lots of applications, such as automatic supervision, human computer interaction and vehicle tracking. Given the true position of the first frame, the purpose of visual tracking is to find the position of the target in the successive frames. There are still some challenges because of occlusions, plane rotation, motion blur, illumination variation, etc.

The main influence factors of tracking include feature extractor, observation model, model update and motion model. Recently, lots of state-of-the-art algorithms are proposed, like LOT [1], Struck [2], SCM [3] and ASLA [4], and these methods focus on exploiting hand-crafted features. Although these algorithms perform well in the past, the hand-crafted features are not suitable for all generic objects.

Convolutional neural network (CNN) can exploit some useful features from raw data, so it has extensive applications, such as image classification [5] and object recognition [6] and segmentation [7]. However, traditional CNNs need too much time and samples to offline training for visual tracking. To solve these problems, Fan et al. [8] trained CNN by some auxiliary picture; Zhou et al. [9] proposed a tracking system using various CNNs; Li et al. [10, 11] used multiple image cues to design a lightweight CNN; Zhang et al. [12] exploited a CNN with lightweight structure, which is fully feed-forward and achieves fast tracking even on a CPU.

All of above-mentioned CNN tracking methods pay particular attention to raw pixels for extracting features, and ignore some inherent features like oriented gradient. Gradient mainly exist in the brim of the target and some parts whose pixels change sharply. And oriented gradient is the most change rate of image gray value, which can reflect the salience of the target. The result of the convolution of a function and unit impulse is similar to copying the function in the position of the unit impulse, so only the value of the parts with similar oriented gradient is much larger when filters extracted from the oriented gradient convolve around the input image. Based on this observation, in this paper, we propose an oriented gradient convolutional network (COG) formulated within a particle filtering framework for object tracking. To build it, we first warp the image to a fixed size and get some patches by sliding windows. Each patch consists of oriented gradient about the object. Then, a overcomplete dictionary is studied from the image patches as a filter. Motivated by the work in [12], we propose tracking system using full feed-forward convolutional network, such that convolutional network is only simple two-layer. The first layer is simple layer, which is constructed by the convolution of the filter and image patches. And the complex layer consists of complex cell feature map that is a tensor, which stacks simple cell feature maps.

2 Particle Filter

Particle filter is an algorithm which could estimate the posterior distribution of state based on the Bayesian sequential importance sampling technique and Monte Carlo simulation. Because of estimating the state by some samples, it is applicable for non-linear system. And it has been widely used in visual tracking [3, 4]. The process of particle filter includes two steps: predicting and updating. The predicting distribution \(p(x_{t}|z_{1:t-1})\) which is given by all observations \(z_{1:t-1}=\{z_{1},z_{2},\ldots ,z_{t-1}\}\) up to time \(\mathrm {t}-1\), can be computed as

(1)

where \(x_{t}\) denotes the state variable which describe the affine motion parameters at time t. When the observation \(z_{t}\) is available at time t, the state vector can be updated

$$\begin{aligned} p(x_{t}|z_{1:t})=\frac{p(z_{t}|x_{t} )p(x_{t} |z_{1:t-1})}{p(z_{t} |z_{1:t-1})} \end{aligned}$$
(2)

where \(p(z_{t}|x_{t})\) denotes the observation likelihood, and \(p(z_{k}|z_{1:k-1})\) is a normalized constant which could be computed as

$$\begin{aligned} p(z_{t}|z_{1:t-1})=\int p(z_{t}|x_{t})p(x_{t}|z_{1:t-1})dx_{t} \end{aligned}$$
(3)

Considering that the integrals above-mentioned are difficult to be computed, the Monte Carlo simulation and importance sampling are utilized to approximate the \(p(z_{t}|z_{1:t})\) by N samples which denote \(\{x_{t}^i\}_{i=1}^N\). The weights of samples denoted \(w_{t}^i\) are updated as

$$\begin{aligned} w_{t}^i=w_{t-1}^i\frac{p(z_{t}|x_{t}^i)p(x_{t}^i|x_{t-1}^i)}{q(x_{t}|x_{1:t-1},z_{1:t})} \end{aligned}$$
(4)

where \(q(x_{t}|x_{1:t-1},z_{1:t})\) denotes importance distribution. Because of sample degenerating, the samples are resampled to duplicate the particles with high weight and abandon the low ones.

In this paper, the importance distribution is set to \(q(x_{t}|x_{1:t-1},z_{1:t})= p(x_{t}|x_{t-1})\) and the weights become the observation likelihood \(p(z_{t}|x_{t})\). We also warp the input image to a fixed size with \(m\times m\) pixels. Then, the state variable could be set to \(x_{t}=(t_{x},t_{y},\alpha _{1},\alpha _{2},\alpha _{3},\alpha _{4})\), where \(\{t_{x},t_{y}\}\) are translation parameters and \(\{\alpha _{1},\alpha _{2},\alpha _{3},\alpha _{4} \}\) are the deformation parameters.

3 Oriented Gradient Convolutional Network

3.1 Preprocessing

After the input image is warped to a fixed size, the gradient of each pixel can be computed as

$$\begin{aligned} {{\varvec{G}}}_{x}(x,y)={{\varvec{I}}}^{m\times m}\bigotimes {{\varvec{I}}}_{o},{{\varvec{G}}}_{y}(x,y)={{\varvec{I}}}^{m\times m}\bigotimes \mathbf I _{o}^T \end{aligned}$$
(5)

where \({{\varvec{I}}}^{m\times m}\) is the wrapped image, which is preprocessed by gamma correction that is correspond to local brightness and contrast moralization, and \({{\varvec{I}}}_{0}=[-1,0,1]\). Then the oriented gradient can be formulated as

$$\begin{aligned} \alpha (x,y)=tan^{-1}\frac{{{\varvec{G}}}_{y}(x,y)}{{{\varvec{G}}}_{x}(x,y)} \end{aligned}$$
(6)

Therefore, a given fixed input image is constructed by the oriented gradient. We sample \((m-n+1)\times (m-n+1)\) patches centered at each pixel location inside the \(\hat{{{\varvec{I}}}}^{m\times m}\) by sliding a window with \(n\times n\). Finally, in each patch, we divide the oriented gradient into R sections, and get a feature vector of R dimensions denoted as \({{\varvec{Y}}}_{K\times K}^R, K=(m-n+1)\) through counting the number of each section.

3.2 Design Filter

Inspired by the spare representation [13]: the most likely representation of the image can be achieved by over-complete sparse representation, which also can obtain higher resolution information than traditional non adaptive methods. So we use the overcomplete dictionary to represent the target region adaptively. Then, the filter can be formulated as

$$\begin{aligned} {{\varvec{F}}}_{K\times W}={{\varvec{Y}}}_{K\times K}^R {{\varvec{X}}}_{K\times W} \end{aligned}$$
(7)

where X is a coefficient matrix, which should be as sparse as possible. The Eq. (7) can be rewritten as

$$\begin{aligned} \min _{{{\varvec{F}}},{{\varvec{X}}}} \{\parallel {{\varvec{X}}}_{i}\parallel _{o} \} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, s.t.\parallel {{\varvec{Y}}}-{{\varvec{FX}}}\parallel _{F}^2 \le \varepsilon \end{aligned}$$
(8)

To solve the above Equation, supposing the dictionary F is fixed, a method called Orthogonal Matching Pursuit is used to achieve sparse coding

$$\begin{aligned} \min _{{{\varvec{x}}}_{i}} \{\parallel {{\varvec{Y}}}_{i}-{{\varvec{FX}}}_{i}\parallel _{2}^2 \} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, s.t.\parallel {{\varvec{Y}}}-{{\varvec{FX}}}\parallel _{F}^2 \le \varepsilon \end{aligned}$$
(9)

where X is a sparse parameter. Then, FX can be computed as

$$\begin{aligned} {{\varvec{FX}}}=\sum _{i=1}^W {{\varvec{f}}}_i {{\varvec{x}}}_{i}^T \end{aligned}$$
(10)

where \({{\varvec{f}}}_{i}\) is the i-th column in the \({{\varvec{F}}}\) and \({{\varvec{x}}}_{i}^T\) is the i-th row in the \({{\varvec{X}}}\). Then we need update the dictionary through several iterations. When the k-th column is updated, the others are supposed to be fixed. Then, the Eq. (9) can be rewritten as

(11)

where \({{\varvec{E}}}_{k}\) is a fixed error matrix. Then, two vectors should be found in \({{\varvec{E}}}_{k}\) to update \({{\varvec{f}}}_{i}\) and \({{\varvec{x}}}_{i}^{T}\). In [14], author used the Singular Value Decomposition (SVD) method to solve this problem,

$$\begin{aligned} {{\varvec{E}}}_{k}={{\varvec{U}}}\varLambda {{\varvec{V}}}^{T} \end{aligned}$$
(12)

where U and V are orthogonal basis, and \(\varLambda \) is a diagonal matrix. The vector of U which corresponds to the maximum value of \(\varLambda \) is accepted as \({{\varvec{f}}}_{i}\). However, the sparsity of \({{\varvec{X}}}\) may be corrupted if we choose the corresponding vector in \({{\varvec{V}}}\) to update \({{\varvec{x}}}_{i}^{T}\). The solution is to construct a new matrix \(\varOmega _{K\times L}\) which consists of the nonzero element in \({{\varvec{x}}}_{i}^{T}\)

$$\begin{aligned} \hat{{{\varvec{E}}}}_{k}={{\varvec{E}}}_{K}\varOmega _{K\times L}, \hat{{{\varvec{x}}}}_{k}={{\varvec{x}}}_{k}^{T}\varOmega _{K\times L} \end{aligned}$$
(13)

Then, a sparse \({{\varvec{x}}}_{i}^{T}\) can be obtained when using SVD to \(\hat{{{\varvec{E}}}}_{k}\).

3.3 Object Tracking

After getting the filter F, the region of each particle is preprocessed, denoted as P. Then, the simple layer can be defined as

$$\begin{aligned} {{\varvec{S}}}={{\varvec{F}}}\bigotimes {{\varvec{P}}} \end{aligned}$$
(14)
Fig. 1.
figure 1

The simple cell feature map can preserve the local structure of the target.

The simple layer S enhances the outline and structure of object region, which can be remained while illumination variation and motion blur So the position of the object can be discriminated according to the geometric layout information of the simple layer. To further strengthen the effect of representation, a 3D tensor is used to construct a complex feature map in [12],

$$\begin{aligned} {{\varvec{C}}}\in {{\varvec{R}}}^{K\times K\times N} \end{aligned}$$
(15)

which can enhance the strength of local structural. So the complex cell feature map C is robust to occlusion and deemed as the candidate template. In this paper, the state transition distribution was model by a Gaussian distribution, and observation results are independent which could be computed as

$$\begin{aligned} p(z_{t}|x_{t})= e^{-\parallel c_{t}-c_{t}^{i}\parallel _{2}^{1}} \end{aligned}$$
(16)

where \({{\varvec{c}}}_{t}\) and \({{\varvec{c}}}_{t}^{i}\) are the target template and i-th candidate template respectively at frame t.

3.4 Model Update

Filter and target template c should be updated incrementally to accommodate appearance changes. The target template can be update as

$$\begin{aligned} {{\varvec{c}}}_{t}=(1-\rho ){{\varvec{c}}}_{t-1}+\rho \hat{{{\varvec{c}}}}_{t-1} \end{aligned}$$
(17)

where \(\rho \) is fixed parameter which is set to 0.05 in our experiments, \({{\varvec{c}}}_{t}\) represent the target template at frame t, and \(\hat{{{\varvec{c}}}}_{t-1}\) is the sparse representation of the target template at frame \(\mathrm {t}-1\), which can be easily achieved by a soft shrinkage function

$$\begin{aligned} \hat{{{\varvec{c}}}}=\text {sign}(vec({{\varvec{C}}}))\max (0,\text {abs}(vec({{\varvec{C}}})-\text {median}(vec({{\varvec{C}}})) \end{aligned}$$
(18)

The filter is updated as

$$\begin{aligned} {{\varvec{F}}}_{t+1}=\lambda {{\varvec{F}}}_{o}+\lambda {{\varvec{F}}}_{t}+(1-2\lambda ){{\varvec{F}}}_{t-1} \end{aligned}$$
(19)

where \({{\varvec{F}}}_{o}\) is the filter in first frame, \(\lambda \) is a fixed parameter which is set to 0.15 in our experiments, \({{\varvec{F}}}_{t}\) is the filter at frame t, and \({{\varvec{F}}}_{t-1}\) is the filter at frame \(\mathrm {t}-1\). To improve the speed, we update the filter every other 5 frames.

4 Experience and Discussion

4.1 Experimental Setup

Our tracker was implemented in matlab2016 on a PC with Intel Core i7-7700 CPU@ 3.5 GHz, and run at approximately 1 frame per second. To compare the robustness of our algorithm, we use the Visual Tracking Benchmark dataset [15] and its outstanding code library, including IVT [16], L1APG [17], ASLA [4], LOT [1], MTT  [18] and SCM [3]. Furthermore, we also add the CNT algorithm, which could be downloaded on authors homepage. Nine challenging sequences from visual tracking benchmark are used for comparison, including blurcar1, boy, david3, fleetface, jumping, jogging-1, singer2, suv and trellis.

Several parameters are used in COG tracker. The variance of particle filter is set to \(\lambda _{t}=4+(v_{t-1}+v_{t-2}+v_{t-3})/3\), where \(v_{t-1}\), \(v_{t-2}\) and \(v_{t-3}\) are the targets movement speed at frame t, \(\mathrm {t}-1\) and \(\mathrm {t}-2\) respectively. Deformation parameters \(\{ \alpha _{1},\alpha _{2},\alpha _{3},\alpha _{4}\}\) are set to \(\{0.02,0.005,0.005,0\}\). The size of the warped image is set to m = 32, and the size of sliding window is set to n = 6.

Fig. 2.
figure 2

Tracking results in terms of center error (in pixel). The COG is compared with 9 state-of-the-art algorithms on 9 challenging image sequences

Table 1. Average overlap rate of 10 algorithm. The best and second results are shown in bold-font and italic-font.
Table 2. Successful rate based on Fig. 3. The best and second results are shown in bold-font and italic-font.

4.2 Qualitative Comparisons

Motion Blur: Target region would become blurred due to the motion of target or camera. In blurcar1 sequences (Fig. 4(a)), camera moves so swift that target is blurred. Only the COG and Struck perform well in the entire sequence. As show in Fig. 4(e), The motion blur caused by drastic appearance change in the jumping sequence that only COG, Stuck and LOT is not failed after frame 16. The main reason that proposed COG successfully keeps track of the target with motion blur is that the simple feature map is based on oriented gradient, which is robust to motion blur (see. Fig. 1).

Plane Rotation: The target may be deformable while rotating in or out of the image plane. It also led to motion blur if its speed of rotation is too high. For the boy sequence (Fig. 4(b)), the boy rotates both in and out of plane. Only COG and Stuck perform well in all sequence. In the fleetface sequence (Fig. 4(d)), there are both in-plane and out-of-plane rotations after frame 375. Except for COG and Stuck algorithms, the others all drift. The COG deals with plane rotation well because its online updated scheme is suit for appearance variation.

Fig. 3.
figure 3

Tracking results in terms of center error (in pixel). The COG is compared with 9 state-of-the-art algorithms on 9 challenging image sequences

Fig. 4.
figure 4

Tracking results in terms of center error (in pixel). The COG is compared with 9 state-of-the-art algorithms on 9 challenging image sequences (Color figure online)

Illumination Variation: The reasons for illumination variation include different col-our and varying levels of light. In singer2 sequence (Fig. 4(g)), the colour of back-ground illumination change so drastic that only the COG algorithm can track the target stably in the entire sequence. In trellis sequences (Fig. 4(i)), only the COG and ASLA algorithms perform well while the strength of illumination is varied. The proposed COG algorithm well handles the situation with illumination variations as the feature is extracted from the normalized local filters with gamma correction and local brightness normalization.

Occlusion: The target may be partially or fully occluded by other object. In david3 sequence (Fig. 4(c)), only the COG and LOT algorithm perform well when the target is partially occluded by the tree (e.g., frame 84, frame 190). In jogging-1 sequences (Fig. 4(f)), the target is completely occluded by a lamppost (e.g., frame 74, frame 78). Only the COG is able to detect the person when the target appears again in the screen (e.g., frame 80). Our COG tracker achieves stable performance for occlusion due to employing the local features in complex feature maps.

4.3 Quantitative Comparisons

For quantitative comparison, the center position error plot and overlap rate plot are employed. The center position error is defined as

$$\begin{aligned} e=\sqrt{(x_{t}-x_{s})^{2}+(y_{t}-y_{s})^{2}} \end{aligned}$$
(20)

where \(x_{t}\) with \(y_{t}\) is the center position of the tracking result, and \(x_{s}\) with \(y_{s}\) is the ones of the true state. The overlap rate is defined as

$$\begin{aligned} S=\frac{Area(R_{t}\bigcap R_{s})}{Area(R_{t}\bigcup R_{s})} \end{aligned}$$
(21)

where \(R_{t}\) represent the area of tracking bounding box, and \(R_{s}\) represent the area of true state. The tracking result is successful if the value of overlap rate is greater than 0.5 in current frame. And the successful rate is defined as the number of success frames divided by the number of all frames.

Figure 2 illustrates the center position error of each frame in terms of the results of 10 algorithms, while Fig. 3 shows the corresponding overlap rate plot. The results indicate that the center position errors of COG keep the value at relatively low level and the corresponding overlap rate keep higher level in all sequence. Table 1 shows the average overlap rate of each algorithm. Overall, the proposed COG achieves mean overlap rate of 0.75, which outperforms LOT by 26%. Meanwhile, in the success rate which shows in Table 2, its score is 0.97, outperforms significantly Struck by 37%.

5 Conclusion

In this paper, we have proposed and demonstrated an effective and robust tracking method based on the oriented gradient convolutional networks. The tracker is constructed by a simple two-layer convolutional network and formulated within a particle filtering framework. First, we warp the input images to a fixed size and extract a set of normalized patches of oriented gradient by sliding window, which can handle illumination variation. To obtain a spare representation of the target, we exploit the overcomplete dictionary from the normalized patches as filters. Then, the first layer is constructed from the convolution of filters and input images, which is robust to motion blur. Finally, we stack all the simple feature maps to construct the complex feature map as the representation of the target which can overcome drifts and occlusions. Furthermore, both the latest observations and the original filter are considered in our update scheme, which can deal with appearance change and the drift problem well. Compared with other nine state-of-the-art algorithms, the experimental results of quantitative and qualitative comparisons on 9 challenging image sequences show the robustness of proposed tracking algorithm.