Keywords

1 Introduction

Object tracking has been widely used in video surveillance and robotics which is a very popular topic in computer vision [1]. In recent years, although many tracking methods have been proposed and much success has been demonstrated, robust tracking is still a challenging task due to factors such as occlusion, fast moving, motion blur and pose change [2].

In order to deal with these factors, how to build an effective adaptive appearance model is particularly important. In general, tracking algorithms can be categorized into two classes: generative and discriminative algorithms [3]. The generative tracking algorithms aim at modeling the target and finding the location of the target by searching the image blocks which are the most similar to the target model. Kumar et al. [4] combine Kalman filter with geometric shape template matching method and can solve multi-target segmentation and merging. Zhan et al. [5] propose a combination of mean shift algorithm and Kalman filter tracking algorithm which can avoid the model update error. Wang et al. [6] use partial least squares (PLS) to study low dimensional distinguishable subspaces, and the tracking drift problem is alleviated by online update of the apparent model. Hu et al. [7] introduce the sparse weight constraint to dynamically select the relevant template in the global template set and use the multi-feature joint sparse representation for multi-target tracking under occlusion. However, the above generative algorithms ignore the background information. When the other object with similar texture to the target or the object is occluded, the tracking algorithm is easily interrupted or tracking failure.

Discriminative tracking algorithms consider that target tracking as a two element classification problem, the purpose is to find a boundary that divide the target from the background [8]. Babenko et al. [9] propose a multiple instance learning (MIL) approach, because of its feature selection calculation complexity, it leads to poor real-time performance. Kaur and Sahambi et al. [10] propose an improved steady-state gain Kalman filter. By introducing a fractional feedback loop in the Kalman filter, the proposed algorithm solves the problem of abrupt motion. Zhang et al. [11] make full use of hybrid SVMs for appearance models to solve the blurring problem of the former background boundary and avoid the drift problem effectively. But these discriminative methods involve high computational cost, which hinders their real-time applications.

In order to take advantage of the above two kinds of methods, this paper proposes an adaptive learning compressive tracking algorithm [12, 13] based on Kalman filter (ALCT-KF), which is used to solve the problems of severe occlusion, fast moving, similar object and illumination change. The adaptive learning compressive tracking algorithm uses CT algorithm to track the target, and calculate the value of the Peak-to-Sidelobe (PSR) by confidence map to update Bayesian classifier adaptively. When the PSR is less than a certain threshold, the object is considered to be heavy occlusion, then uses the Kalman filter to predict the location of object.

The rest of this paper is organized as follows. Section 2 gives a brief review of original CT. The proposed algorithm is detailed in Sect. 3. Section 4 shows the experimental results of proposed and we conclude in Sect. 5.

2 Compressive Tracking

As shown in [12, 13] are based on compressive sensing theory, a very sparse random matrix is adopted that satisfies the restricted isometry property (RIP), facilitating projection from the high-dimensional Haar-like feature vector to a low-dimensional measurement vector

$$ V = Rx , $$
(1)

where \( R \in R^{n \times m} \) \( (n \ll m) \) is sparse random matrix, feature vector \( x \in R^{m \times 1} \), compressive feature vector \( V \in R^{n \times 1} \).

$$ R(i,j) = r_{i,j} = \sqrt s \, \times \,\,\left\{ {\begin{array}{*{20}c} 1 & {w{\text{ith probability }}\frac{ 1}{{ 2 {\text{s}}}}} \\ 0 & {w{\text{ith probability 1}} - \frac{ 1}{\text{s}}} \\ { - 1} & {w{\text{ith probability }}\frac{ 1}{{ 2 {\text{s}}}}} \\ \end{array} } \right. $$
(2)

where \( s = m/(a\log_{10} (m)) \), \( m = 10^{6} \sim 10^{10} \), \( a = 0.4 \). \( R \) becomes very sparse, and the number of non-zero elements for each row is only 4 at most, further reducing the computational complexity.

The compressed features v are obtained by (1) and (2) which inputs to the naive Bayesian classifier and the position of the target is determined by the response value. Assuming that the elements in v are independently distributed, the naive Bayesian classifier is constructed:

$$ \begin{aligned} H(v) = & \log \left( {\frac{{\prod\nolimits_{i = 1}^{n} {p(v_{i} \,|\,y = 1)p(y = 1)} }}{{\prod\nolimits_{i = 1}^{n} {p(v_{i} \,|\,y = 0)p(y = 0)} }}} \right) \\ = & \sum\limits_{i = 1}^{n} {\log } \left( {\frac{{p(v_{i} \,|\,y = 1)}}{{p(v_{i} \,|\,y = 0)}}} \right) \\ \end{aligned} $$
(3)

where \( p(y = 1) = p(y = 0) = 0.5 \), and \( y \in \{ 0, \, 1\} \) is binary variable which represents the sample label. The conditional distributions \( p(v_{i} \,|\,y = 1) \) and \( p(v_{i} \,|\,y = 0) \) in \( H(v) \) are assumed to be Gaussian distributed with four parameters \( (\mu_{i}^{1} ,\delta_{i}^{1} ,\mu_{i}^{0} ,\delta_{i}^{0} ) \)

$$ p(v_{i} \,|\,y = 1)\sim N(\mu_{i}^{1} ,\sigma_{i}^{1} ) \, , \, p(v_{i} \,|\,y = 0)\sim N(\mu_{i}^{0} ,\sigma_{i}^{0} ) $$
(4)

where \( \mu_{i}^{1} (\mu_{i}^{0} ) \) and \( \mu_{i}^{0} (\delta_{i}^{0} ) \) are mean and standard deviation of the positive (negative) class. These parameters can be updated by

$$ \begin{aligned} & \mu_{i}^{1} \leftarrow \lambda \mu^{1} + (1 - \lambda )\mu_{i}^{1} \\ & \sigma_{i}^{1} \leftarrow \sqrt {\lambda (\sigma^{1} )^{2} + (1 - \lambda )(\sigma_{i}^{1} )^{2} + \lambda (1 - \lambda )(\mu_{i}^{1} - \mu^{1} )^{2} } \\ \end{aligned} $$
(5)

where \( \lambda \) is the learning parameter, and

$$ \begin{aligned} & \sigma^{1} = \sqrt {\frac{1}{n}\sum\nolimits_{k = 0|y = 1}^{n - 1} {(v_{i} (k) - \mu^{1} )^{2} } } \\ & \mu^{1} = \frac{1}{n}\sum\nolimits_{k = 0|y = 1}^{n - 1} {v_{i} (k)} \\ \end{aligned} $$
(6)

Negative sample parameters \( \mu_{i}^{0} \) and \( \sigma_{i}^{0} \) are updated with the similar rules.

Compressive tracking is simple and efficient, but the problem still exits: the classifier is updated by Eq. (5) which uses a fixed learning rate \( \lambda \). When the occlusion and other conditions occur, it may cause the classifier update error.

3 Proposed Algorithm

3.1 Adaptive Learning Compressive Tracking (ALCT)

Compressive tracking algorithm is difficult to re-find the right object when it tracks drift or failure. One of the main reasons is that \( p(v_{i} \,|\,y = 0) \) and \( p(v_{i} \,|\,y = 1) \) are determined by the four parameters \( \mu_{i}^{0} \), \( \mu_{i}^{1} \), \( \sigma_{i}^{0} \), \( \sigma_{i}^{1} \) while a fixed learning parameter \( \lambda \) is used in Eq. (5). When the occlusion or other conditions occur, \( \lambda \) may cause the classifier to update incorrectly.

According to Eq. (3), we define the non-linear function for the naive Bayes classifier \( H(v) \) as objective confidence

$$ c(x) = p(y = 1\,|\,x) = \sigma (H(v)) $$
(7)

where \( \sigma ( \cdot ) \) is a sigmoid function, \( \sigma (x) = (1/1 + e^{ - x} ) \).

The Peak-to-Sidelobe(PSR) [14], which measures the strength of a correlation peak, can be used to detect occlusions or tracking failure.

$$ PSR(t) = \frac{{\hbox{max} (c_{t} (x)) - \mu_{t} }}{{\sigma_{t} }} $$
(8)

where \( c_{t} (x) \) denotes the classifier response value for all the candidate positions at the t-th frame, split into the peak which is the maximum value \( \hbox{max} (c_{t} (x)) \) and the sidelobe which is the rest of the search position excluding an \( 11 \times 11 \) window around the peak. \( \mu_{t} \) and \( \sigma_{t} \) are the mean and standard deviation of the sidelobe. Taking the Cliff bar sequence as an example, the PSR distribution is shown in Fig. 1.

Fig. 1.
figure 1

Analysis of PSR in sequence of Cliff bar

Figure 1 shows the PSR can locate the most challenging factors of that video. In the first 75 frames, object has few interference factors and PSR stabilized at about 1.6. The object moves fast and causes the target area to be blurred, PSR is down to point A in the 75–90 frames. When there is no longer moving blur, PSR gradually returns to the normal level. In the same way, when the object undergoes occlusion, fast motion, scale change, rotation, which cause the values of PSR down to the valley point, corresponding to B, C, D, E, F respectively in Fig. 1. The value of PSR can reflect the influence of factors. The higher PSR is, the higher confidence of target location. Therefore, when the PSR is less than a certain threshold, the classifier should be updated with a smaller learning rate, which can improve the anti-interference ability of the model.

Experiments (see Fig. 1) show that when the value of PSR is higher than 1.6, the tracking results are completely credible. If PSR is less than 1.6, the object may be occlusion, pose and illumination change. So we can determine the update weight of the classifier according to the PSR of each frame. The new update formula is shown in Eq. (9):

$$ \left\{ {\begin{array}{*{20}l} {w_{t} = \,\left\{ {\begin{array}{*{20}l} 0 \hfill & {PSR_{t} < PSR_{0} } \hfill \\ {\exp [ - (PSR_{t} - PSR_{1} )^{2} ]} \hfill & {PSR_{0} < PSR_{t} < PSR_{1} } \hfill \\ 1 \hfill & {\,{\text{other}}} \hfill \\ \end{array} } \right.} \hfill \\ {\mu_{i}^{1} \leftarrow (1 - \lambda w_{t} )\mu_{i}^{1} + \lambda w_{t} \mu^{1} \,\,\,\,\,\,\,\,\,\,\,\,\,} \hfill \\ {\sigma_{i}^{1} \leftarrow \sqrt {(1 - \lambda w_{t} )(\sigma_{i}^{1} )^{2} + \lambda w_{t} (\sigma^{1} )^{2} + \lambda w_{t} (1 - \lambda w_{t} )(\mu_{i}^{1} - \mu^{1} )^{2} } } \hfill \\ \end{array} } \right. $$
(9)

where \( PSR_{t} \) represents the PSR at the t-th frame, \( PSR_{0} \) and \( PSR_{1} \) are two thresholds. When \( \, PSR_{0} < PSR_{t} < PSR_{1} \), it is considered that the object may undergo partial occlusion, fast motion, pose change. When \( \, PSR_{t} < PSR_{0} \), it is considered that the object is completely occluded, and classifier is not updated at this time. At this time, Kalman filter is used to predict the position of object.

3.2 Heavy Conclusion

In the process of object tracking, occlusion, illumination change, fast moving and similar target can not be avoided. If the above factors occur, the accuracy of many algorithms are obviously decreased. The adaptive learning compressive tracking algorithm proposed in this paper can meet the factors of partial occlusion and slow illumination change, but it needs to improve the algorithm for heavy occlusion.

Kalman filter algorithm [15] is mainly used to estimate the target location and state. The algorithm uses the position and velocity of the object as the state vector to describe the change of the object state. Kalman filter algorithm can also effectively reduce the influence of noise in the object tracking process. The state equations and the observation equation of the Kalman filter are as follows:

$$ x_{t + 1} = \phi x_{t} + w_{t} $$
(10)
$$ Z_{t} = Hx_{t} + V_{t} $$
(11)

where \( x_{t} (x_{t + 1} ) \) is the state vector of the \( t(t + 1) \) moment, \( Z_{t} \) is the observation vector of the \( t \) moment, \( \phi \) is state transition matrix, \( H \) is observation matrix, \( w_{t} \) is state noise vector of system disturbance, \( v_{t} \) is observed noise vector.

In Sect. 3.1, it is proved when \( \, PSR_{t} < PSR_{0} \) considers the target to be heavy occlusion. Because the Kalman filter can predict the position of the target in the next frame and effectively reduce the influence of noise, we use it to solve the above problems.

Kalman filter can be divided into two phases: prediction and updating. The prediction phase is mainly based on the state of the current frame target to estimate the state of the next frame. In the update phase, the estimate of the prediction phase is optimized by using the observations of the next frame to obtain more accurate new predictions. Assuming that the target has a serious occlusion at (t + 1)-th frame, the Kalman filter is used to re-estimate the object position.

(1) prediction phase

State prediction equation:

$$ x_{t + 1}^{ - } = \phi x_{t}^{ + } $$
(12)

where \( x_{t}^{ + } \) is the tracking result of the ALCT algorithm at the t-th frame.

Error covariance prediction equation:

$$ P_{t + 1}^{ - } = \phi P_{t}^{ + } \phi^{T} + Q $$
(13)

where \( P_{t}^{ + } \) is the covariance matrix at t frame, \( Q \) is the state noise covariance matrix and the value is constant.

(1) updating phase

Gain equation:

$$ K_{t + 1} = P_{t + 1}^{ - } H^{T} (HP_{t}^{ - } H^{T} + R)^{ - 1} $$
(14)

where \( K_{t + 1} \) is the Kalman gain matrix, \( R \) is the measurement noise covariance matrix and the value is constant.

Error covariance modification equation:

$$ P_{t + 1}^{ + } = (1 - K_{t + 1} H)P_{t + 1}^{ - } $$
(15)

State modification equation:

$$ x_{t + 1}^{ + } = x_{t + 1}^{ - } + K_{t + 1} (Z_{t + 1} - Hx_{t + 1}^{ - } ) $$
(16)

where \( Z_{t + 1} \) is the object position that the ALCT algorithm tracks when it is at (t + 1)-th frame, \( x_{t + 1}^{ + } \) is the position of the estimated object at the (t + 1)-th frame.

Then, using the ALCT algorithm to track the position of object at the t-th frame. If \( \, PSR_{t + 2} < PSR_{0} \) then re-estimate the target in the frame position by Eqs. (12)–(16), otherwise, ALCT is used to track the object at next frame. The flow chart of adaptive learning compressive tracking algorithm based on Kalman filter (ALCT-KF) is shown in Fig. 2.

Fig. 2.
figure 2

Flow of ALCT-KF algorithm

Firstly, the position of the target in the first frame is manually calibrated and the object is tracked by the ALCT algorithm. Then, the PSR is calculated by the target confidence map and the Bayesian classifier is updated by the PSR. If the PSR is less than a certain threshold, it is considered that the target is serious occlusion. At this time, the Kalman filter is used to predict the position of the target in the current frame and the predicted position is assigned to ALCT algorithm for object tracking in next frame.

4 Experiment

In order to validate the proposed algorithm, 6 different challenging video sequences are adopted, including occlusion, illumination, pose change, fast motion, and similar object. We compare the proposed ALCT-KF algorithm with the state-of-art methods, Compressive tracking(CT) [13], Online discriminative feature selection(ODFS) [16], Spatio-temporal context learning(STC) [17], Tracking-Learning-Detection(TLD) [18]. Implemented in MATLAB2013a, Core(TM)i5-4570CPU and 4 GB RAM. As is shown in Fig. 1, the thresholds of Eq. (9) are set to \( PSR_{0} \in [1.2,\,1.4] \) and \( PSR_{1} \in [1.6 - 1.8] \). The Kalman filter parameter is set to:

$$ \begin{aligned} & \varphi = \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right],\,\,\,\,\,\,\,R = \left[ {\begin{array}{*{20}c} {0.2845} & {0.0045} \\ {0.0045} & {0.0045} \\ \end{array} } \right],\,\,\,\,Q = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right],\,\,\,\,\,\,H = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ \end{array} } \right] \\ & P = \left[ {\begin{array}{*{20}c} {400} & 0 & 0 & 0 \\ 0 & {400} & 0 & 0 \\ 0 & 0 & {400} & 0 \\ 0 & 0 & 0 & {400} \\ \end{array} } \right] \\ \end{aligned} $$

Two metrics are used to evaluate the experimental results. (1) The first metric is the success rate which is defined as,

$$ score = \frac{{area(ROI_{T} \cap ROI_{G} )}}{{area(ROI_{T} \cup ROI_{G} )}} $$
(17)

where \( ROI_{G} \) is the ground truth bounding box and \( ROI_{T} \) is the tracking bounding box. If score is larger than 0.5 in one frame, the tracking result is considered as a success.

Table 1 shows the comparison of success rate in the test video. The proposed algorithm achieves the best or second best performance. Compared with the CT algorithm, the average tracking success rate of M-ALCT is improved by 15.7%. The last row of Table 1 gives the average frames per second. ALCT-KF performs well in speed (only slightly slower than CT method) which is faster than ODFS, TLD methods.

Table 1. Success rate (SR) (%) and average frames per second (FPS). (Top two result are shown in Bold and italic).

The second metric is the center location error which is defined as the euclidean distance between the central locations of the tracked object and the manually labeled ground truth.

$$ CLE = \sqrt {(x_{T} - x_{G} )^{2} + (y_{T} - y_{G} )^{2} } $$
(18)

The tracking error of the proposed algorithm is smaller than other algorithms, which can be maintained within 15 pixels (see Fig. 3).

Fig. 3.
figure 3

Error plots in terms of center location error for 6 test sequences. (Color figure online)

The object in the Dudek and FaceOcc2 sequences (see Fig. 4) is subject to partial occlusion and heavy occlusion. In the Cliff bar and Motocross sequences (see Fig. 5), the object is abrupt motion and rotation which lead to the appearances of objects change significant and motion blur. The David and Pedestrian sequences (see Fig. 6) are challenging due to illumination variation and similar object. Through the above experiments the proposed algorithm effectively avoids the tracking failure when occlusion, abrupt motion, motion blur, similar target and other situations occur.

Fig. 4.
figure 4

Tracking results of the occlusion sequences. (Color figure online)

Fig. 5.
figure 5

Some sample tracking results of abrupt motion and rotation sequences. (Color figure online)

Fig. 6.
figure 6

Tracking results of illumination variation and similar object. (Color figure online)

5 Conclusion

In this paper, an adaptive learning compressive tracking algorithm based on Kalman filter is proposed to deal with the problem of occlusion, fast moving, rotation and similar object. Tracking drift problem is alleviated by using tracking confidence map to adaptively update the classifier model and Kalman filter algorithm is used to predict the object location and reduce the impact of noise. Experiments show that the proposed algorithm has better tracking accuracy and robustness and it is easy to implement and achieves real-time performance.