1 Introduction

Target tracking in videos or visual tracking plays a key role in many fields of computer vision applications such as intelligent surveillance, intelligent transportation, activity recognition and etc. Although many algorithms have been proposed, but still some challenges have remained in which researchers have interest to solve them. Most of the visual tracking algorithms consist of three components

  • Motion model: is used to predict the state of the target in the frame.

  • Appearance model: represents the appearance of the target according to its visual characteristics.

  • Search method: considers the appearance and motion models to select the most likely target’s state.

The main challenge in designing a robust tracking algorithm is changes in the appearance of the target caused by blurring, non-uniform illuminating, size changing and partial occlusions. Therefore, appearance model is the one of the key components in the robust tracking that has received more attention in the recent years [1].

In the most of previous tracking methods, appearance model was based on the templates [1,2,3] or subspaces [4]. However, these methods are not suitable when occlusions or drastic changes occur in the target appearance. Recently, a some of appearance model techniques is presented based on the sparse representation, which have more desirable performance in dealing with appearance corruptions and especially occlusions [5,6,7,8,9]. Sparsity representation has many attractive applications such as compressive sensing, dimension reduction, source separation, super resolution [10] in computer vision and also in other subjects of signal processing such as classification [11], cognitive radios [12] and etc. First of sparse representation based tracking method was proposed in [5] by Mei and Ling, which has some unsolved problems such as high computational cost, low number of templates in the dictionary and occlusion effects in the updated dictionary. Therefore further efforts were done to solve these problems e.g. articles [6, 7] have been able to address the problems of the paper [5]. However, as the results of experiments show both methods have low accuracy in some scenarios yet. In this paper inspired from [6, 7], an effective tracking method is proposed based on both block and pixel based sparsity representations which its results represent the more accuracy and tracking stability with them.

The rest of this paper is organized as follow. The sparse tracking method is described in Sect. 2 and the proposed method is suggested in Sect. 3. Section 4 presents the experimental results and finally the paper is concluded in Sect. 5.

2 Tracking Based on Sparse Representation

In this section, the basics of tracking based on sparsity representation is introduced. In addition, two relevant recent well cited methods called as Sparse Prototypes Tracker (SPT) [6] and Structured Sparse Representation Tracker (SSRT) [7] are discussed too.

2.1 Original Sparse Representation Model

Mei and Ling proposed sparsity representation based tracking method [5]. In this type, target appearance is modeled by a sparse linear combination of target and trivial templates as shown in Fig. 1. In fact, they propose an algorithm (\(l_1\) tracker) by casting the tracking problem as finding the most likely patch with sparse representation and handling partial occlusion with trivial templates. Trivial templates is an identity matrix and is exploited to model occlusion and noise in the real-world observation data. More precisely \(\varvec{y}\in R^d\) could be the observation vector as:

$$\begin{aligned} \varvec{y} \cong \varvec{T}\varvec{a}+\varvec{e}=\left[ \begin{array}{cc} \varvec{T} &{} \varvec{I}\\ \end{array} \right] \left[ \begin{array}{cc} \varvec{a} \\ \varvec{e} \end{array} \right] =\varvec{D} \varvec{c} \end{aligned}$$
(1)

where \(T=\left[ \varvec{t}_1,\varvec{t}_2,...,\varvec{t}_m \right] \in R^{d \times m}\) (\(d\gg m\)) is the set of training templates and \(I \in R^{d \times d}\) is the trivial templates, which \(\left[ \begin{array}{cc} \varvec{T} &{} \varvec{I}\\ \end{array} \right] \) can be assumed as a dictionary of representation. Vector \(\varvec{a} \in R^m\) is the coefficients vector and \(\varvec{e} \in R^d\) is the error vector in which indicates the partial occlusion. The occlusion only covers a portion of the target appearance and therefore it is possible to assume that the error vector \(\varvec{e}\) and consequently vector \(\varvec{c} \in R^{d+m}\) are sparse [5]. To find the sparse vector \(\varvec{c}\), the following minimization problem should be solved,

$$\begin{aligned} \mathop {\min }\limits _{\varvec{c}} \frac{1}{2} \left\| {\varvec{D} \varvec{c} - \varvec{y}} \Vert \right. _2^2 + \lambda \left\| {\varvec{c}} \right\| _1 \end{aligned}$$
(2)

where \(\left\| \right\| _2\) and \(\left\| \right\| _1\) denote norms \(l_2\) and \(l_1\) respectively.

Fig. 1.
figure 1

Original sparse representation model for target tracking [5].

2.2 Sparse Prototypes Tracker (SPT)

In article [6], Wang et al. proposed an extension named as Sparse Prototypes Tracker (SPT) for target representing. They exploit the strength of both subspace learning and sparse representation for modeling object appearance. For object tracking, they model target appearance with PCA basis vectors \(\varvec{U}\), and account for occlusion with trivial templates \(\varvec{I}\) by

$$\begin{aligned} \varvec{y} \cong \varvec{U}\varvec{z}+\varvec{e}=\left[ \begin{array}{cc} \varvec{U} &{} \varvec{I}\\ \end{array} \right] \left[ \begin{array}{cc} \varvec{z} \\ \varvec{e} \end{array} \right] \end{aligned}$$
(3)

where \(\varvec{z}\) indicates the coefficients of basis vectors. In their formulation, the prototypes consist of just a small number of PCA basis vectors, therefore the \(\varvec{z}\) will be dense and the appearance problem can be modified as follow. Figure 2 shows the difference in representations of [5] and SPT [6] which target templates are replaced by PCA basis. Prototypes consist of PCA basis vectors and trivial templates.

Fig. 2.
figure 2

Sparse representation models for target tracking. (a) Original [5] (b) Sparse Prototypes Tracker (SPT) [6].

$$\begin{aligned} \mathop {\min }\limits _{\varvec{z}, \varvec{e}} \frac{1}{2} \left\| {\varvec{y} - \varvec{U} \varvec{z} - \varvec{e} } \Vert \right. _2^2 + \lambda \left\| {\varvec{e}} \right\| _1 \end{aligned}$$
(4)

It is obvious that the number of used basis vectors in matrix \(\varvec{U}\) could be effective on accuracy.

2.3 Structured Sparse Representation Tracker (SSRT)

In SPT [6], authors only use information from individual pixels and do not exploits any predetermined assumptions about the structure of the sparse coefficients. But the performance of using the group sparsity or structured sparsity is higher than using just original sparsity [9]. In other words, having previous knowledge of the signal’s structure and exploiting it can yield the better results. Structured Sparse Representation Tracker (SSRT) is proposed in [7] by Bai and Li with assuming continuous occlusion and previous knowledge of the dictionary structure. As shown in Fig. 3, authors first partition the observed sample and also each of the training templates into R local parts which makes contiguous occlusion (highlighted with red) can be stacked (grouped) as a block sparse vector that has clustered nonzero entries. Then the partitioned regions are stacked into \( 1-D \) vectors \(\varvec{y}\). Also, Corresponding structuring should be considered for PCA or subspace templates. More details can be found in [7].

Fig. 3.
figure 3

A simple illustration of structured sparse representation. (a) Observed holistic sample or template, (b) Partition the sample into the local areas, (c) Convert local areas into vectors and putting them in an observed vector, (d) Block structured basis. (Color figure online)

3 Proposed Tracking Algorithm

In this section, our proposed method based on SPT and SSRT methods is explained. The proposed appearance model is defined first and then the particle filter tracking framework is adjusted for coping the model. Finally, procedure for updating the appearance model is discussed.

3.1 Proposed Appearance Model

Based on tracking with structured sparse representation model, since the occlusion geometry is unknown, therefore regardless of the occlusion geometry, the sample is partitioned into predefined blocks. In cases where occlusion does not completely fill a block (e.g. 7th block in Fig. 3), the block may be determined as clean (without occlusion) or in contrast full of occlusion, and this simple decision criterion leads to a weak accuracy in tracking procedure.

In order to solve this problem, we propose to represent the appearance model of the target by using original sparse representation of pixels and group sparse representation simultaneously. In this model, \(l_{2,1}\) and \(l_1\) norms are used to represent group and individual pixel sparsity, respectively. The proposed sparse tracking model is:

(5)

where \(\varvec{U}\) is the PCA subspace extracted from target templates. Also, \(\varvec{e}\) is the error vector that includes \(\varvec{e}=\left[ e^{1^T}, e^{2^T}, \cdots , e^{J^T} \right] \) where J is the total number of blocks and \(\varvec{e}^j\) is the error vector for the jth block. In this fashion, the lower size of data can model more states of the object. The vector \({\bar{y}}\) is the centered observation vector, i.e. \({{\bar{y}}} = \varvec{y}- \varvec{\mu }\) which \(\varvec{\mu }\) is the average vector of the training space. The subspace coefficients \(\varvec{z}\) and sparse error vector \(\varvec{e}\) should be found while vector \(\varvec{e}\) is considered regarded to both pixel based and group based sparseness properties. Pixel based sparseness of the error vector is considered by \(\left\| \varvec{e} \right\| _1\). The sparseness in groups is computed by

$$\begin{aligned} \left\| {\varvec{e}} \right\| _{2,1}= \sum _{j=1}^J {\left\| \varvec{e}^j \right\| _2}. \end{aligned}$$
(6)

The coefficients of \(\lambda _1\) and \(\lambda _2\) control the sparseness of the pixel based and groups based sparseness.

3.2 Particle Filter

For robust tracking, we exploit the proposed appearance model in the particle filter tracking framework and estimate the state of the target’s [13]. The motion model in the particle filter is modeled by a Gaussian distribution around the target’s state in the previous frame. This means

$$\begin{aligned} p(\varvec{x}_t | x_{t-1})=\mathcal {N}(\varvec{x}_t:\varvec{x}_{t-1},\varPsi ) \end{aligned}$$
(7)

where \(\varvec{x}_t\) is the target’s state vector at tth frame and \(\varPsi \) is the covariance matrix of the target’s states. The state vector \(\varvec{x}_t=(x_t, y_t, \theta _t, s_t, \alpha _t, \phi _t)\) contains six parameters as state variables where \(x_t, y_t, \theta _t, s_t, \alpha _t, \phi _t\) denote xy translations, rotation angle, scale, aspect ratio, and skew respectively. Observation likelihood function is calculated as:

$$\begin{aligned} p(\varvec{y}_t | x_{t})=\exp (-\left\| { \varvec{y}_t-\hat{ \varvec{y}_{t}}} \right\| _2^2) \end{aligned}$$
(8)

where \(\hat{ \varvec{y}_{t}}\) is prediction of the observed sample in the tth frame based on state \(\varvec{x}_{t}\). The formula \(\hat{\varvec{y}_{t}}=\varvec{T} \varvec{a}\) is used in literature of tracking for particle filtering. However, we propose a modified observation model which is inspired by [6] as follow.

$$\begin{aligned} p(\varvec{y}_t | x_{t})=\exp (- \left[ \left\| \varvec{y}_t- \varvec{U} \varvec{z} - \varvec{e} \right\| _2^2 + \lambda _2 \times NOEB \right] )\nonumber \\ =\exp (- \left[ term1 + term2 \right] ) \ \ \ \ \ \ \ \ \ \end{aligned}$$
(9)

As mentioned before, similar criterion is proposed in the SPT method of [6], but their reconstruction error (term1) was just calculated over the pixels without occlusions. In Eq. (9), we also consider the number of occlusion blocks in term2 as NOEB, which is the sum of the Error Number of each block in an observed sample. Figure 3 shows the concept of the observed sample which contains some blocks. Suppose that, the number of blocks in each observed sample is J, then the NOEB is \(NOEB=\sum _{j=1}^J \gamma _j\) where \(\gamma _j\) is the Error Number of each block and is computed as follow.

$$\begin{aligned} \gamma _n=\frac{number \ of \ occluded \ pixels\ in\ the\ nth \ block}{number\ of\ pixels\ in\ the\ nth \ block} \end{aligned}$$
(10)

In addition, two thresholds \(tr_L\) and \(tr_H\) are used to define three types of Error Number as follow.

  • If \(\gamma _j \le tr_L \), the block is considered as error-free and Error Number will be set to \(\gamma _j=0\).

  • If \(\gamma _j \ge tr_H \), the block is considered as completely error block and Error Number will be set to \(\gamma _j=1\).

  • If \(tr_L \ge \gamma _j \le tr_H \), some of the pixels in the block have errors and Error Number will be set to \(\gamma _j\).

3.3 Updating Appearance Model

Because of changes in the appearance of the targets during tracking sequences, it is not logical to use a fixed subspace for target appearance representation. Therefore updating the appearance model dynamically could improve the tracking performance. It is important to update by using the correct templates which has no errors such as occlusion and background. So the first step of the updating procedure is to select the correct templates. We propose to use a local analysis along with global analysis. Suppose that observed sample \(\varvec{y}\) is selected by the particle filter and the corresponding error vector is computed as \(\varvec{e}\) by (5).

In global analysis if the number of occlusion blocks in the selected error vector \(\varvec{e}\), is greater than a certain threshold, then the sample \(\varvec{y}\) will be rejected and not be used for updating. Otherwise this sample will be used to update the subspace after local analysis as follows. For each block j:

  • If \(\gamma _j \le tr_L \), the entire block pixels remains unchanged.

  • If \(\gamma _j \ge tr_H \), the entire block pixels are replaced with the mean vector of the subspace \(\varvec{\mu }\).

  • If \(tr_L \ge \gamma _j \le tr_H \), only occlusion pixels are replaced by corresponding values in the mean vector \(\varvec{\mu }\) and other pixels left unchanged.

After determining the correct templates based on above mentioned procedure and collects them (e.g. 5 corrected templates), they will be used to update the subspace \(\varvec{U}\) and the mean vector \(\varvec{\mu }\) by exploiting the incremental learning algorithm presented in [4].

4 Experimental Results

The proposed tracking algorithm was simulated in the Matlab platform while CVX is used to solve (5) [14, 15]. In order to evaluate the performance of the proposed algorithm, four different sequences are selected in which have different tracking challenges as shown in Table 1.

Table 1. Dataset characteristics: length and challenges

The results of the proposed algorithm are compared with two other sparse tracking algorithms, SPT-2013 [6] and SSRT-2012 [7]. While the SPT simulation codes have been written by its author and are available to use, the simulation code for SSRT algorithm are provided by ourselves.

Each observed sample is resized to size \(32 \times 32\) for SPT and our method and \(15 \times 12\) for SSRT. Each observed sample is partitioned to 64 and 6 blocks for our method and SSRT, respectively. The number of 600 particles are selected for particle filter. In all experiments, \(\lambda _1 = 0.02\), \(\lambda _2=0.27\), \(tr_L=0.25\) and \(tr_H=0.75\). The number of basis for PCA subspace is 10.

In order to evaluate and compare the proposed method with other algorithms quantitatively, the overlapping diagrams are drawn. Overlap Rate (OLR) [16, 17], the overlap area between the detected target and the area specified by the ground truth is defined as

$$\begin{aligned} OLR=\frac{area\left( ROI_{TR} \cap ROI_{GT}\right) }{area\left( ROI_{TR} \cup ROI_{GT}\right) } \end{aligned}$$
(11)

where the \(ROI_{TR}\) is the target’s ROI which is the result of the tracking algorithm and \(ROI_{GT}\) is the corresponding correct area in the Ground truth. In addition, the Center Location Error (CLE), the Euclidean distance between the centers of the found target and Ground truth is computed.

Figure 4 shows the results of the “David” sequence which contains both changing in illumination and target state. The Overlap Rate diagrams are illustrated in Fig. 4(b) and show that the performance of the proposed algorithm is better than other algorithms.

Figure 5 shows the results of the “faceocc2” sequence. In this sequence partial occlusions occur along with rotation of the target. The performance of the proposed algorithm is same as other algorithms until the 700th frame. But after a large occlusion (700th frame), the proposed algorithm brings much better performance than the two other algorithms.

Fig. 4.
figure 4

Tracking results of the proposed tracker, SPT tracker and SRRT tracker on “David” sequence. (a) Quality evaluation, (b) Overlapping rate diagram.

Fig. 5.
figure 5

Tracking results of the proposed tracker, SPT tracker and SRRT tracker on “Faceocc2” sequence. (a) Quality evaluation, (b) Overlapping rate diagram.

Figure 6 is about sequence of the “car6” where the vehicle moves and a large occlusion occurs at 280th frame. While SPT and SSRT fail to track, the proposed algorithm tracks the target in all frames of this sequence as well!

Figure 7 shows the results of the “jumping” sequence. The target have fast motion and blurred in the most of the frames. The Overlap Rate diagrams are illustrated in Fig. 7(b) and show that the performance of different trackers. It can be inferred from Fig. 7 that, while the proposed algorithm performs almost similar to SPT, but is very better than SSRT algorithm.

Fig. 6.
figure 6

Tracking results of the proposed tracker, SPT tracker and SRRT tracker on “Car6” sequence. (a) Quality evaluation, (b) Overlapping rate diagram.

Fig. 7.
figure 7

Tracking results of the proposed tracker, SPT tracker and SRRT tracker on “jumping” sequence. (a) Quality evaluation, (b) Overlapping rate diagram.

Table 2. Average center location error (CLE) of different methods for considered videos.
Table 3. Average overlap rate (OLR) of different methods for considered videos.

Furthermore, the quantity measurements are reported here for each sequence and algorithm. The average of CLE and OLR of the algorithms for all frames are reported in the Tables 2 and 3 for each sequence. As shown in the table, the lowest average CLE and maximum OLR are denoted with bold notation. Results show that, the proposed algorithm and criteria brings better performance of sparsity based tracking.

Finally, in this paper, new criteria are proposed in (5) and (9) to represent and track objects more precisely and robustly. But, for practical usage, it is necessary to deal with real-time sequences. Therefore, we try to extended this work and propose a new and fast solution algorithm in future. On the other hand, using just one adaptive subspace (PCA based) to represent the objects is not comprehensive and it is better to train more different or nonlinear spaces to improve the representation capability. Also, extracting the background information of the image and considering it can be useful to model and find the new representation spaces.

5 Conclusion

This paper proposed a robust and fast tracking algorithm using sparse representation along with particle filtering. In order to represent the target’s appearance, simultaneous pixel based and block based sparse representations are considered. Based on blocking and grouping concepts which are used to develop the appearance model, a new observation model in particle filter is suggested too. Finally, a simple additional criterion is proposed to select and modify the correct templates which will be used for PCA subspace updating. Experiments show the robustness of the proposed tracking algorithm according to major challenges such as occlusion, illumination changes, resizing, rotating and also brings better performance in comparison with recent SPT and SSRT algorithms.