1 Introduction

Object tracking algorithms can be broadly categorized into generative, discriminative and hybrid approaches. A generative approach models the target appearance and localizes it by optimizing a (dis)similarity measure between target and candidate(s) [1,2,3,4]. A discriminative approach learns classifiers from target and background appearance features in a supervised framework [5,6,7]. A hybrid methodology combines both by modeling target appearance in a generative framework while discriminating the same against background [8].

In generative approach, the first notable work which makes uses of a sparse representation based target model was proposed by Mei et al. [4]. Tracking was formulated as a L1 minimization problem where the candidate model is represented as a sparse linear combination of object and trivial templates. Thereafter, a number of variants of this basic formulation has been successfully applied in object tracking [9,10,11,12,13]. Successful use of classifier in tracking has been demonstrated in [6, 7].

We propose a hybrid approach that employs sparse representation based target modeling and uses classifiers for better discrimination of target model from that of background. We propose to extract patches from both background and foreground and learn dictionaries on these patches using K-SVD [14] and spherical clustering. We learn a SVM classifier to discriminate the sparse codes of object from that of background. Sparse codes extracted from (multiple overlapping) patches of non-overlapping cells are weighed by their classification scores to compute cell histogram. These cell histograms together form the object model. The average of cell histogram similarities between target and candidate models are maximized in a particle filter framework [15]. The main contributions of this work are as follows.

  • Proposal of a target model in a hybrid generative-discriminative approach.

  • Object representation by histogram of sparse codes (HSC) obtained from foreground-background dictionaries (generative model).

  • Enhancing discrimination of object models by weighing HSC with patch classification (by SVM) scores (discriminative approach).

The rest of the paper is organized as follows. The proposed approach is elaborated in Sect. 2. Experimental results are presented in Sect. 3. Finally, Sect. 4 summarizes the present work and sketches the future extensions.

2 Proposed Work

The object rectangle is divided into non-overlapping regions called cells. Proposed target model is learned in two stages. The first stage involves generative modeling. Here, object and background patches are used to learn two different dictionaries using K-SVD [16]. It learns the dictionary by solving the following sparse constrained optimization problem

$$\begin{aligned} \underset{ \varvec{\varGamma }, \mathbf {D} }{\min } ||\mathbf {X} -\mathbf {D}\varvec{\varGamma } ||_F^2 \ \text {subject to } \forall i \ ||\mathbf {\gamma }_i||_0 \le T \end{aligned}$$
(1)

where, \(\varvec{\gamma }\) is the sparse code vector, \(\mathbf {D}\) is the dictionary, \(\mathbf {X}\) is the set of signals, \(\varvec{\varGamma }\) has the sparse codes of all the signals and T is the sparsity threshold. The two learned dictionaries are then used to from a combined foreground-background dictionary. In the second stage, sparse codes of background and foreground patches are used to train a SVM classifier in a discriminative framework. Patch classification obtained from this SVM provides further discrimination in the object model. Sparse codes and the classification scores of patches are used to compute the weighted sparse code histogram of the cells, which form the object model. The candidate models are constructed from object state (position, rotation and scaling) proposals obtained through particle filtering. Next, we discuss dictionary learning in detail.

2.1 Learning Foreground-Background Dictionaries

A single dictionary learned from only object patches might provide good reconstruction but poor recognition against background clutter. Rectangular patches are extracted from minimum bounding box of the object (\(\mathbf {bb}^{obj}\)) and a background region (\(\mathbf {bb}^{bg}\)) around \(\mathbf {bb}^{obj}\). The extracted patches are first vectorized and magnitude normalized (using l2-norm). Next, they are arranged to construct the input data matrices \(\mathbf {X}^{obj} \in \mathcal {R}^{n \times np}\) and \(\mathbf {X}^{bg} \in \mathcal {R}^{n \times nn}\) of respective object (positive) and background (negative) classes. Here, np and nn are the total number of object and background patches respectively and n is the dimension of patch vector. Magnitude normalization helps to make the object model robust to the illumination changes. Spherical k-means clustering is performed on the object patch vectors \(\mathbf {X}^{obj}\) and background patch vectors \(\mathbf {X}^{bg}\) separately. The dominant clusters are selected and K-SVD algorithm is performed on each of them to obtain m representative atoms from each cluster. These are stacked together to form the foreground-background dictionary. Common patches of foreground and background may to lead drift in tracking. In order to reduce the effect of such patches, we introduce discriminability through a binary classifier. The dictionary learning procedure is shown in Fig. 1.

Fig. 1.
figure 1

Patches extracted from foreground and background regions are vectorized and magnitude normalized. These are further grouped using spherical clustering. K-SVD performed on cluster members provides us with different atom sets. These atom sets obtained from background and object clusters are combined to form a single dictionary.

2.2 Classifier Learning

A binary classifier is learned on the sparse codes generated using K-SVD algorithm with foreground-background dictionary for the object and background patches. The learned classifier provides confidence scores for object patches. These scores are used in constructing the proposed object model. Let, \(\varvec{\varGamma }^{obj}\) and \(\varvec{\varGamma }^{bg}\) be the respective sparse codes corresponding to patch vectors coming from positive set \(\mathbf {X}^{obj}\) and negative set \(\mathbf {X}^{bg}\). A SVM classifier is learned on the sparse code representations of patch vectors. The sparse vector corresponding to an ambiguous patch will lie closer to the classification boundary and will have a lower classification score compared to the object patches. This will improve the discriminative power of the object model against background and ambiguous patches. The proposed object model is explained next.

2.3 Object Model

The object bounding box is first divided into non-overlapping cells \(\mathbf {C} = \{ \mathbf {c}_i: \mathbf {c}_i \in \mathcal {R}^{cw \times ch} \}\), where cw and ch are the width and height of the cell respectively. The sparse codes corresponding to the rectangular patches from a cell computed using OMP is used to compute the histogram of sparse codes. The set of all cell histograms define the object model i.e. \(\mathbf {H} =\{\mathbf {h}^i\}\) and \(\mathbf {h}^i, i=1,\ldots |C |\). The sparse codes required for computing the object model are computed as

$$\begin{aligned} \underset{\varGamma }{\text {min}} \ ||\mathbf {X}^{obj} -\mathbf {D} \varvec{\varGamma } ||_F^{2} \ \text { s.t. } ||\varvec{\varGamma } ||_{0} \le \gamma _{s} \end{aligned}$$
(2)

where \(\varvec{\varGamma }\) is the sparse code matrix, \(\mathbf {D}\) is the foreground-background dictionary and \(\gamma _{s} \le m\) is the sparsity constraint. The sparse code histogram of the cell is created from the sparse codes \(\varvec{\varGamma }\) of its component patches as

$$\begin{aligned} \mathbf {h}^c(j) = L\sum _{i=1}^{nc} |\varvec{\gamma }_{ij} |\ \omega _i , \ \mathbf {x}_{i} \in \mathbf {X}^{obj} \end{aligned}$$
(3)

where \(\mathbf {x}_i\) is the \(i^{th}\) object patch belonging to cell c, \(\omega _i\) is the normalized classification score of the \(i^{th}\) patch as given by the classifier, nc is the total number of patches in the cell and L is the normalization constant for the histogram. The cell-wise histograms are stacked together to form the object model \(\mathbf {H}\) which is a collection of classifier weighted histogram of sparse codes given by \( \mathbf {H} = [\ \mathbf {h}^{c_1} \ \mathbf {h}^{c_2}\ \dots \ \mathbf {h}^{c_P} \ ] \), where P is the total number of cells. The entire object model creation is depicted in Fig. 2. The particle filter framework for target tracking is explained next.

Fig. 2.
figure 2

Object model as set of weighed sparse code histogram computed from non-overlapping cells of the object bounding box \(\mathbf {bb}^{obj}\). Sparse codes for the patches are computed using OMP (Orthogonal Matching Pursuit) with the learned dictionary. Sparse code vectors of the object patches and background patches are used for training the classifier. Classification score weighted sparse codes are then used to compute the histogram of each cell.

2.4 Particle Filter

Particle Filter otherwise also known as sequential Monte Carlo sampling is used for object localization in tracking. It predicts the posterior distribution of the state of a dynamic system. The particle with the maximum a posteriori is selected as the best particle and is taken as the state of the object in the current frame. Here, we define the object state as \(\mathbf {s} \in \mathcal {R}^5\) and is given by \( \mathbf {s} = [ x^c \ y^c \ w \ h \ \theta ]^T \) where \((x^c, y^c)\) are the image plane co-ordinates of the object bounding box centroid, \(w,h,\theta \) are the respective width, height and orientation of the object. The motion model defines the temporal evolution of state. We consider simple random walk as our motion model. The current state is assumed to be sampled from a Gaussian distribution centered at the previous state as, \( \mathbf {s}_{t} \sim \mathcal {N}(\mathbf {s}_{t-1}, \varvec{\varSigma } ) \) where \(\varSigma \) is a diagonal covariance matrix of state variables given by \(diag( \sigma _{x}^2, \sigma _{y}^2, \sigma _{w}^2, \sigma _{h}^2,\sigma _{\theta }^2)\). The observation probability is defined as the similarity measure between the target model and the candidate model of the particle. The average of Bhattacharyya Coefficients (\(\rho \)) of cell histograms is used as the observation probability given by \(p(\mathbf {y}\vert \mathbf {s} ) = \frac{1}{|C |} \sum _{i=1}^{|C|} \rho ^{c_i} = \frac{1}{|C |} \sum _{j=1}^{ k }\sqrt{ \mathbf {h}_q^{c_i}(j)\times \mathbf {h}_p^{c_i}(j)}\). The \(c_i^{th}\) cell histograms of the target and candidate respectively given by \(\mathbf {h}_q^{c_i}\) and \(\mathbf {h}_p^{c_i}\) and, k is the dimension of the sparse code vector. The state with highest average Bhattacharyya coefficient is selected as the state of the object in \(t^{th}\) frame. Experimental verification of our proposal and its performance analysis are presented next.

3 Experimental Results

The performance of the algorithm is evaluated on dataset VOT2014Footnote 1[17] and the tracker performance is compared with other trackers in the literature like Mean-Shift tracker (MST) [1], Track Learn and Detect (TLD) [6] and CMT [3] tracker. The trackers were executed with their default parameter settings. The experimental results show that our proposal fares sufficiently well compared to the state of art trackers (Table 1). The results of the proposed tracker on different challenging sequences from VOT2014 are shown in Fig. 3.

3.1 Quantitative Evaluation

The performance of the proposed tracker is evaluated using one pass evaluation (OPE) [18] scheme where the tracker is initialized with ground truth value in first frame and allowed to track over entire sequence. The results obtained on different sequences are reported in Table 1. The performance measures used are average overlap (AO) and success rate (SR). The overlap measure of a sequence is given by \( \phi _t(\varLambda _G,\varLambda _P) = \frac{ \varLambda ^{G}_t \cap \varLambda ^{P}_t}{\varLambda ^{G}_t \cup \varLambda ^{P}_t}\) where, \(\varLambda ^{G}\) is the area of the bounding box described by the ground truth, \(\varLambda ^{P}\) is the area of the bounding box predicted by the tracker. The average overlap is given by \( \varPhi _{avg} = \frac{1}{N_{s}}\sum _{t=1}^{N_{s}} \phi _{t} \) where, \(N_{s}\) is the total number of successfully tracked frames in the sequence. Tracking is assumed to be successful if \(\phi _{t}\) exceeds the threshold value \(\phi _{th} = 0.33\). The other parameters of the proposed algorithm are number of clusters (\(K = 100\)), number of atoms per cluster (\(m= 3\)), the sparsity constraint (\(T = 3\)) and number of particles (\(p = 75\)). Patches of dimension \(5\times 5\) (i.e. patch vector size is \(n = 25\)) were extracted from cells of size \(10\times 10\).

The computational complexity depends on number of particles (p), dictionary size (\(n \times l\)), number of OMP iterations (T), number of candidate region patches (u) and the computations (\(t_d\)) required for evaluating the orthogonal projection for OMP. The total computational time per frame can be computed as \(p\times u \times t_{OMP}\), where \(t_{OMP}\) is the computational load for OMP algorithm [14] given by \(t_{OMP} = t_dT + 2nT + 2T(l + n) + T^3\).

Table 1. Performance comparison of the proposed tracker with the trackers MST [1], TLD [6], CMT [3]

3.2 Qualitative Evaluation

The results of the proposed tracker on different sequences are shown in Fig. 3. There are continuous changes of appearance and orientation in “ball” and “polar bear” sequences. The target undergoes partial occlusions (frames: \(156 - 177\)) as well as scale changes in “car” sequence. The cell histogram based object model (constructed using patches) and particle filter based localization helps in handling these challenges. Illumination change is significant in “tunnel” sequence as target moves through differently illuminated regions. Here, patch normalization and sparse coding helps in achieving illumination invariant tracking.

Fig. 3.
figure 3

Results of single object tracking for proposed Tracker on (a)-(d) “ball” (frames: 7, 217, 440, 586); (e)-(h) “car” (frames: 35, 141, 168, 235); (i)-(l) “tunnel” (frames: 6, 252, 483, 694) and (m)-(p) “polar bear” (frames: 77, 171, 251, 326) sequences from VOT2014 dataset covering different challenges like illumination change (il), scale change (sc), in-plane rotation (ro) and partial occlusions (po)

4 Conclusion

We have proposed a novel target model in a hybrid generative-discriminative framework. The object patches are represented using foreground and background dictionaries (generative model). These representations are further weighed by SVM based classification scores (discriminability). The object is localized in a particle filter framework. The proposed tracker was able to handle different challenging scenarios like background clutter, partial occlusions, in-plane rotations, scale and illumination changes. Performance of the proposed tracker is benchmarked with state of art trackers on sequences from VOT2014 dataset.

The present work did not incorporate continuous dictionary and classifier update schemes in the object model. This extension will enable the tracker to trail targets for longer durations, under sever appearance changes and occlusions. Also, the present approach is somewhat slower due to repeated application of OMP at the particle filtering stage. We propose to extend the present formulation through discriminative dictionary learning and fast OMP solvers.