Abstract
Object tracking involves target localization in dynamic scenes using either generative models, discriminative classifiers or their combination. We propose a combined approach consisting of generative models (learned in sparse representation framework) and discriminative classifiers (SVM). Sparse codes are initially computed from two different dictionaries constructed from foreground and background patches using K-SVD. SVM learned on these sparse codes provides classifier scores for patches. These scores for sparse codes of patches drawn from a region are used to form a weighted histogram. This weighted histogram of sparse codes form the object and candidate models. The learned dictionaries provide distinct representations for object and background patches. This discrimination is further enhanced by classifier scores. The object is localized by maximizing Bhattacharyya coefficient between target and candidate models in a particle filter framework. Performance of the proposed tracker is benchmarked on videos from VOT2014 dataset against existing generative and discriminative approaches. Our proposal was able to handle different challenging situations involving background clutter, in-plane rotations, scale and illumination changes.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Object tracking algorithms can be broadly categorized into generative, discriminative and hybrid approaches. A generative approach models the target appearance and localizes it by optimizing a (dis)similarity measure between target and candidate(s) [1,2,3,4]. A discriminative approach learns classifiers from target and background appearance features in a supervised framework [5,6,7]. A hybrid methodology combines both by modeling target appearance in a generative framework while discriminating the same against background [8].
In generative approach, the first notable work which makes uses of a sparse representation based target model was proposed by Mei et al. [4]. Tracking was formulated as a L1 minimization problem where the candidate model is represented as a sparse linear combination of object and trivial templates. Thereafter, a number of variants of this basic formulation has been successfully applied in object tracking [9,10,11,12,13]. Successful use of classifier in tracking has been demonstrated in [6, 7].
We propose a hybrid approach that employs sparse representation based target modeling and uses classifiers for better discrimination of target model from that of background. We propose to extract patches from both background and foreground and learn dictionaries on these patches using K-SVD [14] and spherical clustering. We learn a SVM classifier to discriminate the sparse codes of object from that of background. Sparse codes extracted from (multiple overlapping) patches of non-overlapping cells are weighed by their classification scores to compute cell histogram. These cell histograms together form the object model. The average of cell histogram similarities between target and candidate models are maximized in a particle filter framework [15]. The main contributions of this work are as follows.
-
Proposal of a target model in a hybrid generative-discriminative approach.
-
Object representation by histogram of sparse codes (HSC) obtained from foreground-background dictionaries (generative model).
-
Enhancing discrimination of object models by weighing HSC with patch classification (by SVM) scores (discriminative approach).
The rest of the paper is organized as follows. The proposed approach is elaborated in Sect. 2. Experimental results are presented in Sect. 3. Finally, Sect. 4 summarizes the present work and sketches the future extensions.
2 Proposed Work
The object rectangle is divided into non-overlapping regions called cells. Proposed target model is learned in two stages. The first stage involves generative modeling. Here, object and background patches are used to learn two different dictionaries using K-SVD [16]. It learns the dictionary by solving the following sparse constrained optimization problem
where, \(\varvec{\gamma }\) is the sparse code vector, \(\mathbf {D}\) is the dictionary, \(\mathbf {X}\) is the set of signals, \(\varvec{\varGamma }\) has the sparse codes of all the signals and T is the sparsity threshold. The two learned dictionaries are then used to from a combined foreground-background dictionary. In the second stage, sparse codes of background and foreground patches are used to train a SVM classifier in a discriminative framework. Patch classification obtained from this SVM provides further discrimination in the object model. Sparse codes and the classification scores of patches are used to compute the weighted sparse code histogram of the cells, which form the object model. The candidate models are constructed from object state (position, rotation and scaling) proposals obtained through particle filtering. Next, we discuss dictionary learning in detail.
2.1 Learning Foreground-Background Dictionaries
A single dictionary learned from only object patches might provide good reconstruction but poor recognition against background clutter. Rectangular patches are extracted from minimum bounding box of the object (\(\mathbf {bb}^{obj}\)) and a background region (\(\mathbf {bb}^{bg}\)) around \(\mathbf {bb}^{obj}\). The extracted patches are first vectorized and magnitude normalized (using l2-norm). Next, they are arranged to construct the input data matrices \(\mathbf {X}^{obj} \in \mathcal {R}^{n \times np}\) and \(\mathbf {X}^{bg} \in \mathcal {R}^{n \times nn}\) of respective object (positive) and background (negative) classes. Here, np and nn are the total number of object and background patches respectively and n is the dimension of patch vector. Magnitude normalization helps to make the object model robust to the illumination changes. Spherical k-means clustering is performed on the object patch vectors \(\mathbf {X}^{obj}\) and background patch vectors \(\mathbf {X}^{bg}\) separately. The dominant clusters are selected and K-SVD algorithm is performed on each of them to obtain m representative atoms from each cluster. These are stacked together to form the foreground-background dictionary. Common patches of foreground and background may to lead drift in tracking. In order to reduce the effect of such patches, we introduce discriminability through a binary classifier. The dictionary learning procedure is shown in Fig. 1.
2.2 Classifier Learning
A binary classifier is learned on the sparse codes generated using K-SVD algorithm with foreground-background dictionary for the object and background patches. The learned classifier provides confidence scores for object patches. These scores are used in constructing the proposed object model. Let, \(\varvec{\varGamma }^{obj}\) and \(\varvec{\varGamma }^{bg}\) be the respective sparse codes corresponding to patch vectors coming from positive set \(\mathbf {X}^{obj}\) and negative set \(\mathbf {X}^{bg}\). A SVM classifier is learned on the sparse code representations of patch vectors. The sparse vector corresponding to an ambiguous patch will lie closer to the classification boundary and will have a lower classification score compared to the object patches. This will improve the discriminative power of the object model against background and ambiguous patches. The proposed object model is explained next.
2.3 Object Model
The object bounding box is first divided into non-overlapping cells \(\mathbf {C} = \{ \mathbf {c}_i: \mathbf {c}_i \in \mathcal {R}^{cw \times ch} \}\), where cw and ch are the width and height of the cell respectively. The sparse codes corresponding to the rectangular patches from a cell computed using OMP is used to compute the histogram of sparse codes. The set of all cell histograms define the object model i.e. \(\mathbf {H} =\{\mathbf {h}^i\}\) and \(\mathbf {h}^i, i=1,\ldots |C |\). The sparse codes required for computing the object model are computed as
where \(\varvec{\varGamma }\) is the sparse code matrix, \(\mathbf {D}\) is the foreground-background dictionary and \(\gamma _{s} \le m\) is the sparsity constraint. The sparse code histogram of the cell is created from the sparse codes \(\varvec{\varGamma }\) of its component patches as
where \(\mathbf {x}_i\) is the \(i^{th}\) object patch belonging to cell c, \(\omega _i\) is the normalized classification score of the \(i^{th}\) patch as given by the classifier, nc is the total number of patches in the cell and L is the normalization constant for the histogram. The cell-wise histograms are stacked together to form the object model \(\mathbf {H}\) which is a collection of classifier weighted histogram of sparse codes given by \( \mathbf {H} = [\ \mathbf {h}^{c_1} \ \mathbf {h}^{c_2}\ \dots \ \mathbf {h}^{c_P} \ ] \), where P is the total number of cells. The entire object model creation is depicted in Fig. 2. The particle filter framework for target tracking is explained next.
2.4 Particle Filter
Particle Filter otherwise also known as sequential Monte Carlo sampling is used for object localization in tracking. It predicts the posterior distribution of the state of a dynamic system. The particle with the maximum a posteriori is selected as the best particle and is taken as the state of the object in the current frame. Here, we define the object state as \(\mathbf {s} \in \mathcal {R}^5\) and is given by \( \mathbf {s} = [ x^c \ y^c \ w \ h \ \theta ]^T \) where \((x^c, y^c)\) are the image plane co-ordinates of the object bounding box centroid, \(w,h,\theta \) are the respective width, height and orientation of the object. The motion model defines the temporal evolution of state. We consider simple random walk as our motion model. The current state is assumed to be sampled from a Gaussian distribution centered at the previous state as, \( \mathbf {s}_{t} \sim \mathcal {N}(\mathbf {s}_{t-1}, \varvec{\varSigma } ) \) where \(\varSigma \) is a diagonal covariance matrix of state variables given by \(diag( \sigma _{x}^2, \sigma _{y}^2, \sigma _{w}^2, \sigma _{h}^2,\sigma _{\theta }^2)\). The observation probability is defined as the similarity measure between the target model and the candidate model of the particle. The average of Bhattacharyya Coefficients (\(\rho \)) of cell histograms is used as the observation probability given by \(p(\mathbf {y}\vert \mathbf {s} ) = \frac{1}{|C |} \sum _{i=1}^{|C|} \rho ^{c_i} = \frac{1}{|C |} \sum _{j=1}^{ k }\sqrt{ \mathbf {h}_q^{c_i}(j)\times \mathbf {h}_p^{c_i}(j)}\). The \(c_i^{th}\) cell histograms of the target and candidate respectively given by \(\mathbf {h}_q^{c_i}\) and \(\mathbf {h}_p^{c_i}\) and, k is the dimension of the sparse code vector. The state with highest average Bhattacharyya coefficient is selected as the state of the object in \(t^{th}\) frame. Experimental verification of our proposal and its performance analysis are presented next.
3 Experimental Results
The performance of the algorithm is evaluated on dataset VOT2014Footnote 1[17] and the tracker performance is compared with other trackers in the literature like Mean-Shift tracker (MST) [1], Track Learn and Detect (TLD) [6] and CMT [3] tracker. The trackers were executed with their default parameter settings. The experimental results show that our proposal fares sufficiently well compared to the state of art trackers (Table 1). The results of the proposed tracker on different challenging sequences from VOT2014 are shown in Fig. 3.
3.1 Quantitative Evaluation
The performance of the proposed tracker is evaluated using one pass evaluation (OPE) [18] scheme where the tracker is initialized with ground truth value in first frame and allowed to track over entire sequence. The results obtained on different sequences are reported in Table 1. The performance measures used are average overlap (AO) and success rate (SR). The overlap measure of a sequence is given by \( \phi _t(\varLambda _G,\varLambda _P) = \frac{ \varLambda ^{G}_t \cap \varLambda ^{P}_t}{\varLambda ^{G}_t \cup \varLambda ^{P}_t}\) where, \(\varLambda ^{G}\) is the area of the bounding box described by the ground truth, \(\varLambda ^{P}\) is the area of the bounding box predicted by the tracker. The average overlap is given by \( \varPhi _{avg} = \frac{1}{N_{s}}\sum _{t=1}^{N_{s}} \phi _{t} \) where, \(N_{s}\) is the total number of successfully tracked frames in the sequence. Tracking is assumed to be successful if \(\phi _{t}\) exceeds the threshold value \(\phi _{th} = 0.33\). The other parameters of the proposed algorithm are number of clusters (\(K = 100\)), number of atoms per cluster (\(m= 3\)), the sparsity constraint (\(T = 3\)) and number of particles (\(p = 75\)). Patches of dimension \(5\times 5\) (i.e. patch vector size is \(n = 25\)) were extracted from cells of size \(10\times 10\).
The computational complexity depends on number of particles (p), dictionary size (\(n \times l\)), number of OMP iterations (T), number of candidate region patches (u) and the computations (\(t_d\)) required for evaluating the orthogonal projection for OMP. The total computational time per frame can be computed as \(p\times u \times t_{OMP}\), where \(t_{OMP}\) is the computational load for OMP algorithm [14] given by \(t_{OMP} = t_dT + 2nT + 2T(l + n) + T^3\).
3.2 Qualitative Evaluation
The results of the proposed tracker on different sequences are shown in Fig. 3. There are continuous changes of appearance and orientation in “ball” and “polar bear” sequences. The target undergoes partial occlusions (frames: \(156 - 177\)) as well as scale changes in “car” sequence. The cell histogram based object model (constructed using patches) and particle filter based localization helps in handling these challenges. Illumination change is significant in “tunnel” sequence as target moves through differently illuminated regions. Here, patch normalization and sparse coding helps in achieving illumination invariant tracking.
4 Conclusion
We have proposed a novel target model in a hybrid generative-discriminative framework. The object patches are represented using foreground and background dictionaries (generative model). These representations are further weighed by SVM based classification scores (discriminability). The object is localized in a particle filter framework. The proposed tracker was able to handle different challenging scenarios like background clutter, partial occlusions, in-plane rotations, scale and illumination changes. Performance of the proposed tracker is benchmarked with state of art trackers on sequences from VOT2014 dataset.
The present work did not incorporate continuous dictionary and classifier update schemes in the object model. This extension will enable the tracker to trail targets for longer durations, under sever appearance changes and occlusions. Also, the present approach is somewhat slower due to repeated application of OMP at the particle filtering stage. We propose to extend the present formulation through discriminative dictionary learning and fast OMP solvers.
References
Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 564–577 (2003)
Isard, M., Blake, A.: Condensationconditional density propagation for visual tracking. Int. J. Comput. Vision 29, 5–28 (1998)
Nebehay, G., Pflugfelder, R.: Clustering of static-adaptive correspondences for deformable object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2784–2791 (2015)
Mei, X., Ling, H.: Robust visual tracking using l1 1 minimization. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1436–1443. IEEE (2009)
Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 983–990. IEEE (2009)
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1409–1422 (2012)
Avidan, S.: Ensemble tracking. IEEE Trans. Pattern Anal. Mach. Intell. 29, 261–271 (2007)
Lei, Y., Ding, X., Wang, S.: Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans. Syst. Man Cybern. B Cybern. 38, 1578–1591 (2008)
Wang, D., Lu, H., Yang, M.H.: Online object tracking with sparse prototypes. IEEE Trans. Image Process. 22, 314–325 (2013)
Bai, T., Li, Y.F.: Robust visual tracking with structured sparse representation appearance model. Pattern Recogn. 45, 2390–2404 (2012)
Zhang, S., Yao, H., Sun, X., Lu, X.: Sparse coding based visual tracking: Review and experimental comparison. Pattern Recogn. 46, 1772–1788 (2013)
Yang, X., Wang, M., Zhang, L., Sun, F., Hong, R., Qi, M.: An efficient tracking system by orthogonalized templates. IEEE Trans. Industr. Electron. 63, 3187–3197 (2016)
Liu, B., Huang, J., Kulikowski, C., Yang, L.: Robust visual tracking using local sparse appearance model and k-selection. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2968–2981 (2013)
Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. CS Technion 40, 1–15 (2008)
Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. Signal Process. 50, 174–188 (2002)
Aharon, M., Elad, M., Bruckstein, A.: \( k \)-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54, 4311–4322 (2006)
Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., Nebehay, G., Porikli, F., Čehovin, L.: A novel performance evaluation methodology for single-target trackers. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2137–2155 (2016)
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Francis, M., Guha, P. (2017). Object Tracking with Classification Score Weighted Histogram of Sparse Codes. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)