Object Tracking with Classification Score Weighted Histogram of Sparse Codes

Francis, Mathew; Guha, Prithwijit

doi:10.1007/978-3-319-69900-4_21

Object Tracking with Classification Score Weighted Histogram of Sparse Codes

Conference paper
First Online: 01 November 2017

2640 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Abstract

Object tracking involves target localization in dynamic scenes using either generative models, discriminative classifiers or their combination. We propose a combined approach consisting of generative models (learned in sparse representation framework) and discriminative classifiers (SVM). Sparse codes are initially computed from two different dictionaries constructed from foreground and background patches using K-SVD. SVM learned on these sparse codes provides classifier scores for patches. These scores for sparse codes of patches drawn from a region are used to form a weighted histogram. This weighted histogram of sparse codes form the object and candidate models. The learned dictionaries provide distinct representations for object and background patches. This discrimination is further enhanced by classifier scores. The object is localized by maximizing Bhattacharyya coefficient between target and candidate models in a particle filter framework. Performance of the proposed tracker is benchmarked on videos from VOT2014 dataset against existing generative and discriminative approaches. Our proposal was able to handle different challenging situations involving background clutter, in-plane rotations, scale and illumination changes.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Object tracking algorithms can be broadly categorized into generative, discriminative and hybrid approaches. A generative approach models the target appearance and localizes it by optimizing a (dis)similarity measure between target and candidate(s) [1,2,3,4]. A discriminative approach learns classifiers from target and background appearance features in a supervised framework [5,6,7]. A hybrid methodology combines both by modeling target appearance in a generative framework while discriminating the same against background [8].

In generative approach, the first notable work which makes uses of a sparse representation based target model was proposed by Mei et al. [4]. Tracking was formulated as a L1 minimization problem where the candidate model is represented as a sparse linear combination of object and trivial templates. Thereafter, a number of variants of this basic formulation has been successfully applied in object tracking [9,10,11,12,13]. Successful use of classifier in tracking has been demonstrated in [6, 7].

We propose a hybrid approach that employs sparse representation based target modeling and uses classifiers for better discrimination of target model from that of background. We propose to extract patches from both background and foreground and learn dictionaries on these patches using K-SVD [14] and spherical clustering. We learn a SVM classifier to discriminate the sparse codes of object from that of background. Sparse codes extracted from (multiple overlapping) patches of non-overlapping cells are weighed by their classification scores to compute cell histogram. These cell histograms together form the object model. The average of cell histogram similarities between target and candidate models are maximized in a particle filter framework [15]. The main contributions of this work are as follows.

Proposal of a target model in a hybrid generative-discriminative approach.
Object representation by histogram of sparse codes (HSC) obtained from foreground-background dictionaries (generative model).
Enhancing discrimination of object models by weighing HSC with patch classification (by SVM) scores (discriminative approach).

The rest of the paper is organized as follows. The proposed approach is elaborated in Sect. 2. Experimental results are presented in Sect. 3. Finally, Sect. 4 summarizes the present work and sketches the future extensions.

2 Proposed Work

The object rectangle is divided into non-overlapping regions called cells. Proposed target model is learned in two stages. The first stage involves generative modeling. Here, object and background patches are used to learn two different dictionaries using K-SVD [16]. It learns the dictionary by solving the following sparse constrained optimization problem

$$\begin{aligned} \underset{ \varvec{\varGamma }, \mathbf {D} }{\min } ||\mathbf {X} -\mathbf {D}\varvec{\varGamma } ||_F^2 \ \text {subject to } \forall i \ ||\mathbf {\gamma }_i||_0 \le T \end{aligned}$$

(1)

where, $\varvec{\gamma }$ is the sparse code vector, $\mathbf {D}$ is the dictionary, $\mathbf {X}$ is the set of signals, $\varvec{\varGamma }$ has the sparse codes of all the signals and T is the sparsity threshold. The two learned dictionaries are then used to from a combined foreground-background dictionary. In the second stage, sparse codes of background and foreground patches are used to train a SVM classifier in a discriminative framework. Patch classification obtained from this SVM provides further discrimination in the object model. Sparse codes and the classification scores of patches are used to compute the weighted sparse code histogram of the cells, which form the object model. The candidate models are constructed from object state (position, rotation and scaling) proposals obtained through particle filtering. Next, we discuss dictionary learning in detail.

2.1 Learning Foreground-Background Dictionaries

A single dictionary learned from only object patches might provide good reconstruction but poor recognition against background clutter. Rectangular patches are extracted from minimum bounding box of the object ($\mathbf {bb}^{obj}$) and a background region ($\mathbf {bb}^{bg}$) around $\mathbf {bb}^{obj}$. The extracted patches are first vectorized and magnitude normalized (using l2-norm). Next, they are arranged to construct the input data matrices $\mathbf {X}^{obj} \in \mathcal {R}^{n \times np}$ and $\mathbf {X}^{bg} \in \mathcal {R}^{n \times nn}$ of respective object (positive) and background (negative) classes. Here, np and nn are the total number of object and background patches respectively and n is the dimension of patch vector. Magnitude normalization helps to make the object model robust to the illumination changes. Spherical k-means clustering is performed on the object patch vectors $\mathbf {X}^{obj}$ and background patch vectors $\mathbf {X}^{bg}$ separately. The dominant clusters are selected and K-SVD algorithm is performed on each of them to obtain m representative atoms from each cluster. These are stacked together to form the foreground-background dictionary. Common patches of foreground and background may to lead drift in tracking. In order to reduce the effect of such patches, we introduce discriminability through a binary classifier. The dictionary learning procedure is shown in Fig. 1.

2.2 Classifier Learning

A binary classifier is learned on the sparse codes generated using K-SVD algorithm with foreground-background dictionary for the object and background patches. The learned classifier provides confidence scores for object patches. These scores are used in constructing the proposed object model. Let, $\varvec{\varGamma }^{obj}$ and $\varvec{\varGamma }^{bg}$ be the respective sparse codes corresponding to patch vectors coming from positive set $\mathbf {X}^{obj}$ and negative set $\mathbf {X}^{bg}$. A SVM classifier is learned on the sparse code representations of patch vectors. The sparse vector corresponding to an ambiguous patch will lie closer to the classification boundary and will have a lower classification score compared to the object patches. This will improve the discriminative power of the object model against background and ambiguous patches. The proposed object model is explained next.

2.3 Object Model

The object bounding box is first divided into non-overlapping cells $\mathbf {C} = \{ \mathbf {c}_i: \mathbf {c}_i \in \mathcal {R}^{cw \times ch} \}$, where cw and ch are the width and height of the cell respectively. The sparse codes corresponding to the rectangular patches from a cell computed using OMP is used to compute the histogram of sparse codes. The set of all cell histograms define the object model i.e. $\mathbf {H} =\{\mathbf {h}^i\}$ and $\mathbf {h}^i, i=1,\ldots |C |$. The sparse codes required for computing the object model are computed as

$$\begin{aligned} \underset{\varGamma }{\text {min}} \ ||\mathbf {X}^{obj} -\mathbf {D} \varvec{\varGamma } ||_F^{2} \ \text { s.t. } ||\varvec{\varGamma } ||_{0} \le \gamma _{s} \end{aligned}$$

(2)

where $\varvec{\varGamma }$ is the sparse code matrix, $\mathbf {D}$ is the foreground-background dictionary and $\gamma _{s} \le m$ is the sparsity constraint. The sparse code histogram of the cell is created from the sparse codes $\varvec{\varGamma }$ of its component patches as

$$\begin{aligned} \mathbf {h}^c(j) = L\sum _{i=1}^{nc} |\varvec{\gamma }_{ij} |\ \omega _i , \ \mathbf {x}_{i} \in \mathbf {X}^{obj} \end{aligned}$$

(3)

where $\mathbf {x}_i$ is the $i^{th}$ object patch belonging to cell c, $\omega _i$ is the normalized classification score of the $i^{th}$ patch as given by the classifier, nc is the total number of patches in the cell and L is the normalization constant for the histogram. The cell-wise histograms are stacked together to form the object model $\mathbf {H}$ which is a collection of classifier weighted histogram of sparse codes given by $ \mathbf {H} = [\ \mathbf {h}^{c_1} \ \mathbf {h}^{c_2}\ \dots \ \mathbf {h}^{c_P} \ ] $, where P is the total number of cells. The entire object model creation is depicted in Fig. 2. The particle filter framework for target tracking is explained next.

2.4 Particle Filter

Particle Filter otherwise also known as sequential Monte Carlo sampling is used for object localization in tracking. It predicts the posterior distribution of the state of a dynamic system. The particle with the maximum a posteriori is selected as the best particle and is taken as the state of the object in the current frame. Here, we define the object state as $\mathbf {s} \in \mathcal {R}^5$ and is given by $ \mathbf {s} = [ x^c \ y^c \ w \ h \ \theta ]^T $ where $(x^c, y^c)$ are the image plane co-ordinates of the object bounding box centroid, $w,h,\theta $ are the respective width, height and orientation of the object. The motion model defines the temporal evolution of state. We consider simple random walk as our motion model. The current state is assumed to be sampled from a Gaussian distribution centered at the previous state as, $ \mathbf {s}_{t} \sim \mathcal {N}(\mathbf {s}_{t-1}, \varvec{\varSigma } ) $ where $\varSigma $ is a diagonal covariance matrix of state variables given by $diag( \sigma _{x}^2, \sigma _{y}^2, \sigma _{w}^2, \sigma _{h}^2,\sigma _{\theta }^2)$. The observation probability is defined as the similarity measure between the target model and the candidate model of the particle. The average of Bhattacharyya Coefficients ($\rho $) of cell histograms is used as the observation probability given by $p(\mathbf {y}\vert \mathbf {s} ) = \frac{1}{|C |} \sum _{i=1}^{|C|} \rho ^{c_i} = \frac{1}{|C |} \sum _{j=1}^{ k }\sqrt{ \mathbf {h}_q^{c_i}(j)\times \mathbf {h}_p^{c_i}(j)}$. The $c_i^{th}$ cell histograms of the target and candidate respectively given by $\mathbf {h}_q^{c_i}$ and $\mathbf {h}_p^{c_i}$ and, k is the dimension of the sparse code vector. The state with highest average Bhattacharyya coefficient is selected as the state of the object in $t^{th}$ frame. Experimental verification of our proposal and its performance analysis are presented next.

3 Experimental Results

The performance of the algorithm is evaluated on dataset VOT2014^{Footnote 1}[17] and the tracker performance is compared with other trackers in the literature like Mean-Shift tracker (MST) [1], Track Learn and Detect (TLD) [6] and CMT [3] tracker. The trackers were executed with their default parameter settings. The experimental results show that our proposal fares sufficiently well compared to the state of art trackers (Table 1). The results of the proposed tracker on different challenging sequences from VOT2014 are shown in Fig. 3.

3.1 Quantitative Evaluation

The performance of the proposed tracker is evaluated using one pass evaluation (OPE) [18] scheme where the tracker is initialized with ground truth value in first frame and allowed to track over entire sequence. The results obtained on different sequences are reported in Table 1. The performance measures used are average overlap (AO) and success rate (SR). The overlap measure of a sequence is given by $ \phi _t(\varLambda _G,\varLambda _P) = \frac{ \varLambda ^{G}_t \cap \varLambda ^{P}_t}{\varLambda ^{G}_t \cup \varLambda ^{P}_t}$ where, $\varLambda ^{G}$ is the area of the bounding box described by the ground truth, $\varLambda ^{P}$ is the area of the bounding box predicted by the tracker. The average overlap is given by $ \varPhi _{avg} = \frac{1}{N_{s}}\sum _{t=1}^{N_{s}} \phi _{t} $ where, $N_{s}$ is the total number of successfully tracked frames in the sequence. Tracking is assumed to be successful if $\phi _{t}$ exceeds the threshold value $\phi _{th} = 0.33$. The other parameters of the proposed algorithm are number of clusters ($K = 100$), number of atoms per cluster ($m= 3$), the sparsity constraint ($T = 3$) and number of particles ($p = 75$). Patches of dimension $5\times 5$ (i.e. patch vector size is $n = 25$) were extracted from cells of size $10\times 10$.

The computational complexity depends on number of particles (p), dictionary size ($n \times l$), number of OMP iterations (T), number of candidate region patches (u) and the computations ($t_d$) required for evaluating the orthogonal projection for OMP. The total computational time per frame can be computed as $p\times u \times t_{OMP}$, where $t_{OMP}$ is the computational load for OMP algorithm [14] given by $t_{OMP} = t_dT + 2nT + 2T(l + n) + T^3$.

Table 1. Performance comparison of the proposed tracker with the trackers MST [1], TLD [6], CMT [3]

Full size table

3.2 Qualitative Evaluation

The results of the proposed tracker on different sequences are shown in Fig. 3. There are continuous changes of appearance and orientation in “ball” and “polar bear” sequences. The target undergoes partial occlusions (frames: $156 - 177$) as well as scale changes in “car” sequence. The cell histogram based object model (constructed using patches) and particle filter based localization helps in handling these challenges. Illumination change is significant in “tunnel” sequence as target moves through differently illuminated regions. Here, patch normalization and sparse coding helps in achieving illumination invariant tracking.

4 Conclusion

We have proposed a novel target model in a hybrid generative-discriminative framework. The object patches are represented using foreground and background dictionaries (generative model). These representations are further weighed by SVM based classification scores (discriminability). The object is localized in a particle filter framework. The proposed tracker was able to handle different challenging scenarios like background clutter, partial occlusions, in-plane rotations, scale and illumination changes. Performance of the proposed tracker is benchmarked with state of art trackers on sequences from VOT2014 dataset.

The present work did not incorporate continuous dictionary and classifier update schemes in the object model. This extension will enable the tracker to trail targets for longer durations, under sever appearance changes and occlusions. Also, the present approach is somewhat slower due to repeated application of OMP at the particle filtering stage. We propose to extend the present formulation through discriminative dictionary learning and fast OMP solvers.

Notes

1.
http://www.votchallenge.net/vot2014/dataset.html.

References

Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 564–577 (2003)
Article Google Scholar
Isard, M., Blake, A.: Condensationconditional density propagation for visual tracking. Int. J. Comput. Vision 29, 5–28 (1998)
Article Google Scholar
Nebehay, G., Pflugfelder, R.: Clustering of static-adaptive correspondences for deformable object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2784–2791 (2015)
Google Scholar
Mei, X., Ling, H.: Robust visual tracking using l1 1 minimization. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1436–1443. IEEE (2009)
Google Scholar
Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 983–990. IEEE (2009)
Google Scholar
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1409–1422 (2012)
Article Google Scholar
Avidan, S.: Ensemble tracking. IEEE Trans. Pattern Anal. Mach. Intell. 29, 261–271 (2007)
Article Google Scholar
Lei, Y., Ding, X., Wang, S.: Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans. Syst. Man Cybern. B Cybern. 38, 1578–1591 (2008)
Article Google Scholar
Wang, D., Lu, H., Yang, M.H.: Online object tracking with sparse prototypes. IEEE Trans. Image Process. 22, 314–325 (2013)
Article MATH MathSciNet Google Scholar
Bai, T., Li, Y.F.: Robust visual tracking with structured sparse representation appearance model. Pattern Recogn. 45, 2390–2404 (2012)
Article MATH Google Scholar
Zhang, S., Yao, H., Sun, X., Lu, X.: Sparse coding based visual tracking: Review and experimental comparison. Pattern Recogn. 46, 1772–1788 (2013)
Article Google Scholar
Yang, X., Wang, M., Zhang, L., Sun, F., Hong, R., Qi, M.: An efficient tracking system by orthogonalized templates. IEEE Trans. Industr. Electron. 63, 3187–3197 (2016)
Article Google Scholar
Liu, B., Huang, J., Kulikowski, C., Yang, L.: Robust visual tracking using local sparse appearance model and k-selection. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2968–2981 (2013)
Article Google Scholar
Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. CS Technion 40, 1–15 (2008)
Google Scholar
Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. Signal Process. 50, 174–188 (2002)
Article Google Scholar
Aharon, M., Elad, M., Bruckstein, A.: $ k $-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54, 4311–4322 (2006)
Article MATH Google Scholar
Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., Nebehay, G., Porikli, F., Čehovin, L.: A novel performance evaluation methodology for single-target trackers. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2137–2155 (2016)
Article Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Guwahati, Guwahati, 781039, Assam, India
Mathew Francis & Prithwijit Guha

Authors

Mathew Francis
View author publications
You can also search for this author in PubMed Google Scholar
Prithwijit Guha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mathew Francis .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Francis, M., Guha, P. (2017). Object Tracking with Classification Score Weighted Histogram of Sparse Codes. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_21
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)