Elsevier

Pattern Recognition

Volume 46, Issue 7, July 2013, Pages 1772-1788
Pattern Recognition

Sparse coding based visual tracking: Review and experimental comparison

https://doi.org/10.1016/j.patcog.2012.10.006Get rights and content

Abstract

Recently, sparse coding has been successfully applied in visual tracking. The goal of this paper is to review the state-of-the-art tracking methods based on sparse coding. We first analyze the benefits of using sparse coding in visual tracking and then categorize these methods into appearance modeling based on sparse coding (AMSC) and target searching based on sparse representation (TSSR) as well as their combination. For each categorization, we introduce the basic framework and subsequent improvements with emphasis on their advantages and disadvantages. Finally, we conduct extensive experiments to compare the representative methods on a total of 20 test sequences. The experimental results indicate that: (1) AMSC methods significantly outperform TSSR methods. (2) For AMSC methods, both discriminative dictionary and spatial order reserved pooling operators are important for achieving high tracking accuracy. (3) For TSSR methods, the widely used identity pixel basis will degrade the performance when the target or candidate images are not aligned well or severe occlusion occurs. (4) For TSSR methods, 1 norm minimization is not necessary. In contrast, 2 norm minimization can obtain comparable performance but with lower computational cost. The open questions and future research topics are also discussed.

Highlights

► A comprehensive review of visual tracking based on sparse coding. ► Extensive experimental comparison between 15 state of the art tracking methods on a total of 20 challenging sequences. ► Analyze the benefits of using sparse coding in visual tracking. ► Point out the future research topics.

Introduction

Visual tracking is the process that continuously infers the state of a target from an image sequence. Usually, it is formulated as a search problem that aims at finding the candidate most matching to the target template as the tracking result. A typical tracking process contains several stages as shown in Fig. 1. A target template is maintained over time and may be updated online once the tracking result is available. Before starting tracking at the current time, a set of candidates are sampled around the state of the target at the last time. Both the target template and candidates are represented using an appearance model. Then, a target searching strategy is used to find the candidate most matching to the template appearance as the tracking result.

Although visual tracking has been studied for more than 30 years, it is still a very overwhelming research topic due to some unsolved challenging issues arising from both the appearance modeling and target searching. From the point of view of appearance modeling, discriminating the target from the background is a very basic ability and plays a key role in complex scenes where the contrast between the target and background is low. To achieve reliable tracking performance, it is also very important to handle target appearance variations during tracking, which contain both the intrinsic variations such as pose changes and shape deformation and extrinsic variations such as illumination and occlusion. To handle these variations, a good appearance model is desired to meet two requirements: adaptivity that adapts to the intrinsic appearance variations and robustness that is invariant to the extrinsic appearance variations. From the point of view of target searching, computation complexity is an very important issue since the real-time tracking speed is a practical requirement of most subsequent high-level applications such as action recognition and retrieval. In addition, it is also possible to handle appearance variations in target searching stage, which is ignored by most existing methods.

In the literature, a number of tracking algorithms have been proposed (for example [1], [2], [3], [4], [5], [6]; see [7], [8] for detailed reviews). The methods most related to those discussed in this paper are learning based methods. Jepson et al. [9] proposed a framework to learn an adaptive appearance model, which adapts to the changing appearance over time. In Collins et al. [10], an online feature selection method was proposed to select features that are able to discriminate the target from the background. Since the feature selection is online when new observations are available, the selected features adapt to environment changes very well. In Ross et al. [4], a tracking method was proposed to incrementally learn a low-dimensional subspace representation, which efficiently adapts to target appearance variations. Discriminative methods that formulate tracking as a classification problem have also been attracting much attention. For example, Grabner and Bischof [11] proposed an online boosting method to update discriminative features to distinguish the target from the background. However, the update process may introduce errors due to inaccurate tracking results, which can finally lead to tracking failure (drifting). To overcome the drifting problem, Grabner and Leistner [12] further proposed a semi-supervised boosting method to combine decisions of a given prior and an on-line classifier. In Babenko et al. [13], multiple instance learning (MIL) was used to treat ambiguous positive and negative samples into bags to learn a discriminative classifier which can further overcome the drifting problem. Kuo et al. [14] proposed an AdaBoost based algorithm to learn a discriminative appearance model for multi-target tracking, which allows the model to adapt to target appearance variations over time. The common property of these complex appearance models is that they try to achieve both the discriminative ability and robustness in a single appearance representation. However, it is difficult to find a good tradeoff between the discriminative ability and robustness. Usually, the good robustness is achieved while the discriminative ability is lost in some extent.

Recently, motivated by the popularity of compressive sensing in signal processing [15], [16], an elegant and working model, named sparse coding [17], has been attracting much attention in computer vision. Very recently, inspired by the success of sparse representation in face recognition [18], some researchers also tried to use sparse representation in visual tracking and reported state-of-the-art performance [19]. On the other hand, motivated by the biologically inspired object representation model [20], [21] proposed to use the responses of sparse coding to model target appearance for visual tracking. For the past several years, increasing attention has been paid along these directions and some improvements have also been proposed to further enhance the tracking performance.

Although a variety of tracking methods based on sparse representation or sparse coding have been proposed, there is no work reviewing these methods and answering several important questions: (1) What is the connection and difference between these methods? In this work, we classify these methods according to which stage (appearance modeling or target searching) sparse coding is used in. Particularly, we emphasize the difference between sparse representation and sparse coding. Sparse representation, in fact, a sub-process of sparse coding, can be used to perform target searching (TSSR), which is the motivation of the pioneering work [19]. On the other hand, sparse coding learns local representations of image patches, which can be used to model target appearance (AMSC). Classifying different tracking methods into TSSR and AMSC as well as their combination facilitates the understanding of the connection and difference among these methods. (2) Why sparse coding would be useful for visual tracking? Although a huge number of tracking methods based on sparse coding has been proposed, there is no work trying to analyze the rationales behind these methods. In this work, we try to answer this question by analyzing the roles of sparse representation from the point of view of signal processing and the roles of sparse coding from the point of view of biologically inspired representation mechanism of simple cells in visual cortex. (3) Does sparse coding really benefit visual tracking? Although previous publications reported state-of-the-art tracking performance, the limited number of test sequences and comparison methods as well as different implementation frameworks restrict the fair comparison to reveal the benefits of using sparse coding in visual tracking. In this work, we collected a total of 11 trackers based on sparse coding and four widely used baseline trackers to perform a comprehensive experimental comparison on a total of 20 test sequences. The comparison results indicate: (1) AMSC methods significantly outperform TSSR methods. (2) For AMSC methods, both discriminative dictionary and spatial order reserved pooling operators are important for achieving high tracking accuracy. (3) For TSSR methods, the widely used identity pixel basis will degrade the performance when the target or candidate images are not aligned well or severe occlusion occurs. (4) For TSSR methods, 1 norm minimization is not necessary. In contrast, 2 norm minimization can obtain comparable performance but with lower computational cost.

The rest of the paper is organized as follows. In Section 2, we give brief introduction to sparse coding and its roles in visual tracking. In Section 3, we review the tracking methods based on sparse coding in the literature. We conduct experiment comparison and analysis in Section 4. Finally, conclusion and future work are summarized in Section 5.

Section snippets

Overview of sparse coding

Let xRD be an vector obtained by stacking all pixel intensities of an image into a column vector. Sparse coding represents x as a linear combination of a set of basis functions V=[v1,,vK]RD×Kx=k=1Kukvk+nwhere uk is the coefficient of the kth basis function and nRD is the noise. Basis function set V is also called as dictionary and each basis function is called as an atom. Let u=[u1,,uK]TRK be the coefficient vector. In general, there are many solutions of u that satisfy Eq. (1) when the

Visual tracking based on sparse coding

According to the motivations of using sparse coding in visual tracking and the stages of a general tracking system as shown in Fig. 1, visual tracking methods based on sparse coding can be roughly classified into three classes: (1) appearance modeling based on sparse coding (AMSC), (2) target searching based on sparse representation (TSSR) and (3) combination of both AMSC and TSSR. In this section, we first introduce the basic tracking framework in each class and then review some improvement

Experimental comparison

In this section, we conducted both quantitative and qualitative experiments on a total of twenty test sequences to evaluate the benefits of using sparse coding in visual tracking. In the following subsections, we first introduce the experimental setup including comparison methods, parameters, test sequences and evaluation criteria, and then present the performance comparison in details. All the MATLAB source codes and datasets are available at http://www.shengping.us/CSTracking.html.

Conclusion and future work

In this work, we reviewed the recently proposed tracking methods based on sparse coding and conducted extensive experiments to analyze the benefits of using sparse coding for visual tracking. The main contributions of this work are three-folds:

  • The first contribution of this work is to explain the motivations of using sparse coding in visual tracking. In particularly, we emphasized the difference between sparse representation and sparse coding and analyzed the benefits of using them in different

Acknowledgments

The authors would like to thank the editor and reviewers for their valuable comments. The work was supported by the National Natural Science Foundation of China (No. 61071180) and Key Program (No. 61133003). Shengping Zhang was also supported by the Short-Term Overseas Visiting Scholar Program of Harbin Institute of Technology funding when he was an visiting student researcher at Redwood Center for Theoretical Neuroscience, University of California, Berkeley, United States.

Shengping Zhang received his MS and PhD degrees in computer science from Harbin Institute of Technology, Harbin, China. Currently, he is a lecturer at Harbin Institute of Technology, Weihai, China. He was also an visiting student researcher at Redwood Center for Theoretical Neuroscience at University of California, Berkeley. His research interests focus on computer vision and pattern recognition, especially on moving objects detection, tracking and action recognition.

References (69)

  • D. Ross et al.

    Incremental learning for robust visual tracking

    International Journal of Computer Vision

    (2007)
  • X. Sun, H. Yao, S. Zhang, A novel supervised level set method for non-rigid object tracking, in: Proceedings of the...
  • A. Jepson et al.

    Robust online appearance models for visual tracking

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • R. Collins et al.

    On-line selection of discriminative tracking features

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • H. Grabner, H. Bischof, On-line boosting and vision, in: Proceedings of the IEEE Conference on Computer Vision and...
  • H. Grabner, C. Leistner, Semi-supervised on-line boosting for robust tracking, in: Proceedings of the 10th European...
  • B. Babenko et al.

    Robust object tracking with online multiple instance learning

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2011)
  • C. Kuo, C. Huang, R. Nevatia, Multi-target tracking by on-line learned discriminative appearance models, in:...
  • D. Donoho

    Compressed sensing

    IEEE Transactions on Information Theory

    (2006)
  • E. Candès et al.

    Robust uncertainty principlesexact signal reconstruction from highly incomplete frequency information

    IEEE Transactions on Information Theory

    (2006)
  • B. Olshausen et al.

    Emergence of simple-cell receptive field properties by learning a sparse code for natural images

    Nature

    (1996)
  • J. Wright et al.

    Robust face recognition via sparse representation

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2009)
  • X. Mei, H. Ling, Robust visual tracking using L1 minimization, Proceedings of the 12th International Conference on...
  • M. Riesenhuber et al.

    Hierarchical models of object recognition in cortex

    Nature Neuroscience

    (1999)
  • S. Zhang, H. Yao, S. Liu, Robust visual tracking using feature-based visual attention, in: Proceedings of the IEEE...
  • D. Field

    What is the goal of sensory coding ?

    Neural Computation

    (1994)
  • D. Donoho

    For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution

    Communications on Pure and Applied Mathematics

    (2006)
  • M. Figueiredo et al.

    Gradient projection for sparse reconstructionapplication to compressed sensing and other inverse problems

    IEEE Journal of Selected Topics in Signal Processing

    (2007)
  • S. Kim et al.

    An interior-point method for large-scale 1-regularized least squares

    IEEE Journal of Selected Topics in Signal Processing

    (2007)
  • K. Engan et al.

    Frame based signal compression using method of optimal directions

    Proceedings of the IEEE International Symposium on Circuits and Systems

    (1999)
  • M. Aharon et al.

    Svdan algorithm for designing overcomplete dictionaries for sparse representation

    IEEE Transactions on Signal Processing

    (2006)
  • K. Fukushima

    Neocognitrona self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

    Biological Cybernetics

    (1980)
  • T. Serre, L. Wolf, T. Poggio, Object recognition with features inspired by visual cortex, in: Proceedings of the IEEE...
  • J. Mutch et al.

    Object class recognition and localization using sparse features with limited receptive fields

    International Journal of Computer Vision

    (2008)
  • Cited by (293)

    • Visual object tracking: A survey

      2022, Computer Vision and Image Understanding
    • Spatiotemporal Deformation Perception for Fisheye Video Rectification

      2023, Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
    • Visual Object Tracking With Discriminative Filters and Siamese Networks: A Survey and Outlook

      2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
    View all citing articles on Scopus

    Shengping Zhang received his MS and PhD degrees in computer science from Harbin Institute of Technology, Harbin, China. Currently, he is a lecturer at Harbin Institute of Technology, Weihai, China. He was also an visiting student researcher at Redwood Center for Theoretical Neuroscience at University of California, Berkeley. His research interests focus on computer vision and pattern recognition, especially on moving objects detection, tracking and action recognition.

    Hongxun Yao received the BS and MS degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and in 1990, respectively, and the PhD degree in computer science from Harbin Institute of Technology in 2003. Currently, she is a professor with the School of Computer Science and Technology, Harbin Institute of Technology. Her research interests include pattern recognition, multimedia technology, and human–computer interaction technology. She has published three books and over 100 scientific papers.

    Xin Sun received the BS and MS degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively. Currently, she is pursuing the PhD degree. Her research interests focus on computer vision and pattern recognition, especially on moving objects detection and tracking.

    Xiusheng Lu received the BS in computer science from the Harbin Institute of Technology, Harbin, China, in 2010. He is currently a master student in the same school.

    View full text