Sparse coding based visual tracking: Review and experimental comparison
Highlights
► A comprehensive review of visual tracking based on sparse coding. ► Extensive experimental comparison between 15 state of the art tracking methods on a total of 20 challenging sequences. ► Analyze the benefits of using sparse coding in visual tracking. ► Point out the future research topics.
Introduction
Visual tracking is the process that continuously infers the state of a target from an image sequence. Usually, it is formulated as a search problem that aims at finding the candidate most matching to the target template as the tracking result. A typical tracking process contains several stages as shown in Fig. 1. A target template is maintained over time and may be updated online once the tracking result is available. Before starting tracking at the current time, a set of candidates are sampled around the state of the target at the last time. Both the target template and candidates are represented using an appearance model. Then, a target searching strategy is used to find the candidate most matching to the template appearance as the tracking result.
Although visual tracking has been studied for more than 30 years, it is still a very overwhelming research topic due to some unsolved challenging issues arising from both the appearance modeling and target searching. From the point of view of appearance modeling, discriminating the target from the background is a very basic ability and plays a key role in complex scenes where the contrast between the target and background is low. To achieve reliable tracking performance, it is also very important to handle target appearance variations during tracking, which contain both the intrinsic variations such as pose changes and shape deformation and extrinsic variations such as illumination and occlusion. To handle these variations, a good appearance model is desired to meet two requirements: adaptivity that adapts to the intrinsic appearance variations and robustness that is invariant to the extrinsic appearance variations. From the point of view of target searching, computation complexity is an very important issue since the real-time tracking speed is a practical requirement of most subsequent high-level applications such as action recognition and retrieval. In addition, it is also possible to handle appearance variations in target searching stage, which is ignored by most existing methods.
In the literature, a number of tracking algorithms have been proposed (for example [1], [2], [3], [4], [5], [6]; see [7], [8] for detailed reviews). The methods most related to those discussed in this paper are learning based methods. Jepson et al. [9] proposed a framework to learn an adaptive appearance model, which adapts to the changing appearance over time. In Collins et al. [10], an online feature selection method was proposed to select features that are able to discriminate the target from the background. Since the feature selection is online when new observations are available, the selected features adapt to environment changes very well. In Ross et al. [4], a tracking method was proposed to incrementally learn a low-dimensional subspace representation, which efficiently adapts to target appearance variations. Discriminative methods that formulate tracking as a classification problem have also been attracting much attention. For example, Grabner and Bischof [11] proposed an online boosting method to update discriminative features to distinguish the target from the background. However, the update process may introduce errors due to inaccurate tracking results, which can finally lead to tracking failure (drifting). To overcome the drifting problem, Grabner and Leistner [12] further proposed a semi-supervised boosting method to combine decisions of a given prior and an on-line classifier. In Babenko et al. [13], multiple instance learning (MIL) was used to treat ambiguous positive and negative samples into bags to learn a discriminative classifier which can further overcome the drifting problem. Kuo et al. [14] proposed an AdaBoost based algorithm to learn a discriminative appearance model for multi-target tracking, which allows the model to adapt to target appearance variations over time. The common property of these complex appearance models is that they try to achieve both the discriminative ability and robustness in a single appearance representation. However, it is difficult to find a good tradeoff between the discriminative ability and robustness. Usually, the good robustness is achieved while the discriminative ability is lost in some extent.
Recently, motivated by the popularity of compressive sensing in signal processing [15], [16], an elegant and working model, named sparse coding [17], has been attracting much attention in computer vision. Very recently, inspired by the success of sparse representation in face recognition [18], some researchers also tried to use sparse representation in visual tracking and reported state-of-the-art performance [19]. On the other hand, motivated by the biologically inspired object representation model [20], [21] proposed to use the responses of sparse coding to model target appearance for visual tracking. For the past several years, increasing attention has been paid along these directions and some improvements have also been proposed to further enhance the tracking performance.
Although a variety of tracking methods based on sparse representation or sparse coding have been proposed, there is no work reviewing these methods and answering several important questions: (1) What is the connection and difference between these methods? In this work, we classify these methods according to which stage (appearance modeling or target searching) sparse coding is used in. Particularly, we emphasize the difference between sparse representation and sparse coding. Sparse representation, in fact, a sub-process of sparse coding, can be used to perform target searching (TSSR), which is the motivation of the pioneering work [19]. On the other hand, sparse coding learns local representations of image patches, which can be used to model target appearance (AMSC). Classifying different tracking methods into TSSR and AMSC as well as their combination facilitates the understanding of the connection and difference among these methods. (2) Why sparse coding would be useful for visual tracking? Although a huge number of tracking methods based on sparse coding has been proposed, there is no work trying to analyze the rationales behind these methods. In this work, we try to answer this question by analyzing the roles of sparse representation from the point of view of signal processing and the roles of sparse coding from the point of view of biologically inspired representation mechanism of simple cells in visual cortex. (3) Does sparse coding really benefit visual tracking? Although previous publications reported state-of-the-art tracking performance, the limited number of test sequences and comparison methods as well as different implementation frameworks restrict the fair comparison to reveal the benefits of using sparse coding in visual tracking. In this work, we collected a total of 11 trackers based on sparse coding and four widely used baseline trackers to perform a comprehensive experimental comparison on a total of 20 test sequences. The comparison results indicate: (1) AMSC methods significantly outperform TSSR methods. (2) For AMSC methods, both discriminative dictionary and spatial order reserved pooling operators are important for achieving high tracking accuracy. (3) For TSSR methods, the widely used identity pixel basis will degrade the performance when the target or candidate images are not aligned well or severe occlusion occurs. (4) For TSSR methods, norm minimization is not necessary. In contrast, norm minimization can obtain comparable performance but with lower computational cost.
The rest of the paper is organized as follows. In Section 2, we give brief introduction to sparse coding and its roles in visual tracking. In Section 3, we review the tracking methods based on sparse coding in the literature. We conduct experiment comparison and analysis in Section 4. Finally, conclusion and future work are summarized in Section 5.
Section snippets
Overview of sparse coding
Let be an vector obtained by stacking all pixel intensities of an image into a column vector. Sparse coding represents as a linear combination of a set of basis functions where uk is the coefficient of the kth basis function and is the noise. Basis function set is also called as dictionary and each basis function is called as an atom. Let be the coefficient vector. In general, there are many solutions of that satisfy Eq. (1) when the
Visual tracking based on sparse coding
According to the motivations of using sparse coding in visual tracking and the stages of a general tracking system as shown in Fig. 1, visual tracking methods based on sparse coding can be roughly classified into three classes: (1) appearance modeling based on sparse coding (AMSC), (2) target searching based on sparse representation (TSSR) and (3) combination of both AMSC and TSSR. In this section, we first introduce the basic tracking framework in each class and then review some improvement
Experimental comparison
In this section, we conducted both quantitative and qualitative experiments on a total of twenty test sequences to evaluate the benefits of using sparse coding in visual tracking. In the following subsections, we first introduce the experimental setup including comparison methods, parameters, test sequences and evaluation criteria, and then present the performance comparison in details. All the MATLAB source codes and datasets are available at http://www.shengping.us/CSTracking.html.
Conclusion and future work
In this work, we reviewed the recently proposed tracking methods based on sparse coding and conducted extensive experiments to analyze the benefits of using sparse coding for visual tracking. The main contributions of this work are three-folds:
- •
The first contribution of this work is to explain the motivations of using sparse coding in visual tracking. In particularly, we emphasized the difference between sparse representation and sparse coding and analyzed the benefits of using them in different
Acknowledgments
The authors would like to thank the editor and reviewers for their valuable comments. The work was supported by the National Natural Science Foundation of China (No. 61071180) and Key Program (No. 61133003). Shengping Zhang was also supported by the Short-Term Overseas Visiting Scholar Program of Harbin Institute of Technology funding when he was an visiting student researcher at Redwood Center for Theoretical Neuroscience, University of California, Berkeley, United States.
Shengping Zhang received his MS and PhD degrees in computer science from Harbin Institute of Technology, Harbin, China. Currently, he is a lecturer at Harbin Institute of Technology, Weihai, China. He was also an visiting student researcher at Redwood Center for Theoretical Neuroscience at University of California, Berkeley. His research interests focus on computer vision and pattern recognition, especially on moving objects detection, tracking and action recognition.
References (69)
- et al.
Object tracking using sift features and mean shift
Computer Vision and Image Understanding
(2009) - et al.
Human motion tracking for rehabilitation—a survey
Biomedical Signal Processing and Control
(2008) - et al.
Recent advances and trends in visual trackinga review
Neurocomputing
(2011) - et al.
Neurophsyiology of shape processing
Image and Vision Computing
(1993) - et al.
Sparse coding with an overcomplete basis seta strategy employed by V1?
Vision Research
(1997) - et al.
The independent components of natural scenes are edge filters
Vision Research
(1997) - et al.
Robust visual tracking based on online learning sparse representation
Neurocomputing
(2013) - P. Pérez, C. Hue, J. Vermaak, M. Gangnet, Color-based probabilistic tracking, in: Proceedings of the European...
- et al.
Kernel-based object tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2003) - et al.
Tracking non-rigid objects in video sequences
International Journal of Information Acquisition
(2006)
Incremental learning for robust visual tracking
International Journal of Computer Vision
Robust online appearance models for visual tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence
On-line selection of discriminative tracking features
IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust object tracking with online multiple instance learning
IEEE Transactions on Pattern Analysis and Machine Intelligence
Compressed sensing
IEEE Transactions on Information Theory
Robust uncertainty principlesexact signal reconstruction from highly incomplete frequency information
IEEE Transactions on Information Theory
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
Nature
Robust face recognition via sparse representation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Hierarchical models of object recognition in cortex
Nature Neuroscience
What is the goal of sensory coding ?
Neural Computation
For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution
Communications on Pure and Applied Mathematics
Gradient projection for sparse reconstructionapplication to compressed sensing and other inverse problems
IEEE Journal of Selected Topics in Signal Processing
An interior-point method for large-scale -regularized least squares
IEEE Journal of Selected Topics in Signal Processing
Frame based signal compression using method of optimal directions
Proceedings of the IEEE International Symposium on Circuits and Systems
Svdan algorithm for designing overcomplete dictionaries for sparse representation
IEEE Transactions on Signal Processing
Neocognitrona self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position
Biological Cybernetics
Object class recognition and localization using sparse features with limited receptive fields
International Journal of Computer Vision
Cited by (293)
A comprehensive survey of sparse regularization: Fundamental, state-of-the-art methodologies and applications on fault diagnosis
2023, Expert Systems with ApplicationsA comprehensive study on codebook-based feature fusion for gait recognition
2023, Information FusionVisual object tracking: A survey
2022, Computer Vision and Image UnderstandingAutomatic laser profile recognition and fast tracking for structured light measurement using deep learning and template matching
2021, Measurement: Journal of the International Measurement ConfederationSpatiotemporal Deformation Perception for Fisheye Video Rectification
2023, Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023Visual Object Tracking With Discriminative Filters and Siamese Networks: A Survey and Outlook
2023, IEEE Transactions on Pattern Analysis and Machine Intelligence
Shengping Zhang received his MS and PhD degrees in computer science from Harbin Institute of Technology, Harbin, China. Currently, he is a lecturer at Harbin Institute of Technology, Weihai, China. He was also an visiting student researcher at Redwood Center for Theoretical Neuroscience at University of California, Berkeley. His research interests focus on computer vision and pattern recognition, especially on moving objects detection, tracking and action recognition.
Hongxun Yao received the BS and MS degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and in 1990, respectively, and the PhD degree in computer science from Harbin Institute of Technology in 2003. Currently, she is a professor with the School of Computer Science and Technology, Harbin Institute of Technology. Her research interests include pattern recognition, multimedia technology, and human–computer interaction technology. She has published three books and over 100 scientific papers.
Xin Sun received the BS and MS degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively. Currently, she is pursuing the PhD degree. Her research interests focus on computer vision and pattern recognition, especially on moving objects detection and tracking.
Xiusheng Lu received the BS in computer science from the Harbin Institute of Technology, Harbin, China, in 2010. He is currently a master student in the same school.