Elsevier

Neurocomputing

Volume 261, 25 October 2017, Pages 266-275
Neurocomputing

Effective visual tracking by pairwise metric learning

https://doi.org/10.1016/j.neucom.2016.05.115Get rights and content

Abstract

For robust visual tracking, appearance modeling should be able to well separate the object from its backgrounds, while accurately adapt to its appearance variations. However, most of the existing tracking methods mainly focus on one of the two aspects; or design two different modules to combine them with the price of double computational cost. In this paper, by using pairwise metric learning, we present a novel appearance model for robust visual tracking. Specifically, visual tracking is viewed as a pairwise regression problem, and extreme learning machine (ELM) is utilized to construct the pairwise regression framework. In ELM-based pairwise training, two constraints are enforced: the target observations must have different regression outputs from those background ones; while the various target observations during tracking should have approximate regression outputs. Thus, the discriminative and generative capabilities are fully considered in a single object tracking model. Moreover, online sequential ELM (OS-ELM) is used to update the resulting appearance model, thereby leading to a more robust tracking process. Extensive experimental evaluations on challenging video sequences demonstrate the effectiveness and efficiency of the proposed tracker.

Introduction

Visual tracking is a significant topic for human vision cognitive systems [1], and it has various applications including video surveillance, traffic monitoring, behavior analysis, etc. Although there have been much progresses [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17] in the past decades, it remains a challenging problem to design a robust and efficient visual tracker due to the large appearance variations, illumination changes, background clutter, and partial occlusion.

A typical tracking framework mainly consists of three parts: feature representation, appearance modeling and dynamic forecasting [3]. Among them, appearance modeling is mainly to build a mathematical model for object identification, which plays a key role in visual tracking [18]. From appearance modeling point of view, distinguishing the target from its background is a basic ability. This trait is extensively studied in many discriminative trackers, e.g., support vector machines (SVM) based methods [6], [19], and boosting-based ones [3], [20]. On the other hand, to achieve a stable tracking, the appearance model should be able to handle visual variations of the target itself. This property is emphasized in the generative trackers, e.g., subspace models [2], [4], and templates-based ones [21], [22]. Conceptually, discriminative trackers aim at how to maximize the separability between the object and non-object regions; while generative ones concentrate on how to find the object with different visual variations.

For robust tracking, a natural attempt is to fully consider the two aspects above to give a hybrid approach. Zhong et al. [11] proposed a tracking method combining a discriminative classifier and a sparse generative model. In addition, some similar works [23], [24] have also shown that the hybrid trackers could yield a superior tracking performance than the single discriminative or generative ones. However, it is obvious that the resultant advantage is based upon two independent modules. Thus, the corresponding computational cost would be double with the same amount of training data [25].

In this paper, rather than combining the discriminative and generative information of training data using two separated modules, we exploit these two kinds of information in a single object model by pairwise metric learning (PML). The notion of PML refers to studying the similarity or dissimilarity of a data pair, and the sample pairs are used as training instances. In contrast to classical regression/classification, there are two advantages for applying PML: (1) it is easier to obtain large number of pairwise training instances, which can address the problem that the labeled data are insufficient in visual tracking [26]; (2) PML method can exploit the mutual relationship within a training pair, and thus tends to achieve a better learning performance [27].

The PML has been widely used in different fields, such as document retrieval [26], object classification [28], recommendation task [29], etc. Inspired by the success of PML in these applications, we apply it into visual tracking. Technically, unlike the non-pairwise trackers (e.g., [15], [19], [30]) that only use target and/or background samples as the training instances, the target-background pairs and target-target ones are utilized in the proposed tracker. A novel and efficient learning technique, i.e., Extreme Learning Machine (ELM) [31], is to build the pairwise appearance model. Theoretically, for the ELM training, the samples from different subsets (target or background) will have different ELM output responses, and vice versa. With this rationale, the trained ELM network aims to reflect a difference between the target-background pair, while have almost the same responses for the target-target ones. Thus, the discriminative and generative information of training data are fully exploited in a single ELM network. Furthermore, to adapt to the visual changes during tracking, online sequential ELM (OS-ELM) [32] is used to update the obtained pairwise appearance model, which can result in a more robust tracking process.

Recently, there are several trackers [5], [6] involving the concept of PML, which differ from the proposed method in the following aspects: (1) they only concentrate on the discriminative analysis among the target-background pair, but ignore the generative information of target-target one. In contrast, the proposed tracker makes full use of the target-background discriminative information and the target-target generative one in a single object model; (2) unlike the existing pairwise methods, the proposed tracker can be efficiently performed without heavy quadratic programming (QP) [6] or matrix factorization (MF) [5], due to the fast and effective learning capabilities of ELM [31].

Section snippets

Preliminary knowledge

To facilitate the understanding of proposed tracker, this section briefly reviews the related contents of ELM. For a more detailed discussion and analysis, we refer the readers to [31], [33]. Note that the differences and relationships between ELM and other earlier works have been intensively analyzed in [34].

ELM proposed by Huang [34] is originally used for training generalized single hidden layer feed-forward neural networks (SLFNs), and recently extended it to the multi-layer case [35].

Pairwise training

The pipeline of proposed tracker is demonstrated in Algorithm 1 and Fig. 1. Let A=[a1,a2,,an] denote the target samples dynamically collected before the current tth frame, which indicate the different target observations from the 1st frame to the (t1)th frame. The current training data is composed of two parts: X=[x1,x2,,xm] represents the target samples collected at the tth frame, and B=[b1,b2,,bp] stands for the background samples far away from the current estimated object center. And f

Discussions

We note that the contributions of proposed tracker are in twofold: (1) a novel appearance model is presented based on the PML method; (2) ELM technique is utilized to facilitate the pairwise learning performance. The detailed novelties are expounded as follows.

Performance evaluation and analysis

In this section, we conduct comprehensive comparisons to evaluate the performance of proposed approach named the PMLT tracker. And we compare the tracking results of our method with other seven algorithms, including the ranking SVM tracker (RSVT) [6], the sparsity collaborative tracker (SCM) [11], the multiple instance learning tracker (MIL) [3], the fragments tracker (Frag) [22], the compressive tracker (CT) [30], the visual tracking decomposition tracker (VTD) [46] and the

Conclusion

In this paper, based on the PML, we have advocated a novel and effective online tracking method using Extreme Learning Machine (ELM). Unlike the existing trackers, the proposed method has fully considered the discriminative and generative aspects of appearance modeling in a single object model. The fast learning speed of ELM facilitates the pairwise training efficiency. Moreover, we have designed the online sequential updating of appearance model, which results in a more robust tracking

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant 61301090, the Beijing Excellent Talent Fund under Grant 2013D009011000001, the National High Technology Research and Development Program of China under Grant 2014AA8012013L, and in part by the Excellent Young Scholars Research Fund of Beijing Institute of Technology under Grant 2013YR0508.

Chenwei Deng received the Ph.D. degree in signal and information processing from Beijing Institute of Technology, Beijing, China, in 2009. He is currently a full professor at the School of Information and Electronics, Beijing Institute of Technology, China. He has authored or co-authored over 50 technical papers in refereed international journals and conferences, and co-edited one book. His current research interests include image/video coding, quality assessment, perceptual modeling, features

References (49)

  • BaiY. et al.

    Robust visual tracking via ranking svm

    Proceedings of the IEEE International Conference on Image Processing

    (2011)
  • BaiY. et al.

    Robust tracking via weakly supervised ranking svm

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • WangD. et al.

    Object tracking via 2dpca and l1-regularization

    IEEE Signal Process. Lett.

    (2012)
  • BaiY. et al.

    Object tracking via robust multitask sparse representation

    IEEE Signal Process. Lett.

    (2014)
  • WangD. et al.

    Online visual tracking via two view sparse representation

    IEEE Signal Process. Lett.

    (2014)
  • ZhongW. et al.

    Robust object tracking via sparsity-based collaborative model

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • TangM. et al.

    Robust tracking with discriminative ranking lists

    IEEE Trans. Image Process.

    (2012)
  • LiuT. et al.

    Visual tracking via temporally smooth sparse coding

    IEEE Signal Process. Lett.

    (2015)
  • ZhangH. et al.

    Visual tracking via constrained incremental non-negative matrix factorization

    IEEE Signal Process. Lett.

    (2015)
  • LiuH. et al.

    Multitask extreme learning machine for visual tracking

    Cognit. Comput.

    (2014)
  • S. Avidan

    Support vector tracking

    IEEE Trans. Patt. Anal. Mach. Intell.

    (2004)
  • H. Grabner et al.

    Real-time tracking via on-line boosting

    Proceedings of the British Machine Vision Conference

    (2006)
  • MeiX. et al.

    Robust visual tracking and vehicle classification via sparse representation

    IEEE Trans. Patt. Anal. Mach. Intell.

    (2011)
  • A. Adam et al.

    Robust fragments-based tracking using the integral histogram

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2006)
  • Cited by (0)

    Chenwei Deng received the Ph.D. degree in signal and information processing from Beijing Institute of Technology, Beijing, China, in 2009. He is currently a full professor at the School of Information and Electronics, Beijing Institute of Technology, China. He has authored or co-authored over 50 technical papers in refereed international journals and conferences, and co-edited one book. His current research interests include image/video coding, quality assessment, perceptual modeling, features representation.

    Baoxian Wang received the B.Eng. degree from Northeastern University, China, in 2010. He is currently pursuing the Ph.D. degree in the School of Information and Electronics, Beijing Institute of Technology, Beijing, China. His current research interests include image processing, computer vision, machine learning, and pattern recognition.

    Weisi Lin received the Ph.D. degree from King’s College London, London, U.K. He is currently an Associate Professor and Associate Chair (Graduate Studies) with the School of Computer Engineering, Nanyang Technological University, Singapore. His current research interests include visual quality evaluation and perception-inspired signal modeling. He has published over 270 refereed papers at international journals and conferences. More details are available at http://www.ntu.edu.sg/home/wslin/.

    Guang-Bin Huang received the Ph.D. degree in electrical engineering from Nanyang Technological University, Singapore in 1999. From May 2001, he has been working as an Assistant Professor and Associate Professor in the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. His current research interests include machine learning, computational intelligence, and extreme learning machine. He serves as an Associate Editor of Neurocomputing, and IEEE Transactions on Cybernetics.

    Baojun Zhao received the Ph.D. degree in electromagnetic measurement technology and equipment from Harbin Institute of Technology (HIT), Harbin, China, in 1996. From 1996 to 1998, he was a postdoctoral fellow at Beijing Institute of Technology (BIT), Beijing, China. Since 1998, he has been engaged in teaching and research work at Radar Research Laboratory, BIT. He has authored or co-authored over 100 publications. His main research interests include image/video coding, image recognition.

    View full text