Elsevier

Neurocomputing

Volume 315, 13 November 2018, Pages 33-47
Neurocomputing

Discriminative tracking via supervised tensor learning

https://doi.org/10.1016/j.neucom.2018.05.108Get rights and content

Highlights

  • Tensor based discriminative tracking framework is presented in this paper.

  • A multi-linear classifier with structured output is proposed for tensor input.

  • Parameter tensor reconstruction in online updating provides robustness against noise.

  • Tensor block coordinate descent optimization is introduced in online learning.

  • The proposed tracker shows superior performance on benchmark videos.

Abstract

Discriminative tracking algorithms have witnessed continued progress for distinguishing the target from background in unconstrained environments. The learning and detection task in existing visual tracking methods often convert a multidimensional data array into a vector-based observation. By altering the 2-D spatial structure of the image, transformation variants and global noises influence the discriminative ability of target representation, often result in degradation of performance. Different from vector representations, this paper presents a tensor-based large margin discriminative framework for visual tracking that utilizes the supervised tensor learning. In our method, an online structured support tensor classifier is designed which produces the multi-linear decision function, incorporating the nonlinearity of tensor-based feature over the target. In order to provide better spatial cues of target representation against noises and facilitate online tracking, we further introduce truncated tucker decomposition in structured multi-linear learning. The proposed algorithm poses an effective parameter tensor reconstruction in the classifier updating procedure and has a robust discriminative ability against several video background variants. Furthermore, a tensor block coordinate descent optimization is presented to achieve a closed form solution specific to the proposed truncated structured Tucker machine (TSTM). Experiment results on a recent comprehensive tracking benchmark demonstrate a promising performance of the proposed method subjectively and objectively compared with several state-of-the-art algorithms.

Introduction

Visual tracking task spans over many real-world applications that include video surveillance, action analysis, and augmented reality. Significant progress has been made in recent years for visual tracking, which generally aims to continuously locate an annotated target given its initial position. Tracking in the unconstrained environment is still a challenging task due to numerous challenging appearance variations (such as out-of-view, background clutter, illumination, etc.) [1], [14]. Recently, several promising algorithms have been proposed to resolve this problem. Widely discussed models in visual tracking fall under the category of tracking-by-generative models [2], [4], [6], [7], tracking-by-classification models [3], [17], [18], [19], [20], [37], [44], and tracking by correlation filter models [21], [26], [27], [43].

The tracking-by-classification methods have shown favorable performance in tracking framework, which conduct the tracking problem as a classification task and update the discriminative information about the target in an online manner. In supervised learning based tracking, generic discriminative trackers have been developed that incorporate conventional classifiers, such as P-N learning [38], multiple instance learning [31], support vector machines (SVM) [17], [18], [19], [20], [36]. Hare et al [17] argued that the binary classifier is a primary cause of inaccurate sample labeling and that structured output classification can deal with this issue better. Their work proposed tracking based on structured output of SVM and the experiment results demonstrated that the method provided an improved performance in visual tracking as it could deal with complex structural outputs e.g. trees, sequences, or sets rather than class labels [37].

In the tracking process, object representation plays a pivotal role in analyzing appearance variants. A potential limitation of the conventional linear discriminative classifier is their incapability to handle the tensor-based feature structure directly. These methods do not preserve the intrinsic local spatial geometry and do not fully use the discriminative structure of the original features in the learning process. Apart from discounting spatial cues, a major disadvantage of vector formulation is the curse of dimensionality, which is caused by concatenating features of high dimensions. Therefore, vector formulation makes traditional classification methods prone to overfitting problem, especially for small sample size [10]. In consideration of visual tracking application in an online manner, the track-by-classification method usually have fewer training samples available, and therefore is sensitive to global noises and transformation variants which are caused in both training and prediction. The linear models utilized in visual tracking based on vector-encoded formulation does not prove to be an optimal choice while dealing with challenging scenarios. The fact can be experienced clearly in Fig. 1. The conventional tracker like Struck [17], MEEM [19] and DLSVM [18] affected badly in rigid deformation and severe drift variants as shown in Fig. 1.

Recently, tensor or multi-way representation of features has been employed in many multimedia systems [9], [11], [28], [29]. Tensor formulation collects features with multiple separate modes, thus overcomes the disadvantage of vector representation in which all different modes of features are concatenated and mixed. Several multi-linear extensions of vector-based models have been developed and provided better performance. Supervised tensor learning (STL) frameworks [31] usually operates on high-order data directly to facilitate the learning process in which CANDECOMP/PARAFAC (CP) decomposition and Tucker decomposition [39] are used. As a multi-linear generalization of matrix decomposition can capture variants of every mode independently and can be used to solve classification problems [11], [28], [29]. Parameter tensor is decomposed to one-factor vector along each mode that could lead to sustainability of discriminative information. Nevertheless, existing STL framework lacks structured output which is often desirable in online visual tracking against challenging video scenarios.

In order to exploit accurate target representation and reliable discrimination for visual tracking, we develop an end-to-end tensor tracker with structured output learning to perform both tensor representation and tensor discrimination. More specifically the online discriminative model for tensor learning has never been addressed in visual tracking. We propose truncated Tucker decomposition with structured output SVM that construct an online supervised tensor learning procedure for visual tracking. Our method not only takes into account spatial cues of the target but also reduces the noise influence and enhances the multi-linear separability. In the tracking process, the supervised tensor learning deals with tensor features and learned parameters from a classifier. A 3-order parameter tensor is decomposed into a core tensor and projection matrices. The whole tracking procedure is conducted online. At each round of updating tracking prediction, the parameter tensor reconstruction yields an anti-noise layout of iterating updates [40]. By means of proposed online tensor learning, we extract compact spatial information of features via a low-rank classifier parameter tensor. In addition, we also can achieve the goal of dimensionality reduction by adjusting the dimension of the core tensor, which eventually resolves the high dimensionality issues and overfitting problems. In our tracking algorithm, we also propose a tensor block coordinate descent optimization method for online classifier design, which facilitates tracking even in the presence of limited training samples.

The rest of the paper is organized as follows. Section 2 describes related work and context. We introduce details of the tracking algorithm is presented in Section 3.1, the tensor algebra and tensor decomposition in Sections 3.2. The tensor representation and tensor discrimination are presented in Section 3.3 and Section 3.4. Furthermore, efficiency analysis in Section 3.5. In Section 4, we present the experiment results and related analysis. We conclude the paper in Section 5.

Section snippets

Related work and context

A number of algorithms have been proposed in the literature, which can be categorized as generative models and discriminative models, to perform visual tracking. The following section discusses the recent progress in visual tracking using vector-based, matrix-based, and tensor-based models, and limitations of these methods.

Proposed tracking algorithm

In this section, we present the proposed tracking algorithm in detail. Our algorithm is implemented under the structured SVM framework with the tensor-based compact multi-linear classifier. We begin with an outline of the proposed tracking framework and then we introduce the theoretical foundations of the proposed multi-linear classifier along with the closed tensor block coordinate descent optimization in online updating.

Experiments

We have evaluated our algorithm on widely used tracking benchmark data [1]. We select 12 best existing trackers available for comparison on the OTB-50 and OTB-100 benchmark sequences, which fully labeled and contain more than 26,000 frames. We conducted our experiments with DLSVM [18], MEEM [19], Struck [17], KCF [27], SCM [43], CXT [42], TLD [38], ASLA [2], IVT [6], VTD [4], Frag [49], and CSK [32]. Furthermore, the proposed algorithm is compared with the SVMs-based tracker, Correlation filter

Conclusions

In this paper, we propose a novel coherent global tensor discriminative framework for visual tracking. We introduce the invariant structure of tensorial features for a tensor structured compact classifier, i.e., truncated structured Tucker machine. The method improves the discriminative capacity by making use of online tensor updating strategy and truncated structured tensor classifier. In general, the multi-linear tensor model is designed to decompose the parameter tensor into its low-rank

Acknowledgments

This work was supported by Hong Kong Research Grants Council (Project C1007-15G), National Natural Science Foundation (61501259), China Postdoctoral Science Foundation (2016M591891), and the Natural Science Foundation of Jiangsu Province (BK20140874, BK20150864).

Guoxia Xu received the B.S. degree in Department of Mathematics, Yancheng Teachers University, Jiangsu Yancheng, China, in 2015. He is currently pursuing the M. S. degree in College of Computer and Information at Hohai University, Nanjing, China. Currently he is a research assistant in Electronic Engineering at City university of Hong Kong. His research interest includes computer vision, and visual tracking.

References (49)

  • HeL. et al.

    Dusk: A dual structure-preserving kernel for supervised tensor learning with applications to neuroimages

  • J. Davis et al.

    Structured metric learning for high dimensional problems

  • ZhangJ. et al.

    Tucker decomposition-based tensor learning for human action recognition

    Multimed. Syst.

    (2016)
  • HuW. et al.

    Semi-supervised tensor-based ten embedding learning and its application to visual discriminant tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • GaoJ. et al.

    Discriminant tracking using tensor representation with semi-supervised improvement

  • A.W.M. Smeulders et al.

    Visual tracking: an experimental survey

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • ZhangT. et al.

    Strutural Sparse Tracking

  • ZhangT. et al.

    Low-rank sparse learning for robust visual tracking

  • S. Hare et al.

    Struck: structured output tracking with kernels

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • NingJ. et al.

    Object Tracking via Dual Linear Structured SVM and Explicit Feature Map

  • ZhangJ. et al.

    MEEM: robust tracking via multiple experts using entropy minimization

  • S. Avidan

    Support vector tracking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2004)
  • WenL. et al.

    Robust online learned spatio-temporal context model for visual tracking

    IEEE Trans. Image Process.

    (2014)
  • I. Kotsia et al.

    Support tucker machines

  • Cited by (17)

    • An intelligent fault diagnosis method based on adaptive maximal margin tensor machine

      2022, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      Because these two-order samples contain special structure information, they should not be vectorized for classification in SVM methods. To make full use of the inherent structure information of the high-order sample, a supervised tensor learning (STL) [20] framework is proposed, which uses the sample in form of tensor as the input to establish the model. Tensor is a common form of existing data, and the tensors with different dimensions correspond to different types of data, namely, a zero-order tensor is a scalar, a first-order tensor is a vector, second-order tensor is a matrix, third-order tensor is a matrix array, etc.

    • Alpha Divergence based Siamese Network for Object Tracking

      2021, Proceedings - 2021 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing and International Conference on Cyber Science and Technology Congress, DASC/PiCom/CBDCom/CyberSciTech 2021
    • RLGC: residual low rank group sparsity constraint for image denoising

      2021, Proceedings of SPIE - The International Society for Optical Engineering
    • Variational local gradient threshold driven convex optimization for single image reflection suppression

      2021, Proceedings of SPIE - The International Society for Optical Engineering
    View all citing articles on Scopus

    Guoxia Xu received the B.S. degree in Department of Mathematics, Yancheng Teachers University, Jiangsu Yancheng, China, in 2015. He is currently pursuing the M. S. degree in College of Computer and Information at Hohai University, Nanjing, China. Currently he is a research assistant in Electronic Engineering at City university of Hong Kong. His research interest includes computer vision, and visual tracking.

    Sheheryar Khan received the BS degree from UET Peshawar, Pakistan with Honours in 2008 and M.Sc. Degree from Lancaster University, UK with distinction in 2010. Before joining City University of Hong Kong as a Ph.D. candidate in 2015, he was as a Lecturer at COMSATS Institute of Information Technology, Pakistan. His research interests include image processing and computer vision.

    Hu Zhu received his B.S. degree in mathematics and applied mathematics from Huaibei Coal Industry Teachers College, Huaibei, China, in 2007, and received his M.S. and Ph.D. degree in computational mathematics and pattern recognition and intelligent systems from Huazhong University of Science and Technology, Wuhan, China, in 2009 and 2013 respectively. In 2013, he joined the Nanjing University of Posts and Telecommunications, Nanjing. His research interests are pattern recognition, image processing and computer vision.

    Lixin Han received the Ph.D. degree in computer science from Nanjing University, Nanjing, China. He has been a Post-Doctoral Fellow with the Department of Mathematics, Nanjing University, and a Research Fellow with the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong. He is currently a Professor with the Institute of Intelligence Science and Technology, Hohai University, Nanjing, China. He has published over 60 research papers. Prof. Han is an Invited Reviewer for several renowned journals and has been a Program Committee Member of many international conferences. He is listed in Marquis’ Who's Who in the World and Marquis’ Who's Who in Science and Engineering.

    Michael K. Ng received his B.Sc. degree in 1990 and M.Phil. degree in 1992 at the University of Hong Kong, and Ph.D. degree in 1995 at Chinese University of Hong Kong. He is a Chair Professor of the Department of Mathematics, and Chair Professor (Affiliate) of Department of Computer Science at Hong Kong Baptist University. He is elected for a Fellows of the Society for Industrial and Applied Mathematics in 2017. His research interests are scientific computing, data sciences and image sciences.

    Hong Yan received the B.S. degree from Nanjing University of Posts and Telecommunications in 1982, the M.S. degree from the University of Michigan, Ann Arbor in 1984, and the Ph.D. degree from Yale University in 1989, all in electrical engineering. From 1986 to 1989, he was a Research Scientist with General Network Corporation, New Haven, where he worked on design and optimization of computer and telecommunications networks. He joined the University of Sydney in 1989 and became professor of imaging science in 1997. He is currently chair professor of computer engineering at City University of Hong Kong. His research interests include image processing, pattern recognition and bioinformatics. He has authored or co-authored over 300 journal and conference papers in these areas. He was elected an IAPR fellow for contributions to document image analysis and an IEEE fellow for contributions to image recognition techniques and applications. He received the 2016 Norbert Wiener Award from IEEE SMC Society for contributions to image and biomolecular pattern recognition techniques.

    View full text