Keywords

1 Introduction

The goal of RGB-T tracking is to estimate the states of the target object in videos by fusing RGB and thermal (corresponds the visible and thermal infrared spectrum data, respectively) information, given the initial ground truth bounding box. Recently, researchers pay more and more attention on RGB-T tracking [1,2,3,4,5] partly due to the following reasons. (i) The imaging quality of visible spectrum is limited under bad environmental conditions (e.g., low illumination, rain, haze and smog, etc.). (ii) The thermal information can provide the complementary benefits for visible spectrum, especially in adverse illumination conditions. (iii) The thermal sensors have many advantages over others, such as the long-range imaging ability, the insensitivity to lighting conditions and the strong ability to penetrate haze and smog. Figure 1 shows some examples.

Fig. 1.
figure 1

Typically complementary benefits of RGB and thermal data [5]. (a) Benefits of thermal sources over RGB ones, where visible spectrum is disturbed by low illumination, high illumination and fog. (b) Benefits of RGB sources over thermal ones, where thermal spectrum is disturbed by glass and thermal crossover.

Most of RGB-T tracking methods focus on the sparse representation because of its capability of suppressing noises and errors [2,3,4]. These approaches, however, only adopt pixel intensities as feature representation, and thus be difficult to handle complex scenarios. Li et al. [5] extend the spatially ordered and weighted patch descriptor [6] to a RGB-T one, but this approach may be affected by the inaccurate initialization to their model. Deep learning based trackers [7,8,9] adopt powerful deep features or networks to improve tracking performance, but extending them to multi-modal ones has the following issues: (i) Regarding thermal as one channel of RGB or directly concatenating their features might not make the best use of the complementary benefits from multiple modalities [4]. For example, if one modality is malfunction, fusing it equals to adding noises, which might disturb tracking performance [4]. (ii) Designing multi-modal networks usually leads to the time-consuming procedures of network training and testing, especially for multiple input videos.

In this paper, we propose a novel cross-modal ranking algorithm for robust RGB-T tracking. Given one bounding box of the target object, we first partition it into non-overlapping patches, which are characterized by RGB and thermal features (such as color and gradient histograms). The bounding box can thus be represented with a graph with image patches as nodes. Motivated by [5, 6], we assign each patch with a weight to suppress background information, and propose a cross-modal ranking algorithm to compute the patch weights. The patch weights are then incorporated into the RGB-T patch features, and the object location is finally predicted by applying the structured SVM [10]. Figure 2 shows the pipeline of our approach. In particular, our cross-modal ranking algorithm advances existing ones in the following aspects.

First, we propose a general scheme for effective multimodal fusion. The RGB and thermal modalities are heterogeneous with different properties, and the hard consistency [4, 11] between these two modalities may be difficult to perform effective fusion. Therefore, we propose a soft cross-modality consistency to enforce ranking consistency between modalities while allowing sparse inconsistency exists.

Second, we propose a novel method to mitigate the effects of ranking noises. In conventional manifold ranking models, the query quality is very important for ranking accuracy, and thus how to set good queries need to be designed manually [12,13,14]. In visual tracking, the setting of initial patch weights (i.e., queries) is not always reasonable due to noises of tracking results and irregular object shapes [6]. To handle this problem, we introduce an intermediate variable to represent the optimal labels of initial patches, and optimize it in a semi-supervised way based on the observation that visually similar patches tend to have same labels or weights. We formulate it as a \(l_1\)-optimization based sparse learning problem to promote sparsity of the inconsistency between inferred queries and initial ones (because most of the initial queries should be correct and the remaining ones are noises). We call this process as optimal query learning in this paper.

Finally, we present an efficient solver for the objective. Instead of individual consideration for each problem, we propose a single unified optimization framework to learn the patch weights and the optimal queries at a same time, which can be beneficial to boosting their respective performance. In particular, an efficient ADMM (alternating direction method of multipliers) [15] is adopted, and a linearized operation [16] is also employed to avoid matrix inversion for efficiency. By this way, our algorithm has a stable convergence behavior, and each iteration has small computational complexity.

In summary, we make the following contributions to RGB-T tracking and related applications. (i) We integrate a soft consistency into the cross-modal ranking process to model the interdependency between two modalities while allowing sparse inconsistency exists to account for their heterogeneous properties. The proposed cross-modality consistency is general, and can be applied to other multimodal fusion problems. (ii) To mitigate noise effects of initial patches, we introduce an intermediate variable to represent the optimal labels of the initial patches, and formulate it as a \(l_1\)-optimization based sparse learning problem. It is also general and applicable to other semi-supervized tasks, such as saliency detection and interactive object segmentation. (iii) We present a unified ADMM-based optimization framework to solve the objective with stable and efficient convergence behavior, which makes our tracker very efficient. (iv) To demonstrate the efficiency and superior performance of the proposed approach over the state-of-the-art methods, we conduct extensive experiments on two large-scale benchmark datasets, i.e., GTOT [4] and RGBT210 [5].

2 Related Work

The methods of visual tracking are vast, we only discuss the most related to us.

RGB-T tracking has drawn much attention in the computer vision community with the popularity and affordability of thermal infrared sensors [17]. Works on RGB-T tracking mainly focus on sparse representation because of its capability of suppressing noises and errors [2,3,4, 18]. Wu et al. [2] concatenate the intensity features of image patches from RGB and thermal sources into a one-dimensional vector, which is sparsely represented in the target template space. The RGB-T tracking is performed in Bayesian filtering framework by defining reconstruction residues as the likelihood. Liu et al. [3] perform joint sparse representation on both RGB and thermal modalities, and fuse the resultant tracking results using min operation on the sparse representation coefficients. A Laplacian sparse representation is proposed to learn a multi-modal features using the reconstruction coefficients that encode both the spatial local information and occlusion handling [18]. Li et al. [4] propose a collaborative sparse representation based trackers to adaptively fuse RGB and thermal modalities by assigning each modality with a reliability weight. These approaches, however, only adopt pixel intensities as feature representation, and thus be difficult to handle complex scenarios. Kim et al. [6] propose a Spatially Ordered and Weighted Patch (SOWP) descriptor for target object based on the random walk algorithm, and achieve excellent performance for tracking. Li et al. [19] extend SOWP by optimizing a dynamic graph, and an another extension is further proposed to integrate multimodal information adaptively for RGB-T tracking [5].

Different from these works, we propose a novel cross-modal ranking algorithm for RGB-T tracking from a new perspective. In particular, our approach has the following advantages. (i) Generality. The proposed model and schemes are general and applicable, including soft cross-modality consistency and optimal query learning, and can be easily extended to other vision problems. (ii) Effectiveness. Our approach performs well against the state-of-the-art RGB and RGB-T trackers on two large-scale benchmark datasets. (iii) Efficiency. The proposed optimization algorithm is with a fast and stable convergence behavior, which makes our tracker very efficient.

3 Cross-Modal Ranking Algorithm

Our cross-modal ranking algorithm aims to compute patch weights to suppress background effects in the bounding box description of target object. This section will introduce the details of our cross-modal ranking model and the associated optimization algorithm. The weighted patch feature construction and object tracking will be described in detail in the next section. For clarity, we present the pipeline of our tracking approach in Fig. 2.

Fig. 2.
figure 2

Pipeline of our approach. (a) Cropped regions, where the red bounding box represents the region of initial patches. (b) Patch initialization indicated by red color. (c) Optimized results from initial patches. (d) Ranking results with the soft cross-modality consistency. (e) RGB-T feature representation. (f) Structured SVM. (g) Tracking results. (Color figure online)

3.1 Model Formulation

The graph-based manifold ranking problem is described as follows: given a graph and a node in this graph as query, the remaining nodes are ranked based on their affinities to the given query. The goal is to learn a ranking function that defines the relevance between unlabelled nodes and queries [12]. We employ the graph-based manifold ranking model to solve our problem.

Given the target bounding box, we first partition it into a set of non-overlapping patches, which are described with RGB and thermal features (e.g., color, thermal and gradient histograms). To mitigate the effects of background information, we assign each patch with a weight that describes its importance belonging to target, and compute these weights via the cross-modal ranking algorithm. Given a patch feature set \(\mathbf{X}^m = \{\mathbf{x}^m_1, ..., \mathbf{x}^m_n \}\), some patches are labelled as queries and the rest need to be ranked according to their affinities to the queries. Here, \(m\in \{1,2,...,M\}\) indicates the m-th modality, and M denotes the number of modalities. Note that RGB-T data is the special case with \(M=2\), and we discuss its general form from the applicable perspective. Let \(\mathbf{s}^m: \mathbf{X}^m \rightarrow \mathbb {R}^n\) denotes a ranking function which assigns a ranking value \(\mathbf{s}^m_i\) to each patch \(\mathbf{x}^m_i\) in the m-th modality, and \(\mathbf{s}^m\) can be viewed as a vector \(\mathbf{s}^m = [\mathbf{s}^m_1, ..., \mathbf{s}^m_n]^T\). In this work, we regard the initial patch weights as query labels, and \(\mathbf{s}^m\) is thus a patch weight vector.

Let \(\mathbf{q}^m=[\mathbf{q}^m_{1},...,\mathbf{q}^m_{n}]^T\) denote an indication vector, in which \(\mathbf{q}^m_{i} = 1\) if \(\mathbf{x}^m_i\) is target object patch, and \(\mathbf{q}^m_{i} = 0\) if \(\mathbf{x}^m_i\) is the background patch. \(\mathbf{q}^m\) is computed by the initial ground truth (for the first frame) or tracking results (for the subsequent frames) as follows. For i-th patch, if it belongs to the shrunk region of the bounding box then \(\mathbf{q}^m_{i}=1\), and if it belongs to the expanded region of the bounding box then \(\mathbf{q}^m_{i}=0\), as shown in Fig. 3(a). The remaining patches are non-determined, and will be diffused by other patches. In general, the ranking is performed in a two-stage way to account for background and objects, respectively [13], but we aim to integrate them in a unified model. To this end, we define an indication vector \(\mathbf{\Gamma }\) that \(\mathbf{\Gamma }_i = 1\) indicates that the i-th patch is foreground or background patch, and \(\mathbf{\Gamma }_i = 0\) denotes that the i-th patch is non-determined patch. Given the graph \(G^m\) of the m-th modality, through extending traditional manifold ranking model [12], the optimal ranking of queries are computed by solving the following optimization problem:

$$\begin{aligned} \begin{aligned}&\min _{\{\mathbf{s}^m\}}\frac{1}{2}\sum _{m=1}^M\sum _{i,j=1}^n\mathbf{W}_{ij}^m||\frac{\mathbf{s}_i^m}{\sqrt{\mathbf{D}^m_{ii}}}-\frac{\mathbf{s}_j^m}{\sqrt{\mathbf{D}^m_{jj}}}||^2+\lambda ||\mathbf{\Gamma }\circ (\mathbf{s}^m - \mathbf{q}^m)||^2_F+\frac{\lambda _2}{2}\Vert \mathbf{s}^m\Vert _F^2 , \end{aligned} \end{aligned}$$
(1)

where \(\lambda \) is a parameter to balance the smoothness term and fitting term, and \(\lambda _2\) is a regularization parameter. \(\circ \) indicates the element-wise product. \(\mathbf{D}^m\) is the degree matrix of the graph affinity matrix \(\mathbf{W}^m\), whose computation is as follows. In the m-th modality, if graph nodes \(v_i\) and \(v_j\) are adjacent with 8-neighbors, they are connected by an edge \(e_{ij}\), which is assigned a weight \(\mathbf{W}^m_{ij}=\exp ^{(-\gamma \Vert \mathbf{x}_i^m-\mathbf{x}_j^m\Vert )}\), where \(\gamma \) is the scaling parameter, which is set to 5 in this paper.

In (1), it inherently indicates that the available modalities are independent, which may significantly limit the performance in dealing with occasional perturbation or malfunction of individual sources. In addition, the settings of initial patch weights (i.e., queries) are not always reasonable due to noises of tracking results and irregular object shapes, as shown in Fig. 3(a). In this paper, we integrate the soft cross-modality consistency and the optimal query learning into (1) to handle above problems, respectively.

Soft Cross-Modality Consistency. To take advantage of the complementary benefits of RGB and thermal data, we need impose the modality consistency on the ranking process. Wang et al. [11] propose a multi-graphs regularized manifold ranking method to integrate different protein domains using hard constraints, i.e., employing multiple graphs to regularize the same ranking score. It is not suitable for our problem, as RGB and thermal sources are heterogeneous with different properties. Therefore, we introduce a soft cross-modality consistency to enforce ranking consistency between modalities while allowing sparse inconsistency exists to account for their heterogeneous properties. To this end, we propose the soft cross-modality consistency as a \(l_1\)-optimization based sparse learning problem as follows:

$$\begin{aligned} \begin{aligned}&\min _{\{\mathbf{s}^m\}}\lambda _1\sum _{m=2}^{M} ||\mathbf{s}^m - \mathbf{s}^{m-1}||_1=\min _{\mathbf{s}^m}\lambda _1||\mathbf{CS}||_1, \end{aligned} \end{aligned}$$
(2)

where \(\lambda _1\) is a regularization parameter, and \(\mathbf{S}=[\mathbf{s}^1;\mathbf{s}^2;...;\mathbf{s}^M]\). \(\mathbf{C}\) is the cross-modal consistency matrix, which is defined as:

where \(\mathbf{I}\) is the identity matrix.

Optimal Query Learning. To mitigate noise effects of initial patch weights, we introduce an intermediate variable to represent the optimal ones, and optimize it in a semi-supervised way. The details are presented below.

Denoting the intermediate variable as \(\hat{\mathbf{q}}^m = [\hat{\mathbf{q}}^m_{1}, ...,\hat{\mathbf{q}}^m_{n}]^T\), we first introduce two constraints for inferring \(\hat{\mathbf{q}}^m\), i.e., visual similarity constraint and inconsistency sparsity constraint. The first constraint assumes that visually similar patches should have same labels and weights, and vice versa. Therefore, we add a smoothness term \(\sum _{i,j=1}^n\mathbf{W}^m_{ij}(\hat{\mathbf{q}}^m_{i} - \hat{\mathbf{q}}^m_{j})^2\) that can make visual similarity become a graph smoothness constraint. The second constraint aiming to compel sparsity in \(\hat{\mathbf{q}}^m - \mathbf{q}^m\) is enlightened by the common use of \(l_1\)-norm sparsity regularization term in data noise, which has been proven to be effective even when the data noise is not sparse [20, 21]. Therefore, we formulate it as \(||\hat{\mathbf{q}}^m - \mathbf{q}^m||_1\), where \(l_1\)-norm is used to promote sparsity on the inconsistency between inferred labels and initial ones (because most of the initial labels should be correct and the remaining ones are noises). Figure 3 shows the superiority of the \(l_1\) norm over the \(l_2\) norm. By combining these two constraints, the proposed \(l_1\)-optimization problem is formulated as follows:

$$\begin{aligned} \begin{aligned}&\min _{\{\hat{\mathbf{q}}^m\}}\alpha \sum _{i,j=1}^n\mathbf{W}_{ij}^m(\hat{\mathbf{q}}^m_{i} - \hat{\mathbf{q}}^m_{j})^2 + \beta ||\hat{\mathbf{q}}^m - \mathbf{q}^m||_1, \end{aligned} \end{aligned}$$
(3)

where \(\alpha \) and \(\beta \) are the balance parameters. Integrating the soft cross-modality consistency (2) and the optimal query learning (3) into (1), the final cross-modal ranking model is written as:

$$\begin{aligned} \begin{aligned}&\min _{\{\mathbf{s}^m\}, \{\hat{\mathbf{q}}^m\}}\frac{1}{2}\sum _{m=1}^M(\sum _{i,j=1}^n\mathbf{W}_{ij}^m||\frac{\mathbf{s}_i^m}{\sqrt{\mathbf{D}^m_{ii}}}-\frac{\mathbf{s}_j^m}{\sqrt{\mathbf{D}^m_{jj}}}||^2+\lambda ||\mathbf{\Gamma }\circ (\mathbf{s}^m - \hat{\mathbf{q}}^m)||^2_F\\&+\frac{\lambda _2}{2}\Vert \mathbf{s}^m\Vert _F^2+\alpha \sum _{i,j=1}^n\mathbf{W}^m_{ij}(\hat{\mathbf{q}}^m_{i} - \hat{\mathbf{q}}^m_{j})^2+\beta ||\hat{\mathbf{q}}^m - \mathbf{q}^m||_1)+\lambda _1||\mathbf{CS}||_1. \end{aligned} \end{aligned}$$
(4)

Although (4) seems complex, as demonstrated in the experiments, the tracking performance is insensitive to parameter variations.

Fig. 3.
figure 3

Comparison of \(l_1\)-norm and \(l_2\)-norm in learning the optimal queries. (a) Target bounding box (red color), shrink bounding box (white color) and expand bounding box (green color). (b) Heatmap optimized by \(l_1\)-norm. (c) Heatmap optimized by \(l_2\)-norm. (d) Heatmap without optimal query learning. Herein, the heatmap represents the ranking results. (Color figure online)

3.2 Optimization Algorithm

Although the variables of (4) are not joint convex, the subproblem to each variable with fixing others is convex and has a closed-form solution. The ADMM (alternating direction method of multipliers) algorithm [15] is efficient and effective solver for the problems like (4). To apply ADMM to our problem, we introduce two auxiliary variables \(\mathbf{P}=\mathbf{CS}\) and \(\mathbf{f}^m=\hat{\mathbf{q}}^m\) to make (4) separable. With some algebra, we have

$$\begin{aligned} \begin{aligned}&\min _{\{\mathbf{s}^m\}, \{\hat{\mathbf{q}}^m\}, \mathbf{P}, \{\mathbf{f}^m\}}\sum _{m=1}^M((\mathbf{s}^{m})^{T}{} \mathbf{L}^m\mathbf{s}^m+\lambda ||\mathbf{\Gamma }\circ (\mathbf{s}^m - \hat{\mathbf{q}}^m)||^2_F+\frac{\lambda _2}{2}\Vert \mathbf{s}^m\Vert _F^2\\&+\,2\alpha (\mathbf{f}^{m})^{T}(\mathbf{D}^m-\mathbf{W}^m)\mathbf{f}^m +\beta \Vert \hat{\mathbf{q}}^m-\mathbf{q}^m\Vert _1)+\lambda _1\Vert \mathbf{P}\Vert _1,\\&{s{.}t{.}}\quad \mathbf{P}=\mathbf{CS}, \mathbf{f}^m=\hat{\mathbf{q}}^m, \end{aligned} \end{aligned}$$
(5)

where \(\mathbf{L}^m=\mathbf{I}-(\mathbf{D}^m)^{-\frac{1}{2}}{} \mathbf{W}^m(\mathbf{D}^m)^{-\frac{1}{2}}\) is the normalized Laplacian matrix of m-th modality. The augmented Lagrange function of (5) is:

$$\begin{aligned} \begin{aligned}&\mathbb {L}(\{\mathbf{s}^m\}, \{\hat{\mathbf{q}}^m\}, \mathbf{P}, \{\mathbf{f}^m\},\mathbf{Y}_1,\mathbf{Y}_2)\\&=\sum _{m=1}^M((\mathbf{s}^m)^T\mathbf{L}^m\mathbf{s}^m+\lambda ||\mathbf{\Gamma }\circ (\mathbf{s}^m - \hat{\mathbf{q}}^m)||^2_F+\frac{\lambda _2}{2}\Vert \mathbf{s}^m\Vert _F^2\\&+\,2\alpha (\mathbf{f}^{m})^{T}(\mathbf{D}^m-\mathbf{W}^m)\mathbf{f}^m +\beta \Vert \hat{\mathbf{q}}^m-\mathbf{q}^m\Vert _1)+\lambda _1\Vert \mathbf{P}\Vert _1\\&+\frac{\mu }{2}(\Vert \mathbf{P}-\mathbf{CS}+\frac{\mathbf{Y}_1}{\mu }\Vert _F^2+\sum _{m=1}^M\Vert \hat{\mathbf{q}}^m-\mathbf{f}^m+\frac{\mathbf{y}^m_2}{\mu }\Vert _F^2)\\&-\frac{1}{2\mu }(\Vert \mathbf{Y}_1\Vert _F^2+\Vert \mathbf{Y}_2\Vert _F^2), \end{aligned} \end{aligned}$$
(6)

where \(\mathbf{Y}_1\) and \(\mathbf{Y}_2=[\mathbf{y}_2^1,\mathbf{y}_2^2,...,\mathbf{y}_2^M]\) are the Lagrangian multipliers, and \(\mu \) is the Lagrangian parameter. Due to space limitation, we present the detailed derivations in the supplementary file. ADMM alternatively updates one variable by minimizing (6) with fixing other variables. Besides the Lagrangian multipliers, there are four variables, including \(\mathbf{S}\), \(\hat{\mathbf{q}}^m\), \(\mathbf{P}\) and \(\mathbf{f}^m\) to solve. Note that the \(\mathbf{S}\)-subproblem includes the inversion operation of a matrix with size of \({Mn \times Mn}\), which is time consuming. To handle this problem, we adopt a linearized operation [16] to avoid matrix inversion for efficiency. Due to space limitation, we only present the solutions of these subproblems as follows:

$$\begin{aligned} \begin{aligned}&\mathbf{f}^m=(4\alpha (\mathbf{D}^m-\mathbf{W}^m)+\mu \mathbf{I})^{-1}(\mu \hat{\mathbf{q}}^m+\mathbf{y}_2^m)\\&\hat{\mathbf{q}}^m=soft\_thr_1(\mathbf{s}^m, \mathbf{f}^m-\frac{\mathbf{y}_2^m}{\mu }, \mathbf{q}^m, \lambda \circ \mathbf{\Gamma } \circ \mathbf{\Gamma }, \frac{\mu }{2},\beta )\\&\mathbf{P}=soft\_thr(\mathbf{C}{} \mathbf{S}-\frac{\mathbf{Y}_1}{\mu },\frac{\lambda _1}{\mu })\\&\mathbf{S}_{k+1}=\mathbf{S}_k-\frac{1}{\eta \mu }\nabla _{\mathbf{S}_k}J_k, \end{aligned} \end{aligned}$$
(7)

where \(soft\_thr\) is a soft thresholding operator and \(soft\_thr_1\) is also a soft thresholding operator with different inputs to \(soft\_thr\), see the supplementary file for detailed definitions. k indicates the k-th iteration, and \(J_k\) is the abbreviation of \(J(\mathbf{S}_k, \hat{\mathbf{Q}}^m_k, \mathbf{P}_k,\mathbf{Y}_{1,k}, \mu _k)=\mathbf{S}_k^T\mathbf{L}{} \mathbf{S}_k+\lambda \Vert \mathbf{\Gamma }\circ (\mathbf{S}_k-\hat{\mathbf{Q}}_k)\Vert _F^2+\frac{\mu _k}{2}\Vert \mathbf{P}_k-\mathbf{C}{} \mathbf{S}_k+\frac{\mathbf{Y}_{1,k}}{\mu _k}\Vert _F^2+\frac{\lambda _2}{2}\Vert \mathbf{S}_k\Vert _F^2\), where \(\hat{\mathbf{Q}}=[\hat{\mathbf{q}}^1;\hat{\mathbf{q}}^2;...;\hat{\mathbf{q}}^M]\), and

\(\nabla _\mathbf{S}J\) is the partial differential of J with respect to \(\mathbf{S}\), and \(\eta =\frac{1}{M}\sum _{m=1}^M\Vert \mathbf{X}^m\Vert _F^2\). Please refer to the supplementary file for the detailed derivations.

4 RGB-T Object Tracking

This section first imposes the optimized patch weights on the extracted multi-spectral features for more robust feature representation, and then present the tracker’s details.

4.1 Feature Representation

We perform cross-modal ranking to obtain the patch weights, i.e., \(\mathbf{s}^1, \mathbf{s}^{2}, ..., \mathbf{s}^M\). Let \(\mathbf{x}_i=[\mathbf{x}^1_i;...;\mathbf{x}^M_i]\in \mathbb {R}^{dM\times 1}\) be the RGB-T feature vector of i-th patch. Then, we construct the final collaborative feature representation by incorporating the patch weights. Specifically, for the i-th patch, we compute its final weight \(\hat{\mathbf{s}}_i\) by combining all modal weights as follows:

$$\begin{aligned} \begin{aligned}&\hat{\mathbf{s}}_i=\frac{1}{1+\exp (-\sigma \frac{\sum _{m=1}^M\mathbf{s}_i^m}{M})}, \end{aligned} \end{aligned}$$
(8)

where \(\sigma \) is a scaling parameter fixed to 35 in this work. The collaborative feature representation is thus obtained by \(\hat{\mathbf{x}}=[\hat{\mathbf{s}}_1\mathbf{x}_1;...;\hat{\mathbf{s}}_n\mathbf{x}_n]\in \mathbb {R}^{dMn\times 1}\).

4.2 Tracking

We adopt the structured SVM (S-SVM) [10] to perform object tracking in this paper, and other tracking algorithm, such as correlation filters [22], can also be utilized.

Instead of using binary-labeled samples, S-SVM employs the structured sample that consists of a target bounding box and nearby boxes in the same frame to prevent the labelling ambiguity in training the classifier. Specifically, it constrains that the confidence score of an target bounding box \(y_t\) is larger than that of nearby box y by a margin determined by the intersection over union overlap ratio (denoted as \(IoU(y_t,y)\)) between two boxes:

$$\begin{aligned} \mathbf{h}^*=\arg \min _\mathbf{h}~\xi ||\mathbf{h}||^2+\sum _\mathbf{y}\max \{0,\triangle (y_t,y)-\mathbf{h}^T\epsilon (y_t,y)\}, \end{aligned}$$
(9)

where \(\triangle (y_t,y)=1-IoU(y_t,y)\), \(\epsilon (y_t,y)=\varPsi (y_t)-\varPsi (y)\), and \(\xi =0.0001\) is a regularization parameter. \(\varPsi (y_t)\) denotes the object descriptor representing a bounding box \(y_t\) at the t-th frame, and \(\mathbf{h}\) is the normal vector of a decision plane. In this paper, we employ the stochastic variance reduced gradient (SVRG) technique [23] to optimize (9). By this way, S-SVM can reduce adverse effects of false labelling.

Given the bounding box of the target object in previous frame \((t-1)\), we first set a searching window in current frame t, and sample a set of candidates within the searching window. S-SVM selects the optimal target bounding box \(y^*_t\) in the t-th frame by maximizing a classification score:

$$\begin{aligned} y^*_t=\arg \max _{y_t}~(\omega {\mathbf{h}_{t-1}^T\varPsi (y_t)}+(1-\omega )\mathbf{h}_{0}^T\varPsi (y_t)), \end{aligned}$$
(10)

where \(\omega \) is a balancing parameter, and \(\mathbf{h}_{t-1}\) is the normal vector of a decision plane of \((t-1)\)-th frame. \(\mathbf{h}_{0}\) is learnt in the initial frame, which can prevent it from learning drastic appearance changes. To prevent the effects of unreliable tracking results, we update the classifier only when the confidence score of tracking result is larger than a threshold \(\theta \), where the confidence score of tracking result in t-th frame is defined as the average similarity between the weighted descriptor of the tracked bounding box and the positive support vectors: \({\frac{1}{|\mathbb V_{t}|}}\sum _{\mathbf{v}\in {\mathbb V}_{t}}{} \mathbf{v}^T\varPsi (y_t^*)\), where \(\mathbb {V}_{t}\) is the set of the positive support vectors at time t. In addition, we update object scales with three frames interval using the method from [24].

5 Performance Evaluation

5.1 Evaluation Settings

Data. There are only two large RGB-T tracking datasets, i.e., GTOT [4] and RGBT210 [5]. They are large and challenging enough, and we evaluate our approach on them for comprehensive validations. GTOT includes 50 RGB-T video clips with ground truth object locations under different scenarios and conditions. RGBT210 is another larger dataset for RGB-T tracking evaluation. It is highly-aligned, and contains 210 video clips with both RGB and thermal data. This dataset takes many challenges into consideration, such as camera moving, different occlusion levels, large scale variations and environmental challenges. The precision rate (PR) and success rate (SR) are employed to measure quantitative performance of various trackers.

Fig. 4.
figure 4

Success Rate (SR) on the public GTOT benchmark dataset.

Fig. 5.
figure 5

The evaluation results on the public RGBT210 benchmark dataset. The representative score of PR/SR is presented in the legend.

Table 1. Success Rate (SR) of the proposed method with different parameters on the GTOT dataset.
Table 2. Attribute-based Precision Rate and Success Rate (PR/SR %) on RGBT210 dataset with 9 trackers, including CSR [4], DSST [32], MEEM [33], CNN [22], SOWP [6], KCF [22], SGT [5], CFnet [27] and ECO [28]. The best and second results are in and colors, respectively.

Parameters. We fix all parameters and other settings in our experiments. We partition all bounding box into 64 non-overlapping patches to balance accuracy-efficiency trade-off [6], and extract RGB-T features for each patch, including color, thermal and gradient histograms, where the dimensions of gradient and each color channel are set to be 8. To improve the efficiency, each frame is scaled so that the minimum side length of a bounding box is 32 pixels, and the side length of a searching window is fixed to be \(2\sqrt{WH}\), where W and H are the width and height of the scaled bounding box, respectively. We shrink and expand the tracked bounding box (lxlyWH) as \((lx+0.1W, ly+0.1H, 0.8W, 0.8H)\) and \((lx-W', ly-H', W+2W', H+2H')\), respectively, where (lxly) denotes the top-left coordinate of the tracked bounding box, and \(W'\) and \(H'\) indicate the patch width and height, respectively.

The proposed model involves several parameters in (6), including \(\alpha \), \(\beta \), \(\lambda \), \(\lambda _1\) and \(\lambda _2\), and the tracking sensitivity with different parameters are shown in Table 1. The results show the robustness of the proposed model to parameters’ variations, and we set \(\alpha \), \(\beta \), \(\lambda \), \(\lambda _1\) and \(\lambda _2\) to be 0.65, 0.002, 0.56, 0.3 and 0.4, respectively. In S-SVM, we empirically set \(\{\omega , \theta \} = \{0.598, 0.3\}\), and employ a linear kernel.

Baselines. For comprehensive evaluation, we compare ours method with 23 popular trackers, some of which are from GTOT and RGBT210 benchmarks. Since there are few RGB-T trackers [2,3,4,5, 18], we extend some RGB tracking methods to RGB-T ones by concatenating RGB and thermal features into a single vector or regarding the thermal as an extra channel, such as KCF [22], Struck [25], SCM [26] and CFnet [27]. In addition, we also select recently proposed state-of-the-art trackers for comparison, such as C-COT [9], ECO [28], ACFnet [29], SiameseFC [30] and Staple-CA [31], see Figs. 4 and 5 for details.

5.2 Comparison Results

GTOT Evaluation. We present the evaluation results on the GTOT dataset in Fig. 4. Overall, the proposed algorithm performs favorably against the state-of-the-art methods. In particular, our approach outperforms the state-of-the-art methods using deep features with a clear margin, e.g., 5.0%/1.2% over ECO [28] and 11.5%/7.6% over C-COT [9] in PR/SR score. It is beneficial to the effective fusion of visible and thermal information in our method. Note that the methods based on deep features have weak performance on GTOT, including ECO and C-COT. It may be partly due to the weakness of deep features in representing the target objects with low resolution (many targets are small in GTOT). Our approach can handle this challenging factor. Figure 4 shows that our tracker performs well against the state-of-the-art RGB-T methods, which suggest that the proposed fusion approach is effective. SGT [5] is better than our tracker in PR mainly due to adaptive fusion of different modalities by introducing modality weights, but performs weaker than ours in SR.

RGBT210 Evaluation. We further evaluate our method on the RGBT210 dataset in Fig. 5 and Table 2. The comparison curves show that our tracker also performs well against the state-of-the-art methods on RGBT210. In particular, our approach outperforms the state-of-the-art RGB-T tracking methods, e.g., 1.9%/3.3% over SGT [5] and 20.3%/13.3% over CSR [4] in PR/SR score. It justifies the effectiveness of the proposed method in fusing multimodal information for visual tracking. For the state-of-the-art methods using deep features, the proposed tracker performs well against the SiameseFC [30] and CFnet [27] methods in all aspects. The proposed tracker performs equally well against the C-COT [9] and ECO [28] schemes in terms of PR and slightly worse in terms of SR. Furthermore, the proposed algorithm advances the C-COT and ECO methods in several aspects.

  • It does not require laborious pre-training or a large training set, and also does not need to save a large pre-trained deep model. We initialize the proposed model using the ground truth bounding box in the first frame, and update it in subsequent frames.

  • It is easy to implement as each subproblem of the proposed model has a closed-form solution.

  • It performs favorably against the state-of-the-art deep tracking methods in terms of efficiency on a cheaper hardware setup (Ours: 8 FPS on 4.0 GHz CPU, ECO: 8 FPS on 3.4 GHz CPU and NVIDIA Tesla K40m GPU, C-COT: 1 FPS).

  • It performs more robustly than the ECO and C-COT methods in some situations. In particular, it outperforms the ECO method on sequences with partial occlusion, low illumination, object deformation and background clutters in terms of PR and SR, which suggests the effectiveness of our approach in fusing the multimodal information and suppressing the background effects during tracking.

In addition, the example visual results on RGBT210 and GTOT are presented in the supplementary file, which further qualitatively verify the effectiveness of our method.

Table 3. PR/SR (%) of the proposed method with the different versions on the GTOT dataset.

5.3 Ablation Study

To justify the significance of the main components, we implement 3 versions of our approach for empirical analysis on GTOT. The 3 versions are: (1) Ours-noC, that computes the patch weights without the constraint of cross-modal consistency. (2) Ours-no\(\hat{q}\), that removes the optimal query learning operation in ranking model. (3) Ours-noS, that removes the patch weights in the feature presentation.

From the evaluation results reported in Table 3, we can draw the following conclusions. (1) The patch weights in collaborative object representation plays critical roles in RGB-T tracking by observing that Ours outperforms Ours-noS. (2) The improvements of Ours over Ours-no\(\hat{q}\) demonstrate the effectiveness of the introduced optimal query learning. (3) The soft consistency is important for cross-modal ranking from the observation that Ours-noC is much lower than Ours.

5.4 Runtime Performance

The experiments are carried out on a PC with an Intel i7 4.0 GHz CPU and 32 GB RAM, and implemented in C++. The proposed tracker performs at about 8 frames per second. In particular, our ranking algorithm converges within 30 iterations, and costs about 20 ms per frame (tested on all datasets). Note that our codes do not include any optimization and parallel operation, and the feature extraction and the structured SVM take most of time per frame (above 80%).

6 Conclusion

In this paper, we propose a graph-based cross-modal ranking algorithm to learn robust RGB-T object features for visual tracking. In the ranking process, we introduce the soft cross-modality consistency between modalities and the optimal query learning to improve the robustness. The solver to the proposed model is fast makes the tracker efficient. Extensive experiments on two large-scale benchmark datasets demonstrate the effectiveness and efficiency of the proposed approach against the state-of-the-art trackers.

However, our approach has the following two major limitations. First, the tracking performance is affected by the imaging limitation of some individual source, as shown in Table 2 (TC). Second, the runtime does not meet the demand of real-time applications. In future work, we will introduce the modality weights [4, 5] in our model to address the first limitation, and implement our approach using parallel computation to improve the efficiency, such as multi-thread based multimodal feature extraction and GPU based structured SVM [34].