1 Introduction

Target tracking is a major foundation in visual surveillance, human-machine interface, vehicle navigation, video scene analysis, etc. Numerous methods have been reported [1], and these methods can be classified into two categories: using single-modality [2,3,4,5] and using multi-modality [6,7,8,9,10]. By leveraging the combined benefit of using different modalities while compensating for failure in individual modality, multi-modality system offers considerable advantage over single-modality system.

The fusion of different spectral images is significant in multi-modality tracking, because each spectral image provides disparate, yet complementary information about a scene. It offers an improved operational robustness, because distinct physical sensing principles compensate for particular perception shortcomings. Besides the fusion of multi-spectral images, the combination of multiple features from identical spectral image is also widely researched. The multi-feature fusion can better represent the target than single feature under complex dynamic circumstance by taking advantage of the complementarity of different features. However, above two types of multi-modality tracking method act in their own way, and few method can integrate them in a natural mean. In this paper, we propose a unified framework to cope with the problem by employing spatial histogram representation and mean shift search. Being fast and robust, the Mean Shift Tracker (MST) [5] has been used widely and many extensions and variants were proposed [11,12,13,14]. Instead of completely ignoring the spatial structure of the object features in original MST, Birchfield et al. [11] introduced a spatial histogram (abbreviated to spatiogram) that was formed by weighting each bin of histogram with the mean and covariance of the locations of the pixels that contribute to that bin. The spatiogram-based method not only improves the accuracy of the target appearance model but also preserves the rapidity of mean shift tracking, so we choose it to build our framework. The main contributions of our work are as follows:

  • Establish a unified framework of multi-modal target tracking based on joint spatiogram representation, and the framework is flexible enough to handle any number of modalities;

  • Design a fast multi-input multi-output fuzzy system to adaptively adjust the weight of each modality for evaluating the target state pre frame;

  • Develop a particle filter based model update scheme to keep track of the most representative reference spatiogram throughout the tracking procedure;

The rest of paper is organized as follows. In Sect. 2, we first introduce the related work. Next, the spatiograms and their similarity is sketched in Sect. 3. After that, respectively detail the proposed tracking algorithm and fuzzy logic based weight adjust method in Sect. 4. Our model update scheme based particle filter is described in Sect. 5. Experimental results on the method are reported in Sect. 6. We conclude this paper in Sect. 7.

2 Related Work

The cooperation of visible and infrared imagery is the most active topic in multi-spectral combination. Infrared sensor can detect relative difference in the amount of thermal energy emitted from objects, and is independent of illumination, making it more effective than visible camera under poor lighting condition, but infrared imagery can’t provide color and texture features. Visible sensor is oblivious to temperature difference, but is more effective than infrared sensor when objects are at thermal crossover, provided that the scene is well illuminated and the objects have color signatures different from the background. Therefore, by combining their data, a tracker can perform better than one that uses only single sensor. There are many methods for tracking color-infrared target. Reference [6] proposed a framework by fusing the outputs of multiple spatiogram trackers. Reference [7] presented a multi-cue mean-shift tracking approach based on fuzzified region dynamic image fusion. Reference [8] proposed a tracking approach for visible-infrared target using tracking-before-fusion, in which the visible and infrared targets were tracked individually, and their results were fused. However, the final result may be very bad when any one of their tracking results is poor. Reference [9] proposed a compressive spatial-temporal Kalman fusion tracking algorithm that allowed independent features to be integrated, whose fusion model was formulated with both spatial and temporal fusion coefficients. However, its Bayesian classification requires a large number of samples to gain enough accuracy, which leads to large computational burden. Reference [10] proposed an infrared-visible target tracking method using joint sparse representation, in which only one target template was required for infrared or visible image. The method need solve only one \(\ell _1\) norm minimization problem, so its time cost is dramatically decreased. However, it is the single target template that leads to the loss of diversity of target templates, thus reducing the robustness of tracker.

In recent years, many multi-feature fusion tracking algorithms have been reported. Reference [15] proposed a fusion tracking method that employed local steering kernel descriptor and color histogram to represent the target object. Reference [16] proposed a new joint sparse representation model for robust feature-level fusion tracking, which dynamically prevented unreliable feature being fused by taking advantage of the sparse representation. Reference [17] developed a particle filter tracking algorithm based on the fusion of color histogram and edge orientation histogram. Reference [18] located the target through independent multi-feature fusion and region-based temporal difference model under particle filter framework. Reference [19] proposed a multi-feature fusion tracking framework by employing hashing method to fuse different features to generate compact binary feature. Reference [20] presented a novel online object tracking algorithm by using multi-feature channels with adaptive weights.

3 The Spatiograms and Their Similarity

Spatiogram [11] is generalization of histogram that includes potentially higher order moments. Conventional histogram is a zeroth-order spatiogram, while second-order spatiogram contains spatial mean and covariance for each histogram bin.

Denote by \(\varvec{\hat{h}} = \{\hat{p}_u, \varvec{\hat{\mu }}_u, \widehat{\mathbf {\Sigma }}_u\}_{u=1}^{m}\) the m-bin reference second-order spatiogram of the target. In each frame of the image sequence, the center point \(\varvec{z}\) of the image region in which the spatiogram \(\varvec{h}(\varvec{z}) = \{p_u(\varvec{z}), \varvec{\mu }_u(\varvec{z}), \mathbf {\Sigma }_u(\varvec{z})\}_{u=1}^{m}\) is closest to \(\varvec{\hat{h}}\) is sought. Let \(\{\varvec{x}_{i}\}_{i=1}^{n}\) denote the (normalized) coordinates of the pixels in a candidate region centered at \(\varvec{z}\), then the feature probability of the uth bin is

$$\begin{aligned} {p}_{u}(\varvec{z})= \sum \limits _{i=1}^{n}{k(\Vert \varvec{x}_{i}-\varvec{z}\Vert ^2)\delta {[b(\varvec{x}_i)-u]}}, \end{aligned}$$
(1)

where \(b(\varvec{x})\) is the bin number \((1,\cdots ,m)\) associated with the feature at location \(\varvec{x}\), \(\delta \) is the Kronecker delta function, and k(x) is a kernel profile that assigns smaller weights to pixels farther from the circle center. Note that the kernel is normalized such that its profile satisfies \(\sum \nolimits _{i=1}^{n}k(\left\| \varvec{x}_{i}-\varvec{z}\right\| ^2)=1\). The coordinate mean \(\varvec{\mu }_u(\varvec{z})\) and covariance \(\mathbf {\Sigma }_u(\varvec{z})\) of pixels belong to the bin u are

$$\begin{aligned} {\varvec{\mu }}_{u}(\varvec{z})=\frac{1}{\sum _{k=1}^{n}{\delta {[b(\varvec{x}_k)-u]}}} \sum _{i=1}^{n}{(\varvec{x}_{i}-\varvec{z})\delta {[b(\varvec{x}_i)-u]}}, \end{aligned}$$
(2)
$$\begin{aligned} {\mathbf {\Sigma }}_{u}(\varvec{z}) =\frac{1}{\sum _{k=1}^{n}{\delta {[b(\varvec{x}_k)-u]}-1}} \sum \limits _{i=1}^{n} {(\varvec{x}_{i} - {\varvec{\mu }}_{u}(\varvec{z}))(\varvec{x}_{i} - {\varvec{\mu }}_{u}(\varvec{z}))^\mathrm {T}\delta {[b(\varvec{x}_i)-u]}}, \end{aligned}$$
(3)

Similarly, the calculation of reference spatiogram \(\varvec{\hat{h}}\) can be regarded as a special case of \(\varvec{h}(\varvec{z})\) that \(\varvec{z}=0\).

The similarity between the spatiograms is measured by

$$\begin{aligned} \rho (\varvec{z})\triangleq \rho [\varvec{h}(\varvec{z}), \varvec{\hat{h}}]=\sum \limits _{u=1}^{m}{\psi _u(\varvec{z})\sqrt{{{p}_{u}}(\varvec{z}){\hat{p}_{u}}}}, \end{aligned}$$
(4)

where \(\sqrt{p_u(\varvec{z})\hat{p}_u}\) is used to measure the similarly between features of candidate region and target image, while

$$\begin{aligned} \psi _u(\varvec{z}) = \frac{4|\mathbf {\Sigma }_u(\varvec{z})\widehat{\mathbf {\Sigma }}_u|^{\frac{1}{4}}}{|\widetilde{\mathbf {\Sigma }}_u|^{\frac{1}{2}}}\exp \left\{ -\frac{1}{2}(\varvec{\mu }_u(\varvec{z})-\varvec{\hat{\mu }}_u)^\mathrm {T}(\widetilde{\mathbf {\Sigma }}_u(\varvec{z}))^{-1}(\varvec{\mu }_u(\varvec{z})-\varvec{\hat{\mu }}_u)\right\} \end{aligned}$$
(5)

is used to measure the similarity in spatial arrangement between these features, and \(\widetilde{\mathbf {\Sigma }}_u(\varvec{z}) = 2(\mathbf {\Sigma }_u(\varvec{z})+\widehat{\mathbf {\Sigma }}_u)\).

4 Proposed Multi-modality Tracker

Assume there are N different modalities (or N reference spatiograms), then it is a fact that each candidate state should correspond to N image patches with different modalities. That is to say, at any time instant t, all well registered N image patches with different modalities should share same dynamic state (including location, size and shape), because they merely use different modalities to describe identical real object.

4.1 Joint Spatiogram Representation

Denote by \(\varvec{\hat{h}}_j = \{\hat{p}_u^j, \varvec{\hat{\mu }}_u^j, \widehat{\mathbf {\Sigma }}_u^j\}_{u=1}^{m}\) the jth modality reference spatiogram. Given a candidate state centered at \(\varvec{z}\), then there are N candidate spatiograms with different modalities, and we denote by \(\varvec{{h}}_j = \{p_u^j(\varvec{z}), \varvec{{\mu }}_u^j(\varvec{z}), {\mathbf {\Sigma }}_u^j(\varvec{z})\}_{u=1}^{m}\) the jth modality candidate spatiogram. The similarity between the jth modality candidate spatiogram and its reference spatiogram is measured by

$$\begin{aligned} \rho _j(\varvec{z})=\sum \limits _{u=1}^{m}{\psi _u^j(\varvec{z})\sqrt{{{p}_{u}^j}(\varvec{z}){\hat{p}_{u}^j}}} \end{aligned}$$
(6)

where

$$\begin{aligned} \psi _u^j(\varvec{z}) = \frac{4|\mathbf {\Sigma }_u^j(\varvec{z})\widehat{\mathbf {\Sigma }}_u^j|^{\frac{1}{4}}}{|\widetilde{\mathbf {\Sigma }}_u^j|^{\frac{1}{2}}}\exp \left\{ -\frac{1}{2}(\varvec{\mu }_u^j(\varvec{z})-\varvec{\hat{\mu }}_u^j)^\mathrm {T}(\widetilde{\mathbf {\Sigma }}_u^j(\varvec{z}))^{-1}(\varvec{\mu }_u^j(\varvec{z})-\varvec{\hat{\mu }}_u^j)\right\} , \end{aligned}$$
(7)

and \(\widetilde{\mathbf {\Sigma }}_u^j(\varvec{z})=2(\mathbf {\Sigma }_u^j(\varvec{z})+\widehat{\mathbf {\Sigma }}_u^j)\).

For the multi-modality fusion tracking, whether a candidate state should be accepted or not is decided by joint similarity of all modality candidates and their corresponding targets, so we define the joint similarity (or object function) as

$$\begin{aligned} \rho (\varvec{z})\triangleq \sum \limits _{j=1}^{N}{\alpha _j\rho _j(\varvec{z})}, \end{aligned}$$
(8)

where \(0\le \alpha _j\le 1\) is the weight that reflects the reliability of the jth modality to evaluating the target state in current frame, and \(\sum \nolimits _{j=1}^{N}{\alpha _j}=1\).

4.2 Target Localization

The goal of target localization is to estimate the target translation \(\varvec{\hat{z}}\) that maximizes the joint similarity in (8). Denote by \(\varvec{\hat{z}}_0\) the estimated target location in the previous frame. Approximating the joint similarity (8) in the current frame by its first-order Taylor expansion around the values \({p}_{u}^j(\varvec{\hat{z}}_{0})\) and \({\varvec{\mu }}_{u}^j(\varvec{\hat{z}}_{0})\) results in

$$\begin{aligned} \begin{aligned} \rho (\varvec{z}) \approx&\sum _{j=1}^{N}\sum _{u=1}^{m}\alpha _j\psi _u^j(\varvec{\hat{z}}_0) \sqrt{p_u^j(\varvec{\hat{z}}_0)\hat{p}_u^j}((\widetilde{\mathbf {\Sigma }}_u^j(\varvec{\hat{z}}_0))^{-1}(\varvec{\hat{\mu }}_u^j - {\varvec{\mu }}_u^j(\varvec{\hat{z}}_0))\varvec{\mu }_u^j(\varvec{z})+\cdots \\&\sum _{j=1}^{N}\sum _{u=1}^{m}\frac{\alpha _j}{2}\psi _u^j(\varvec{\hat{z}}_0) \sqrt{{\hat{p}_u^j}/{p_u^j(\varvec{\hat{z}}_0)}}p_u^j(\varvec{z})+C, \end{aligned} \end{aligned}$$
(9)

where

$$\begin{aligned} p_u^j(\varvec{z})= \sum \limits _{i=1}^{n}{k(\Vert \varvec{x}_{i}^j-\varvec{z}\Vert ^2)\delta {[b(\varvec{x}_i^j)-u]}}, \end{aligned}$$
(10)
$$\begin{aligned} \varvec{\mu }_u^j(\varvec{z})=\frac{1}{\sum _{k=1}^{n}{\delta {[b(\varvec{x}_k^j)-u]}}} \sum _{i=1}^{n}{(\varvec{x}_{i}^j-\varvec{z})\delta {[b(\varvec{x}_i^j)-u]}}, \end{aligned}$$
(11)

and C is independent of \(\varvec{z}\). Taking the derivative of (9) with respect to \(\varvec{z}\) yields

$$\begin{aligned} \frac{\partial \rho (\varvec{z})}{\partial \varvec{z}} = \sum _{j=1}^{N}\sum \limits _{i=1}^{n}w_i^j{k'} (\Vert \varvec{z}-\varvec{x}_{i}^j\Vert ^2)(\varvec{z}-\varvec{x}_i^j) - \sum _{j=1}^{N}\sum \limits _{u=1}^{m}\varvec{w}_u^j, \end{aligned}$$
(12)

where

$$\begin{aligned} \left. \begin{aligned}&w_i^j =\sum \limits _{u=1}^{m} \frac{\alpha _j}{2}\psi _u^j(\varvec{\hat{z}}_0) \sqrt{{\hat{p}_u^j}/{p_u^j(\varvec{\hat{z}}_0)}}\delta {[b(\varvec{x}_i^j)-u]}\\&\varvec{w}_u^j=\alpha _j\psi _u^j(\varvec{\hat{z}}_0) \sqrt{p_u^j(\varvec{\hat{z}}_0)\hat{p}_u^j}((\widetilde{\mathbf {\Sigma }}_u^j(\varvec{\hat{z}}_0))^{-1}(\varvec{\hat{\mu }}_u^j - {\varvec{\mu }}_u^j(\varvec{\hat{z}}_0)) \end{aligned} \right\} \end{aligned}$$
(13)

Set \(\frac{\partial \rho (\varvec{z})}{\partial \varvec{z}}=0\) and solve for \(\varvec{z}\):

$$\begin{aligned} \varvec{\hat{z}} = \frac{\sum \limits _{j=1}^{N}\sum \limits _{i=1}^{n}w_i^j g(\Vert \varvec{z}_0-\varvec{x}_{i}^j\Vert ^2)\varvec{x}_i^j- \sum \limits _{j=1}^{N}\sum \limits _{u=1}^{m}\varvec{w}_u^j}{\sum \limits _{j=1}^{N}\sum \limits _{i=1}^{n}w_i^j g(\Vert \varvec{z}_0-\varvec{x}_{i}^j\Vert ^2)}, \end{aligned}$$
(14)

where \(g(x) = -k'(x)\). If we use the Epanechnikov profile [5] then the derivative of the kernel is constant, thus the iteration (14) is reduced to

$$\begin{aligned} \varvec{\hat{z}} = \left( \sum \limits _{j=1}^{N}\sum \limits _{i=1}^{n}w_i^j\varvec{x}_i^j-\sum \limits _{j=1}^{N}\sum \limits _{u=1}^{m}\varvec{w}_u^j\right) \bigg /{\sum \limits _{j=1}^{N}\sum \limits _{i=1}^{n}w_i^j}, \end{aligned}$$
(15)

To enable the physical meaning of (15) to evident, we rearrange the numerator to obtain

$$\begin{aligned} \varvec{\hat{z}} = \sum \limits _{j=1}^{N}\sum \limits _{i=1}^{n}(w_i^j\varvec{x}_i^j-\varvec{v}_i^j) \bigg /{\sum \limits _{j=1}^{N}\sum \limits _{i=1}^{n}w_i^j}, \end{aligned}$$
(16)

where \(\varvec{v}_i^j = \sum \nolimits _{u=1}^{m}(\varvec{w}_u^j\delta {[b(\varvec{x}_i^j)-u]}\big /{\sum _{k=1}^{n}{\delta {[b(\varvec{x}_k^j)-u]}}})\). It is easy to see from (16) that each pixel in each modality candidate patch casts a vote proportional to \(\Vert w_i^j\varvec{x}_i^j - \varvec{v}_i^j\Vert \) in the direction of \(\varvec{x}_i^j-\varvec{v}_i^j/w_i^j\) for the Mean-Shift offset toward or away from \(\varvec{\hat{z}}\), which also shows that the optimal location of the target is determined by all modalities.

4.3 Adjusting Weights Using Fuzzy Logic

The weights in most algorithms are assumed to be unchanged during the tracking, but the fact is that the importance (reliability) of each modality changes over time. Usually, the target doesn’t change drastically between consecutive frames, but has always a few difference, so we can only fuzzily rather than determinately estimate the change. In this paper, we use the similarity between the tracking result and the reference model to measure the change, and employ the fuzzy logic method to calculate the weights. Specifically, the inputs of fuzzy system are the reliabilities

$$\begin{aligned} e_j(\varvec{\hat{z}})=\frac{1}{\sqrt{2\pi }\delta }\exp \left( -\frac{1-\rho _j(\varvec{\hat{z}})}{2\delta ^2}\right) \end{aligned}$$
(17)

index j = 1,...,N, where \(\rho _j(\varvec{\hat{z}})\) is the similarity between the tracking result and the reference model in current frame. Notice that the smaller the similarity, the lower the reliability. The outputs of fuzzy system are the weights \(\alpha _j\) index j = 1,...,N in the next frame.

In this study, we apply the singleton fuzzification, product inference, and centroid defuzzification to build the fuzzy system [21]. Each input variable \(e_j\) is fuzzified with five linguistic variables, labeled SR, S, M, B, BR, partitioned on the interval [0, 1], and each output variable \(\alpha _j\) is also fuzzified with nine linguistic variables, labeled ST, VS, SR, S, M, B, BR, VB, BT, partitioned on the interval [0, 1], where ST stands for smallest, VS for very smaller, SR for smaller, S for small, M for middle, B for big, BR for bigger, VB for very bigger, BT for biggest. The membership functions of \(e_j\) and \(\alpha _j\) are all Gaussian function, as shown in Fig. 1(a) and (b).

Fig. 1.
figure 1

(a) Membership function for \(e_j\) (b) membership function for \(\alpha _j\) (c) the control rule bases for a dual-input and single-output fuzzy system.

Our fuzzy system is a typical multi-input multi-output system with N inputs and N outputs, and if directly using the IF-THEN rule of N inputs inferring one output then resulting in \(N \times 5^N\) rules, which is time-consuming. Therefore, we convert the fuzzy system into several dual-input and single-output (DISO) fuzzy systems, and define the fuzzy rule of each DISO system as

$$\begin{aligned} \text {IF}\ e_{j}\ \text {is}\ A_{1}^{k}\ \text {and}\ {e}_{i}\ \text {is}\ A_{2}^{k}\ \text {THEN}\ \alpha ^i_j\ \text {is}\ {B}^{k},k=1,\ldots ,K \end{aligned}$$
(18)

where \(A_{1}^{k}\) and \(A_{2}^{k}\) belong to {SR, S, M, B, BR}, \(B^{k}\) belongs to {ST, VS, SR, S, M, B, BR, VB, BT}, \(\alpha ^i_j\) (\(i\ne j\)) is the weight of modality j relative to modality i, and K is the total number of rules. The control rule bases originating from (18) is shown in Fig. 1(c), apparently, K = 25 and \(\alpha ^j_i = 1 -\alpha ^i_j\). After performing the fuzzy control procedure to acquire all \(\alpha ^i_j\), the \(\alpha _j\) is calculated by

$$\begin{aligned} \alpha _j=(\alpha ^1_j+\cdots +\alpha ^{(j-1)}_j+\alpha ^{(j+1)}_j+\cdots +\alpha ^N_j)\big /(N-1) \end{aligned}$$
(19)

Finally, all \(\alpha _j\) are normalized such that they satisfy \(\sum _{j}\alpha _j = 1\). Since the derivation of each \(\alpha _j\) needs \(N-1\) noninteractive DISO fuzzy systems, the total number of DISO systems is \(N(N-1)/2\), thus the total fuzzy rules can be reduced to \(25\times N(N-1)/2\).

Algorithm 1 summarizes the proposed Multi-modality Joint Spatiogram Tracking (MJST) procedure, which is implemented in MATLAB+MEX. The model update will be detailed in the next section.

figure a

5 Model Update via Particle Filter

In practical tracking scenarios, the target model should be updated so that it can capture the appearance variations due to illumination or pose changes. Some self-learning methods use the weighted sum of the current tracking result and current model to get new model, which is easy to realize but also prone to drift from the target because of the accumulation of errors. Other semi-supervised learning methods are suggested to avoid the drift problem, but the classifier-based tracking cannot be applicable to our case. Motivated by the Ref. [12], we propose a model update mechanism based on particle filter, and its details are introduced as follows.

5.1 Spatiogram Filtering

In most of tracking methods, the particle filter keeps tracking of object changes in position and velocity, not the changes of object appearance. We use particle filter for filtering object spatiogram so as to obtain the optimal estimate of the target model.

In frame t, let \(I(\varvec{x}_i, t)\) be the feature value (e.g., intensity, gradient, texture, etc.) at the normalized coordinate \(\varvec{x}_i\) in object image, and \(\varvec{h}(\{I(\varvec{x}_i, t)\}_{i=1}^n)\) be the corresponding spatiogram that is obtained by performing (1) to (3) on \(\{I(\varvec{x}_i, t)\}_{i=1}^n\). The object appearance variations in essence are the value changes of all \(I(\varvec{x}_i, t)\). Since the change of each \(I(\varvec{x}_i, t)\) is independent, we can filter each of them independently. Thus, the state prediction equation of each \(I(\varvec{x}_i, t)\) is given by

$$\begin{aligned} I(\varvec{x}_i, t)=I(\varvec{x}_i, t-1)+\omega _{i}(t - 1), \quad \quad i=1,2,\cdots ,n, \end{aligned}$$
(20)

where \(I(\varvec{x}_i, t)\) is called the state of pixel \(\varvec{x}_i\) in the frame t, and \(I(\varvec{x}_i, t - 1)\) is its counterpart in the frame \(t - 1\). \(\omega _{i}(t - 1)\) specifies the state noise owing to the object appearance variation, which is assumed to be Gaussian, and furthermore, to have the same variance \(\sigma _\omega ^2\) for all \(\varvec{x}_i\).

To obtain measurements of the filters, the previous reference spatiogram \(\varvec{h}(\{I(\varvec{x}_i, t - 1)\}_{i=1}^n)\) is matched with the current frame, and yielding a new spatiogram \(\varvec{h}(\varvec{z}_t)\) at the convergence position \(\varvec{z}_t\) provided by the mean-shift tracking algorithm. Then, \(\varvec{h}(\varvec{z}_t)\) is used as measurement for \(\varvec{h}(\{I(\varvec{x}_i, t)\}_{i=1}^n)\), and the observation equation is

$$\begin{aligned} \varvec{h}(\varvec{z}_t)=\varvec{h}(\{I(\varvec{x}_i, t)\}_{i=1}^n)+\varvec{\upsilon }(t), \end{aligned}$$
(21)

where \(\varvec{\upsilon }(t)\) models the noise in the image signal. Since each bin in the spatiogram is independent, the (21) can be rewritten as

$$\begin{aligned} h_u(\varvec{z}_t)=h_u(t)+\varvec{\upsilon }_u(t), \quad \quad u=1,2,\cdots ,m, \end{aligned}$$
(22)

where \(h_u(\varvec{z}_t) = [p_u(\varvec{z}_t), \varvec{\mu }_u(\varvec{z}_t), \mathbf {\Sigma }_u(\varvec{z}_t)]^{\mathrm {T}}\), \(\varvec{h}(\varvec{z}_t) = \{h_u(\varvec{z}_t)\}_{u=1}^m\), \(h_u(t) = [\hat{p}_u(t), \varvec{\hat{\mu }}_u (t), \mathbf {\widehat{\Sigma }}_u(t)]^{\mathrm {T}}\), \(\varvec{h}(\{I(\varvec{x}_i, t)\}_{i=1}^n) = \{h_u(t)\}_{u=1}^m\), and \(\varvec{\upsilon }_u(t)\) is the observation noise, which is assumed to be Gaussian, and furthermore, to have the same variance \(\sigma _\upsilon ^2\) for all u. Because the \(\mathbf {\Sigma }_u\) is 2-D diagonal matrix, we only update its two diagonal entries, thus the \(h_u\) is a 5-D vector.

We call \(\varvec{h}(\{I(\varvec{x}_i, t - 1)\}_{i=1}^n)\) current model, and \(\varvec{h}(\varvec{z}_t)\) observation model. Particle filter then yields a tradeoff between the two models and provides us an optimal estimate of the object model that is called candidate model, denoted by \(\varvec{h}(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n)\), where \(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n\) is the optimal particle. In implement, the \(\{I(\varvec{x}_i, t)\}_{i=1}^n\) is treated as a particle whose likelihood weight is calculated by \(\frac{1}{\sqrt{2\pi }\sigma _\upsilon }\exp \{-\frac{1-\rho [\varvec{h}(\varvec{z}_t), \varvec{h}(\{I(\varvec{x}_i, t)\}_{i=1}^n)]}{\sigma _\upsilon ^2}\}\) incorporating with (4), and the \(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n\) is determined by taking the weighted mean of all particles.

5.2 Update Criterion

It is not suitable for us to accept the candidate model all the time. We should try to find a robust criterion to decide whether the candidate mode \(\varvec{h}(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n)\) should be accepted because over-update could make tracker sensitive to outliers like occlusions or dramatic appearance changes.

We take the similarity \(\rho [\varvec{h}(\varvec{z}_t), \varvec{h}(\{I(\varvec{x}_i, t - 1)\}_{i=1}^n\})]\) between the observation model and the current model as the criterion. If the similarity is smaller than the threshold \(\gamma \), which implies that the object encounters dramatic appearance changes, then we reject \(\varvec{h}(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n)\) and remain using current model, otherwise accept it. The final model update formula is as follows

$$\begin{aligned} \varvec{\hat{h}}(t) = {\left\{ \begin{array}{ll} \varvec{h}(\{I(\varvec{x}_i, t - 1)\}_{i=1}^n\})&{}\rho [\varvec{h}(\varvec{z}_t), \varvec{h}(\{I(\varvec{x}_i, t - 1)\}_{i=1}^n\})]<\gamma \\ \varvec{h}(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n)&{}\rho [\varvec{h}(\varvec{z}_t), \varvec{h}(\{I(\varvec{x}_i, t - 1)\}_{i=1}^n\})]\ge \gamma \\ \end{array}\right. } \end{aligned}$$
(23)

Once \(\varvec{h}(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n)\) is accepted, the \(\{\hat{I}(\varvec{x}_i, t)\}_{i=1}^n\) is also saved.

6 Experiments

To verify the flexibility, accuracy and efficiency of the proposed tracker, we tested six registered infrared-visible sequences which involve general difficulties like night, shade, cluster, crossover and occlusion. We also compared the proposed tracker to the state-of-the-art methods such as \(\ell _1\) tracker (L1T) [4], joint sparse representation tracker (JSRT) [10], and fuzzified region dynamic fusion tracker (FRD) [7]. All tracking results are obtained by running these trackers on an Intel Dual-Core 2.6 GHz CPU with 8 GB RAM, using the same initial positions for fair comparison. The L1T can tackle only one modality at a time.

Set \(M_{max} = 20\), \(\gamma = 0.4\), and the number of particles 50 for model update. We use the combination of infrared gray, infrared HOG (histogram of oriented gradient), visible gray, and visible LBP (local binary pattern), so \(N=4\).

6.1 Combination of Infrared Gray-HOG and Visible Gray-LBP

The LBP [22] is very effective to describe the image texture features, and has advantages such as fast computation and rotation invariance, which facilitates the wide usage in the fields of texture analysis, image retrieval, object tracking, etc. In LBP, each pixel is assigned a texture value, which can be naturally combined with the gray of the pixel to represent targets. The LBP operator labels the pixel in an image by thresholding its neighborhood with the center value and considering the result as a binary number.

The HOG technique [23] counts occurrences of gradient orientation in localized portions of an image, and is very effective to describe the object shape feature. HOG is invariant to geometric and photometric transformations, so is used widely in object detection. See [23] for the extraction steps of HOG. To enable HOG to naturally combine with the gray of image, we employ the visualization method suggested in [24].

Generally, the visible image contains more texture features, whereas shape feature is more obvious in the infrared image, so we extract LBP feature from visible image and HOG feature from infrared image. Figure 2 shows the extraction of LBP and HOG features from visible and infrared images respectively.

Fig. 2.
figure 2

Extracting LBP and HOG features from visual and infrared images respectively, (a) visible image, (b) visible gray, (c) visible LBP, (d) infrared image, (e) infrared gray, (f) infrared HOG.

Figure 3 shows the screenshots of some sampled tracking results on video1-video6.

Fig. 3.
figure 3

Screenshots of some sampled tracking results, where L1TIR represents using L1T to track infrared target, and L1TVS represents using L1T to track visible target.

Night and Shade: At night without light, the target in visible image is usually obscure (see Fig. 3(a)), so the tracker that only using visible camera often fails to hold the target. However, by using or cooperating with infrared camera, our tracker can hold well the target, because the infrared sensor is independent of illumination. In Fig. 3(b), the man in black is walking into a shade area, and the contrast between him and his background is very low in visible image but no for infrared image, which is similar to walking at night, so our tracker can hold the target well. It is notable that our tracker successfully overcome the occlusion of lamppost (see frame 245 and 288 in Fig. 3(b)), which is mainly attribute to the use of online model update.

Crossover and Clutter: The tracker is easily attracted by the distracter when target is at thermal crossover or in cluttered background. For example, the FRD is attracted by the right man when two men are at thermal crossover (see frame 78 in Fig. 3(c)), and FRD and L1TVS are attracted by the cluttered background (see frame 118 and 291 in Fig. 3(d)). The distracter presents similar appearance as the target, so can give good match to the target, which makes it be difficult to discriminate such a distracter only using single source or matching score. However, our tracker can stick with the target all the time because of using joint compressive representation of infrared and visual modalities, thus can always capture the target well.

Partial and Full Occlusion: Occlusion is the most general yet crucial problem in object tracking, and it is classified into partial and full. For partial occlusion, our tracker can successfully overcome it (see frame 245 in Fig. 3(b), frame 112 in Fig. 3(e), and frame 100 in Fig. 3(f)), because both it use model update to handle occlusion. For full occlusion, as showed in Figs. 3(f) and 4(f), almost all trackers lost the targets, which is because that the targets have completely disappeared from our sight.

6.2 Quantitative Comparison

We use five criteria, i.e., center offset error \(\varrho = \sqrt{(x_G - x_T)^{2} + (y_G - y_T)^{2}}\), average center offset error \(\bar{\varrho } = \frac{1}{q}\sum \nolimits _{i}\varrho _i\), overlap ratio \(\epsilon = \tfrac{area({R}_{G}\bigcap {R}_{T})}{area({R}_{G}\bigcup {R}_{T})}\), average overlap ratio \(\bar{\epsilon } = \tfrac{1}{q} \sum \nolimits _{i}\epsilon _i\), and success rate \(sr = \tfrac{1}{q} \sum \nolimits _{i}f(\epsilon _i - 0.5)\), where \((x_G,y_G,{R}_{G})\) is the center and region of target given by manual, \((x_T,y_T,{R}_{T})\) is the result given by the tacker, and q is the number of frames. The f(x) is step function, if \(x \ge 0\), then \(f(x) = 1\), else \(f(x) = 0\). The overlap ratio is used to evaluate the accuracy of the tracking result pre frame, and whose value is 1 when the estimated region overlaps fully with the ground truth, thereby obtaining best tracking result. An ideal tracker should have less center offset error, and high overlap ratio and success rate.

Table 1. Quantitative comparison of MJST, JSRT, FRD and L1T.

The L1T can track only one of visible and infrared at a time, for easy to compare with other fusion trackers, we integrate the tracking results of L1TVS and L1TIR as following: (1) the time consumer of L1T is equal to the time sum of L1TVS and L1TIR, and (2) the target state of L1T is equal to the weight average of the target states of L1TVS and L1TIR, where the weight is induced by their observation likelihood in a mean square error way. The quantitative results are summarized in Table 1, where font indicates the best performance while the font indicates the second best ones. The detail of overlap ratio pre frame is shown in Fig. 4. It can be observed that, compared with other trackers, our tracker obtains the highest tracking accuracy and success rate.

Fig. 4.
figure 4

The overlap ratio between tracked region and manually marked ground truth

7 Conclusions

In this paper we had proposed a unified multi-modality tracking framework by cooperating fuzzy logic. In the framework, different types of modalities can be arbitrary added, removed and naturally integrated through joint spatiogram representation. In addition, the target model update strategy enables the tracker promptly respond to the variations of appearance. Experiment results on six datasets demonstrate that our approach performs best. Our current tracker is not quite suitable for tracking full occlusion targets.