1 Introduction

Tracking is an elemental task for any practical video application, requiring a level of understanding about the objects of interest. The subject has received increasing attention in recent years, where state-of-the-art trackers mainly focus on using the visible spectrum and deep neural networks. Despite the increase in accuracy and robustness, some limitations still persist. In order to overcome the constraints from a single spectral range during tracking, a multi-spectral approach can be utilized. For example, an additional infrared sensor can provide complementary information to an image obtained in the visible range. On the one hand, visible images offer rich content (i.e. colors, texture) and should be preferred when the thermal properties of an object are close to the surrounding environment, on the other hand, infrared images are better suited in case of a change in lighting conditions or gloomy environments [3, 7]. This approach could highly benefit applications such as, surveillance [1], traffic monitoring [2] and medical imaging [6].

Among some challenges arising in both domains, the potential added value is accompanied by an overhead like increased amount of raw data that needs to be processed. Therefore, an ideal fusion method should preserve the positive characteristics of the individual channels, but reduce the amount of data needed to be processed. Accordingly, all subsequent processing stages, like visual tracking should benefit from the fused input stream. However, due to an inevitable non-optimal fused image, it is not clear if the general expectation of increased performance is met. Towards this end, we evaluate the effect of current state-of-art visual trackers on fused data streams provided by current state-of-the-art fusion strategies.

The evaluation is done on typical visual tracking sequences from the Camel [9] data set. Since image fusion relies on synchronized and well-registered camera, the data set is carefully further edited to ensure best possible fusion results by still capturing different visual attributes as displayed in Fig. 1. Thereby, we can identify approaches and occurring visual attributes which can benefit from a fused input stream. This study differs from the 2015 VOT challenge [15], by only evaluating the appearance changes from one spectrum to another, by using synchronized input streams.

Fig. 1.
figure 1

Synchronized frames of a video stream from Camel [9] in the (from left to right) visual, infrared, addition and l1-norm subsets.

In the following Sect. 2 a short introduction of the selected trackers and fusion strategies are presented. Afterward, we describe the evaluation process and examine the quantitative results in Sect. 3 and Sect. 4 concludes the paper.

2 Visual Trackers and Fusion

2.1 Reference Trackers

Due to the potential wide range of industrial tracking-based applications, visual tracking is a very popular research area and several publicly benchmarks exist. A selection of current single-object visual trackers achieving top-ranks on most-widely used visual tracking benchmarks [14, 24] is considered for the experiments presented in this paper. Furthermore, the selected tracker have to rely on an FCNN for feature extraction and operate model free. For a more detailed description and categorization, we refer to the original papers and to the corresponding benchmark papers. Following those criteria, we chose to work with:

  • Re\(^3\) [10] (Real-Time Recurrent Regression Network) performs feature extraction using the CaffeNet architecture (Re-implementation of AlexNet [16] in Caffe) without the fully connected part. A regression layer is used to output the location of the object in the frame and the size of the object. The tracker uses two LSTM layers for remembering appearance changes and motion information of the target.

  • MBMD [27] is the winner of 2018 VOT long term challenge [14] and is composed of a bounding box regression network for identifying potential locations of the target and a verification network, identifying the target and the potential locations. Feature extraction for the regression is performed by an SSD-MobileNet [13] network and feature extraction for the verification stage is performed by a pre-trained VGG-16 [22].

  • DaSiamRPN [28] is the winner of the VOT real-time challenge 2018 [14] and runner up in the long term part of the challenge and is an extension of SiamRPN [18]. Feature extraction is performed using the FCNN of Bertinetto et al. [5]. In addition, it uses a local-global search region strategy for target re-detection and a distractor-aware component for catching target appearance variations and to discriminate the target against the surrounding.

  • MemTrack [26] is based on a dynamic memory network [17] architecture. Feature extraction is performed by the FCNN used in [5]. A soft attention mechanism [25] locates potential regions of the targets in the search area. An LSTM selects an appropriate template from stored ones based on the output of the attention mechanism, combined with a reference template, creating a residual template. Afterwards, the residual template is used for finding the location of the target in the search area.

  • SiamMask [23] is a recently state-of-the-art introduced tracker which produces in addition to a rotational bounding box a binary mask, classifying pixel-wise belongingness to the target. Feature extraction is performed by a variation of ResNet-50 [12]. Using a depth-wise cross correlation [4] on feature spaces and a region proposal network [21]. By examining the potential targets locations, the actual target is found and a binary mask is created.

2.2 Fusion Strategies

Image fusion strives to preserve positive characteristics of individual channels, in addition to reducing the amount of data needed for processing. Due to inevitable errors induced by image fusion and depending on specific conditions, it is unclear how beneficial this approach is for tracking applications. Before analyzing the effects, we present the main concepts on image fusion methods and the selected strategies for this study.

In multi-modal fusion, different sensors i.e. visible, infrared, are used in the process of image acquisition. The fusion process can be applied on pixel, object or on decision levels. However, in this paper, we examine only pixel level fusion, which can traditionally be divided into transformation domain methods and spatial domain methods [7, 11]. These techniques can involve around simple transformations e.g. averaging, adding, subtracting, on pixel intensities, or more complex transformations e.g. Laplace pyramid, wavelet pyramid. Unlike these traditional strategies, a shift to deep learning is occurring, which are now able to achieve state-of-the-art performance. Therefore, for the experiments of this study we select the deep neural network (DNN), DenseFuse [19].

The authors of DenseFuse propose a novel deep learning architecture using convolutional layers and dense blocks. The DNN is composed of three major components. The first component is an encoder made from one convolutional layer, that extracts rough features, followed by three dense convolutional layers, enabling the network to preserve mid and deep level features better. Allowing in addition to improve information flow and diminish the overfitting problem during training. The second component is a fusion layer incorporating two fusion strategies, i.e addition strategy presented originally in DeepFuse [20] and an l1-norm strategy. This layer integrates into one feature map the pertinent features extracted by the encoder from the source images i.e from the visible and infrared images. The third part of the DNN is a decoder which re-constructs the fused visible-infrared image using convolutional layers [19].

Originally, DenseFuse fuses a grayscale image with an infrared image, but it also handles visible images, in splitting the image into separate channels, that are then passed through the DNN and fused separately with the infrared image. The final result is a combination of the three newly created fused images. Figure 1 displays the fused image using both strategies from DenseFuse, and the original visible and infrared images.

3 Evaluation

The goal of this section is to provide empirical results and discuss the benefits and detriments of a fused approach, which can lead to non-optimal output.

3.1 Data Set to Subsets

For this study, we employed the Camel [9] data set which captured 26 annotated video streams paired in the visible and infrared domain. The video streams were taken in an urban environment and captured during day and night time. Similar visual attributes from popular data set for visual object tracking challenges [8, 14] are present, i.e. in-plane rotation, illumination change, scale variation, occlusion and camera motion. The data set contains 765 annotated objects in the visible domain and 787 in the infrared domain, where four different classes are present, i.e bicycle, person, vehicle and dog. In order to reduce registration errors, only a reduced subset of sequences are considered for the evaluation. The criteria are set as follows:

  • We kept sequences having an Intersection over Union (IoU) over 0.7 between ground-truth bounding boxes in the visible-infrared domain and lasting at least 30 consecutive frames.

  • Furthermore, in order to ensure adequate sequence length, a short drop of the IoU under 0.7 is accepted if its only for less than 10 frames and still above 0.5. Whereas an IoU under 0.5 stops the recording until the condition from stage one is valid again.

Based on the newly created domain subset, we created the two fused subsets with the available fusion strategies from DenseFuse (i.e. addition and l1-norm subsets). The ground truth annotation for the fused subsets, is simply adapted by averaging the ground truth bounding boxes from both domains, ensuring us to keep a minimal valid bounding box around the target.

The resulting 4 subsets, visible (VIS), infrared (IR), addition (Add) and l1-norm (l1) subsets used for the evaluation, contain 438 sequences, with a median sequence length of 107 frames, a median target width ratio of 0.1 and height ratio of 0.17. Example images from the four evaluation subsets are displayed in figures of Subsect. 3.5.

Although, we use a state-of-art DNN for image fusion, we can not prevent errors generated by the DNN during the fusion process, i.e. noise, registration difference between visible and infrared images. For example, the images in Fig. 2 depict a non-optimal registration between the source images.

Fig. 2.
figure 2

Example images from Camel [9], showing a registration problem between the synchronized frames in (from left to right) the visual and infrared domain subsets, and the final results in the addition and l1-norm subsets

In contrast to the most closely related investigations on the VOT-IR dataset, here by selecting the Camel data set, we ensure to find synchronized and same attributes at the same locations, effectively studying only the impact of a spectral change on the FCNN of the trackers.

3.2 Evaluation Metrics

Methodologies from one challenge to another differ in the evaluation process as well as the performance measures. In this paper two measures are mainly used to rank the performance: Firstly, accuracy which measures the overlap during successful tracking periods, secondly robustness, which is the number of times the tracker lost the target. We rank the tracker accordingly to their average IoU and average robustness on the whole subsets. For the evaluation, a target is considered lost when the IoU between ground truth and predicted bounding box is under 0.5 for 10 consecutive frames. If the target is lost, we initialize the tracker again on the next frame. For easier comparison between the tracker results, we combine accuracy and robustness together in on score.

3.3 Evaluation on Subsets

Since we use video streams differing domain wise, and use original implementation of the trackers, the change in performance can mainly by assigned to the extracted features. Their performances on the subsets are displayed in Fig. 3 and Table 1 resumes the results in one score.

Based on these results, the Re3, and MBMD trackers show better scores in accuracy and robustness on the visual subset compared to the infrared subset. Both also perform better on the addition subset, gaining in accuracy and robustness, whereas the usage of the l1-norm subset shows a performance drop.

The MemTrack and SiamMask trackers responded interestingly with a better score on the infrared subset than on the visible subset, but regardless of the fusion strategy employed, both achieve better results on the fused subsets. With the MemTrack achieving a slightly better score on the addition subset, and the SiamMask tracker on the l1-norm subset.

Table 1. Average score (combination between average accuracy and robustness) for each tracker on every subsets.
Table 2. Relative performance variation between a domain subsets (DS) and a fused subsets (FS) depending on the tracker.

In contrast to previous trackers, the DaSiamRPN tracker, does not react as positively as expected. Indeed, the best performance is achieved on the visible subset, even though the addition subset score is close to the visible subset. The worst score for this tracker is achieved when applying the fused L1-norm subset.

Aside from the DaSiamRPN tracker, all trackers benefit from a fusion between the visible and the infrared spectrum at the input stage as shown in Table 2. We also notice that, even though the FCNN part of the tracker are different from each other, some react similarly, as Re3 with MBMD or MemTrack with SiamMask. For instance, the MemTrack and SiamMask respond very well to both fusion strategies, and Re3 displays a similar score variation as MBMD on VIS-Add.

3.4 Fused Subsets Versus Domain Subsets

Based on Table 3 we notice that the lowest score distribution occurs on the infrared subset, regardless the tracker. Indicating that the infrared subset is the most difficult one to extract an appearance model from the surrounding, which is coherent since the trackers were originally trained on visible images.

Fig. 3.
figure 3

Average accuracy and robustness given by reference trackers on the subsets. Tracker reaching the top-right corner of the graphic display better performance.

DaSiamPRN responds strongly on 48% video streams of the visible subset and poorly on 22%, with a standard score deviation of 0.229. Whereas, 13% video streams from the additional subset enable a high response from the tracker and only 4% a low response, with a standard score deviation of 0.196. Although, the DaSiamRPN responds stronger on a higher number of video streams from the visible subset, it also responds poorer on a higher number in comparison to the additional subset. Thus, using a fused subset enabled the tracker to track more robustly on a wider sequences diversity, even if using the fused subset did not manage to outperform the score from the visible subset.

Re3 and MBMD score better on 35% and 39% of the video streams from the additional subset, albeit Re3 also responds strongly to the same number of video streams in the visible subset. We note that both respond poorly on a low number of video streams from the addition subset, and achieving a low standard score deviation on the fused subsets. Indicating that using fused input stream enhances their performance on a larger number of streams and allows a more robust approach.

Oddly enough, MemTrack and SiamMask show a strong response on 43% video streams of the infrared subset and a poor response on 39% of it, albeit they originally were trained for visible imagery. Yet, both respond with a low amount of strong and poor scores on the video streams from the fused subsets. The standard score deviation for both trackers is also lower in the fused subsets compared to the domain subsets, even though the MemTrack has the lowest deviation on the infrared subset. Results from SiamMask indicate that using video streams from the fused subset reduce the standard score deviation, thus making the tracker more stable and also improving the results on a variety of sequences. Whereas the MemTrack, having a lower standard score deviation on the infrared domain, performs still better on the fused subsets, suggesting that the fused subset did not necessarily increase stability, but increased overall scores of the tracker on the sequences that gave previously poor scores on the domain subsets.

Table 3. Highest and lowest score distribution of trackers on video streams from the subsets and standard score deviation of the trackers on the subsets

In most cases, the usage of fused subsets proved to be beneficially for stability, as the scores where more balanced and enhanced overall as a whole, rather than individual sequences.

3.5 Special Case Analysis

In this section, we look at video streams that give high score variations from a domain subset to a fused subset, regardless if positive or negative. Albeit the trackers react very differently to the video streams, there are special cases where a general tendency can be observed.

Synchronized frames of the video stream number 3 from Camel [9] are shown in Fig. 4. The video stream has a score increase on the fused subsets compared to the infrared subset, but a lower score compared to the visible subset. In these video stream, the potential targets belong to the same class and are clustered together, thus making the tracking process more challenging in the infrared domain because they all look alike in the infrared spectrum. Whereas, in the visible domain, color and texture are used to discriminate the target against the surrounding, easing the tracking process. But, depending on the fusion strategy, color and texture are removed to some extent and whitened, and useful features from the visible domain are kept to a degree, allowing a performance increase compared to the infrared domain.

Fig. 4.
figure 4

Synchronized frames of video stream 03 from Camel [9] in the (from left to right) visual, infrared, addition and l1-norm subset, which favours the visible domain due to the presence of color and texture in a crowded scene. (Color figure online)

Matching frames from video stream 9 from Camel [9] are displayed in Fig. 5. Contrariwise to the previous example, these video stream shows a performance gain on the fused subset compared to the visible subset and a drop compared to the infrared subset. Most of the potential targets (i.e. cars) undergo a sudden change in luminosity when passing under the shadows, which is a difficult attribute to handle for visual trackers. Whereas, on the infrared subset, the potential targets are still very clear, and since they are also clearly apart from each other, no clutter attribute is present to increase the difficulty of the tracking process in the infrared domain. By fusing both streams, the whitened target benefits from a constant white color that does not fade away under the shadows, in contrary to colors and textures.

Fig. 5.
figure 5

Synchronized frames of video stream 09 from Camel [9] in the (from left to right) visual, infrared, addition and l1-norm subsets, which favours the infrared domain due to the sudden illumination change. (Color figure online)

Presented in Fig. 6, are four synchronized frames from video steam number 15, which is recorded during night time in Camel [9]. Trackers show better performances on the fused subsets version of this stream than on the domain subsets. Because of the gloomy environment, tracking in the visible domain is very difficult and naturally tracking in the infrared domain is more suited under these conditions. However, a fused version of both domains, shows a more robust alternative to the visible subset and an increase in accuracy compared to the infrared subset.

Fig. 6.
figure 6

Synchronized frames of video stream 15 from Camel [9] in the (from left to right) visual, infrared, addition and l1-norm subsets, favouring the fused subsets.

Synchronized frames of the video stream number 7 from Camel [9] are showed in Fig. 7. Trackers show better results on the domain subsets than on the fused versions. Due to the environment, a fused approach makes tracking more difficult since the whitening of the targets blends better with the white background and the snowy weather. Thus, tracking in the visible domain under these conditions is easier, since the tracking process can rely on color and texture features. Also, when using the infrared domain, the target is even clearer to discriminates against the surrounding background as shown in Fig. 7.

Fig. 7.
figure 7

Synchronized frames of video stream 07 from Camel [9] in the (from left to right) visual, infrared, addition and l1-norm subsets, where neither of the fused subset outperforms a domain subset. (Color figure online)

4 Conclusion

We present an evaluation of state-of-the-art visual trackers applied on visible, infrared, and fused input streams. In contrary to using one domain for tracking, where specific attributes to the domain can be difficult i.e. illumination change in the visible domain or clutter in the infrared domain to deal with, a fused approach at the input stage can be effective at handling those attributes. Indeed using early fused input streams indicate a tendency to enhance performance, enabling more robust and accurate tracking. In addition, allowing also to handle more robustly diverse type of sequences under various conditions and attributes. Depending on the fusion strategy, performances can improve or diminish. However an ideal fusion strategy would enable the tracker to perform on the fused subset as good as it would perform on an adequate domain subset version.