Keywords

1 Introduction

Person re-identification (ReID) is of primary importance for tasks such as video surveillance and photo-tagging. It has been the focus of intense research in recent years. While modern methods provide excellent results during in a well-lit environment, their performance is not robust without a suitable light source.

Fig. 1.
figure 1

Overview of color-thermal ReID using our ThermalGAN framework. We transform a single color probe image A to multimodal thermal probe set \(\{B_1, \ldots , B_i\}\). We use thermal signatures I to perform matching with real thermal gallery set. (Color figure online)

Infrared cameras perceive thermal emission that is invariant to the lighting conditions. However, challenges of cross-modality color-infrared matching reduce benefits of night mode infrared cameras. Recently cross-modality color-to-thermal matching received a lot of scholar attention [35, 37, 51, 52]. Multiple datasets with infrared images [33, 35, 37, 47] were developed for cross-modality infrared-to-color person ReID. Thermal cameras operate in longwave infrared (LWIR, 8–14 \(\mu \)) and provide real temperatures of a person body which are more stable to viewpoint changes than near-infrared images [45, 54, 58].

This paper is focused on the development of a ThermalGAN framework for color-thermal cross-modality person ReID. We use assumptions of Bhuiyan [4] and Zhu [62] as a starting point for our research to develop a color-to-thermal transfer framework for cross-modality person ReID. We perform matching using calibrated thermal images to benefit from the stability of surface temperatures to changes in light intensity and viewpoint. Matching is performed in two steps. Firstly, we model a person appearance in a thermal image conditioned by a color image. We generate a multimodal thermal probe set from a single color probe image using a generative adversarial network (GAN). Secondly, we perform matching in thermal images using the synthesized thermal probe set and a real thermal gallery set (Fig. 1).

We collected a new ThermalWorld dataset to train our GAN framework and to test the ReID performance. The dataset contains two parts: ReID and Visual Objects in Context (VOC). The ReID split includes 15118 pairs of color and thermal images and 516 person ID. The VOC part is designed for training color-to-thermal translation using a GAN framework. It consists of 5098 pairs of color and thermal images and pixel-level annotations for ten classes: person, car, truck, van, bus, building, cat, dog, tram, boat.

We perform an evaluation of baselines and our framework on the ThermalWorld dataset. The results of the evaluation are encouraging and show that our ThermalGAN framework achieves and surpasses the state-of-the-art in the cross-modality color-thermal ReID. The new ThermalGAN framework will be able to perform matching of color probe image with thermal gallery set in video surveillance applications.

Section 2 presents the structure of the developed ThermalWorld dataset. In Sect. 3 we describe the ThermalGAN framework and thermal signature-based matching. In Sect. 4 we give an evaluation of ReID baselines and the ThermalGAN framework on the ThermalWorld dataset.

1.1 Contributions

We present three main contributions: (1) the ThermalGAN framework for color-to-thermal image translation and ReID, (2) a large-scale multispectral ThermalWorld dataset with two splits: ReID with 15118 color-thermal image pairs and 516 person ID, and VOC with 5098 pairs color-thermal image pairs with ground truth pixel-level object annotations of ten object classes, (3) an evaluation of the modern cross-modality ReID methods on ThermalWorld ReID dataset.

2 Related Work

2.1 Person Re-identification

Person re-identification has been intensively studied by computer vision society recently [3, 4, 9, 12, 40, 47]. While new methods improve the matching performance steadily, video surveillance applications still pose challenges for ReID systems. Recent methods regarding person ReID can be divided into three kinds of approaches [4]: direct methods, metric learning methods and transform learning methods.

In [4] an overview of modern ReID methods was performed, and a new transform learning-based method was proposed. The method models an appearance of a person in a new camera using cumulative weight brightness transfer function (CWBTF). The method leverages a robust segmentation technique to segment the human image into meaningful parts. Matching of features extracted only from the body area provides an increased ReID performance. The method also exploits multiple pedestrian detections to improve the matching rate.

While the method provides the state-of-the-art performance on color images, night-time application requires additional modalities to perform robust matching in low-light conditions. Cross-modality color-infrared matching is gaining increasing attention. Multiple multispectral datasets were collected in recent years [33, 35, 37, 47, 51]. SYSU-MM01 dataset [47] includes unpaired color and near-infrared images. RegDB dataset [51] presents color and infrared images for evaluation of cross-modality ReID methods. Evaluation of modern methods on these datasets has shown that color-infrared matching is challenging. Nevertheless, it provides an increase in ReID robustness during the night-time.

Thermal camera has received a lot of scholar attention in the field of video surveillance [8, 53]. While thermal cameras provide a significant boost in pedestrian detection [42, 50] and ReID with paired color and thermal images [33], cross-modality person ReID is challenging [33,34,35,36,37] due to severe changes in a person appearance in color and thermal images.

Recently, generative adversarial networks (GAN) [13] have demonstrated encouraging results in arbitrary image-to-image translation applications. We hypothesize that color-to-thermal image translation using a dedicated GAN framework can increase color-thermal ReID performance.

2.2 Color-to-Thermal Translation

Transformation of the spectral range of an image has been intensively studied in such fields as transfer learning [39, 46] domain adaptation [23,24,25, 32] and cross-domain recognition [1, 17, 19, 21, 47, 49, 54, 55]. In [30] a deep convolutional neural network (CNN) was proposed for translation of a near-infrared image to a color image. The approach was similar to image colorization [15, 56] and style transfer [27, 44] problems that were actively studied in recent years. The proposed architecture succeeded in a translation of near-infrared images to color images. Transformation of LWIR images is more challenging due to the low correlation between the red channel of a color image and a thermal image.

2.3 Generative Adversarial Networks

GANs increased the quality of image-to-image translation significantly [20, 54, 58] using an antagonistic game approach [13]. Isola et al. [20] proposed a pix2pix GAN framework for arbitrary image transformations using geometrically aligned image pairs sampled from source and target domains. The framework succeeded in performing arbitrary image-to-image translations such as season change and object transfiguration. Zhang et al. [54, 58] trained the pix2pix network to transform a thermal image of a human face to the color image. The translation improved the quality of a face recognition performance in a cross-modality thermal to visible range setting. While human face has a relatively stable temperature, color-thermal image translation for the whole human body with an arbitrary background is more ambiguous and conditioned by the sequence of events that have occurred with a person.

We hypothesize that multimodal image generation methods can model multiple possible thermal outputs for a single color probe image. Such modeling can improve the ReID performance. Zhu et al. proposed a BicycleGAN framework [63] for modeling a distribution of possible outputs in a conditional generative modeling setting. To resolve the ambiguity of the mapping Zhu et al. used a randomly sampled low-dimension latent vector. The latent vector is produced by an encoder network from the generated image and compared to the original latent vector to provide an additional consistency loss. We propose a conditional color-to-thermal translation framework for modeling of a set of possible person appearances in a thermal image conditioned by a single color image.

3 ThermalWorld Dataset

We collected a new ThermalWorld dataset to train and evaluate our cross-modality ReID framework. The dataset was collected with multiple FLIR ONE PRO cameras and divided into two splits: ReID and Visual Objects in Context (VOC). The ReID split includes 15118 aligned color and thermal image pairs of 516 IDs. The VOC split was created for effective color-to-thermal translation GAN training. It contains 5098 color and thermal image pairs and pixel-level annotations of ten object classes: person, car, truck, van, bus, building, cat, dog, tram, boat.

Initially, we have tried to train a color-to-thermal translation GAN model using only the ReID split. However, the trained network has poor generalization ability due to a limited number of object classes and backgrounds. This stimulated us to collect a large-scale dataset with aligned pairs of color and thermal images. The rest of this section presents a brief dataset overview. For more details on the dataset, please refer to the supplementary material.

3.1 ThermalWorld ReID Split

The ReID split includes pairs of color and thermal images captured by sixteen FLIR ONE PRO cameras. Sample images from the dataset are presented in Fig. 2. All cameras were located in a shopping mall area. Cameras #2, 9, 13 are located in underground passages with low-light conditions. Cameras #1, 3, 7, 8, 10, 12, 14 are located at the entrances and present both day-time and night-time images. Cameras #15,16 are placed in the garden. The rest of the cameras are located inside the mall.

Fig. 2.
figure 2

Examples of person images from ThermalWorld ReID dataset.

Table 1. Comparison to previous ReID datasets. #/# represents the number of color/infrared images or cameras.

We have developed a dedicated application for a smartphone to record sequences of thermal images using FLIR ONE PRO. Comparison to previous ReID datasets is presented in Table 1.

3.2 ThermalWorld VOC Split

The VOC split of the dataset was collected using two FLIR ONE PRO cameras. We use insights of developers of previous multispectral datasets [2, 8, 11, 19, 43, 55] and provide new object classes with pixel-level annotations. The images were collected in different cities (Paris, Strasbourg, Riva del Garda, Venice) during all seasons and in different weather conditions (sunny, rain, snow). Captured object temperatures range from −20 \(^{\circ }\)C to +40 \(^{\circ }\)C.

We hypothesized that training a GAN to predict the relative temperature contrasts of an object (e.g., clothes/skin) instead of its absolute temperature can improve the translation quality. We were inspired by the previous work on the explicit encoding of multiple modes in the output [63], and we assumed that the thermal segmentation that provides average temperatures of the emitting objects in the scene could resolve the ambiguity of the generated thermal images. Examples from ThermalWorld VOC dataset are presented in Fig. 3.

Fig. 3.
figure 3

Examples of annotated images in ThermalWorld VOC dataset.

We manually annotated the dataset, to automatically extract an object’s temperature from the thermal images. Comparison to previous multispectral datasets and examples of all classes are presented in the supplementary material.

4 Method

Color-to-thermal image translation is challenging because there are multiple possible thermal outputs for a single color input. For example, a person in a cold autumn day and a hot summer afternoon may have a similar appearance in the visible range, but the skin temperature will be different.

We have experimented with multiple state-of-the-art GAN frameworks [5, 20, 26, 63] for multi-modal image translation on the color-to-thermal task. We have found that GANs can predict object temperature with accuracy of approximately 5 \(^{\circ }\)C.

However, thermal images must have accuracy of 1 \(^{\circ }\)C to make local body temperature contrasts (e.g., skin/cloth) distinguishable. To improve the translation quality we developed two-step approach inspired by [48]. We have observed that relative thermal contrasts (e.g., eyes/brow) are nearly invariant to changes in the average body temperature due to different weather conditions.

We hypothesize that a prediction of relative thermal contrasts can be conditioned by an input color image and average temperatures for each object. Thus, we predict an absolute object temperature in two steps (Fig. 4). Firstly, we predict average object temperatures from an input color image. We term the resulting image as a “thermal segmentation” image \(\hat{S}\).

Secondly, we predict the relative local temperature contrasts \(\hat{R}\), conditioned by a color image A and a thermal segmentation \(\hat{S}\). The sum of a thermal segmentation and temperature contrasts provides the thermal image: \(\hat{B} = \hat{S} + \hat{R}\).

Fig. 4.
figure 4

Overview of our ThermalGAN framework.

The sequential thermal image synthesis has two advantages. Firstly, the problem remains multimodal only in the first step (generation of thermal segmentation). Secondly, the quality of thermal contrasts prediction is increased due to lower standard deviation and reduced range of temperatures.

To address the multimodality in color-to-thermal translation, we use a modified BicyleGAN framework [63] to synthesize multiple color segmentation images for a single color image. Instead of a random noise sample, we use a temperature vector \(T_i\), which contains the desired background and object temperatures.

The rest of this section presents a brief introduction to conditional adversarial networks, details on the developed ThermalGAN framework and features used for thermal signature matching.

4.1 Conditional Adversarial Networks

Generative adversarial networks produce an image \(\hat{B}\) for a given random noise vector z, \(G : z \rightarrow \hat{B}\) [13, 20]. Conditional GAN receives extra bits of information A in addition to the vector z, \(G : \{A, z\} \rightarrow \hat{B}\). Usually, A is an image that is transformed by the generator network G. The discriminator network D is trained to distinguish “real” images from target domain B from the “fakes” \(\hat{B}\) produced by the generator. Both networks are trained simultaneously. Discriminator provides the adversarial loss that enforces the generator to produce “fakes” \(\hat{B}\) that cannot be distinguished from “real” images B.

We train two GAN models. The first generator \(G_1 : \{T_i, A\} \rightarrow \hat{S}_i\) synthesizes multiple thermal segmentation images \(\hat{S}_i \in \mathbb {R}^{W \times H}\) conditioned by a temperature vector \(T_i\) and a color image \(A \in \mathbb {R}^{W \times H \times 3}\). The second generator \(G_2 : \{\hat{S}_i, A\} \rightarrow \hat{R}_i\) synthesizes thermal contrasts \(\hat{R}_i \in \mathbb {R}^{W \times H}\) conditioned by a thermal segmentation \(\hat{S}_i\) and the input color image A. We can produce multiple realistic thermal outputs for a single color image by changing only the temperature vector \(T_i\).

4.2 Thermal Segmentation Generator

We use the modified BicycleGAN framework for thermal segmentation generator \(G_1\). Our contribution to the original U-Net generator [41] is an addition of one convolutional layer and one deconvolutional layer to increase the output resolution. We use average background temperatures instead of the random noise sample to be able to control the appearance of the thermal segmentation. The loss function for the generator \(G_1\) is given by [63]:

$$\begin{aligned}&G_1^{*}(G_1,D_1)= \text {arg } {\mathop {\hbox {min}}\limits _{G_1}} \, {\mathop {\hbox {max}}\limits _{D_1}} \, \mathcal {L}_{GAN}^{VAE}(G_1,D_1,E)+\lambda \mathcal {L}_1(G_1,E) \nonumber \\&\qquad \qquad \quad + \mathcal {L}_{GAN}(G_1,D_1) + \lambda _{\text {thermal}}\mathcal {L}_1^{\text {thermal}}(G_1,E) + \lambda _{KL}\mathcal {L}_{KL}(E), \end{aligned}$$
(1)

where \(\mathcal {L}_{GAN}^{VAE}\) – is Variational Autoencoder-based loss [63] that stimulates the output to be multimodal, \(\mathcal {L}_1\) – is an \(L^1\) loss, \(\mathcal {L}_{GAN}\) – loss provided by the discriminator \(D_1\), \(\mathcal {L}_1^{\text {thermal}}\) – is a loss of the latent temperature domain, \(\mathcal {L}_{KL}\) – Kullback–Leibler-divergence loss, E – encoder network, \(\lambda \) – weight parameters. We train both generators independently.

4.3 Relative Thermal Contrast Generator

We hypothesize that the distribution of relative thermal contrasts conditioned by a thermal segmentation and a color image is unimodal (compare images B and R for various background temperatures in Fig. 5). Hence, we use a unimodal pix2pix framework [20] as a starting point for our relative contrast generator \(G_2\). Our contribution to the original framework is two-fold. Firstly, we made the same modifications to the generator \(G_2\) as for the generator \(G_1\). Secondly, we use four channel input tensor. First three channels are RGB channels of an image A, the fourth channel is thermal segmentation \(\hat{S}_i\) produced by generator \(G_1\). We sum the outputs from the generators to obtain an absolute temperature image.

Fig. 5.
figure 5

Comparison of relative contrast R and absolute temperature B images for various camera locations. The relative contrast image R is equal to the difference of an absolute temperature B and a thermal segmentation S. Note that the person appearance is invariant to background temperature in relative contrast images.

4.4 Thermal Signature Matching

We use an approach similar to [4] to extract discriminative features from thermal images. The original feature signature is extracted in three steps [4]: (1) a person appearance is transferred from camera \(C_k\) to camera \(C_l\), (2) the body region is separated from the background using stel component analysis (SCA) [22], (3) feature signature is extracted using color histograms [9] and maximally stable color regions (MSCR) [10].

However, thermal images contain only a single channel. We modify color features for absolute temperature domain. We use the monochrome ancestor of MSCR – maximally stable extremal regions (MSER) [31]. We use temperature histograms instead of color histograms. The resulting matching method includes four steps. Firstly, we transform the person appearance from a single color probe image A to multiple thermal images \(\hat{B}_i\) using the ThermalGAN framework. Various images model possible variations of temperature from camera to camera. Unlike the original approach [4], we do not train the method to transfer person a appearance from camera to camera. Secondly, we extract body regions using SCA from real thermal images \(B_j\) in the gallery set and synthesized images \(\hat{B}_i\). After that, we extract thermal signatures I from body regions

$$\begin{aligned} I = f(B) = \bigg [H_t(B), f_{\text {MSER}}(B)\bigg ], \end{aligned}$$
(2)

where \(H_t\) is a histogram of body temperatures, \(f_{\text {MSER}}\) is MSER blobs for an image B.

Finally, we compute a distance between two signatures using Bhattacharyya distance for temperature histograms and MSER distance [6, 31].

(3)

where \(d_{H}\) is a Bhattacharyya distance, \(d_{\text {MSER}}\) is a MSER distance [31], and \(\beta _{H}\) is a calibration weight parameter.

5 Evaluation

5.1 Network Training

The ThermalGAN framework was trained on the VOC split of the ThermalWorld dataset using the PyTorch library [38]. The VOC split includes indoor and outdoor scenes to avoid domain shift. The training was performed using the NVIDIA 1080 Ti GPU and took 76 h for \(G_1\), \(D_1\) and 68 h for \(G_2\), \(D_2\). For network optimization, we use minibatch SGD with an Adam solver. We set learning rate to 0.0002 with momentum parameters \(\beta _1 = 0.5\), \(\beta _2 = 0.999\) similar to [20].

5.2 Color-to-Thermal Translation

Qualitative Comparison. For a qualitative comparison of the ThermalGAN model on the color-to-thermal translation, we generate multiple thermal images from the independent ReID split of ThermalWorld dataset. Our goal is to keep the resulting images both realistic and diverse in terms of person relative thermal contrasts. We compare our framework with five baselines: pix2pix+noise [20], cLR-GAN [5, 63], cVAE-GAN [26, 63], cVAE-GAN++ [26, 63], BicycleGAN [63]. All baselines were trained to convert color image to grayscale image representing perceptual thermal contrasts (8-bit, grayscale). Our ThermalGAN framework was trained to produce thermal images in degree Celsius. For comparison, they were converted to perceptual thermal intensities. Results of multimodal thermal image generation are presented in Fig. 6.

The results of pix2pix+noise are unrealistic and do not provide a changes of thermal contrast. cLR-GAN and cVAE-GAN produce a slight variation of thermal contrasts but do not translate meaningful features for ReID. cVAE-GAN++ and BicycleGAN produce a diverse output, which fails to model thermal features present in real images. Our ThermalGAN framework combines the power of BicycleGAN method with two-step sequential translation to produce the output that is both realistic and diverse.

Fig. 6.
figure 6

Qualitative method comparison. We compare performance of various multimodal image translation frameworks on ThermalWorld ReID dataset. For each model, we present three random output. The output of ThermalGAN framework is realistic, diverse, and shows the small temperatures contrasts that are important for robust ReID. Please note that only ThermalGAN framework produces output as calibrated temperatures that can be used for thermal signature matching.

Quantitate Evaluation. We use the generated images to perform a quantitative analysis of our ThermalGAN framework and the baselines. We measure two characteristics: diversity and perceptual realism. To measure multimodal reconstruction diversity, we use the averaged Learned Perceptual Image Patch Similarity (LPIPS [57]) distance as proposed in [57, 63]. For each baseline and our method, we calculate the average distance between 1600 pairs of random output thermal images, conditioned by 100 input color images. We measure perceptual realism of the synthesized thermal images using the human experts utilizing an approach similar to [56]. Real and synthesized thermal images are presented to human experts in a random order for one second. Each expert must indicate if the image is real or not. We perform the test on Amazon Mechanical Turk (AMT). We summarize the results of the quantitative evaluation in Fig. 7 and Table 2. Results of cLR-GAN, BicycleGAN and our ThermalGAN framework were most realistic. Our ThermalGAN model outperforms baselines in terms of both diversity and perceptual realism.

Fig. 7.
figure 7

Realism vs diversity for synthesized thermal images.

Table 2. Comparison with state-of-the-art multimodal image-to-image translation methods.

5.3 ReID Evaluation Protocol

We use 516 ID from the ReID split for testing the ReID performance. Please note, that we use independent VOC split for training color-to-thermal translation. We use cumulative matching characteristic (CMC) curves and normalized area-under-curve (nAUC) as a performance measure. The CMC curve presents a recognition performance versus re-identification ranking score. nAUC is an integral score of a ReID performance of a given method. To keep our evaluation protocol consistent with related work [4], we use 5 pedestrians in the validation set. We also keep independent the gallery set and the probe set according to the common practice.

We use images from color cameras for a probe set and images from thermal cameras for a gallery set. We exclude images from cameras #2, 9, 13 from the probe set, because they do not provide meaningful data in the visible range. We use a single color image in the single-shot setting. ThermalGAN ReID framework uses this single color image to generate 16 various thermal images. Baseline methods use the single input color image according to the common practice. For the multi-shot setting, we use ten color images for the probe set. Therefore ThermalGAN framework generates 16 thermal images for each color probe image and generates 160 thermal images for the probe set.

5.4 Cross-Modality ReID Baselines

We compare our framework with six baseline models including hand-crafted features HOG [7] and modern deep-learning based cross-modality matching methods: One Stream Network (OSN) [47], Two Stream Network (TSN) [47], Deep Zero-Padding (DZP) [47], Two-stream CNN network (TONE_1) [52], and Modified two-stream CNN network (TONE_2) [51].

5.5 Comparison and Analysis

We show results of a comparison of our framework and baselines on ThermalWorld ReID datasets in Table 3 for a single-shot setting and in Table 4 for the multi-shot setting. Results are given in terms of top-ranked matching rate and nAUC. We present the results in terms of CMC curves in Fig. 8.

Table 3. Experiments on ThermalWorld ReID dataset in single-shot setting.

We make the following observations from the single-shot evaluation. Firstly, the two-stream network [47] performs the worst among other baselines. We assume that the reason is that fine-tuning of the network from near-infrared data to thermal range is not sufficient for effective matching. Secondly, hand-crafted HOG [7] descriptor provided discriminative features that present both in color and thermal images and can compete with some of modern methods. Finally, our ThermalGAN ReID framework succeeds in the realistic translation of meaningful features from color to thermal images and provides discriminative features for effective color-to-thermal matching.

Table 4. Experiments on ThermalWorld ReID dataset in multi-shot setting.

Results in the multi-shot setting are encouraging and prove that multiple person detection improves the matching rate with a cross-modality setup. We conclude the following observations from the results presented in Table 4 and Fig. 8. Firstly, the performance of deep-learning-based baselines is improved in average in 5%. Secondly, multi-shot setting improves rank-5 and rank-10 recognition rates. Finally, our ThermalGAN method benefits from the multi-shot setting and can be used effectively with multiple person images provided by robust pedestrian detectors for thermal images [50].

Fig. 8.
figure 8

CMC plot and nAUC for evaluation of baselines and ThermalGAN method in single-shot setting (left) and multi-shot setting (right).

6 Conclusion

We showed that conditional generative adversarial networks are effective for cross-modality prediction of a person appearance in thermal image conditioned by a probe color image. Furthermore, discriminative features can be extracted from real and synthesized thermal images for effective matching of thermal signatures. Our main observation is that thermal cameras coupled with a GAN ReID framework can significantly improve the ReID performance in low-light conditions.

We developed a ThermalGAN framework for cross-modality person ReID in the visible range and LWIR images. We have collected a large-scale multispectral ThermalWorld dataset to train our framework and compare it to baselines. We made the dataset publicly available. Evaluation of modern cross-modality ReID methods and our framework proved that our ThermalGAN method achieves the state-of-the-art and outperforms it in the cross-modality color-thermal ReID.