Full length article
Illumination-aware window transformer for RGBT modality fusion,☆☆

https://doi.org/10.1016/j.jvcir.2022.103725Get rights and content

Abstract

Combination of RGB and thermal sensors has been proven to be useful for many vision applications. However, how to effectively fuse the information of two modalities remains a challenging problem. In this paper, we propose an Illumination-Aware Window Transformer (IAWT) fusion module to handle the RGB and thermal multi-modality fusion. Specifically, the IAWT fusion module adopts a window-based multi-modality attention combined with additional estimated illumination information. The window-based multi-modality attention infers dependency cross modalities within a local window, thus implicitly alleviate the problem caused by weakly spatial misalignment of the RGB and thermal image pairs within specific dataset. The introduction of estimated illumination feature enables the fusion module to adaptively merge the two modalities according to illumination conditions so as to make full use of the complementary characteristics of RGB and thermal images under different environments. Besides, our proposed fusion module is task-agnostic and data-specific, which means it can be used for different tasks with RGBT inputs. To evaluate the advances of the proposed fusion method, we embed the IAWT fusion module into different networks and conduct the experiments on various RGBT tasks, including pedestrian detection, semantic segmentation and crowd counting. Extensive results demonstrate the superior performance of our method.

Introduction

The research progress in computer vision has been growing rapidly in recent years. Current researches on computer vision are mainly based on RGB modality inputs. One of the fatal limitations of RGB modality is that it may provide limited visual cues when facing with challenging environments, such as complex background, dim light or total darkness. However, a lot of vision applications, such as autonomous driving and surveillance systems are required to be able to maneuver in all types of weathers and lighting conditions. To alleviate the above problems, some works have introduced thermal imaging sensors [1], [2], [3], [4], [5] to specific real-world vision applications.

Different from RGB cameras, thermal cameras capture the thermal radiation emitted by objects, which means they are not dependent on external light source and can be used during daytime and nighttime. Due to the imaging characteristics of RGB and thermal cameras, RGB and thermal images can provide complementary information about the interested objects in the scene. Variety of works have demonstrated the benefit of combining RGB and thermal modalities for different vision tasks, such as pedestrian detection [1], [6], [7], [8], [9], [10], [11], crowd counting [4], [12], semantic segmentation [3], [13], [14], [15], [16], [17], [18] and salient object detection [19], [20]

Although great progress has been made in the combination of RGBT modalities during recent years, how to effectively fuse the two modalities remains a problem. There are two major challenges existing in modeling the fusion process of these two modalities. First, the RGBT image pairs within specific dataset are not strictly aligned [6] due to the unsatisfactory data acquisition conditions. As shown in Fig. 1, there is a position shift between RGB and thermal image pairs of the KAIST [1] multispectral pedestrian detection dataset, causing that one object has different positions on images from two modalities. Zhang et al. [6] made a statistical analysis for the KAIST [1] dataset and found that more than half of the pedestrian bounding box of the two modalities have position shift and the shift distance mostly ranges from 0 to 10 pixels. However, most existing RGBT fusion methods combine the two modalities under the spatial alignment assumption, which may puzzle the model training process, especially for those tasks that are sensitive to spatial locations (e.g. detection and segmentation). Second, the characteristics of RGBT data are not fully taken into account when combining these two modalities. Most of the previous works follows an adaptive weighting fusion, which is a general fusion method for most types of modalities. However, the RGB and thermal modalities have its own characteristics. Fusion process for RGB-T modalities could consider using these characteristics. One of the major characteristics is that the two modalities play different role in different illumination conditions. As shown in Fig. 2, when the environment illumination is strong (e.g., during the daytime), RGB images can provide color and texture details about the objects while thermal images may suffer from indistinct edges of objects and low contrast. On the contrary, when environment illumination is weak (e.g., during the nighttime), RGB images lose visual cues while objects with higher temperature than surrounding environments are still distinguishable on thermal images. A convincing RGBT fusion method is expected to take such difference into consideration. Several works have already addressed the above two issues. For the spatial misalignment problem, Zhang et al. [11] proposed a novel RFA module to shift and align the features of two weakly-aligned modalities. However, this method requires annotations of both modalities, which are difficult to acquire in real-world application environments. To leverage the characteristics of RGBT data, [8] weighted the RGB and thermal decisions via a gated function defined over the estimated illumination value. However, the illumination information is only an important factor in the RGBT modality fusion process, not a decisive factor.

In this paper, we propose an illumination-aware window transformer (IAWT) fusion module to handle the RGB and thermal modality fusion problem. Specifically, the IAWT module adopts a window-based multi-modality attention combined with estimated illumination information. For the modality spatial misalignment problem, we do not explicitly align the two modalities like previous work [11], which requires annotations from two modalities. Instead, we learn representations from weakly-aligned RGBT modalities directly and adopt a window-wise fusion strategy to implicitly alleviate the problem caused by modality spatial misalignment. It is worth noting that although there is a spatial misalignment problem, the offset between RGBT image pairs is not large. As shown in Fig. 1, the car on two images has a position shift, however, they are all within a local window. Therefore, we partition RGBT image pairs into local windows, among which a transformer block is used to enable cross-modality interactions at the scale of local windows, making the fusion model robust to weak spatial misalignment. In order to make full use of the characteristics of RGBT modalities, following [8], we introduce illumination information into our window-based transformer block to form the final illumination-aware window transformer (IAWT) block. First, we design a classification network to estimate the shooting time (day or night) of each image pairs and regard the deep features of classification network as illumination token. Then, we make this illumination token participate in the calculation of Multi-head Self-Attention (MSA) among each window. In addition, different from previous work, which independently design a fusion model for each task, our fusion module is data-specific and task-agnostic, which means it can be used for various tasks with inputs of RGBT data.

To sum up, the main contributions of this paper can be summarized as follows:

  • We proposed a task-agnostic Illumination-aware Window Transformer (IAWT) fusion module to handle the RGB and thermal modality fusion problem. The IAWT fusion module combines RGB and thermal modalities with an illumination-aware attention mechanism within partitioned 3D windows, which considers the characteristics of the RGBT data under different illumination environment and is robust to weakly spatial misalignment of the two modalities for specific dataset.

  • We proposed an illumination-aware window-based multi-head self-attention (IAW-MSA) block to fuse the RGB and thermal features. The calculation of attention score of IAW-MSA comes from two way. One is the self-attention within local 3D windows, which can infer local dependency cross modalities, thus implicitly alleviates the problem caused by modality spatial misalignment of specific RGBT dataset. Another is the embedding of estimated illumination features of an illumination estimation network (IEN), making the fusion process guided by illumination information.

  • We conduct extensive experiments on various RGBT tasks, including pedestrian detection, semantic segmentation and crowd counting. We achieved state-of-the-art performance on RGBT crowd counting datasets (DroneRGBT and RGBT-CC) and make performance improvements on KAIST RGBT pedestrian detection dataset and Urban Scene RGBT semantic segmentation dataset.

Section snippets

RGBT modality fusion

Techniques for RGBT modality fusion, which cover a variety of application domains, such as pedestrian detection, semantic segmentation and crowd counting, have been widely investigated. We first review current existing RGBT fusion methods in these three tasks separately.

Overall architecture

The overall architecture of the proposed Illumination-aware Window-based Transformer (IAWT) Fusion Network is depicted in Fig. 3(a). It consists of five major components: a two-stream feature extractor to extract feature representations of individual modality separately, a window partition layer to divide the feature maps into non-overlapping windows, an Illumination Estimation Network (IEN) to extract the global illumination feature of the input image, an Illumination-Aware Window Transformer

Experiments

In this section, we evaluate the performance of the proposed IAWT fusion network on various RGBT-related tasks including RGBT pedestrian detection, RGBT semantic segmentation and RGBT crowd counting. We show that our IAWT fusion network achieve state-of-the-art performance on two RGBT crowd counting datasets (DroneRGBT and RGBT-CC) and make performance improvements on KAIST RGBT pedestrian detection dataset and Urban Scene RGBT semantic segmentation dataset.

Conclusion

In this paper, we proposed an Illumination-aware window transformer (IAWT) fusion network to handle the RGB and thermal modality fusion problem. Specifically, the IAWT module use a window-based multi-modality attention combined with additional estimated illumination information. We propose a window-wise fusion strategy to implicitly alleviate the problem caused by RGBT modality spatial misalignment problem within specific dataset. We also introduce global illumination context to help the fusion

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Zhenzhong Chen reports financial support was provided by National Natural Science Foundation of China.

References (56)

  • ZhouK. et al.

    Improving multispectral pedestrian detection by addressing modality imbalance problems

  • LiuL. et al.

    Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting

  • ZhangL. et al.

    Weakly aligned feature fusion for multimodal object detection

    IEEE Trans. Neural Netw. Learn. Syst.

    (2021)
  • LiuJ. et al.

    Multispectral deep neural networks for pedestrian detection

  • ZhangL. et al.

    Weakly aligned cross-modal learning for multispectral pedestrian detection

  • PengT. et al.

    RGB-T crowd counting from drone: A benchmark and MMCCN network

  • HaQ. et al.

    MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes

  • SunY. et al.

    RTFNet: RGB-Thermal fusion network for semantic segmentation of urban scenes

    IEEE Robot. Autom. Lett.

    (2019)
  • DengF. et al.

    FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation

  • ZhouW. et al.

    MFFENet: Multiscale feature fusion and enhancement network for RGB-Thermal urban road scene parsing

    IEEE Trans. Multimed.

    (2022)
  • ZhouW. et al.

    GMNet: Graded-feature multilabel-learning network for RGB-Thermal urban scene semantic segmentation

    IEEE Trans. Image Process.

    (2021)
  • ZhouW. et al.

    ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection

    IEEE Trans. Circuits Syst. Video Technol.

    (2022)
  • ZhouW. et al.

    APNet: Adversarial learning assistance and perceived importance fusion network for all-day RGB-T salient object detection

    IEEE Trans. Emerg. Top. Comput. Intell.

    (2022)
  • HeY. et al.

    STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling

  • ZhouW. et al.

    FRNet: Feature reconstruction network for RGB-D indoor scene parsing

    IEEE J. Sel. Top. Sign. Proces.

    (2022)
  • ZhouW. et al.

    RLLNet: a lightweight remaking learning network for saliency redetection on RGB-D images

    Sci. China Inf. Sci.

    (2022)
  • VaswaniA. et al.

    Attention is all you need

  • DevlinJ. et al.

    BERT: Pre-training of deep bidirectional transformers for language understanding

    (2019)
  • Cited by (3)

    This work was supported in part by the National Natural Science Foundation of China] under Grant 62036005 and the Special Fund of Hubei Luojia Laboratory .

    ☆☆

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text