Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities...Show MoreMetadata
Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities. The widely used VI-ReID framework consists of a convolution neural backbone network that extracts the visual features, and a feature embedding network to project heterogeneous features to the same feature space. However, many studies based on the existing pre-trained models neglect potential correlations between different locations and channels within a single sample during the feature extraction. Inspired by the success of the Transformer in computer vision, we extend it to enhance feature representation for VI-ReID. In this paper, we propose a discriminative feature learning network based on a visual Transformer (DFLN-ViT) for VI-ReID. Firstly, to capture long-term dependencies between different locations, we propose a spatial feature awareness module (SAM), which utilizes a single-layer Transformer with a novel patch-embedding strategy to encode location information. Secondly, to refine the representation at each channel, we design a channel feature enhancement module (CEM). The CEM treats the features of each channel as a sequence of Transformer inputs, taking advantage of the Transformer's ability to model long-term dependencies. Finally, we propose a Triplet-aided Hetero-Center (THC) loss to learn more discriminative feature representation by balancing the cross-modality distance and intra-modality distance of the center. The experimental results on two datasets show that our method can significantly improve the VI-ReID performance, outperforming most state-of-the-art methods.
Published in: IEEE Transactions on Multimedia ( Volume: 25)