Abstract:
Long-term (also called Clothing-Change) person re-identification (CC-reID) aims at confirming the identity of pedestrians captured at diverse locations and/or times. Curr...Show MoreMetadata
Abstract:
Long-term (also called Clothing-Change) person re-identification (CC-reID) aims at confirming the identity of pedestrians captured at diverse locations and/or times. Current CC-reID methods heavily rely on ID features learned by the CNN architecture. However, with limited receptive fields, CNN is hard to effectively explore some unique but discriminative ID features (e.g., hair style, tattoo and accessories) from small body regions. Compared with CNN, Transformer has certain merits in exploring more diverse ID-unique features and retaining more details by the multi-head self-attention design and the removal of down-sampling operation. In this paper, a two-stream hybrid Convolution-Transformer Network (CT-Net) is proposed for CC-reID by combining both CNN and Transformer parallelly in an end-to-end learning scheme. Specifically, CT-Net contains a CNN-based stream (C-Stream) and a Transformer-based stream (T-Stream). Compared with using C-Stream only, T-Stream is used to encourage the C-Stream to explore more detailed ID-unique features when the clothing information is no reliable in CC-reID. Specifically, a Feature Supplement Module (FSM) is proposed to transfer features learned by T-Stream to C-Stream from low-level to high-level for mining more ID-unique feature. In order to further enhance the discriminability and complementary of ID features learned by our CT-Net, we also introduce a hierarchical supervision with bilinear pooling (HSBP). Experimental results demonstrate that CT-Net performs favorably against the state-of-the-art methods over three CC-reID benchmarks. Meanwhile, CT-Net also demonstrates good generalization ability by achieving comparable performance on traditional person re-ID datasets such as Market-1501 and DukeMTMC-reID.
Published in: IEEE Transactions on Multimedia ( Volume: 26)