Cross-modality person re-identification via multi-task learning

doi:10.1016/j.patcog.2022.108653

Pattern Recognition

Volume 128, August 2022, 108653

https://doi.org/10.1016/j.patcog.2022.108653 Get rights and content

Highlights

•
We take the initiative to investigate the importance and strategy of exploiting person body information for cross-modality VI-PReID.
•
We formulate a multi-task learning model by building each individual branch for a different recognition task relevant to person Re-ID.
•
We design a task translation module to transfer the person body information from the person mask prediction branch into the VI-PReID branch.

Abstract

Despite its promising preliminary results, existing cross-modality Visible-Infrared Person Re-IDentification (VI-PReID) models incorporating semantic (person) masks simply use these person masks as selection maps to separate person features from background regions. Such models do not dedicate to extracting more modality-invariant person body features in the VI-PReID network itself, thus leading to suboptimal results in VI-PReID. Differently, we aim to better capture person body information in the VI-PReID network itself for VI-PReID by exploiting the inner relations between person mask prediction and VI-PReID. To this end, a novel multi-task learning model is presented in this paper, where person body features obtained by person mask prediction potentially facilitate the extraction of discriminative modality-shared person body information for VI-PReID. On top of that, considering the task difference between person mask prediction and VI-PReID, we propose a novel task translation sub-network to transfer discriminative person body information, extracted by person mask prediction, into VI-PReID. Doing so enables our model to better exploit discriminative and modality-invariant person body information. Thanks to more discriminative modality-shared features, our method outperforms previous state-of-the-arts by a significant margin on several benchmark datasets. Our intriguing findings validate the effectiveness of extracting discriminative person body features for the VI-PReID task.

Introduction

Person Re-IDentification (PReID) aims at matching a probe pedestrian image with those in a large-scale gallery collected by using some non-overlapping cameras at diverse locations. It is one of the core steps in many computer vision applications, including video surveillance [1] and people tracking [2].

Most existing PReID models [3], [4] concentrate on solving the visible image-based PReID task (VV-PReID), which assumes that visible (RGB) images are captured under good illumination conditions. However, this assumption makes VV-PReID models impractical to deal with real-world applications because visible cameras in such scenarios are incapable of capturing discriminative information under poor lighting conditions. Therefore, most existing surveillance systems opt for dual-mode cameras, where visible cameras provide detailed visual characteristics in daytimes and infrared (IR) cameras capture discriminative information in dark environments. Consequently, recent research focus has shifted to visible-infrared PReID (VI-PReID) problem, overcoming the limitations of existing PReID techniques and further facilitating their real-life applications.

It turns out that extracting discriminative modality-shared person information from the input images is essential to solve two grand challenges in VI-PReID, i.e., cross-modality variations caused by the large modality discrepancy between RGB and thermal images, and intra-modality variations caused by different viewpoints, human poses changing, etc. [6], [7], [8], [9]. Given the person retrieval dataset RAP [5], Fig. 1 summaries most of the pedestrian attributes that may appear in RGB images and IR images, respectively. It can be seen that most modality-shared person features are closely related to person body information. This indicates that person body information is vital to discriminative modality-shared features for VI-PReID. However, most existing VI-PReID models advocate the use of image labels to supervise training, which tends to learn features from the entire image but fails to dig precise modality-shared person body information. Therefore, the overall performance of such systems, especially the feature extraction part, is unsatisfactory. This motivates us to investigate better ways to exploit more modality-shared person body information for VI-PReID.

While many single-modality VV-PReID models have paid attention to person body information, the study of extracting such information from cross-modality VI-PReID is still in its infancy. The major reason is that the transition from VV-PReID to VI-PReID does not seem straightforward. Concretely, as shown in Fig. 2(a), most existing VV-PReID models carry out a two-step approach, where the first step is to use pre-trained models (e.g., pose estimation, key point detection and human parsing) to extract person body information. In the subsequent step, this person body information serves as a feature selection mask to help align body parts for the PReID task. This simple operation of feature selection (multiplication) may work well for single-modality person ReID because it can regulate feature maps to focus on the person body region, such as texture or appearance information of person body.

However, such primitive feature selection strategies suffer when applied to cross-modality VI-PReID due to the following reasons. First, the amount of original modality-invariant person information extracted in VI-PReID models is far less than that of single-modality person information extracted in VV-PReID models. After feature selection, some discriminative person information for VI-PReID may be mistakenly discarded. As a result, the simple feature selection strategy will not boost but may even degrade the performance of a cross-modality VI-PReID model. Secondly, taking only the extracted person body information as selection masks cannot help capture more modality-invariant features that are related to person bodies in the feature extractor of the VI-PReID model. Accordingly, as shown in Fig. 2(c), these VI-PReID models still cannot well exploit the features that contain precise person body information for VI-PReID. The above analysis urges us to come up with a better strategy to extract modality-invariant person information for VI-PReID.

Actually, there are strong inner relations between person mask prediction and VI-PReID. On one hand, VI-PReID can provide person-related semantic information, e.g., view-invariant person information, to aid person mask prediction segmenting persons with different poses, views, etc.. On the other hand, person mask prediction can impel VI-PReID to extract more discriminative modality-shared features, e.g., abundant person body information, which is the key to reducing cross-modality variations as well as intra-modality variations. To this end, as shown in Fig. 2(b), we propose a novel multi-task learning framework, which simultaneously learns two vision tasks from different supervisory signals, plus a uniform loss function, to build an improved feature extraction for the purpose of cross-modality person re-identification. In this framework, some modality-invariant person body information (e.g., precise shapes, contours and appearances) will be captured by exploiting the interactions between the two tasks. Accordingly, as shown in Fig. 2(c) and (d), our proposed VI-PReID model can effectively exploit more person body information in the feature learning stage via multi-task learning than those obtained in a feature selection way. This will greatly boost the performance of VI-PReID.

Meanwhile, although the two tasks, i.e., person mask prediction and VI-PReID, are highly related, they are still quite different. Specifically, person mask prediction aims to segment all the persons in the given images from backgrounds without caring one certain person identity, while VI-PReID aims to identify all the person identities in the given images. Therefore, directly introducing the features extracted by the person mask prediction branch into the VI-PReID branch may obtain suboptimal results. Considering that, a task translation sub-network is specially designed to bootstrap the discriminative information, extracted by person mask prediction task, into the VI-PReID task.

In summary, the main contributions of this paper are as follows:

(1)
We take the initiative to investigate the importance and strategy of exploiting person body information for cross-modality VI-PReID. Eventually, our strategy can enhance the discriminability of extracted modality-shared features and further reduce the cross-modality variations as well as intra-modality variations.
(2)
We formulate a novel multi-task learning model by building each individual branch for a different recognition task relevant to person Re-ID. In doing so, our model can concurrently capture discriminative modality-invariant person body information via person mask prediction and VI-PReID. Unlike existing models, which only employ person body information for feature selection, our model exploits the inner relations between person mask prediction and VI-PReID, thus enabling the interactions of the person body information across the two tasks.
(3)
A task translation sub-network is put in place to transfer discriminative person body information, extracted by the person mask prediction sub-branch into the VI-PReID sub-branch. This reduces the feature discrepancy between the two different tasks, thus leading to better person body information exploitation in the VI-PReID sub-branch.

The rest of this paper is organized as the follows. We briefly describe some previous works related to the visible image based person Re-ID and cross-modality person Re-ID first in Section 2, followed by the details of our newly proposed method in Section 3. Several experiments are conducted to validate the proposed model in Section 4, Finally, in Section 5, a brief conclusion is made.

Section snippets

VV-PReID

In recent years, deep learning methods for VV-PReID show significant superiority on retrieval accuracy. Summarizing the vast amount of existing research on VV-PReID is beyond the scope of this paper and we refer those interested readers to Wu et al. [10], 11] for recent surveys on these topics. Many VV-PReID models also try to exploit those discriminative person body related information for reducing intra-modality variations. Generally, these models exploit person body related information via

Proposed model

In this section, we will elaborate on how we devise our proposed multi-task learning based VI-PReID model to exploit the inner relations between person mask prediction and VI-PReID. As shown in Fig. 3, our proposed multi-task learning based model mainly contains five components, i.e., modality-specific feature extraction, modality-shared feature translation, person mask prediction branch, task translation module and VI-PReID branch. It should be noticed that, in our model, the modality-specific

Datasets and evaluation metrics

Datasets Two public available benchmarks (i.e., SYSU-MM01 [21] and RegDB [22]) are employed to evaluate the performance of our proposed network.

SYSU-MM01 is a large-scale and widely used VI-PReID dataset [21]. It employs four visible cameras and two infrared cameras to collect RGB images and IR images, respectively, at the indoor and outdoor environments. It selects 395 person identities as the training set and 96 person identities as the testing set. It contains two evaluation modes in

Conclusion

In this paper, a novel VI-PReID model has been presented to explicitly exploit person body information for enhancing the discriminability of modality-shared features and further reducing cross-modality variation as well as intra-modality variations in a multi-task learning framework. Concretely, via the proposed multi-task learning framework, the inner relations between person mask prediction and VI-PReID are effectively exploited. This further facilitates our proposed VI-PReID model extracting

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Nianchang Huang is currently pursuing the Ph.D. degree in School of Mechano-Electronic Engineering, Xidian University, China. His research interests include deep learning and multimodal image processing in computer vision.

References (46)

X. Wang
Intelligent multi-camera video surveillance: areview
Pattern Recognit. Lett.
(2013)
Z. Zhang et al.
Part-guided graph convolution networks for person re-identification
Pattern Recognit.
(2021)
S. Li et al.
Attributes-aided part detection and refinement for person re-identification
Pattern Recognit.
(2020)
E. Basaran et al.
An efficient framework for visible-infrared cross modality person re-identification
Signal Process. Image Commun.
(2020)
Y. Hao et al.
Modality adversarial neural network for visible-thermal person re-identification
Pattern Recognit.
(2020)
H. Liu et al.
Enhancing the discriminative feature learning for visible-thermal cross-modality person re-identification
Neurocomputing
(2020)
D. Wu et al.
Deep learning-based methods for person re-identification: a comprehensive review
Neurocomputing
(2019)
W. Wu et al.
Deep features for person re-identification on metric learning
Pattern Recognit.
(2021)
Y. Zhu et al.
Hetero-center loss for cross-modality person re-identification
Neurocomputing
(2020)
N. McLaughlin et al.
Video person re-identification for wide area tracking based on recurrent neural networks
IEEE Trans. Circuits Syst. Video Technol.
(2017)

D. Li et al.

A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios

IEEE Trans. Image Process.

(2019)

M. Ye et al.

Bi-directional center-constrained top-ranking for visible thermal person re-identification

IEEE Trans. Inf. Forensics Secur.

(2020)

Y. Wu et al.

Person re-identification by multi-scale feature representation learning with random batch feature mask

IEEE Trans. Cogn. Dev. Syst.

(2020)

S. Gao et al.

Pose-guided visible part matching for occluded person ReID

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2020)

X. Liang et al.

Look into person: joint body parsing & pose estimation network and a new benchmark

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

M.M. Kalayeh et al.

Human semantic parsing for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

M. Ye et al.

Visible thermal person re-identification via dual-constrained top-ranking

Proceedings of the International Joint Conference on Artificial Intelligence

(2018)

Y. Yang et al.

Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification

Neural Netw.

(2020)

G. Wang et al.

RGB-infrared cross-modality person re-identification via joint pixel and feature alignment

Proceedings of the IEEE International Conference on Computer Vision

(2019)

K. He et al.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

Y. Huang et al.

Alleviating modality bias training for infrared-visible person re-identification

IEEE Trans. Multimed.

(2021)

A. Wu et al.

RGB-infrared cross-modality person re-identification

Proceedings of the IEEE International Conference on Computer Vision

(2017)

D.T. Nguyen et al.

Person recognition system based on a combination of body images from visible light and thermal cameras

Sensors

(2017)

Cited by (24)

Enhanced visible–infrared person re-identification based on cross-attention multiscale residual vision transformer
2024, Pattern Recognition
Visible–infrared (VI) person re-identification (Re-ID) is a critical identification task that involves retrieving and matching images of an individual using both infrared and visible imaging modalities. To improve the performance, researchers have developed methods to obtain implicit feature information; however, this degrades with fewer discriminative features. To address this issue, we propose a weighted fused cross-attention multi-scale residual vision transformer (WF-CAMReViT) approach to re-identify the appropriate person from visible–infrared modality images by integrating the cross-attention multi-scale residual vision transformer architecture with Opposition-based Dove Swarm Optimization (ODSO). The proposed framework aims to bridge the domain gap between the visible and infrared modalities and significantly improve the re-identification performance. RGB (visible) and infrared (IR) images of persons are gathered from standard datasets, subjected to a cross-attention multi-scale residual vision transformer network to collect features, and then fuse using minimal weight. We also propose Opposition-based DSO to find the minimal weight. The weighted fused features are then subjected to the final decoder layer of CAMReViT to perceive the characteristics of each modality. In this study, model-aware enhancement (MAE) loss is develop to improve the modality information capacity of modality-shared features. Then, the experimental results on the SYSU-MM01 and RegDB datasets are compared with state-of-the-art transformer-based visible–infrared person Re-ID tasks to verify the efficacy of the proposed model.
Co-segmentation assisted cross-modality person re-identification
2024, Information Fusion
We present a deep learning-based method for Visible-Infrared person Re-Identification (VI-ReID). The major contribution lies in the incorporation of co-segmentation into a multi-task learning framework for VI-ReID, where the co-segmentation concept aids in making the distributions of RGB images and IR images the same for the same identity but diverse for different identities. Accordingly, a novel multi-task learning based model, i.e., co-segmentation assisted VI-ReID (CSVI), is proposed in this paper. Specifically, the co-segmentation network first takes as the inputs the modality-shared features extracted from a set of RGB and IR images by using the VI-ReID model. Then, it exploits their semantic similarities for predicting the person masks of the common identities within the input RGB and IR images by using a cross-modality center based weight generation module and a segmentation decoder. Doing so enables the VI-ReID model to extract more additional modality-shared shape features for boosting performance. Meanwhile, the co-segmentation network implicitly establishes the interactions among the set of RGB and IR images, thus further bridging the large modality discrepancies. Our model’s effectiveness and superiority are verified through experimental comparisons with state-of-the-art algorithms on several benchmark datasets.
Image-based human re-identification: Which covariates are actually (the most) important?
2024, Image and Vision Computing
Human re-identification (re-ID) is nowadays among the most popular topics in computer vision, due to the increasing importance given to safety/security in modern societies. Being expected to sun in totally uncontrolled data acquisition settings (e.g., visual surveillance) automated re-ID not only depends on various factors that may occur in non-controlled data acquisition settings, but - most importantly - performance varies with respect to different subject features (e.g., gender, height, ethnicity, clothing, and action being performed), which may result in highly biased and undesirable automata. While many efforts have been putted in increase the robustness of identification to uncontrolled settings, a systematic assessment of the actual variations in performance with respect to each subject feature remains to be done. Accordingly, the contributions of this paper are threefold: 1) we report the correlation between the performance of three state-of-the-art re-ID models and different subject features; 2) we discuss the most concerning features and report valuable insights about the roles of the various features in re-ID performance, which can be used to develop more effective and unbiased re-ID systems; and 3) we leverage the concept of biometric menagerie, in order to identify the groups of individuals that typically fall into the most common menagerie families (e.g., goats, lambs, and wolves). Our findings not only contribute to a better understanding of the factors affecting re-ID performance, but also may offer practical guidance for researchers and practitioners concerned on human re-identification development.
Attention Cycle-consistent universal network for More Universal Domain Adaptation
2024, Pattern Recognition
Existing Universal Domain Adaptation (UniDA) approaches can handle various domain adaptation (DA) tasks, which need no prior information about the category overlap across target and source domains. However, traditional UniDA scenario cannot fully cover every DA scenario, e.g., Multi-Source DA is absent. Therefore, aiming to simultaneously handle more DA scenarios in nature, we propose the More Universal Domain Adaptation (MUniDA) task. There are three challenges in MUniDA: (i) Category shift between source and target domains; (ii) Domain shift, especially the domain shift among multiple modalities in the source, which is ignored by the current UniDA approaches; (iii) How to recognize common categories across domains? We propose a more universally applicable DA approach that can tackle above challenges without any modification called Attention Cycle-consistent Universal Network (A-CycleUN). We show through extensive experiments on several benchmarks that A-CycleUN works stably and outperforms baselines across different MUniDA settings.
On exploring pose estimation as an auxiliary learning task for Visible–Infrared Person Re-identification
2023, Neurocomputing
Visible–infrared person re-identification (VI-ReID) has been challenging due to the existence of large discrepancies between visible and infrared modalities. Most pioneering approaches reduce intra-modality variations and inter-modality discrepancies by learning modality-shared features. However, an explicit modality-shared cue, i.e., body keypoints, has not been fully exploited in VI-ReID. Additionally, existing feature learning paradigms imposed constraints on either global features or partitioned feature stripes, which neglect the prediction consistency of global and part features. To address the above problems, we exploit Pose Estimation as an auxiliary learning task to assist VI-ReID in an end-to-end framework. By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality ID-related features. On top of it, the learnings of global features and local features are seamlessly synchronized by Hierarchical Feature Constraint (HFC), where the former supervises the latter using the knowledge distillation strategy. Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins. Specifically, our method achieves nearly 20% mAP improvements against the state-of-the-art method on the RegDB dataset. Our intriguing findings highlight the usage of auxiliary task learning in VI-ReID. Our source code is available at https://github.com/yoqim/Pose_VIReID.
BDNet: A BERT-based dual-path network for text-to-image cross-modal person re-identification
2023, Pattern Recognition
Text-to-image person re-identification (TI-ReID) aims to provide a descriptive sentence to find a specific person in the gallery. The task is very challenging due to the huge feature differences between both image and text descriptions. Currently, most approaches use the idea of combining global and local features to get more fine-grained features. However, these methods usually acquire local features with the help of human pose or segmentation models, which makes it difficult to use in realistic scenarios due to the introduction of additional models or complex training evaluation strategies. To facilitate practical applications, we propose a BERT-based framework for dual-path TI-ReID. Without the help of additional models, our approach directly employs visual attention in the global feature extraction network to allow the network to adaptively learn to focus on salient local features in image and text descriptions, which enhances the network’s attention to local information through a visual attention mechanism, thus strengthening the global feature representation and effectively improving the global feature representation. In addition, to learn text and image modality invariant feature representations, we propose a convolutional shared network (CSN) to learn image and text features together. To optimize cross-modal feature distances more effectively, we propose a global hybrid modal triplet global metric loss. In addition to combining local metric learning and global metric learning, we also introduce the CMPM loss and CMPC loss to jointly optimize the proposed model. Extensive experiments on the CUHK-PEDES dataset show that the proposed method performs significantly better than the current research results, achieving a Rank-1/mAP accuracy of 66.27%/ 57.04%.

View all citing articles on Scopus

Kunlong Liu received the B.S. degree from Hefei University of Technology, Hefei, China, in 2019. He is currently pursuing the M.S. degree in control engineering with Xidian University, Xi’an, China. His current research interests include deep learning and computer vision.

Yang Liu received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xian, China, in 2013, 2015 and 2018, respectively. He is currently a Post-Doctoral Researcher in Xidian University, Xi’an, China. He has authored nearly 30 technical articles in refereed journals and proceedings, including IEEE Trans. Image, IEEE Trans. Cybernetics, PR, CVPR, AAAI, and IJCAI. His research interests include dimensionality reduction, pattern recognition, and deep learning.

Qiang Zhang received the B.S. degree in automatic control, the M.S. degree in pattern recognition and intelligent systems, and the Ph.D. degree in circuit and system from Xidian University, China, in 2001,2004, and 2008, respectively. He was a Visiting Scholar with the Center for Intelligent Machines, McGill University, Canada. He is currently a professor with the Automatic Control Department, Xidian University, China. His current research interests include image processing, pattern recognition.

Jungong Han is currently a Full Professor and Chair in Computer Science at Aberystwyth University, UK. His research interests span the fields of video analysis, computer vision and applied machine learning. He has published over 180 papers, including 40+ IEEE Trans and 40+ A* conference papers.

^☆: This work is supported by the National Natural Science Foundation of China under Grant no. 61773301.

View full text

Cross-modality person re-identification via multi-task learning☆

Highlights

Abstract

Introduction

Section snippets

VV-PReID

Proposed model

Datasets and evaluation metrics

Conclusion

Declaration of Competing Interest

Pattern Recognit. Lett.

Pattern Recognit.

Pattern Recognit.

Signal Process. Image Commun.

Pattern Recognit.

Neurocomputing

Neurocomputing

Pattern Recognit.

Neurocomputing

Video person re-identification for wide area tracking based on recurrent neural networks

IEEE Trans. Circuits Syst. Video Technol.

A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios

IEEE Trans. Image Process.

Bi-directional center-constrained top-ranking for visible thermal person re-identification

IEEE Trans. Inf. Forensics Secur.

Person re-identification by multi-scale feature representation learning with random batch feature mask

IEEE Trans. Cogn. Dev. Syst.

Pose-guided visible part matching for occluded person ReID

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Look into person: joint body parsing & pose estimation network and a new benchmark

IEEE Trans. Pattern Anal. Mach. Intell.

Human semantic parsing for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Visible thermal person re-identification via dual-constrained top-ranking

Proceedings of the International Joint Conference on Artificial Intelligence

Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification

Neural Netw.

RGB-infrared cross-modality person re-identification via joint pixel and feature alignment

Proceedings of the IEEE International Conference on Computer Vision

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Alleviating modality bias training for infrared-visible person re-identification

IEEE Trans. Multimed.

RGB-infrared cross-modality person re-identification

Proceedings of the IEEE International Conference on Computer Vision

Person recognition system based on a combination of body images from visible light and thermal cameras

Sensors