Cross-modality person re-identification via multi-task learning☆
Introduction
Person Re-IDentification (PReID) aims at matching a probe pedestrian image with those in a large-scale gallery collected by using some non-overlapping cameras at diverse locations. It is one of the core steps in many computer vision applications, including video surveillance [1] and people tracking [2].
Most existing PReID models [3], [4] concentrate on solving the visible image-based PReID task (VV-PReID), which assumes that visible (RGB) images are captured under good illumination conditions. However, this assumption makes VV-PReID models impractical to deal with real-world applications because visible cameras in such scenarios are incapable of capturing discriminative information under poor lighting conditions. Therefore, most existing surveillance systems opt for dual-mode cameras, where visible cameras provide detailed visual characteristics in daytimes and infrared (IR) cameras capture discriminative information in dark environments. Consequently, recent research focus has shifted to visible-infrared PReID (VI-PReID) problem, overcoming the limitations of existing PReID techniques and further facilitating their real-life applications.
It turns out that extracting discriminative modality-shared person information from the input images is essential to solve two grand challenges in VI-PReID, i.e., cross-modality variations caused by the large modality discrepancy between RGB and thermal images, and intra-modality variations caused by different viewpoints, human poses changing, etc. [6], [7], [8], [9]. Given the person retrieval dataset RAP [5], Fig. 1 summaries most of the pedestrian attributes that may appear in RGB images and IR images, respectively. It can be seen that most modality-shared person features are closely related to person body information. This indicates that person body information is vital to discriminative modality-shared features for VI-PReID. However, most existing VI-PReID models advocate the use of image labels to supervise training, which tends to learn features from the entire image but fails to dig precise modality-shared person body information. Therefore, the overall performance of such systems, especially the feature extraction part, is unsatisfactory. This motivates us to investigate better ways to exploit more modality-shared person body information for VI-PReID.
While many single-modality VV-PReID models have paid attention to person body information, the study of extracting such information from cross-modality VI-PReID is still in its infancy. The major reason is that the transition from VV-PReID to VI-PReID does not seem straightforward. Concretely, as shown in Fig. 2(a), most existing VV-PReID models carry out a two-step approach, where the first step is to use pre-trained models (e.g., pose estimation, key point detection and human parsing) to extract person body information. In the subsequent step, this person body information serves as a feature selection mask to help align body parts for the PReID task. This simple operation of feature selection (multiplication) may work well for single-modality person ReID because it can regulate feature maps to focus on the person body region, such as texture or appearance information of person body.
However, such primitive feature selection strategies suffer when applied to cross-modality VI-PReID due to the following reasons. First, the amount of original modality-invariant person information extracted in VI-PReID models is far less than that of single-modality person information extracted in VV-PReID models. After feature selection, some discriminative person information for VI-PReID may be mistakenly discarded. As a result, the simple feature selection strategy will not boost but may even degrade the performance of a cross-modality VI-PReID model. Secondly, taking only the extracted person body information as selection masks cannot help capture more modality-invariant features that are related to person bodies in the feature extractor of the VI-PReID model. Accordingly, as shown in Fig. 2(c), these VI-PReID models still cannot well exploit the features that contain precise person body information for VI-PReID. The above analysis urges us to come up with a better strategy to extract modality-invariant person information for VI-PReID.
Actually, there are strong inner relations between person mask prediction and VI-PReID. On one hand, VI-PReID can provide person-related semantic information, e.g., view-invariant person information, to aid person mask prediction segmenting persons with different poses, views, etc.. On the other hand, person mask prediction can impel VI-PReID to extract more discriminative modality-shared features, e.g., abundant person body information, which is the key to reducing cross-modality variations as well as intra-modality variations. To this end, as shown in Fig. 2(b), we propose a novel multi-task learning framework, which simultaneously learns two vision tasks from different supervisory signals, plus a uniform loss function, to build an improved feature extraction for the purpose of cross-modality person re-identification. In this framework, some modality-invariant person body information (e.g., precise shapes, contours and appearances) will be captured by exploiting the interactions between the two tasks. Accordingly, as shown in Fig. 2(c) and (d), our proposed VI-PReID model can effectively exploit more person body information in the feature learning stage via multi-task learning than those obtained in a feature selection way. This will greatly boost the performance of VI-PReID.
Meanwhile, although the two tasks, i.e., person mask prediction and VI-PReID, are highly related, they are still quite different. Specifically, person mask prediction aims to segment all the persons in the given images from backgrounds without caring one certain person identity, while VI-PReID aims to identify all the person identities in the given images. Therefore, directly introducing the features extracted by the person mask prediction branch into the VI-PReID branch may obtain suboptimal results. Considering that, a task translation sub-network is specially designed to bootstrap the discriminative information, extracted by person mask prediction task, into the VI-PReID task.
In summary, the main contributions of this paper are as follows:
- (1)
We take the initiative to investigate the importance and strategy of exploiting person body information for cross-modality VI-PReID. Eventually, our strategy can enhance the discriminability of extracted modality-shared features and further reduce the cross-modality variations as well as intra-modality variations.
- (2)
We formulate a novel multi-task learning model by building each individual branch for a different recognition task relevant to person Re-ID. In doing so, our model can concurrently capture discriminative modality-invariant person body information via person mask prediction and VI-PReID. Unlike existing models, which only employ person body information for feature selection, our model exploits the inner relations between person mask prediction and VI-PReID, thus enabling the interactions of the person body information across the two tasks.
- (3)
A task translation sub-network is put in place to transfer discriminative person body information, extracted by the person mask prediction sub-branch into the VI-PReID sub-branch. This reduces the feature discrepancy between the two different tasks, thus leading to better person body information exploitation in the VI-PReID sub-branch.
The rest of this paper is organized as the follows. We briefly describe some previous works related to the visible image based person Re-ID and cross-modality person Re-ID first in Section 2, followed by the details of our newly proposed method in Section 3. Several experiments are conducted to validate the proposed model in Section 4, Finally, in Section 5, a brief conclusion is made.
Section snippets
VV-PReID
In recent years, deep learning methods for VV-PReID show significant superiority on retrieval accuracy. Summarizing the vast amount of existing research on VV-PReID is beyond the scope of this paper and we refer those interested readers to Wu et al. [10], 11] for recent surveys on these topics. Many VV-PReID models also try to exploit those discriminative person body related information for reducing intra-modality variations. Generally, these models exploit person body related information via
Proposed model
In this section, we will elaborate on how we devise our proposed multi-task learning based VI-PReID model to exploit the inner relations between person mask prediction and VI-PReID. As shown in Fig. 3, our proposed multi-task learning based model mainly contains five components, i.e., modality-specific feature extraction, modality-shared feature translation, person mask prediction branch, task translation module and VI-PReID branch. It should be noticed that, in our model, the modality-specific
Datasets and evaluation metrics
Datasets Two public available benchmarks (i.e., SYSU-MM01 [21] and RegDB [22]) are employed to evaluate the performance of our proposed network.
SYSU-MM01 is a large-scale and widely used VI-PReID dataset [21]. It employs four visible cameras and two infrared cameras to collect RGB images and IR images, respectively, at the indoor and outdoor environments. It selects 395 person identities as the training set and 96 person identities as the testing set. It contains two evaluation modes in
Conclusion
In this paper, a novel VI-PReID model has been presented to explicitly exploit person body information for enhancing the discriminability of modality-shared features and further reducing cross-modality variation as well as intra-modality variations in a multi-task learning framework. Concretely, via the proposed multi-task learning framework, the inner relations between person mask prediction and VI-PReID are effectively exploited. This further facilitates our proposed VI-PReID model extracting
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Nianchang Huang is currently pursuing the Ph.D. degree in School of Mechano-Electronic Engineering, Xidian University, China. His research interests include deep learning and multimodal image processing in computer vision.
References (46)
Intelligent multi-camera video surveillance: areview
Pattern Recognit. Lett.
(2013)- et al.
Part-guided graph convolution networks for person re-identification
Pattern Recognit.
(2021) - et al.
Attributes-aided part detection and refinement for person re-identification
Pattern Recognit.
(2020) - et al.
An efficient framework for visible-infrared cross modality person re-identification
Signal Process. Image Commun.
(2020) - et al.
Modality adversarial neural network for visible-thermal person re-identification
Pattern Recognit.
(2020) - et al.
Enhancing the discriminative feature learning for visible-thermal cross-modality person re-identification
Neurocomputing
(2020) - et al.
Deep learning-based methods for person re-identification: a comprehensive review
Neurocomputing
(2019) - et al.
Deep features for person re-identification on metric learning
Pattern Recognit.
(2021) - et al.
Hetero-center loss for cross-modality person re-identification
Neurocomputing
(2020) - et al.
Video person re-identification for wide area tracking based on recurrent neural networks
IEEE Trans. Circuits Syst. Video Technol.
(2017)
A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios
IEEE Trans. Image Process.
Bi-directional center-constrained top-ranking for visible thermal person re-identification
IEEE Trans. Inf. Forensics Secur.
Person re-identification by multi-scale feature representation learning with random batch feature mask
IEEE Trans. Cogn. Dev. Syst.
Pose-guided visible part matching for occluded person ReID
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Look into person: joint body parsing & pose estimation network and a new benchmark
IEEE Trans. Pattern Anal. Mach. Intell.
Human semantic parsing for person re-identification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Visible thermal person re-identification via dual-constrained top-ranking
Proceedings of the International Joint Conference on Artificial Intelligence
Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification
Neural Netw.
RGB-infrared cross-modality person re-identification via joint pixel and feature alignment
Proceedings of the IEEE International Conference on Computer Vision
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Alleviating modality bias training for infrared-visible person re-identification
IEEE Trans. Multimed.
RGB-infrared cross-modality person re-identification
Proceedings of the IEEE International Conference on Computer Vision
Person recognition system based on a combination of body images from visible light and thermal cameras
Sensors
Cited by (24)
Co-segmentation assisted cross-modality person re-identification
2024, Information FusionImage-based human re-identification: Which covariates are actually (the most) important?
2024, Image and Vision ComputingAttention Cycle-consistent universal network for More Universal Domain Adaptation
2024, Pattern RecognitionBDNet: A BERT-based dual-path network for text-to-image cross-modal person re-identification
2023, Pattern Recognition
Nianchang Huang is currently pursuing the Ph.D. degree in School of Mechano-Electronic Engineering, Xidian University, China. His research interests include deep learning and multimodal image processing in computer vision.
Kunlong Liu received the B.S. degree from Hefei University of Technology, Hefei, China, in 2019. He is currently pursuing the M.S. degree in control engineering with Xidian University, Xi’an, China. His current research interests include deep learning and computer vision.
Yang Liu received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xian, China, in 2013, 2015 and 2018, respectively. He is currently a Post-Doctoral Researcher in Xidian University, Xi’an, China. He has authored nearly 30 technical articles in refereed journals and proceedings, including IEEE Trans. Image, IEEE Trans. Cybernetics, PR, CVPR, AAAI, and IJCAI. His research interests include dimensionality reduction, pattern recognition, and deep learning.
Qiang Zhang received the B.S. degree in automatic control, the M.S. degree in pattern recognition and intelligent systems, and the Ph.D. degree in circuit and system from Xidian University, China, in 2001,2004, and 2008, respectively. He was a Visiting Scholar with the Center for Intelligent Machines, McGill University, Canada. He is currently a professor with the Automatic Control Department, Xidian University, China. His current research interests include image processing, pattern recognition.
Jungong Han is currently a Full Professor and Chair in Computer Science at Aberystwyth University, UK. His research interests span the fields of video analysis, computer vision and applied machine learning. He has published over 180 papers, including 40+ IEEE Trans and 40+ A* conference papers.