Elsevier

Pattern Recognition

Volume 128, August 2022, 108653
Pattern Recognition

Cross-modality person re-identification via multi-task learning

https://doi.org/10.1016/j.patcog.2022.108653Get rights and content

Highlights

  • We take the initiative to investigate the importance and strategy of exploiting person body information for cross-modality VI-PReID.

  • We formulate a multi-task learning model by building each individual branch for a different recognition task relevant to person Re-ID.

  • We design a task translation module to transfer the person body information from the person mask prediction branch into the VI-PReID branch.

Abstract

Despite its promising preliminary results, existing cross-modality Visible-Infrared Person Re-IDentification (VI-PReID) models incorporating semantic (person) masks simply use these person masks as selection maps to separate person features from background regions. Such models do not dedicate to extracting more modality-invariant person body features in the VI-PReID network itself, thus leading to suboptimal results in VI-PReID. Differently, we aim to better capture person body information in the VI-PReID network itself for VI-PReID by exploiting the inner relations between person mask prediction and VI-PReID. To this end, a novel multi-task learning model is presented in this paper, where person body features obtained by person mask prediction potentially facilitate the extraction of discriminative modality-shared person body information for VI-PReID. On top of that, considering the task difference between person mask prediction and VI-PReID, we propose a novel task translation sub-network to transfer discriminative person body information, extracted by person mask prediction, into VI-PReID. Doing so enables our model to better exploit discriminative and modality-invariant person body information. Thanks to more discriminative modality-shared features, our method outperforms previous state-of-the-arts by a significant margin on several benchmark datasets. Our intriguing findings validate the effectiveness of extracting discriminative person body features for the VI-PReID task.

Introduction

Person Re-IDentification (PReID) aims at matching a probe pedestrian image with those in a large-scale gallery collected by using some non-overlapping cameras at diverse locations. It is one of the core steps in many computer vision applications, including video surveillance [1] and people tracking [2].

Most existing PReID models [3], [4] concentrate on solving the visible image-based PReID task (VV-PReID), which assumes that visible (RGB) images are captured under good illumination conditions. However, this assumption makes VV-PReID models impractical to deal with real-world applications because visible cameras in such scenarios are incapable of capturing discriminative information under poor lighting conditions. Therefore, most existing surveillance systems opt for dual-mode cameras, where visible cameras provide detailed visual characteristics in daytimes and infrared (IR) cameras capture discriminative information in dark environments. Consequently, recent research focus has shifted to visible-infrared PReID (VI-PReID) problem, overcoming the limitations of existing PReID techniques and further facilitating their real-life applications.

It turns out that extracting discriminative modality-shared person information from the input images is essential to solve two grand challenges in VI-PReID, i.e., cross-modality variations caused by the large modality discrepancy between RGB and thermal images, and intra-modality variations caused by different viewpoints, human poses changing, etc. [6], [7], [8], [9]. Given the person retrieval dataset RAP [5], Fig. 1 summaries most of the pedestrian attributes that may appear in RGB images and IR images, respectively. It can be seen that most modality-shared person features are closely related to person body information. This indicates that person body information is vital to discriminative modality-shared features for VI-PReID. However, most existing VI-PReID models advocate the use of image labels to supervise training, which tends to learn features from the entire image but fails to dig precise modality-shared person body information. Therefore, the overall performance of such systems, especially the feature extraction part, is unsatisfactory. This motivates us to investigate better ways to exploit more modality-shared person body information for VI-PReID.

While many single-modality VV-PReID models have paid attention to person body information, the study of extracting such information from cross-modality VI-PReID is still in its infancy. The major reason is that the transition from VV-PReID to VI-PReID does not seem straightforward. Concretely, as shown in Fig. 2(a), most existing VV-PReID models carry out a two-step approach, where the first step is to use pre-trained models (e.g., pose estimation, key point detection and human parsing) to extract person body information. In the subsequent step, this person body information serves as a feature selection mask to help align body parts for the PReID task. This simple operation of feature selection (multiplication) may work well for single-modality person ReID because it can regulate feature maps to focus on the person body region, such as texture or appearance information of person body.

However, such primitive feature selection strategies suffer when applied to cross-modality VI-PReID due to the following reasons. First, the amount of original modality-invariant person information extracted in VI-PReID models is far less than that of single-modality person information extracted in VV-PReID models. After feature selection, some discriminative person information for VI-PReID may be mistakenly discarded. As a result, the simple feature selection strategy will not boost but may even degrade the performance of a cross-modality VI-PReID model. Secondly, taking only the extracted person body information as selection masks cannot help capture more modality-invariant features that are related to person bodies in the feature extractor of the VI-PReID model. Accordingly, as shown in Fig. 2(c), these VI-PReID models still cannot well exploit the features that contain precise person body information for VI-PReID. The above analysis urges us to come up with a better strategy to extract modality-invariant person information for VI-PReID.

Actually, there are strong inner relations between person mask prediction and VI-PReID. On one hand, VI-PReID can provide person-related semantic information, e.g., view-invariant person information, to aid person mask prediction segmenting persons with different poses, views, etc.. On the other hand, person mask prediction can impel VI-PReID to extract more discriminative modality-shared features, e.g., abundant person body information, which is the key to reducing cross-modality variations as well as intra-modality variations. To this end, as shown in Fig. 2(b), we propose a novel multi-task learning framework, which simultaneously learns two vision tasks from different supervisory signals, plus a uniform loss function, to build an improved feature extraction for the purpose of cross-modality person re-identification. In this framework, some modality-invariant person body information (e.g., precise shapes, contours and appearances) will be captured by exploiting the interactions between the two tasks. Accordingly, as shown in Fig. 2(c) and (d), our proposed VI-PReID model can effectively exploit more person body information in the feature learning stage via multi-task learning than those obtained in a feature selection way. This will greatly boost the performance of VI-PReID.

Meanwhile, although the two tasks, i.e., person mask prediction and VI-PReID, are highly related, they are still quite different. Specifically, person mask prediction aims to segment all the persons in the given images from backgrounds without caring one certain person identity, while VI-PReID aims to identify all the person identities in the given images. Therefore, directly introducing the features extracted by the person mask prediction branch into the VI-PReID branch may obtain suboptimal results. Considering that, a task translation sub-network is specially designed to bootstrap the discriminative information, extracted by person mask prediction task, into the VI-PReID task.

In summary, the main contributions of this paper are as follows:

  • (1)

    We take the initiative to investigate the importance and strategy of exploiting person body information for cross-modality VI-PReID. Eventually, our strategy can enhance the discriminability of extracted modality-shared features and further reduce the cross-modality variations as well as intra-modality variations.

  • (2)

    We formulate a novel multi-task learning model by building each individual branch for a different recognition task relevant to person Re-ID. In doing so, our model can concurrently capture discriminative modality-invariant person body information via person mask prediction and VI-PReID. Unlike existing models, which only employ person body information for feature selection, our model exploits the inner relations between person mask prediction and VI-PReID, thus enabling the interactions of the person body information across the two tasks.

  • (3)

    A task translation sub-network is put in place to transfer discriminative person body information, extracted by the person mask prediction sub-branch into the VI-PReID sub-branch. This reduces the feature discrepancy between the two different tasks, thus leading to better person body information exploitation in the VI-PReID sub-branch.

The rest of this paper is organized as the follows. We briefly describe some previous works related to the visible image based person Re-ID and cross-modality person Re-ID first in Section 2, followed by the details of our newly proposed method in Section 3. Several experiments are conducted to validate the proposed model in Section 4, Finally, in Section 5, a brief conclusion is made.

Section snippets

VV-PReID

In recent years, deep learning methods for VV-PReID show significant superiority on retrieval accuracy. Summarizing the vast amount of existing research on VV-PReID is beyond the scope of this paper and we refer those interested readers to Wu et al. [10], 11] for recent surveys on these topics. Many VV-PReID models also try to exploit those discriminative person body related information for reducing intra-modality variations. Generally, these models exploit person body related information via

Proposed model

In this section, we will elaborate on how we devise our proposed multi-task learning based VI-PReID model to exploit the inner relations between person mask prediction and VI-PReID. As shown in Fig. 3, our proposed multi-task learning based model mainly contains five components, i.e., modality-specific feature extraction, modality-shared feature translation, person mask prediction branch, task translation module and VI-PReID branch. It should be noticed that, in our model, the modality-specific

Datasets and evaluation metrics

Datasets Two public available benchmarks (i.e., SYSU-MM01 [21] and RegDB [22]) are employed to evaluate the performance of our proposed network.

SYSU-MM01 is a large-scale and widely used VI-PReID dataset [21]. It employs four visible cameras and two infrared cameras to collect RGB images and IR images, respectively, at the indoor and outdoor environments. It selects 395 person identities as the training set and 96 person identities as the testing set. It contains two evaluation modes in

Conclusion

In this paper, a novel VI-PReID model has been presented to explicitly exploit person body information for enhancing the discriminability of modality-shared features and further reducing cross-modality variation as well as intra-modality variations in a multi-task learning framework. Concretely, via the proposed multi-task learning framework, the inner relations between person mask prediction and VI-PReID are effectively exploited. This further facilitates our proposed VI-PReID model extracting

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Nianchang Huang is currently pursuing the Ph.D. degree in School of Mechano-Electronic Engineering, Xidian University, China. His research interests include deep learning and multimodal image processing in computer vision.

References (46)

  • D. Li et al.

    A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios

    IEEE Trans. Image Process.

    (2019)
  • M. Ye et al.

    Bi-directional center-constrained top-ranking for visible thermal person re-identification

    IEEE Trans. Inf. Forensics Secur.

    (2020)
  • Y. Wu et al.

    Person re-identification by multi-scale feature representation learning with random batch feature mask

    IEEE Trans. Cogn. Dev. Syst.

    (2020)
  • S. Gao et al.

    Pose-guided visible part matching for occluded person ReID

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)
  • X. Liang et al.

    Look into person: joint body parsing & pose estimation network and a new benchmark

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • M.M. Kalayeh et al.

    Human semantic parsing for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • M. Ye et al.

    Visible thermal person re-identification via dual-constrained top-ranking

    Proceedings of the International Joint Conference on Artificial Intelligence

    (2018)
  • Y. Yang et al.

    Cross-modality paired-images generation and augmentation for RGB-infrared person re-identification

    Neural Netw.

    (2020)
  • G. Wang et al.

    RGB-infrared cross-modality person re-identification via joint pixel and feature alignment

    Proceedings of the IEEE International Conference on Computer Vision

    (2019)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Y. Huang et al.

    Alleviating modality bias training for infrared-visible person re-identification

    IEEE Trans. Multimed.

    (2021)
  • A. Wu et al.

    RGB-infrared cross-modality person re-identification

    Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • D.T. Nguyen et al.

    Person recognition system based on a combination of body images from visible light and thermal cameras

    Sensors

    (2017)
  • Cited by (24)

    View all citing articles on Scopus

    Nianchang Huang is currently pursuing the Ph.D. degree in School of Mechano-Electronic Engineering, Xidian University, China. His research interests include deep learning and multimodal image processing in computer vision.

    Kunlong Liu received the B.S. degree from Hefei University of Technology, Hefei, China, in 2019. He is currently pursuing the M.S. degree in control engineering with Xidian University, Xi’an, China. His current research interests include deep learning and computer vision.

    Yang Liu received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xian, China, in 2013, 2015 and 2018, respectively. He is currently a Post-Doctoral Researcher in Xidian University, Xi’an, China. He has authored nearly 30 technical articles in refereed journals and proceedings, including IEEE Trans. Image, IEEE Trans. Cybernetics, PR, CVPR, AAAI, and IJCAI. His research interests include dimensionality reduction, pattern recognition, and deep learning.

    Qiang Zhang received the B.S. degree in automatic control, the M.S. degree in pattern recognition and intelligent systems, and the Ph.D. degree in circuit and system from Xidian University, China, in 2001,2004, and 2008, respectively. He was a Visiting Scholar with the Center for Intelligent Machines, McGill University, Canada. He is currently a professor with the Automatic Control Department, Xidian University, China. His current research interests include image processing, pattern recognition.

    Jungong Han is currently a Full Professor and Chair in Computer Science at Aberystwyth University, UK. His research interests span the fields of video analysis, computer vision and applied machine learning. He has published over 180 papers, including 40+ IEEE Trans and 40+ A* conference papers.

    This work is supported by the National Natural Science Foundation of China under Grant no. 61773301.

    View full text