Elsevier

Neurocomputing

Volume 414, 13 November 2020, Pages 303-312
Neurocomputing

Person re-identification from virtuality to reality via modality invariant adversarial mechanism

https://doi.org/10.1016/j.neucom.2020.06.075Get rights and content

Highlights

  • A modality invariant adversarial mechanism for improving the multi-style Re-ID task.

  • Two new datasets from virtuality to reality for the multi-style Re-ID task.

  • Space transformation and different category classifiers for performance improvement.

Abstract

Person re-identification based on multi-style images helps in crime scene investigation, where only a virtual image (sketch or portrait) of the suspect is available for retrieving possible identities. However, due to the modality gap between multi-style images, standard model of person re-identification cannot achieve satisfactory performance when directly applied to match the virtual images with the real photographs. To address this problem, we propose a modality invariant adversarial mechanism (MIAM) to remove the modality gap between multi-style images. Specifically, the MIAN consists of two parts: a space transformation module to transfer the multi-style person images to a modality-invariant space, and an adversarial learning module “played” between the category classifier and modality classifier to steer the representation learning. The modality classifier discriminates between the real and virtual images while the category classifier predicts the identities of the input transformed images. We explore the space transformation for data augmentation to further bridge the modality gap and facilitate the performance. Furthermore, we build two new datasets for the multi-style Re-ID to evaluate the performance. Extensive experimental results demonstrate the effectiveness of the proposed method on improving the performance against the existing feature learning networks. Further comparison results conducted on different modules in MIAM show that our approach is of favorable generalization ability on alleviating the modality gap to improve the multi-style Re-ID.

Introduction

Person re-identification is of important capability for surveillance systems [1], [2], [3]. Traditional person Re-ID aims to match images of the same identity across different non-overlapping camera views. It remains a challenging issue due to illumination and pose variations among different views. However, there are many cases where law enforcement agencies have no photos of a suspect and only a virtual image (sketch or portrait) made with the help of an eyewitness or victim is available. Furthermore, with the popularity of image processing software, there remains an urgent demand of Re-ID between the artificial image and the real person image. We describe the problem as multi-style person Re-ID from virtuality to reality, where “virtuality” denotes the artificial images with artistic treatment, and “reality” denotes the natural images captured by real surveillance systems.

The traditional Re-ID suffers severe variations across different camera views. Compared to the traditional person Re-ID task, multi-style Re-ID contains the problems existing on the traditional Re-ID and brings more violent challenges. Due to significant differences with real images from surveillance systems, the virtual images cannot be easily matched with the real identity by the traditional recognition methods [4]. This problem has been defined in the literature as modality gap [5]. As shown in Fig. 1, besides variations existing among different camera views, the modality gap between the different image styles brings extra challenges that limit the performance. As different data sources typically have different statistical properties and distributions, it is difficult to directly compare each other for feature matching [6].

One solution to solve the modality gap between different data sources is doing data augmentation across sets, such as using cyclegan [7] to do image transformation across different camera views [8] or datasets [9]. However, the fixed data augmentations scheme can not supply flexible input changes to help further promote feature learning. Other representative method includes pre-training the source encoder to adjust the target encoder that can not be discriminated from each other [10]. The fixed classifier trained on the source domain for the target classification also lack of generalization for the recognition across domains [10]. Some other works [11], [6] propose an adversarial learning network on the feature plane to enable flexible retrieval experience across different modalities. They usually need pre-trained feature extractor to obtain good performance, which limit their practicability. Furthermore, we observe that adversarial learning on the feature plane cannot well solve the modality gap, as the high-level features always lack the underlying information of original data. Therefore, in this paper, we explore to bridge the modality gap in the original data plane, and propose an end-to-end framework that incorporates the data transformation with the classification to further solve the multi-style Re-ID task.

To address the above-mentioned problems, we propose a modality invariant adversarial mechanism (MIAM) to handle the multi-style Re-ID task. Two new datasets special for the problem are proposed, the sources of which are from constructed real surveillance systems. The MIAM is to deal with the challenges caused by multi-style data sources, improving the performance of existing feature learning networks. From our observation, person Re-ID on different data sources with multiple image styles, can be regarded as a cross-modal task. We treat the data sources as different modalities to solve these problems by bridging the modality gap in a unified network. Specifically, the proposed MIAM firstly adopts a space transformation module to transfer the data from different sources into a modality-invariant space and remove the inconsistency caused by the modality gap in multi-style Re-ID. Then the adversarial learning “played” between the category classifier and the modality classifier is introduced to steer the representation learning, where the modality classifier discriminates between the real and virtual images and the category classifier predicts the identities of the input transformed images. We explore the space transformation for data augmentation to further bridge the modality gap. We also conduct the performance on different improved architectures to demonstrate the generalization of our method. The proposed datasets and MIAM model will be made available to the public.

The contributions of our work are summarized as follows: (1) Propose a modality invariant adversarial mechanism (MIAM) incorporating space transformation and adversarial learning to address the modality gap for improving multi-style Re-ID; (2) Propose two new datasets from virtuality to reality for multi-style Re-ID; (3) Explore space transformation for data augmentation and different architectures of the category classifier for further performance improvement.

Section snippets

Related work

Existing approaches for the traditional Re-ID can be divided into two aspects: feature [12], [13], [14], [15], [16], [17], [18], [19], [20] or distance learning [21], [22], [23], [24], [25], [26] to directly learn projections of data items from different camera views into a common feature representation subspace in which the similarity between them can be assessed directly. However, the over-fitting usually occurs when the system is not of good generality to out-of-set examples, especially on

Overview

In this section, we concretely introduce the framework of our MIAM network. As shown in Fig. 2, the MIAM includes space transformation and adversarial learning for simultaneously removing the modality gap and improving the performance of existing feature learning networks. The space transformation module consists of an image generator, trying to transfer the original images from two inconsistent sources into a modality-invariant space. The adversarial learning plays between a category

Experiments and results

To clarify the effectiveness of our method, we propose two new datasets for the virtuality to reality Re-ID problem. In this section, detailed experimental results conducted on the two datasets are given to show the performance of our method. We explore different architectures of the category classifier in MIAM to give comparisons of the performance on improving existing feature learning networks. Further comparison results conducted on different modules in MIAM are also given to show the

Conclusion

In this paper, we focus on solving the modality gap to improve the performance on multi-style person re-identification task. MIAM is proposed to deal with the severe modality gap between multi-style images. The MIAM includes two procedures: the space transformation (“G”) to transfer the original images from different data sources to a modality-invariant space; the adversarial learning to play between a category classifier (“L”) and a modality classifier (“D”) for adjusting the space

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant No. 61771303), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 20DZ1200203, 19DZ1209303, 18DZ1200102, 18DZ2270700), and SJTU-Yitu/Thinkforce Joint laboratory for visual computing and application.

LIN CHEN received her B.Sc. (summa cum laude) degree in Electronic Information Science and Technology from the Nankai University (NKU), Tianjing, China, in 2015. She is currently pursuing the Ph.D. degree with the Institution of Image Communication and Network Engineering, Department of Electronic Engineering, Shanghai Jiao Tong University (SJTU), Shanghai, China. Her research interests include computer vision, deep learning, video surveillance, and person re-identification. She received the

References (49)

  • B. Wang et al.

    Adversarial cross-modal retrieval

  • J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial...
  • Z. Zhong, L. Zheng, Z. Zheng, S. Li, Y. Yang, Camera style adaptation for person re-identification, CoRR...
  • W. Deng et al.

    Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification

  • E. Tzeng et al.

    Adversarial discriminative domain adaptation

  • Y. Ganin, V.S. Lempitsky, Unsupervised domain adaptation by backpropagation, in: ICML, 2015, pp....
  • R. Quan et al.

    Auto-reid: Searching for a part-aware convnet for person re-identification

  • L. Chen, H. Yang, J. Zhu, Q. Zhou, S. Wu, Z. Gao, Deep spatial-temporal fusion network for video-based person...
  • H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, X. Tang, Spindle net: Person re-identification with human...
  • D. Li, X. Chen, Z. Zhang, K. Huang, Learning deep context-aware features over body and latent parts for person...
  • L. Zhao et al.

    Deeply-learned part-aligned representations for person re-identification

  • J. Zhang et al.

    Multi-shot pedestrian re-identification via sequential decision making

  • H. Fan et al.

    Unsupervised person re-identification: Clustering and fine-tuning

    ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

    (2018)
  • J. Lin, L. Ren, J. Lu, J. Feng, J. Zhou, Consistent-aware deep learning for person re-identification in a camera...
  • Cited by (3)

    • Modeling context appearance changes for person re-identification via IPES-GCN

      2022, Neurocomputing
      Citation Excerpt :

      Most of the previous works focus on learning discriminative individual features, ignoring the valuable context relations between gallery persons. These works pay attention to designing novel CNN models, e.g., the Two-Stream Attentive Network (TSAN) [5], Region-based Quality Estimation Network (RQEN) [3], Salient Part Network with Weighted Bilinear Coding Model [4], the Modality Invariant Adversarial Mechanism (MIAM) [7], and Multiple Granularity Network (MGN) [2], or effective metric learning principles, such as the Batch Hard Triplet Loss [1], Quadruplet Loss [8], Reference-oriented Loss [9], and Distribution Context Aware Loss [10]. Besides, some researchers also pay attention to exploiting relationships between gallery images to improve the performance of person Re-ID.

    • Graph similarity rectification for person search

      2021, Neurocomputing
      Citation Excerpt :

      After deep learning achieves great success, it is also applied in the Re-ID task and has become a mainstream method in the Re-ID community. Some works [14–21] focus on extracting robust and discriminative features of person images. Some methods [22–26] propose to combine local features with the global feature of the holistic person to obtain more discriminative representations.

    LIN CHEN received her B.Sc. (summa cum laude) degree in Electronic Information Science and Technology from the Nankai University (NKU), Tianjing, China, in 2015. She is currently pursuing the Ph.D. degree with the Institution of Image Communication and Network Engineering, Department of Electronic Engineering, Shanghai Jiao Tong University (SJTU), Shanghai, China. Her research interests include computer vision, deep learning, video surveillance, and person re-identification. She received the 1st award of wider challenge person search on ECCV2018, and best poster award on IFTC2018.

    HUA YANG received the Ph.D. degree in Communication and Information from Shanghai Jiaotong University, in 2004, and both the B.S. and M.S. degrees in Communication and Information from Haerbin Engineering University (SJTU), China in 1998 and 2001, respectively. She is currently an Associate Professor in the Department of Electronic Engineering, Shanghai Jiaotong University, China. Her current research interests include video coding and networking, computer vision, and smart video surveillance. She was the supervisor of the 1st award team for wider challenge person search on ECCV2018, and received the best poster award on IFTC2018.

    ZHIYONG GAO received the B.S. and M.S. degrees in electrical engineering from the Changsha Institute of Technology (CIT), Changsha, China, in 1981 and 1984, respectively, and the Ph.D. degree from Tsinghua University, Beijing, China, in 1989. From 1994 to 2010, he took several senior technical positions in England, including a Principal Engineer with Snell & Wilcox, Petersfield, U.K., from 1995 to 2000, a Video Architect with 3DLabs, Egham, U.K., from 2000 to 2001, a Consultant Engineer with Sony European Semiconductor Design Center, Basingstoke, U.K., from 2001 to 2004, and a Digital Video Architect with Imagination Technologies, Kings Langley, U.K., from 2004 to 2010. Since 2010, he has been a Professor with Shanghai Jiao Tong University. His research interests include video processing and its implementation, video coding, digital TV, and broadcasting.

    View full text