Person re-identification from virtuality to reality via modality invariant adversarial mechanism

doi:10.1016/j.neucom.2020.06.075

Neurocomputing

Volume 414, 13 November 2020, Pages 303-312

https://doi.org/10.1016/j.neucom.2020.06.075 Get rights and content

Highlights

•
A modality invariant adversarial mechanism for improving the multi-style Re-ID task.
•
Two new datasets from virtuality to reality for the multi-style Re-ID task.
•
Space transformation and different category classifiers for performance improvement.

Abstract

Person re-identification based on multi-style images helps in crime scene investigation, where only a virtual image (sketch or portrait) of the suspect is available for retrieving possible identities. However, due to the modality gap between multi-style images, standard model of person re-identification cannot achieve satisfactory performance when directly applied to match the virtual images with the real photographs. To address this problem, we propose a modality invariant adversarial mechanism (MIAM) to remove the modality gap between multi-style images. Specifically, the MIAN consists of two parts: a space transformation module to transfer the multi-style person images to a modality-invariant space, and an adversarial learning module “played” between the category classifier and modality classifier to steer the representation learning. The modality classifier discriminates between the real and virtual images while the category classifier predicts the identities of the input transformed images. We explore the space transformation for data augmentation to further bridge the modality gap and facilitate the performance. Furthermore, we build two new datasets for the multi-style Re-ID to evaluate the performance. Extensive experimental results demonstrate the effectiveness of the proposed method on improving the performance against the existing feature learning networks. Further comparison results conducted on different modules in MIAM show that our approach is of favorable generalization ability on alleviating the modality gap to improve the multi-style Re-ID.

Introduction

Person re-identification is of important capability for surveillance systems [1], [2], [3]. Traditional person Re-ID aims to match images of the same identity across different non-overlapping camera views. It remains a challenging issue due to illumination and pose variations among different views. However, there are many cases where law enforcement agencies have no photos of a suspect and only a virtual image (sketch or portrait) made with the help of an eyewitness or victim is available. Furthermore, with the popularity of image processing software, there remains an urgent demand of Re-ID between the artificial image and the real person image. We describe the problem as multi-style person Re-ID from virtuality to reality, where “virtuality” denotes the artificial images with artistic treatment, and “reality” denotes the natural images captured by real surveillance systems.

The traditional Re-ID suffers severe variations across different camera views. Compared to the traditional person Re-ID task, multi-style Re-ID contains the problems existing on the traditional Re-ID and brings more violent challenges. Due to significant differences with real images from surveillance systems, the virtual images cannot be easily matched with the real identity by the traditional recognition methods [4]. This problem has been defined in the literature as modality gap [5]. As shown in Fig. 1, besides variations existing among different camera views, the modality gap between the different image styles brings extra challenges that limit the performance. As different data sources typically have different statistical properties and distributions, it is difficult to directly compare each other for feature matching [6].

One solution to solve the modality gap between different data sources is doing data augmentation across sets, such as using cyclegan [7] to do image transformation across different camera views [8] or datasets [9]. However, the fixed data augmentations scheme can not supply flexible input changes to help further promote feature learning. Other representative method includes pre-training the source encoder to adjust the target encoder that can not be discriminated from each other [10]. The fixed classifier trained on the source domain for the target classification also lack of generalization for the recognition across domains [10]. Some other works [11], [6] propose an adversarial learning network on the feature plane to enable flexible retrieval experience across different modalities. They usually need pre-trained feature extractor to obtain good performance, which limit their practicability. Furthermore, we observe that adversarial learning on the feature plane cannot well solve the modality gap, as the high-level features always lack the underlying information of original data. Therefore, in this paper, we explore to bridge the modality gap in the original data plane, and propose an end-to-end framework that incorporates the data transformation with the classification to further solve the multi-style Re-ID task.

To address the above-mentioned problems, we propose a modality invariant adversarial mechanism (MIAM) to handle the multi-style Re-ID task. Two new datasets special for the problem are proposed, the sources of which are from constructed real surveillance systems. The MIAM is to deal with the challenges caused by multi-style data sources, improving the performance of existing feature learning networks. From our observation, person Re-ID on different data sources with multiple image styles, can be regarded as a cross-modal task. We treat the data sources as different modalities to solve these problems by bridging the modality gap in a unified network. Specifically, the proposed MIAM firstly adopts a space transformation module to transfer the data from different sources into a modality-invariant space and remove the inconsistency caused by the modality gap in multi-style Re-ID. Then the adversarial learning “played” between the category classifier and the modality classifier is introduced to steer the representation learning, where the modality classifier discriminates between the real and virtual images and the category classifier predicts the identities of the input transformed images. We explore the space transformation for data augmentation to further bridge the modality gap. We also conduct the performance on different improved architectures to demonstrate the generalization of our method. The proposed datasets and MIAM model will be made available to the public.

The contributions of our work are summarized as follows: (1) Propose a modality invariant adversarial mechanism (MIAM) incorporating space transformation and adversarial learning to address the modality gap for improving multi-style Re-ID; (2) Propose two new datasets from virtuality to reality for multi-style Re-ID; (3) Explore space transformation for data augmentation and different architectures of the category classifier for further performance improvement.

Section snippets

Related work

Existing approaches for the traditional Re-ID can be divided into two aspects: feature [12], [13], [14], [15], [16], [17], [18], [19], [20] or distance learning [21], [22], [23], [24], [25], [26] to directly learn projections of data items from different camera views into a common feature representation subspace in which the similarity between them can be assessed directly. However, the over-fitting usually occurs when the system is not of good generality to out-of-set examples, especially on

Overview

In this section, we concretely introduce the framework of our MIAM network. As shown in Fig. 2, the MIAM includes space transformation and adversarial learning for simultaneously removing the modality gap and improving the performance of existing feature learning networks. The space transformation module consists of an image generator, trying to transfer the original images from two inconsistent sources into a modality-invariant space. The adversarial learning plays between a category

Experiments and results

To clarify the effectiveness of our method, we propose two new datasets for the virtuality to reality Re-ID problem. In this section, detailed experimental results conducted on the two datasets are given to show the performance of our method. We explore different architectures of the category classifier in MIAM to give comparisons of the performance on improving existing feature learning networks. Further comparison results conducted on different modules in MIAM are also given to show the

Conclusion

In this paper, we focus on solving the modality gap to improve the performance on multi-style person re-identification task. MIAM is proposed to deal with the severe modality gap between multi-style images. The MIAM includes two procedures: the space transformation (“G”) to transfer the original images from different data sources to a modality-invariant space; the adversarial learning to play between a category classifier (“L”) and a modality classifier (“D”) for adjusting the space

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant No. 61771303), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 20DZ1200203, 19DZ1209303, 18DZ1200102, 18DZ2270700), and SJTU-Yitu/Thinkforce Joint laboratory for visual computing and application.

LIN CHEN received her B.Sc. (summa cum laude) degree in Electronic Information Science and Technology from the Nankai University (NKU), Tianjing, China, in 2015. She is currently pursuing the Ph.D. degree with the Institution of Image Communication and Network Engineering, Department of Electronic Engineering, Shanghai Jiao Tong University (SJTU), Shanghai, China. Her research interests include computer vision, deep learning, video surveillance, and person re-identification. She received the

References (49)

D. Wu et al.
Deep learning-based methods for person re-identification: A comprehensive review
Neurocomputing
(2019)
X. Wang et al.
Fusion of multiple channel features for person re-identification
Neurocomputing
(2016)
W. Fang et al.
Perceptual hash-based feature description for person re-identification
Neurocomputing
(2018)
F. Ma et al.
Multi-view coupled dictionary learning for person re-identification
Neurocomputing
(2019)
W. Yang et al.
Adaptive deep metric embeddings for person re-identification under occlusions
Neurocomputing
(2019)
S. Ding et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognition
(2015)
T. Wang, S. Gong, X. Zhu, S. Wang, Person re-identification by video ranking, in: ECCV, 2014, pp....
E. Ahmed, M. Jones, T. K. Marks, An improved deep learning architecture for person re-identification, in: CVPR, 2015,...
Y. P. Calana, H. Méndez-Vázquez, R. L. Fonseca, Face composite sketch recognition by bovw-based discriminative...
X. Wang, X. Tang, Face photo-sketch synthesis and recognition, in: ICCV, 2003, p....

B. Wang et al.

Adversarial cross-modal retrieval

J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial...

Z. Zhong, L. Zheng, Z. Zheng, S. Li, Y. Yang, Camera style adaptation for person re-identification, CoRR...

W. Deng et al.

Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification

E. Tzeng et al.

Adversarial discriminative domain adaptation

Y. Ganin, V.S. Lempitsky, Unsupervised domain adaptation by backpropagation, in: ICML, 2015, pp....

R. Quan et al.

Auto-reid: Searching for a part-aware convnet for person re-identification

L. Chen, H. Yang, J. Zhu, Q. Zhou, S. Wu, Z. Gao, Deep spatial-temporal fusion network for video-based person...

H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, X. Tang, Spindle net: Person re-identification with human...

D. Li, X. Chen, Z. Zhang, K. Huang, Learning deep context-aware features over body and latent parts for person...

L. Zhao et al.

Deeply-learned part-aligned representations for person re-identification

J. Zhang et al.

Multi-shot pedestrian re-identification via sequential decision making

H. Fan et al.

Unsupervised person re-identification: Clustering and fine-tuning

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

(2018)

J. Lin, L. Ren, J. Lu, J. Feng, J. Zhou, Consistent-aware deep learning for person re-identification in a camera...

Cited by (3)

DKTNet: Dual-Key Transformer Network for small object detection
2023, Neurocomputing
Object detection is a fundamental computer vision task that plays a crucial role in a wide range of real-world applications. However, it is still a challenging task to detect the small size objects in the complex scene, due to the low resolution and noisy representation appearance caused by occlusion, distant depth view, etc. To tackle this issue, a novel transformer architecture, Dual-Key Transformer Network (DKTNet), is proposed in this paper. To improve the feature attention ability, the coherence of linear layer outputs Q and V are enhanced by a dual-K integrated from $K_{1}$ and $K_{2}$ , which are computed along Q and V, respectively. Instead of spatial-wise attention, channel-wise self-attention mechanism is adopted to promote the important feature channels and suppress the confusing ones. Moreover, 2D and 1D convolution computations for Q, K and V are proposed. Compared with the fully-connected computation in conventional transformer architectures, the 2D convolution can better capture local details and global contextual information, and the 1D convolution can reduce network complexity significantly. Experimental evaluation is conducted on both general and small object detection datasets. The superiority of the aforementioned features in our proposed approach is demonstrated with the comparison against the state-of-the-art approaches.
Modeling context appearance changes for person re-identification via IPES-GCN
2022, Neurocomputing
Citation Excerpt :
Most of the previous works focus on learning discriminative individual features, ignoring the valuable context relations between gallery persons. These works pay attention to designing novel CNN models, e.g., the Two-Stream Attentive Network (TSAN) [5], Region-based Quality Estimation Network (RQEN) [3], Salient Part Network with Weighted Bilinear Coding Model [4], the Modality Invariant Adversarial Mechanism (MIAM) [7], and Multiple Granularity Network (MGN) [2], or effective metric learning principles, such as the Batch Hard Triplet Loss [1], Quadruplet Loss [8], Reference-oriented Loss [9], and Distribution Context Aware Loss [10]. Besides, some researchers also pay attention to exploiting relationships between gallery images to improve the performance of person Re-ID.
Most of the previous person re-identification works focus on learning discriminative features for individuals, and retrieve a query person only based on pair-wise individual feature similarities, ignoring context relationships among gallery images. Consequently, it is hard to re-identify a query person when large appearance changes occur. To address this problem, we propose the IPES-GCN model to exploit context relationships across gallery images. In the IPES-GCN, we first construct an Individual Pivot Expansion Subgraph (IPES) to enrich the context representation of each individual. By taking high-order nearest neighbors of each individual into account, we can model more positive appearance changes into the representation. Then GCN is employed to explicitly model various appearance changes contained in the IPES into a graph embedding for each enriched individual. Finally, the graph embeddings are utilized to re-rank the gallery images. Experiments on the Market1501, DukeMTMC and MSMT17 datasets show that the proposed method has strong generalization and can yield favorable results to the state-of-the-art methods.
Graph similarity rectification for person search
2021, Neurocomputing
Citation Excerpt :
After deep learning achieves great success, it is also applied in the Re-ID task and has become a mainstream method in the Re-ID community. Some works [14–21] focus on extracting robust and discriminative features of person images. Some methods [22–26] propose to combine local features with the global feature of the holistic person to obtain more discriminative representations.
In person search task, it is hard to retrieve the query persons undergoing large visual changes. To tackle this problem, we propose to exploit the context information to rectify the original individual similarity for better retrieval. To this end, we propose to model a query frame and a gallery frame as a graph pair, and then design the Siamese Residual Graph Convolutional Networks (SR-GCN) to aggregate context information to generate graph similarity as a complement of the original similarity. To model the relationships between context persons, we define the joint similarity adjacency matrix which assigns the proposed joint similarity as the edge weight to measure the contributions a context person makes to the aggregation. Therefore, the context node with a higher possibility to be a co-traveler of the target node makes more contributions to the matching of the target node. To further enhance the discriminative power of individual features, we also design a Random Proxy Center loss which explicitly constrains the intra-class variations to be smaller than the inter-class variations in the feature space and could make use of unlabeled samples. Experimental results on two public datasets show that our approach performs favorably against the state-of-the-art methods.

HUA YANG received the Ph.D. degree in Communication and Information from Shanghai Jiaotong University, in 2004, and both the B.S. and M.S. degrees in Communication and Information from Haerbin Engineering University (SJTU), China in 1998 and 2001, respectively. She is currently an Associate Professor in the Department of Electronic Engineering, Shanghai Jiaotong University, China. Her current research interests include video coding and networking, computer vision, and smart video surveillance. She was the supervisor of the 1st award team for wider challenge person search on ECCV2018, and received the best poster award on IFTC2018.

ZHIYONG GAO received the B.S. and M.S. degrees in electrical engineering from the Changsha Institute of Technology (CIT), Changsha, China, in 1981 and 1984, respectively, and the Ph.D. degree from Tsinghua University, Beijing, China, in 1989. From 1994 to 2010, he took several senior technical positions in England, including a Principal Engineer with Snell & Wilcox, Petersfield, U.K., from 1995 to 2000, a Video Architect with 3DLabs, Egham, U.K., from 2000 to 2001, a Consultant Engineer with Sony European Semiconductor Design Center, Basingstoke, U.K., from 2001 to 2004, and a Digital Video Architect with Imagination Technologies, Kings Langley, U.K., from 2004 to 2010. Since 2010, he has been a Professor with Shanghai Jiao Tong University. His research interests include video processing and its implementation, video coding, digital TV, and broadcasting.

View full text

Person re-identification from virtuality to reality via modality invariant adversarial mechanism

Highlights

Abstract

Introduction

Section snippets

Related work

Overview

Experiments and results

Conclusion

Declaration of Competing Interest

Acknowledgment

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Pattern Recognition

Adversarial cross-modal retrieval

Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification

Adversarial discriminative domain adaptation

Auto-reid: Searching for a part-aware convnet for person re-identification

Deeply-learned part-aligned representations for person re-identification

Multi-shot pedestrian re-identification via sequential decision making

Unsupervised person re-identification: Clustering and fine-tuning

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)