Neighbor similarity and soft-label adaptation for unsupervised cross-dataset person re-identification

doi:10.1016/j.neucom.2019.12.115

Neurocomputing

Volume 388, 7 May 2020, Pages 246-254

https://doi.org/10.1016/j.neucom.2019.12.115 Get rights and content

Abstract

Most of the existing person re-identification algorithms rely on supervised model learning from a large number of labeled training data per-camera-pair. However, the manual annotations often require expensive human labor, which limits the application of supervised methods for large-scale real-world deployments. To address this problem, we formulate a Neighbor Similarity and Soft-label Adaptation (NSSA) algorithm to transfer the supervised information from source domain to a new unlabeled target dataset. Specifically, we introduce a distance metric on the target domain, which incorporates inner-domain neighbor similarity and inter-domain soft-label adapted from source domain. The unlabeled samples which are close in this metric are considered to share the same pseudo-id and further selected to fine-tune the model. The training is performed iteratively to incorporate more credible sample pairs and incrementally improve the model. Extensive experimental results validate the superiority of our proposed NESSA algorithm, which significantly outperforms the state-of-the-art unsupervised and domain adaptation re-identification methods.

Introduction

Person re-identification (Re-ID) plays an important role in surveillance video analysis because of a wide range of real-world applications, such as searching the target person, analyzing the trace of crowd flow etc.. Given a query pedestrian image, the task aims at matching the same pedestrian from multiple non-overlapping cameras. In spite of various forms of different re-id methods, it shares a common goal of learning an optimal visual representation from image space to feature space, which pulls images of the same identity close to each other while pushing those of different identities apart in the learned feature space.

Deep neural networks (DNNs) have shown prominent advantages in representation learning and have been proven highly effective in supervised person re-identification [1], [4], [7], [21], [32], [33], [41], [46], [47]. With the manually labeled identity for each image, the objective function based on similarity (e.g. pairwise [33], triplet [25] or quadruplet [3] loss) and classification (e.g. Softmax [39], [49] or OIM [40] loss) is applied to train a DNN model, which learns an optimal feature representation of person images. However, the manual annotations require expensive human labor, especially in long-period multi-camera scenarios. For each person appearing in one camera, it needs to traverse all other cameras to find out if the person appears again. The annotation cost limits the application and expansion of supervised re-id methods in the large-scale real-world scenarios.

Hence, unsupervised cross-dataset re-id algorithms have been raised in recent years. Given an annotated source dataset, this task aims at learning the discriminative feature representation on the target dataset without any label. This is a challenging problem due to the objective gap between source and target domains, including the view of cameras, quality of images, change of dressing style due to the different regions and seasons, etc.. A common practice to cross-datasets problem is unsupervised domain adaptation. but this method assumes that the source and target domains share the same set of classes. However, this assumption does not hold for person re-id because the source and target datasets usually contain entirely different identities. A few methods [23], [34] assume that the datasets share the same semantic attributes that can be learned from the source domain and transferred to the target domain. Another kind of methods [6], [36], [52] train an image-to-image translation model and generate images with identities on the target domain.

In this work, we follow the self-supervised methods [9], [20] and propose Neighbor Similarity and Soft-label Adaptation (NSSA), a simple yet effective algorithm for unsupervised cross-dataset person re-id problem. The illustration of this method is shown in Fig. 1. Firstly, a feature representation model $F$ is trained on the supervised source dataset, then applied on the target domain to extract the feature points on the dataset. Note that there is not any supervised label on the target dataset, the self-supervised mechanism relies on the initial feature distance to obtain similar pairs. Besides the commonly used Euclidean distance, we introduce fused distance metrics, including inner-domain neighbor similarity and inter-domain soft-label, to achieve a better similarity metric. Then we select the samples following the assumption that the feature pair with smaller distance ought to have higher probability of sharing the same identity label. The most credible samples, which are associated by the pseudo-id, are selected to fine-tune the model $F$ . These steps are performed iteratively to incrementally improve the discriminability of model $F$ .

It is worth noting that the fused distance metric in our method consists of three aspects: (1) The vanilla Euclidean distance of features, which is solely applied in previous self- supervised cross-dataset re-id methods [9], [20]. (2) The inner-domain neighbor similarity, which explores the topological relationship between the feature points and utilizes the priority that the group of features from the same identity should be close to each other and have similar neighbors. (3) The inter-domain soft-label, which spreads the identity label from source domain to target domain according to the cross-domain point-wise distance. The soft-label provides effective supervised information on the unlabeled target domain.

The unlabeled data samples are progressively taken into the training schedule. The initial model $F$ is suboptimal on target dataset at the beginning iteration, hence not all the feature points can be assigned with an accurate pseudo-id at the beginning. Therefore, the samples with higher confidence are sampled to fine-tune the model $F$ . With the improvement of the model at the training stage, more credible samples are selected into the training set.

The remainder of the paper is organized as follows: Related works are reviewed in Section 2. In Section 3, we discuss the detailed implementation of our proposed model. The experimental analysis and comparison with the state-of-the-art methods are presented in Section 4.

Section snippets

Related works

Related works of the proposed method can be summarized into three categories: unsupervised person re-id, semi-supervised learning, and curriculum learning. We will explain the connections and differences between NSSA and these methods in the corresponding aspects.

Overview

We will introduce the proposed Neighbor Similarity and Soft-label Adaptation algorithm in this section. The architecture of the method is illustrated in Fig. 2, which contains 4 steps:

•
Step (1). Model pre-training by supervised learning on the labeled source dataset. In this step, a person feature embedding model is trained by classification loss and domain adaptation loss. However, it is suboptimal on the target dataset due to the domain gap, so it needs to be fine-tuned.
•
Step (2). Feature

Datasets and settings

Datasets and evaluation protocol. We evaluate our proposed model on three person re-id datasets: Market-1501 [48], DukeMTMC [31] and MSMT17 [37]. Market-1501 contains 32,668 labeled images from 1501 people, which are captured by 6 cameras. The standard training/test split (750 / 751 ids) and single-query is adopted in our experiments. Duke-MTMC has 8 cameras and 1404 identities with 36,411 images. Half of the identities are used for training and another half are for testing. MSMT17 has 4101

Conclusion

In this paper, we propose a simple yet effective algorithm, NSSA for unsupervised cross-dataset person re-identification. We introduce the inner-domain neighbor similarity and the inter-domain soft-label adaptation to explore the topological relationship besides the vanilla Euclidean distance. The representation ability of the feature model improves by the iterative training process. Extensive experimental results on three real-world datasets demonstrate the advantage of the proposed model over

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgements

This paper is partially supported by NSFC (No. 61772330, 61533012, 61876109), the pre-research project (No. 61403120201), Shanghai authentication key Lab. (2017XCWZK01), and Technology Committee the interdisciplinary Program of Shanghai Jiao Tong University (YG2019QNA09).

Yiru Zhao received the B.S. degree in computer science from Tongji University, China, in 2015. He is currently pursuing Ph.D. degree in Shanghai Jiao Tong University, China. His research interests include deep learning, image retrieval and machine learning.

References (56)

S. Ding et al.
Deep feature learning with relative distance comparison for person re-identification
Pattern Recognit.
(2015)
X. Yang et al.
Enhancing person re-identification in a self-trained subspace
ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)
(2017)
E. Ahmed et al.
An improved deep learning architecture for person re-identification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2015)
Y. Bengio et al.
Curriculum learning
Proceedings of the 26th Annual International Conference on Machine Learning
(2009)
W. Chen et al.
Beyond triplet loss: a deep quadruplet network for person re-identification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2017)
D. Cheng et al.
Person re-identification by multi-channel parts-based CNN with improved triplet loss function
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
D.S. Cheng et al.
Custom pictorial structures for re-identification.
Proceedings of the BMVC
(2011)
W. Deng et al.
Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2018)
M. Ester et al.
A density-based algorithm for discovering clusters in large spatial databases with noise.
Proceedings of the KDD
(1996)
H. Fan et al.
Unsupervised person re-identification: clustering and fine-tuning
ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)
(2018)

M. Farenzena et al.

Person re-identification by symmetry-driven accumulation of local features

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2010)

I. Goodfellow et al.

Generative adversarial nets

Advances in Neural Information Processing Systems

(2014)

A. Gretton et al.

A kernel method for the two-sample-problem

Advances in Neural Information Processing Systems

(2007)

K. He et al.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv:1703.07737...

L. Jiang et al.

Self-paced learning with diversity

Advances in Neural Information Processing Systems

(2014)

E. Kodirov et al.

Person re-identification by unsupervised l₁ graph learning

Proceedings of the European Conference on Computer Vision

(2016)

E. Kodirov et al.

Dictionary learning with iterative Laplacian regularisation for unsupervised person re-identification.

Proceedings of the BMVC

(2015)

A. Krizhevsky et al.

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems

(2012)

M.P. Kumar et al.

Self-paced learning for latent variable models

Advances in Neural Information Processing Systems

(2010)

M. Li et al.

Unsupervised person re-identification by deep learning Tracklet association

Proceedings of the European Conference on Computer Vision

(2018)

W. Li et al.

DeepReID: Deep filter pairing neural network for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2014)

Y.-J. Li et al.

Adaptation and re-identification network: an unsupervised deep transfer learning approach to person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

(2018)

S. Lin, H. Li, C.-T. Li, A.C. Kot, Multi-task mid-level feature alignment network for unsupervised cross-dataset person...

G. Lisanti et al.

Person re-identification by iterative re-weighted sparse ranking

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

H. Liu et al.

End-to-end comparative attention networks for person re-identification

IEEE Trans. Image Process.

(2017)

X. Liu et al.

Semi-supervised coupled dictionary learning for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2014)

M. Long et al.

Learning transferable features with deep adaptation networks

Proceedings of the International Conference on Machine Learning

(2015)

Cited by (9)

Generative Segment-pose Representation based Augmentation (GSRA) for unsupervised person re-identification
2023, Image and Vision Computing
Person re-identification matches the images of a person captured in multiple cameras in a smart surveillance environment. The process of matching the images captured from multiple viewing angles is challenging due to the variations caused by illumination, occlusion, dynamic pose change, etc., To tackle such challenges, large number of samples are required to identify the unique features of a person. In real-world crowded surveillance environment, it is highly difficult to capture the sufficient number of images to build a deep model. This scarcity in samples can be resolved by generating images using generative networks. The existing literature lacks robust discriminators and validation techniques to validate the generative network in an unsupervised person re-identification setup. Thus, we propose an unsupervised adversarial segment-pose distance threshold representation to validate the generated images in addition to the conventional discriminator. The images are generated and cross-validated with the determined segment-pose distance threshold. Labelling process is performed by matching the unoccluded segment with its appropriate ground truth parent cluster based on the segment-pose distance threshold. We have performed experiments on the benchmark person re-ID datasets like DukeMTMC re-ID, Market1501, CUHK03 and MSMT17. The effectiveness of the proposed unsupervised generative model is proved by reporting a +2.6% highest ranking accuracy over the state-of-the-art methods.
Unsupervised visual feature learning based on similarity guidance
2022, Neurocomputing
Citation Excerpt :
Zhang et al. [26] proposed an entropy-based distance metric that quantifies the distance between categories by exploiting the information provided by different attributes that correlate with the target one. Zhao et al. [27] introduced a distance metric which incorporates inner-domain neighbor similarity. In [28,29], the sample pairs were measured taking into account the surrounding information, and the original ordering in the image retrieval is rearranged.
The availability of a large amount of image data and the impracticality of annotating each sample, coupled with various changes in the target class, such as lighting, posture, etc., make the performance of feature learning disappointing on unlabeled datasets. Lack of attention to hard sample pairs in network modeling and one-sided consideration of similarity measurement in the process of merging have exacerbated the huge performance gap between supervised and unsupervised feature expression. In order to alleviate these problems, we propose an unsupervised network that gradually optimizes feature expression under the guidance of similarity. It employs the deep network to train high-dimensional features and small-scale merge to generate high-quality labels to alternately execute the two steps. Feature learning is guided by gradually generating high-quality labels, thereby narrowing the huge gap between unsupervised learning and supervised learning. The proposed method has been evaluated on both general datasets and the datasets for person re-identification (person re-ID) with superior performance in comparison with existing state-of-the-art methods.
Cross-domain person re-identification with pose-invariant feature decomposition and hypergraph structure alignment
2022, Neurocomputing
Citation Excerpt :
These methods consider only the inter-domain variation between the source and target domains whereas the intra-domain (different camera views) variation of a single domain has been ignored, which is an important factor affecting Re-ID performance. The methods mine the underlying data distribution information of the target domain for model refinement [7–11]. These methods only take the model pre-trained on the source domain as the initial model for the feature learning in the target domain.
Person Re-identification (Re-ID) has attracted more and more attention thanks to its great practical value in the field of video surveillance. Most works have focused on solving the problem of supervised Re-ID on a single domain and made significant progress. However, the cross-domain Re-ID is still challenging due to the domain bias between the source and target domains. To this end, we propose a dictionary learning algorithm based on matrix factorization to eliminate the influence of style and pedestrian pose information on the cross-domain Re-ID. Specifically, the proposed approach includes two novel parts: (1) the original visual feature is decomposed into pose-invariant feature space, camera-style feature space and residual feature space to extract discriminant pose-invariant feature that is not affected by style and pedestrian pose information, such that the influence of interference information between pedestrians on recognition can be eliminated; (2) considering the domain-invariance of attribute, a hypergraph structure alignment is introduced to integrate pose-invariant feature, attribute and pedestrian identity into a dictionary learning framework. The relationship between pose-invariant feature and attribute is built so that the pedestrian attribute of the target dataset can be accurately predicted during testing. Finally, the pedestrian similarity measurement can be carried out by combining the pose-invariant feature and attribute of pedestrians. The effectiveness of the proposed algorithm is verified with the experiments on several benchmark Re-ID datasets.
Deep manifold clustering based optimal pseudo pose representation (DMC-OPPR) for unsupervised person re-identification
2020, Image and Vision Computing
Citation Excerpt :
Bottom-up clustering (BUC) [34] clusters the unlabeled images using diversity regularizer without considering external parameters which influences labeling. A domain adaptive method (NSSA) [35] is proposed based on the nearest neighbor approach. Conventional pose estimation methods matches the unlabeled poses to a pre-defined canonical pose based on keypoint detection which fails in a crowded surveillance environment.
Person re-identification (re-ID) is highly complex in a diverse surveillance environment. The existing person re-ID methods are evaluated as a closed set problem with limited environmental variation. It is highly challenging to estimate the diverse poses of a dynamically crowded environment using the traditional unsupervised person re-ID methods. To resolve this issue of handling complex diverse poses and camera angles, a contextual incremental multi-clustering based unsupervised person re-ID method have been proposed. Cam-pose based optimal similarity distance threshold is determined to label the unlabeled person re-ID images efficiently. Frequent intra and inter-camera pseudo pose sequences are represented with optimal distance threshold. This resolves the over-fitting issue created by the dominant samples of an identity and reduces the source-target domain gap. The experimental results show the supremacy of our proposed method over the existing unsupervised person re-ID methods in handling complex poses and camera angles in an incremental self-learning diverse surveillance environment.
Multi-information Constraint Learning for Unsupervised Domain Adaptive Person Re-identification
2023, Neural Processing Letters
A Convolutional Neural-Network-Based Training Model to Estimate Actual Distance of Persons in Continuous Images
2022, Sensors

View all citing articles on Scopus

Hongtao Lu is now a Professor in the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His current research interests include computer vision, deep learning and machine learning. He had authored or co-authored more than 100 papers in journals and premier conferences.

View full text

Neighbor similarity and soft-label adaptation for unsupervised cross-dataset person re-identification

Abstract

Introduction

Section snippets

Related works

Overview

Datasets and settings

Conclusion

Declaration of Competing Interest

Acknowledgements

Pattern Recognit.

ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)

An improved deep learning architecture for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Curriculum learning

Proceedings of the 26th Annual International Conference on Machine Learning

Beyond triplet loss: a deep quadruplet network for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Person re-identification by multi-channel parts-based CNN with improved triplet loss function

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Custom pictorial structures for re-identification.

Proceedings of the BMVC

Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

A density-based algorithm for discovering clusters in large spatial databases with noise.

Proceedings of the KDD

Unsupervised person re-identification: clustering and fine-tuning

ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)

Person re-identification by symmetry-driven accumulation of local features

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Generative adversarial nets

Advances in Neural Information Processing Systems

A kernel method for the two-sample-problem

Advances in Neural Information Processing Systems

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Self-paced learning with diversity

Advances in Neural Information Processing Systems

Person re-identification by unsupervised l1 graph learning

Proceedings of the European Conference on Computer Vision

Dictionary learning with iterative Laplacian regularisation for unsupervised person re-identification.

Proceedings of the BMVC

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems

Self-paced learning for latent variable models

Advances in Neural Information Processing Systems

Unsupervised person re-identification by deep learning Tracklet association

Proceedings of the European Conference on Computer Vision

DeepReID: Deep filter pairing neural network for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Adaptation and re-identification network: an unsupervised deep transfer learning approach to person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

Person re-identification by iterative re-weighted sparse ranking

IEEE Trans. Pattern Anal. Mach. Intell.

End-to-end comparative attention networks for person re-identification

IEEE Trans. Image Process.

Semi-supervised coupled dictionary learning for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning transferable features with deep adaptation networks

Proceedings of the International Conference on Machine Learning

Person re-identification by unsupervised l₁ graph learning