Elsevier

Neurocomputing

Volume 388, 7 May 2020, Pages 246-254
Neurocomputing

Neighbor similarity and soft-label adaptation for unsupervised cross-dataset person re-identification

https://doi.org/10.1016/j.neucom.2019.12.115Get rights and content

Abstract

Most of the existing person re-identification algorithms rely on supervised model learning from a large number of labeled training data per-camera-pair. However, the manual annotations often require expensive human labor, which limits the application of supervised methods for large-scale real-world deployments. To address this problem, we formulate a Neighbor Similarity and Soft-label Adaptation (NSSA) algorithm to transfer the supervised information from source domain to a new unlabeled target dataset. Specifically, we introduce a distance metric on the target domain, which incorporates inner-domain neighbor similarity and inter-domain soft-label adapted from source domain. The unlabeled samples which are close in this metric are considered to share the same pseudo-id and further selected to fine-tune the model. The training is performed iteratively to incorporate more credible sample pairs and incrementally improve the model. Extensive experimental results validate the superiority of our proposed NESSA algorithm, which significantly outperforms the state-of-the-art unsupervised and domain adaptation re-identification methods.

Introduction

Person re-identification (Re-ID) plays an important role in surveillance video analysis because of a wide range of real-world applications, such as searching the target person, analyzing the trace of crowd flow etc.. Given a query pedestrian image, the task aims at matching the same pedestrian from multiple non-overlapping cameras. In spite of various forms of different re-id methods, it shares a common goal of learning an optimal visual representation from image space to feature space, which pulls images of the same identity close to each other while pushing those of different identities apart in the learned feature space.

Deep neural networks (DNNs) have shown prominent advantages in representation learning and have been proven highly effective in supervised person re-identification [1], [4], [7], [21], [32], [33], [41], [46], [47]. With the manually labeled identity for each image, the objective function based on similarity (e.g. pairwise [33], triplet [25] or quadruplet [3] loss) and classification (e.g. Softmax [39], [49] or OIM [40] loss) is applied to train a DNN model, which learns an optimal feature representation of person images. However, the manual annotations require expensive human labor, especially in long-period multi-camera scenarios. For each person appearing in one camera, it needs to traverse all other cameras to find out if the person appears again. The annotation cost limits the application and expansion of supervised re-id methods in the large-scale real-world scenarios.

Hence, unsupervised cross-dataset re-id algorithms have been raised in recent years. Given an annotated source dataset, this task aims at learning the discriminative feature representation on the target dataset without any label. This is a challenging problem due to the objective gap between source and target domains, including the view of cameras, quality of images, change of dressing style due to the different regions and seasons, etc.. A common practice to cross-datasets problem is unsupervised domain adaptation. but this method assumes that the source and target domains share the same set of classes. However, this assumption does not hold for person re-id because the source and target datasets usually contain entirely different identities. A few methods [23], [34] assume that the datasets share the same semantic attributes that can be learned from the source domain and transferred to the target domain. Another kind of methods [6], [36], [52] train an image-to-image translation model and generate images with identities on the target domain.

In this work, we follow the self-supervised methods [9], [20] and propose Neighbor Similarity and Soft-label Adaptation (NSSA), a simple yet effective algorithm for unsupervised cross-dataset person re-id problem. The illustration of this method is shown in Fig. 1. Firstly, a feature representation model F is trained on the supervised source dataset, then applied on the target domain to extract the feature points on the dataset. Note that there is not any supervised label on the target dataset, the self-supervised mechanism relies on the initial feature distance to obtain similar pairs. Besides the commonly used Euclidean distance, we introduce fused distance metrics, including inner-domain neighbor similarity and inter-domain soft-label, to achieve a better similarity metric. Then we select the samples following the assumption that the feature pair with smaller distance ought to have higher probability of sharing the same identity label. The most credible samples, which are associated by the pseudo-id, are selected to fine-tune the model F. These steps are performed iteratively to incrementally improve the discriminability of model F.

It is worth noting that the fused distance metric in our method consists of three aspects: (1) The vanilla Euclidean distance of features, which is solely applied in previous self- supervised cross-dataset re-id methods [9], [20]. (2) The inner-domain neighbor similarity, which explores the topological relationship between the feature points and utilizes the priority that the group of features from the same identity should be close to each other and have similar neighbors. (3) The inter-domain soft-label, which spreads the identity label from source domain to target domain according to the cross-domain point-wise distance. The soft-label provides effective supervised information on the unlabeled target domain.

The unlabeled data samples are progressively taken into the training schedule. The initial model F is suboptimal on target dataset at the beginning iteration, hence not all the feature points can be assigned with an accurate pseudo-id at the beginning. Therefore, the samples with higher confidence are sampled to fine-tune the model F. With the improvement of the model at the training stage, more credible samples are selected into the training set.

The remainder of the paper is organized as follows: Related works are reviewed in Section 2. In Section 3, we discuss the detailed implementation of our proposed model. The experimental analysis and comparison with the state-of-the-art methods are presented in Section 4.

Section snippets

Related works

Related works of the proposed method can be summarized into three categories: unsupervised person re-id, semi-supervised learning, and curriculum learning. We will explain the connections and differences between NSSA and these methods in the corresponding aspects.

Overview

We will introduce the proposed Neighbor Similarity and Soft-label Adaptation algorithm in this section. The architecture of the method is illustrated in Fig. 2, which contains 4 steps:

  • Step (1). Model pre-training by supervised learning on the labeled source dataset. In this step, a person feature embedding model is trained by classification loss and domain adaptation loss. However, it is suboptimal on the target dataset due to the domain gap, so it needs to be fine-tuned.

  • Step (2). Feature

Datasets and settings

Datasets and evaluation protocol. We evaluate our proposed model on three person re-id datasets: Market-1501 [48], DukeMTMC [31] and MSMT17 [37]. Market-1501 contains 32,668 labeled images from 1501 people, which are captured by 6 cameras. The standard training/test split (750 / 751 ids) and single-query is adopted in our experiments. Duke-MTMC has 8 cameras and 1404 identities with 36,411 images. Half of the identities are used for training and another half are for testing. MSMT17 has 4101

Conclusion

In this paper, we propose a simple yet effective algorithm, NSSA for unsupervised cross-dataset person re-identification. We introduce the inner-domain neighbor similarity and the inter-domain soft-label adaptation to explore the topological relationship besides the vanilla Euclidean distance. The representation ability of the feature model improves by the iterative training process. Extensive experimental results on three real-world datasets demonstrate the advantage of the proposed model over

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgements

This paper is partially supported by NSFC (No. 61772330, 61533012, 61876109), the pre-research project (No. 61403120201), Shanghai authentication key Lab. (2017XCWZK01), and Technology Committee the interdisciplinary Program of Shanghai Jiao Tong University (YG2019QNA09).

Yiru Zhao received the B.S. degree in computer science from Tongji University, China, in 2015. He is currently pursuing Ph.D. degree in Shanghai Jiao Tong University, China. His research interests include deep learning, image retrieval and machine learning.

References (56)

  • S. Ding et al.

    Deep feature learning with relative distance comparison for person re-identification

    Pattern Recognit.

    (2015)
  • X. Yang et al.

    Enhancing person re-identification in a self-trained subspace

    ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)

    (2017)
  • E. Ahmed et al.

    An improved deep learning architecture for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Y. Bengio et al.

    Curriculum learning

    Proceedings of the 26th Annual International Conference on Machine Learning

    (2009)
  • W. Chen et al.

    Beyond triplet loss: a deep quadruplet network for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • D. Cheng et al.

    Person re-identification by multi-channel parts-based CNN with improved triplet loss function

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • D.S. Cheng et al.

    Custom pictorial structures for re-identification.

    Proceedings of the BMVC

    (2011)
  • W. Deng et al.

    Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise.

    Proceedings of the KDD

    (1996)
  • H. Fan et al.

    Unsupervised person re-identification: clustering and fine-tuning

    ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)

    (2018)
  • M. Farenzena et al.

    Person re-identification by symmetry-driven accumulation of local features

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2010)
  • I. Goodfellow et al.

    Generative adversarial nets

    Advances in Neural Information Processing Systems

    (2014)
  • A. Gretton et al.

    A kernel method for the two-sample-problem

    Advances in Neural Information Processing Systems

    (2007)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv:1703.07737...
  • L. Jiang et al.

    Self-paced learning with diversity

    Advances in Neural Information Processing Systems

    (2014)
  • E. Kodirov et al.

    Person re-identification by unsupervised l1 graph learning

    Proceedings of the European Conference on Computer Vision

    (2016)
  • E. Kodirov et al.

    Dictionary learning with iterative Laplacian regularisation for unsupervised person re-identification.

    Proceedings of the BMVC

    (2015)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems

    (2012)
  • M.P. Kumar et al.

    Self-paced learning for latent variable models

    Advances in Neural Information Processing Systems

    (2010)
  • M. Li et al.

    Unsupervised person re-identification by deep learning Tracklet association

    Proceedings of the European Conference on Computer Vision

    (2018)
  • W. Li et al.

    DeepReID: Deep filter pairing neural network for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • Y.-J. Li et al.

    Adaptation and re-identification network: an unsupervised deep transfer learning approach to person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2018)
  • S. Lin, H. Li, C.-T. Li, A.C. Kot, Multi-task mid-level feature alignment network for unsupervised cross-dataset person...
  • G. Lisanti et al.

    Person re-identification by iterative re-weighted sparse ranking

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • H. Liu et al.

    End-to-end comparative attention networks for person re-identification

    IEEE Trans. Image Process.

    (2017)
  • X. Liu et al.

    Semi-supervised coupled dictionary learning for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • M. Long et al.

    Learning transferable features with deep adaptation networks

    Proceedings of the International Conference on Machine Learning

    (2015)
  • Cited by (9)

    • Unsupervised visual feature learning based on similarity guidance

      2022, Neurocomputing
      Citation Excerpt :

      Zhang et al. [26] proposed an entropy-based distance metric that quantifies the distance between categories by exploiting the information provided by different attributes that correlate with the target one. Zhao et al. [27] introduced a distance metric which incorporates inner-domain neighbor similarity. In [28,29], the sample pairs were measured taking into account the surrounding information, and the original ordering in the image retrieval is rearranged.

    • Cross-domain person re-identification with pose-invariant feature decomposition and hypergraph structure alignment

      2022, Neurocomputing
      Citation Excerpt :

      These methods consider only the inter-domain variation between the source and target domains whereas the intra-domain (different camera views) variation of a single domain has been ignored, which is an important factor affecting Re-ID performance. The methods mine the underlying data distribution information of the target domain for model refinement [7–11]. These methods only take the model pre-trained on the source domain as the initial model for the feature learning in the target domain.

    • Deep manifold clustering based optimal pseudo pose representation (DMC-OPPR) for unsupervised person re-identification

      2020, Image and Vision Computing
      Citation Excerpt :

      Bottom-up clustering (BUC) [34] clusters the unlabeled images using diversity regularizer without considering external parameters which influences labeling. A domain adaptive method (NSSA) [35] is proposed based on the nearest neighbor approach. Conventional pose estimation methods matches the unlabeled poses to a pre-defined canonical pose based on keypoint detection which fails in a crowded surveillance environment.

    View all citing articles on Scopus

    Yiru Zhao received the B.S. degree in computer science from Tongji University, China, in 2015. He is currently pursuing Ph.D. degree in Shanghai Jiao Tong University, China. His research interests include deep learning, image retrieval and machine learning.

    Hongtao Lu is now a Professor in the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His current research interests include computer vision, deep learning and machine learning. He had authored or co-authored more than 100 papers in journals and premier conferences.

    View full text