Elsevier

Neurocomputing

Volume 483, 28 April 2022, Pages 116-126
Neurocomputing

Momentum source-proxy guided initialization for unsupervised domain adaptive person re-identification

https://doi.org/10.1016/j.neucom.2022.01.013Get rights and content

Abstract

Unsupervised domain adaptive person re-identification (UDA Re-ID), aiming to adapt the model trained from source domain to target domain, is especially challenging due to the non-overlapping identities between the two Re-ID domains. State-of-the-art UDA Re-ID methods optimize the model pre-trained on source domain with pseudo labels generated by clustering algorithms on the target domain. The drawback lies in that the initial parameters are learned only from labeled source domain, neglecting the target domain information that can be easily obtained from unlabeled data. In order to better fit the target distribution while preventing from over-fitting to the source one, we propose a novel momentum source-proxy guided initialization (MSPGI) approach to integrate information from unlabeled data into the pre-training process. Specifically, we assign soft labels to unlabeled data according to similarity to the feature proxies of the source domain, based on the finding that different Re-ID datasets share commonalities. In addition, we instantiate the pretext task in unsupervised pre-training as constraining the predicted soft label to be consistent with the one calculated from the temporally-averaged parameters of the model. Experiments are conducted on multiple downstream approaches, pushing forward the state-of-the-art results by an impressive margin on Market-1501 and DukeMTMC-reID. By making use of unlabeled data, MSPGI further improves the performance of a fully supervised network.

Introduction

Person re-identification (Re-ID) plays an important role in security and surveillance. However, effective deep learning based approaches require a large number of annotations, which is especially hard and time-consuming for Re-ID since pair-wise images under non-overlapping cameras are hard to obtain. Hence, more researchers resort to unsupervised Re-ID methods. Although recent unsupervised Re-ID approaches [32], [5], [23] have narrowed down the performance gap with their supervised counterparts, the results are still not satisfying. Unsupervised domain adaptive (UDA) Re-ID [18], [27] is proposed to make use of labeled data at hand. Under this setting, a labeled source domain and an unlabeled target domain are provided. Among multiple UDA Re-ID approaches, the self-supervised learning scheme proposed in [23] has achieved a higher performance boost. It iterates between clustering on deep features to assign pseudo-labels and retraining the network with them. Based on this pipeline, [30] enhanced the clustering process via building multiple clusters based on the global and local features. [29], [9] solved the problem of outliers in the clustering process via two models co-teaching or mutual teaching. However, all of these approaches naively pre-train the network only with labeled source data in a fully supervised setting. We argue that there is room for improvement since information in unlabeled data has not been explored during the initialization stage. We can see from Fig. 1 that we propose MSPGI which integrates information of unlabeled data into the pre-training process via unsupervised loss.

A main purpose of pre-training is to learn representations that can be transferred to downstream approaches for performance gain and better generalization. Supervised pre-training is widely explored and dominant in UDA Re-ID. However, some recent works show promising results via unsupervised pre-training in multiple areas. MoCo [11] formulated unsupervised pre-training as designing pretext tasks and loss functions. Based on this formulation, a number of effective pretext tasks were proposed. For example, [21] played jigsaw puzzles, instantiating the pretext task as predicting the relative position of all the image patches. [10] predicted the rotation angles of images. We propose to conduct supervised and unsupervised pre-training simultaneously.

To achieve this goal, firstly, we assigns soft labels to unlabeled data according to its similarity with the feature proxies of the source domain. Representing target domain data with source domain is feasible since commonality exists in different domains for the Re-ID task. The conclusion of commonality can be drawn via the model trained on source domain gaining certain accuracy when directly transforming it to target domain. The idea of soft label was proposed by [33], however, we generalize the reference agent in [33] to proxy. Specifically, we take inspiration from Proxy-NCA [20], where the proxies are defined over a space of points that approximates the source training data. In this paper, we choose one point as proxy for each class in the source domain [20] and call them source proxies.

Secondly, we propose the pretext task as enforcing the predictive distribution of soft labels to approximate to more powerful signals. Many kinds of signals can be treated as more powerful ones in an unsupervised manner. For example, Hinton et al. [13] took the probability distribution from a deeper network. Laine et al. [16] proposed self-ensembled prediction, which is the moving average output of the network-in-training on different epochs. Similarly, Tarvainen et al. [25] proposed self-ensembled parameters, they called which the “mean teacher”. Taking inspiration from these works [16], [25], we temporally average the network parameters. Both the embeddings and the source proxies drawn from these set of parameters are more robust to noises, thus can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch. The source proxies computed on this set of parameters are called momentum source proxies. Finally, we push the prediction for source proxies to be as close to that for momentum source proxies as possible. In this way, MSPGI fits the network to the target distribution while preventing it from over-fitting to the source domain and thus form a better initialization.

We remark that the idea of using ensembling parameters has been investigated in the literature [16], [25]. However, our method differs from them in two folds, i.e. input data and usage. The labeled and unlabeled images of [16], [25] come from the same domain, while ours come from different domains, thus we use source proxies to bridge them together. Another difference between our method and [16], [25] lies in the usage. [16], [25] use the consistency loss as a regularization methods to reduce over-fitting in semi-supervised learning. However, in addition to reducing over-fitting, our method aims to inject generalization to the network to make it more suitable for downstream approaches. The ability of such a framework to inject generalization was demonstrated in [7].

The main contributions of this paper are summarized as follows. We highlight the importance of fitting the network to the target distribution during initialization to begin training from a more informative and general starting point. To address the issue, we propose MSPGI, which includes two core operations. Firstly, we propose to use similarities to source proxies as soft labels to represent target unlabeled data with source data. Source proxy is obtained without additional operations such as averaging or clustering. It is parameters stored during network training. Secondly, we adopt temporally averaged parameters to form the ensembling embeddings and source proxies which are more robust to noise. We constrain the prediction on source proxies and that of the momentum ones to be close to each other, thus fitting to the target distribution by means of unsupervised learning. The proposed MSPGI achieves competitive performances on three popular benchmarks including DukeMTMCreID, Market-1501 and MSMT17, compared with state-of-the-art approaches.

Section snippets

Related work

Unsupervised person re-identification. Approaches that use hand-crafted features are naturally unsupervised methods, but the performances of them on large-scale Re-ID datasets are poor, for designing view-invariant features is challenging due to extreme changes from different camera views. Recent works introduced pseudo-labels (generated by clustering) and deep convolutional neural networks (CNN), which improved the unsupervised Re-ID performance by a large margin. Yu et al. proposed CAMEL [31]

Method

In this section, we first briefly introduce the common practice of self-supervised learning based UDA ReID. Then we will elaborate on the proposed MSPGI, which is composed of assigning unlabeled data with soft label according to the similarity with source proxies and the pretext task to fit target domain via unsupervised learning. The overall framework of the proposed MSPGI is illustrated in Fig. 2.

Experiment

We evaluate the proposed MSPGI in terms of the model performance mainly taking UDAP [23] and MMT [9] as the downstream approaches. Below we introduce the standard datasets, evaluation metrics and implementation details, followed by an ablation study and the comparison to the state-of-the-arts.

Conclusion

In this work, we propose momentum source-proxy guided initialization (MSPGI) approach, which can explore semantic prior existing in unlabeled data during network initialization. Specifically, we adopt a novel pretext task that enforces the source-proxy represented unlabeled data to be similar to its momentum counterpart. Extensive experimental results on three large-scale benchmarks validate the effectiveness of our approach, the performance of which is competitive compared with other

CRediT authorship contribution statement

Jiali Xi: Conceptualization, Methodology, Software, Validation, Writing – original draft, Visualization. Qin Zhou: Writing – review & editing, Supervision, Project administration. Xinzhe Li: Writing – review & editing. Shibao Zheng: Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant No. 62071292, 61771303) and Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 18DZ2270700).

Jiali Xi received the B.S. degree in electrical engineering from Beijing University of Posts and Telecommunications, China, in 2017. She is currently pursuing Ph.D. degree in Shanghai Jiao Tong University, China. She has been with the Alibaba DAMO Academy, HangZhou, China, as a Research Intern, since 2018. Her research interests include deep learning, machine learning, and computer vision.

References (39)

  • P. Bachman et al.

    Learning with pseudo-ensembles

    Advances in Neural Information Processing Systems

    (2014)
  • Z. Bai, Z. Wang, J. Wang, D. Hu, E. Ding, Unsupervised multi-source domain adaptation for person re-identification,...
  • M. Caron, P. Bojanowski, J. Mairal, A. Joulin, Unsupervised pre-training of image features on non-curated data,...
  • W. Deng et al.

    Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification

  • H. Fan et al.

    Unsupervised person re-identification: Clustering and fine-tuning

    ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

    (2018)
  • H. Feng et al.

    Complementary pseudo labels for unsupervised domain adaptation on person re-identification

    IEEE Trans. Image Process.

    (2021)
  • T. Furlanello, Z.C. Lipton, M. Tschannen, L. Itti, A. Anandkumar, Born again neural networks, 2018. arXiv preprint...
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky,...
  • Y. Ge, D. Chen, H. Li, Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person...
  • S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting image rotations, 2018. arXiv...
  • K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, 2019....
  • K. He et al.

    Deep residual learning for image recognition

  • G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, 2015. arXiv preprint...
  • X. Jin, C. Lan, W. Zeng, Z. Chen, Global distance-distributions separation for unsupervised person re-identification,...
  • X. Jin et al.

    Style normalization and restitution for generalizable person re-identification

  • S. Laine, T. Aila, Temporal ensembling for semi-supervised learning, 2016. arXiv preprint...
  • H. Li et al.

    Attribute-aligned domain-invariant feature learning for unsupervised domain adaptation person re-identification

    IEEE Trans. Inf. Forensics Secur.

    (2020)
  • Y.J. Li et al.

    Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification

  • Y. Lin et al.

    Unsupervised person re-identification via softened similarity learning

  • Cited by (3)

    • Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification

      2023, Neurocomputing
      Citation Excerpt :

      Due to significant variations in poses, illustrations, and views, person ReID is a challenging task. Due to the powerful fitting ability, Convolutional Neural Networks (CNNs) have been widely used in computer vision [4–8,67]. In the person ReID field, researchers usually build effective deep models to extract robust pedestrian features.

    • Self-Supervised Consistency Based on Joint Learning for Unsupervised Person Re-identification

      2023, ACM Transactions on Multimedia Computing, Communications and Applications

    Jiali Xi received the B.S. degree in electrical engineering from Beijing University of Posts and Telecommunications, China, in 2017. She is currently pursuing Ph.D. degree in Shanghai Jiao Tong University, China. She has been with the Alibaba DAMO Academy, HangZhou, China, as a Research Intern, since 2018. Her research interests include deep learning, machine learning, and computer vision.

    Qin Zhou received her Ph.D. degree in Information and Communication Engineering from Shanghai Jiao Tong University in March, 2019. Before that, she was a visiting student at Professor Haibin Ling’s lab from Oct, 2016 to July, 2018. She is currently working in the Alibaba DAMO academy as a senior algorithm engineer. Her research interests include computer vision, machine learning, convex optimization.

    Xinzhe Li received his B.S. degree in electronic information engineering from Dalian University of Technology, Dalian, China, in 2015. He is current a Ph.D. student at the Department of Electronic Engineering, Shanghai Jiao Tong University (SJTU), Shanghai, China. His current research interests include few-shot learning, meta-learning and data cleaning.

    Shibao Zheng received both the B.S. and M.S. degrees in electronic engineering from Xidian University, Xi’an, China, in 1983 and 1986, respectively. He is currently the professor and vice director of Elderly Health Information and Technology Institute of Shanghai Jiao Tong University (SJTU), Shanghai, China. And he is also the professor committee member of Shanghai Key Laboratory of Digital Media Processing and Transmission, and commissioner of Shanghai Communication Society Multimedia division. His current research interests include urban image surveillance system, intelligent video analysis, spatial information system, and elderly health technology, etc.

    This document is the results of the research project funded by NSFC 62071292, 61771303, STCSM 18DZ2270700.

    View full text