Soft pseudo-Label shrinkage for unsupervised domain adaptive person re-identification

https://doi.org/10.1016/j.patcog.2022.108615Get rights and content

Highlights

  • We changed the title to “Soft Pseudo-Label Shrinkage for Unsupervised Domain Adaptive Person Re-identification”.

Abstract

One effective way to tackle unsupervised domain adaptation (UDA) on person re-identification (Re-ID) is to use clustering-based self-training approach, where a model is trained with hard pseudo-labels obtained from a clustering method. Using a hard pseudo-label, a sample is assigned to the cluster with the highest probability, which is sensitive to the incorrect clustering result due to imperfect clustering algorithms. Soft pseudo-labels can mitigate this issue by representing the sample with the full range of class probabilities from all clusters. Specifically, soft pseudo-labels comprise probabilities of full range classes, because they consider both the hard samples and easy samples. This will distract the model from learning more discriminative features in the hard examples. To solve this issue, we propose a coarse-to-fine refinement mechanism to produce robust refined soft pseudo-labels by progressively focusing more on the hard samples while less on the easy samples. The proposed refined soft pseudo-labels can be readily integrated into cross-entropy loss as a strong supervision to guide the model to learn more discriminative features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art unsupervised domain adaptation approaches on person Re-ID with a considerable margin. Code will be available at: http://github.com/Dingyuan-Zheng/ctf-UDA.

Introduction

Person re-identification (Re-ID) has been attracting great attentions in the past decade due to its wide applications in surveillance systems, such as tracing criminal suspects or missing populations. Given a target person, person Re-ID aims to retrieve all the images of this person captured by various cameras. Although many existing approaches have achieved impressive performance on supervised person Re-ID [1], [2], [3], they depend on a large amount of labeled data which is very costly and expensive to collect in the real-world scenarios. Moreover, these methods perform well on a single dataset (domain) and their performance deteriorates significantly when the test images are from a different dataset (domain) due to the significant distribution mismatch between the source and target domains. To solve these issues, unsupervised domain adaptive person re-identification methods have been proposed to effectively utilize a large amount of unlabelled data in the target domain, and transfer the model discrimination learned from a labeled source dataset (domain) to unlabelled target dataset (domain).

Although unsupervised domain adaptation (UDA) for person Re-ID have been investigated extensively in recent works [4], [5], it is still an open research topic due to two main challenges. One challenge is the limited model transferability due to the large domain misalignment between source and target domains, and the other one is the unknown number of identities and complex intra-domain variations in the unlabelled target domain. The first issue is attempted to be solved by aligning feature distributions between source and target domains [6], [7] or converting images from source domains to follow the style for the images in target domains using the generative adversarial networks (GAN) [8], [9], [10]. To tackle the second issue, images in the target domain are grouped using clustering methods, i.e., k-means or DBSCAN [11], and in turn, the clustering generated pseudo-labels are adopted as supervised information to train the network in a supervised manner [12]. Although these approaches achieve competitive results with heuristic training techniques [13], [14], [15], label noise is inevitably introduced due to imperfect clustering algorithms, compromising the final identification performance.

To mitigate the label noise issue, recent works [16], [17], [18] introduce soft label into clustering-based UDA person Re-ID task. A soft label attempts to represent the person identity with multiple class probabilities instead of one hard class index from clustering method. Therefore, the selection of classes and the probability assigned to each class are crucial for a soft label. Early work [16] proposes to assign uniform weights (1/k) for each selected top-k nearest neighbors based on feature distance, where each neighbor is with a unique class identity. Recently, Zhang et al. [17] proposes more sophisticated probability allocation mechanism for the top-k nearest neighbors. To further improve the quality of soft labels, Ge et al. [18] proposes to adopt the full range of pseudo-classes generating from clustering methods with auto-learned weights as the soft pseudo-labels, rather than only relying on the nearest k classes. Such labels are produced from the classifiers of the two mutually learned networks.

As pointed out in previous work [19], [20], [21], more discriminative features can be learned if we enforce the model to focus on misclassified (hard) samples. Similarly as in the UDA person Re-ID task, if a person identity is represented by the full range of pseudo-classes with both hard and easy samples, it will hinder the model from learning discriminative features in the target domain. One example to illustrate the easy/hard examples generated from a clustering method is shown in Fig. 1. A person image (red bounding box) is assigned to a cluster, and this cluster is denoted as anchor cluster. The images (red/blue bounding boxes in the anchor cluster) within the anchor cluster are with slight intra-ID appearance differences. However, some other person images (blue bounding boxes in the hard cluster) with the same ground-truth identity are assigned to a different (hard) cluster due to different camera viewpoints and illumination conditions. These images with large intra-ID variance are the hard positive samples of this person. Besides, some person images (green bounding boxes) with different ground-truth identities and with large inter-ID appearance differences are assigned to a distinct (easy) cluster, and these images are regarded as easy samples. We observe that the cluster centroid of the hard cluster is normally closer to the anchor cluster centroid compared with that of the easy cluster. In this paper, we propose to use class centroid distance to classify easy classes and hard classes.

Inspired by the above analysis, we propose a coarse-to-fine refinement network (CTFRN) to generate robust refined soft pseudo-labels, represented by refined pseudo-classes rather than the full range of pseudo-classes. In turn, such refined soft pseudo-labels are applied to the cross-entropy loss as a strong supervision to guide the model to learn more discriminative features focusing on hard samples. Specifically, we first produce initial soft pseudo-labels which are represented by the full range of pseudo-classes as in [18]. Then, we retain the pseudo-classes corresponding to hard samples while discarding the well-classified classes mainly consists of easy samples. This process follows a coarse-to-fine manner as illustrated in Fig. 2. For example, in Fig. 2, person image P1 which is close to both cluster A and B is assigned with hard pseudo-label B, and its initial full-class soft pseudo-label contains its probabilities to be classified to all classes. After performing coarse-to-fine refinement on such a full-class label, the final refined soft pseudo-label of P1 only keeps the probabilities for pseudo-class A and B. For P1, pseudo-class A and B contain hard samples, and learning on hard samples will facilitate the network to learn discriminative features. Benefiting from the proposed coarse-to-fine refinement network (CTFRN), the obtained model maximizes the transferability from the labeled source domain to the unlabelled target domain and it promotes the final identification performance.

The contributions of our work can be summarized into three folds:

  • Observing that using hard pseudo-label or full-class soft pseudo-label for UDA on person Re-ID can distract model from learning discriminative features, we propose to refine the soft pseudo-labels by retaining those pseudo-classes containing hard samples while gradually discarding the well-classified classes that consist of easy samples, and thus learn more discriminative features from hard examples.

  • We propose a coarse-to-fine progressive soft pseudo-label refine mechanism, where more pseudo-classes are selected in the beginning of the training, while less pseudo-classes are used at the end of the training. We also propose to use temporally averaged pseudo-class centroid to classify easy classes and hard classes, which is more robust than simply using sample features.

  • Extensive experiments on four unsupervised domain adaptation Re-ID tasks demonstrate the superiority of our proposed method. Specifically, our method outperforms the state-of-the-art method [18] by 4.9%, 2.1%, 4.0%, and 3.1% in terms of mAP on Duke-to-Market, Market-to-Duke, Duke-to-MSMT, and Market-to-MSMT tasks, respectively.

Section snippets

Unsupervised domain adaptation on person Re-ID

Unsupervised domain adaptation (UDA) on person Re-ID have been extensively studied in recent works. In this section, we first review three typical approaches to solve this challenging task.

Learning via translation. One challenge that compromises the model transferability is large domain misalignment between source and target domains. Generative adversarial networks (GAN) based methods [8], [9], [10], [22] have been extensively studied to bridge the domain gap in person Re-ID task. Zhong et al.

Methodology

In the task of unsupervised domain adaptation on person re-identification, we are provided with a source domain training set {xs,ii=1Ns} associated with the ground-truth identity labels {ys,ii=1Ns} where Ns is the number of images in the source domain training set. In the target domain, training images {xt,ii=1Nt} are also available without identity labels, where Nt is the number of images. Our objective is to obtain a feature encoding module for person identification in the unlabelled

Datasets and evaluation metrics

Datasets. We evaluate our proposed coarse-to-fine soft pseudo-label refinement network on three widely-used person Re-ID datasets. Specifically, the Market1501 [30] contains 32,668 labeled images of 1501 identities captured from 6 cameras in Tsinghua campus. It is split into a training set with 12,936 images of 751 identities and a test set with the rest 19,732 images of 750 identities.

The DukeMTMC-reID [31] consists of 36,411 annotated images of 1404 identities shot from 8 cameras, for which

Conclusion and future work

In this paper, to tackle label noise problem existing in clustering-based unsupervised domain adaptation (UDA) approach for person Re-ID, we propose a coarse-to-fine refinement network (CTFRN) to generate refined soft pseudo-labels represented with less pseudo-classes from hard samples. Specifically, we progressively discard pseudo-classes which mainly consist of easy samples while retaining pseudo-classes which contain hard samples to represent soft pseudo-label. The distance ranking between

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work was supported by National Natural Science Foundation of China under 61972323, and Key Program Special Fund in XJTLU under KSF-T-02, KSF-P-02.

Dingyuan Zheng received the B.S. degree in the Electronic Science and Technology from the Xi’an Jiaotong-Liverpool University, Suzhou, PR China, in 2015, and obtained the M.S. degree in the Microelectronics Systems Design from the University of Southampton, Southampton, U.K., in 2016. He is now a Ph.D. candidate in the Department of the Electrical and Electronic Engineering of the Xi’an Jiaotong-Liverpool University, Suzhou, PR China. His current research interests include computer vision,

References (42)

  • L. Wei et al.

    Person transfer gan to bridge domain gap for person re-identification

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2018)
  • Y.-J. Li et al.

    Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation

    Proceedings of the IEEE International Conference on Computer Vision

    (2019)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Kdd, volume 96

    (1996)
  • X. Zhang et al.

    Self-training with progressive augmentation for unsupervised cross-domain person re-identification

    Proceedings of the IEEE International Conference on Computer Vision

    (2019)
  • Y. Fu et al.

    Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification

    Proceedings of the IEEE International Conference on Computer Vision

    (2019)
  • Z. Zhong et al.

    Invariance matters: Exemplar memory for domain adaptive person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • X. Zhang et al.

    Memorizing comprehensively to learn adaptively: unsupervised cross-domain person re-id with multi-level memory

    arXiv preprint arXiv:2001.04123

    (2020)
  • Y. Ge et al.

    Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification

    arXiv preprint arXiv:2001.01526

    (2020)
  • T.-Y. Lin et al.

    Focal loss for dense object detection

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • A. Hermans et al.

    In defense of the triplet loss for person re-identification

    arXiv preprint arXiv:1703.07737

    (2017)
  • C. Wang et al.

    Mancs: A multi-task attentional network with curriculum sampling for person re-identification

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • Cited by (0)

    Dingyuan Zheng received the B.S. degree in the Electronic Science and Technology from the Xi’an Jiaotong-Liverpool University, Suzhou, PR China, in 2015, and obtained the M.S. degree in the Microelectronics Systems Design from the University of Southampton, Southampton, U.K., in 2016. He is now a Ph.D. candidate in the Department of the Electrical and Electronic Engineering of the Xi’an Jiaotong-Liverpool University, Suzhou, PR China. His current research interests include computer vision, person re-identification and unsupervised domain adaptation.

    Jimin Xiao received the B.S. and M.E. degrees in telecommunication engineering from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 2004 and 2007, respectively, and the Ph.D. degree in electrical engineering and electronics from the University of Liverpool, Liverpool, U.K., in 2013. From November 2013 to November 2014, he was a senior researcher with the Department of Signal Processing, Tampere University of Technology, Tampere, Finland, and an external researcher with the Nokia Research Center, Tampere, Finland. Since December 2014, he has been a faculty member with Xian Jiaotong-Liverpool University, Suzhou, China. His research interests include image and video processing, computer vision, and deep learning.

    Ke Chen is currently an associate professor in the South China University of Technology (SCUT), Guangzhou, China. He received his B.E. and M.E. degrees at the Sun Yat-sen University in 2007 and 2009 respectively, and Ph.D degree at the Queen Mary, University of London in 2013. Before he joined in SCUT, he worked at the Tampere University of Technology, Finland for five years. He has published more than 70 papers including top-tier CVPR, ICCV, IJCAI in the field. His research interests include computer vision, pattern recognition, neural dynamics and robotics.

    Xiaowei Huang is currently a Reader with the Department of Computer Science, University of Liverpool, Liverpool, UK. His research interests include correctness (e.g., safety, trustworthiness, etc) of autonomous systems. Specifically, verification of neural network-based deep learning on safety and security properties, practical analysis techniques (software testing, safety argument, certification, etc) for machine learning techniques, interpretation and explanation of deep learning, and logic-based approaches for the specification, verification and synthesis of autonomous multiagent systems.

    Lin Chen received the B.E. degree from the University of Science and Technology of China, Hefei, China, in 2009, and the Ph.D. degree from the School of Computer Engineering, Nanyang Technological University, Singapore, in 2014. He is currently the Sr. Principal Scientist at Wyze Labs, Seattle, U.S. He is currently leading the Wyze AI team of talented and passionated scientists and engineers to build smart home products by innovating in AIoT. His current research interests include computer vision and machine learning, in particular, deep learning with its application to computer vision tasks, such as object recognition, image/video retrieval, and classification.

    Yao Zhao received the B.S. degree from the Radio Engineering Department, Fuzhou University, Fuzhou, China, in 1989, and the M.E. degree from the Radio Engineering Department, Southeast University, Nanjing, China, in 1992, and the Ph.D. degree from the Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 1996, where he became an Associate *Author Biography Professor and a Professor in 1998 and 2001, respectively. From 2001 to 2002, he was a Senior Research Fellow with the Information and Communication Theory Group, Faculty of Information Technology and Systems, Delft University of Technology, Delft, The Netherlands. In 2015, he visited the Swiss Federal Institute of Technology, Lausanne, Switzerland. From 2017 to 2018, he visited the University of Southern California. He is currently the Director with the Institute of Information Science, BJTU. His current research interests include image/video coding, digital watermarking and forensics, video analysis and understanding, and artificial intelligence. Dr. Zhao is a fellow of the IET. He serves on the Editorial Boards of several international journals, including as an Associate Editor for the IEEE TRANSACTIONS ON CYBERNETICS, a Senior Associate Editor for the IEEE SIGNAL PROCESSING LETTERS, and an Area Editor for Signal Processing: Image Communication. He was named a Distinguished Young Scholar by the National Science Foundation of China in 2010 and was elected as a Chang Jiang Scholar of Ministry of Education of China in 2013.

    View full text