Person re-identification with part prediction alignment

https://doi.org/10.1016/j.cviu.2021.103172Get rights and content

Highlights

  • We propose a part-based person feature extraction network with Part Prediction Alignment(PPA), the network does not need the external datasets and pose estimator, and this will reduce the complexity of the training process.

  • We adopt the teacher–student network for global–local feature extraction. In this way, the extracted features will be more discriminative and achieve a higher score in the testing phase.

  • Experiments on three datasets demonstrate that the proposed network in this paper effectively improves the re-id accuracy.

Abstract

The key to success of person re-identification(re-id) is extracting the discriminative person features. Various part-level feature extraction methods are proposed to capture local person features for re-id. A prerequisite of part feature extraction is that each part should be well located. We believe that ID predictions in different parts of the same image should be consistent. Instead of using the external dataset and pose estimator for guiding, we propose a re-id model with Part Prediction Alignment (PPA), which aims at aligning the predicted distributions between each part. Due to the global feature and local feature contains different spacial information, we consider that the combination of two sides will further improve the detection effect. Therefore, in this paper we adopt the teacher–student training strategy based on PPA for global–local feature extraction, and the global feature extraction branch as a teacher to guide the training of local feature branch. Experimental results on Market-1501, DukeMTMC-reID and CUHK03 (including CUHK03_Detected and CUHK03_Labeled) datasets confirm the effectiveness of our proposal, we achieve Rank1 with 92.4%, 85.1%, 65.5%, 69.2% on Market-1501, DukeMTMC-reID, CUHK03_Detected and CUHK03_Labeled, respectively.

Introduction

Person re-identification (re-id) aims at identifying an interest person at other places, it is a challenging task in computer vision (Sun et al., 2018, Binghui et al., 2019, Ruibing et al., 2019, Tianlong et al., 2019). Recently, deep learning becomes the most popular method in computer vision community due to its high discriminative ability, such as object detection, object tracking and person re-id (Junwei et al., 2018, Yi et al., 2019, Zhang et al., 2020, Wang et al., 2018a). Many state of the art re-id models achieve higher accuracy based on deep-learned features (Wei et al., 2017, He et al., 2018, Qian et al., 2018, Cheng et al., 2020, Jieru et al., 2019, Wang et al., 2019). Person feature extraction can be roughly divided into two aspects: global based feature extraction and local based feature extraction. Global features focus on overall information but ignore the spatial structure of a person, so in recent years, many re-id methods mainly extracting local features for re-identification.

The key to extract discriminative local features is that every part should be located accurately (Wei et al., 2017). This relies on pedestrian external information, pose estimator and human pose estimate datasets are needed for information extraction. Pose estimator and external datasets will undoubtedly increase the training complexity, so presently, some re-id models try to abandon the pose cues and achieve competitive accuracy (Sun et al., 2018, Zhang and Huang, 2018, Kumar et al., 2017, Zheng et al., 2017). There are some partition strategies in Fig. 1. The purpose of accurately locate is to enhance the internal consistency of parts, and then the ID of each part will be predicted precisely. Since the ultimate goal of pose estimate is to improve the prediction accuracy of each local area, we reconsider this problem from the perspective of each part ID prediction, aim at aligning the prediction results of each part and then enhance the prediction consistency of parts.

In the re-id community, person images are usually divided into six parts (head, upper body, lower body, upper legs, lower legs and feet). We believe that for the same person images, no matter how many parts are divided, the ID predictions of each part should be the same. However, during the training process, the ID predictions of parts will be different due to the misalignment of person images. As we can see in the second sub-picture of Fig. 1(d), part 1 only contains background, therefore, in ID prediction step the deep learning network will be difficult to judge the ID of part 1. Based on the above considerations, this paper proposes a network with Part Prediction Alignment (PPA) to extract the part-level features for re-id. The architecture of the network is concise, with slight modifications on the ResNet-50 network.

Global features contain the global space information of an image, but lack person’s spatial structure cues, while local features are opposite to the global features. We consider the motivation that if we can combine the advantages of two sides, the extracted features will be more discriminative and achieve higher accuracy. Therefore, in this paper we propose to use teacher–student network to extract the global–local person features for re-id. The structure of teacher–student network in this paper is shown in Fig. 3, the global feature network as a teacher and guide the learning of local feature network. In this paper, we take a whole image as the input and output the person’s feature. Feature extracted in this way will include two sides of information, which will benefit the testing phase.

Our contributions can be summarized as follows:

(1) We propose a part-based person feature extract network with Part Prediction Alignment (PPA), the network does not need the external datasets and pose estimator for guiding, and this will reduce the complexity of the training process.

(2) We adopt the teacher–student network for global–local feature extraction, the extracted features not only contain the global space information but also the spatial structure cues of person. In this way, the extracted features will be more discriminative and achieve a higher score during testing.

(3) Experiments on three datasets including Market-1501, DukeMTMC-ReID and CUHK03 demonstrate that the proposed network in this paper effectively improves the re-id accuracy.

Section snippets

Related works

In this section, we mainly discuss the related work of part-based deep features and the teacher–student network.

Proposed method

In Section 3.1, we introduce the baseline of our proposal and the Part Prediction Alignment (PPA). PPA mainly focuses on the predicted distributions, and aligns the distributions between each part. Section 3.2 describes the teacher–student network for global–local feature extraction. The training strategy and the parameters are described in Section 3.3.

Datasets

We use three common benchmarks named Market-1501 (Zheng et al., 2015), DukeMTMC-reID (Ristani et al., 2016) and CUHK03 (Wei et al., 2014) to verify our proposed method. Table 1 shows the detailed information of the three datasets. The ranking accuracy of re-id model is measured by Rank-n and mean average precision (mAP), the higher scores of Rank-n and mAP indicate the better re-id model.

Market-1501 dataset contains 32668 images, which is released in 2015. The person images are captured from 6

Conclusion

This paper makes two contributions to extract the discriminative person features. First, we propose a part-level feature extraction network with Part Prediction Alignment (PPA). This network does not require additional datasets and pretrained pose estimator for guide, which will reduce the complexity of re-id model. The global and local features contain different spacial information, the combination of two sides will make the person features more discriminative. Therefore, the second

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partially supported by National Key Research and Development Program of China (No. 2018YFB1308604), National Natural Science Foundation of China (No. 61672215, No. 61976086) and Hunan Science and Technology Innovation Project (No. 2017XK2102).

References (56)

  • FelzenszwalbP.F. et al.

    Object detection with discriminatively trained part based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L., 2017. Look into person: Self-supervised structure-sensitive learning...
  • He, L., Liang, J., Li, H., Sun, Z., 2018. Deep spatial feature reconstruction for partial person re-identification:...
  • HermansA. et al.

    In defense of the triplet loss for person re-identification

    (2017)
  • HuangZ. et al.

    Like what you like: Knowledge distill via neuron selectivity transfer

    (2017)
  • HuangH. et al.

    EANet: Enhancing alignment for cross-domain person re-identification

    (2018)
  • JieruJ. et al.

    Frustratingly Easy Person Re-Identification: Generalizing Person Re-ID in Practice

    (2019)
  • JunweiH. et al.

    Advanced deep-learning techniques for salient and category-specific object detection: A survey

    IEEE Signal Process. Mag.

    (2018)
  • Kalayeh, M.M., Basaran, E., Gokmen, M., Kamasak, M.E., Shah, M., 2018. Human semantic parsing for person...
  • Kumar, V., Namboodiri, A., Paluri, M., Jawahar, C.V., 2017. Pose-Aware person recognition. In: CVPR. pp....
  • Lei, J.B., Caruana, R., 2014. Do deep nets really need to be deep? In: International Conference on Neural Information...
  • Li, D., Chen, X., Zhang, Z., Huang, K., 2017. Learning deep context-aware features over body and latent parts for...
  • LiW. et al.

    Person re-identification by deep joint learning of multi-loss classification

  • Liao, S., Yang Hu, X.Z., Li, S.Z., 2015. Person re-identifification by local maximal occurrence representation and...
  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., 2016. SSD: Single shot multibox...
  • Ma, A.J., Yuen, P.C., Li, J., 2013. Domain transfer support vector ranking for person re-identification without target...
  • QianX. et al.

    Pose-normalized image generation for person re-identification

  • Ristani, E., Solera, F., Zou, R.S., Cucchiara, R., Tomasi, C., 2016. Performance measures and a data set for...
  • Cited by (25)

    • G<sup>2</sup>DA: Geometry-guided dual-alignment learning for RGB-infrared person re-identification

      2023, Pattern Recognition
      Citation Excerpt :

      Whereas the adversarial training process is unstable, and such fine-grained alignment seems to profit less from global features alone, leaving large room for further performance improvement. To mine as much discriminative cues as possible, most ReID works [2,3,6,35,36] focus their efforts on local feature learning. Horizontal division approach usually partitions images into equal horizontal strips from top to bottom, while the pre-defined rigid grids are not well adapted to pose variations, imperfect pedestrian detectors and heavy occlusions [1,9].

    View all citing articles on Scopus
    View full text