Person re-identification with part prediction alignment

doi:10.1016/j.cviu.2021.103172

Computer Vision and Image Understanding

Volume 205, April 2021, 103172

https://doi.org/10.1016/j.cviu.2021.103172 Get rights and content

Highlights

•
We propose a part-based person feature extraction network with Part Prediction Alignment(PPA), the network does not need the external datasets and pose estimator, and this will reduce the complexity of the training process.
•
We adopt the teacher–student network for global–local feature extraction. In this way, the extracted features will be more discriminative and achieve a higher score in the testing phase.
•
Experiments on three datasets demonstrate that the proposed network in this paper effectively improves the re-id accuracy.

Abstract

The key to success of person re-identification(re-id) is extracting the discriminative person features. Various part-level feature extraction methods are proposed to capture local person features for re-id. A prerequisite of part feature extraction is that each part should be well located. We believe that ID predictions in different parts of the same image should be consistent. Instead of using the external dataset and pose estimator for guiding, we propose a re-id model with Part Prediction Alignment (PPA), which aims at aligning the predicted distributions between each part. Due to the global feature and local feature contains different spacial information, we consider that the combination of two sides will further improve the detection effect. Therefore, in this paper we adopt the teacher–student training strategy based on PPA for global–local feature extraction, and the global feature extraction branch as a teacher to guide the training of local feature branch. Experimental results on Market-1501, DukeMTMC-reID and CUHK03 (including CUHK03_Detected and CUHK03_Labeled) datasets confirm the effectiveness of our proposal, we achieve Rank1 with 92.4%, 85.1%, 65.5%, 69.2% on Market-1501, DukeMTMC-reID, CUHK03_Detected and CUHK03_Labeled, respectively.

Introduction

Person re-identification (re-id) aims at identifying an interest person at other places, it is a challenging task in computer vision (Sun et al., 2018, Binghui et al., 2019, Ruibing et al., 2019, Tianlong et al., 2019). Recently, deep learning becomes the most popular method in computer vision community due to its high discriminative ability, such as object detection, object tracking and person re-id (Junwei et al., 2018, Yi et al., 2019, Zhang et al., 2020, Wang et al., 2018a). Many state of the art re-id models achieve higher accuracy based on deep-learned features (Wei et al., 2017, He et al., 2018, Qian et al., 2018, Cheng et al., 2020, Jieru et al., 2019, Wang et al., 2019). Person feature extraction can be roughly divided into two aspects: global based feature extraction and local based feature extraction. Global features focus on overall information but ignore the spatial structure of a person, so in recent years, many re-id methods mainly extracting local features for re-identification.

The key to extract discriminative local features is that every part should be located accurately (Wei et al., 2017). This relies on pedestrian external information, pose estimator and human pose estimate datasets are needed for information extraction. Pose estimator and external datasets will undoubtedly increase the training complexity, so presently, some re-id models try to abandon the pose cues and achieve competitive accuracy (Sun et al., 2018, Zhang and Huang, 2018, Kumar et al., 2017, Zheng et al., 2017). There are some partition strategies in Fig. 1. The purpose of accurately locate is to enhance the internal consistency of parts, and then the ID of each part will be predicted precisely. Since the ultimate goal of pose estimate is to improve the prediction accuracy of each local area, we reconsider this problem from the perspective of each part ID prediction, aim at aligning the prediction results of each part and then enhance the prediction consistency of parts.

In the re-id community, person images are usually divided into six parts (head, upper body, lower body, upper legs, lower legs and feet). We believe that for the same person images, no matter how many parts are divided, the ID predictions of each part should be the same. However, during the training process, the ID predictions of parts will be different due to the misalignment of person images. As we can see in the second sub-picture of Fig. 1(d), part 1 only contains background, therefore, in ID prediction step the deep learning network will be difficult to judge the ID of part 1. Based on the above considerations, this paper proposes a network with Part Prediction Alignment (PPA) to extract the part-level features for re-id. The architecture of the network is concise, with slight modifications on the ResNet-50 network.

Global features contain the global space information of an image, but lack person’s spatial structure cues, while local features are opposite to the global features. We consider the motivation that if we can combine the advantages of two sides, the extracted features will be more discriminative and achieve higher accuracy. Therefore, in this paper we propose to use teacher–student network to extract the global–local person features for re-id. The structure of teacher–student network in this paper is shown in Fig. 3, the global feature network as a teacher and guide the learning of local feature network. In this paper, we take a whole image as the input and output the person’s feature. Feature extracted in this way will include two sides of information, which will benefit the testing phase.

Our contributions can be summarized as follows:

(1) We propose a part-based person feature extract network with Part Prediction Alignment (PPA), the network does not need the external datasets and pose estimator for guiding, and this will reduce the complexity of the training process.

(2) We adopt the teacher–student network for global–local feature extraction, the extracted features not only contain the global space information but also the spatial structure cues of person. In this way, the extracted features will be more discriminative and achieve a higher score during testing.

(3) Experiments on three datasets including Market-1501, DukeMTMC-ReID and CUHK03 demonstrate that the proposed network in this paper effectively improves the re-id accuracy.

Section snippets

Related works

In this section, we mainly discuss the related work of part-based deep features and the teacher–student network.

Proposed method

In Section 3.1, we introduce the baseline of our proposal and the Part Prediction Alignment (PPA). PPA mainly focuses on the predicted distributions, and aligns the distributions between each part. Section 3.2 describes the teacher–student network for global–local feature extraction. The training strategy and the parameters are described in Section 3.3.

Datasets

We use three common benchmarks named Market-1501 (Zheng et al., 2015), DukeMTMC-reID (Ristani et al., 2016) and CUHK03 (Wei et al., 2014) to verify our proposed method. Table 1 shows the detailed information of the three datasets. The ranking accuracy of re-id model is measured by Rank-n and mean average precision (mAP), the higher scores of Rank-n and mAP indicate the better re-id model.

Market-1501 dataset contains 32668 images, which is released in 2015. The person images are captured from 6

Conclusion

This paper makes two contributions to extract the discriminative person features. First, we propose a part-level feature extraction network with Part Prediction Alignment (PPA). This network does not require additional datasets and pretrained pose estimator for guide, which will reduce the complexity of re-id model. The global and local features contain different spacial information, the combination of two sides will make the person features more discriminative. Therefore, the second

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partially supported by National Key Research and Development Program of China (No. 2018YFB1308604), National Natural Science Foundation of China (No. 61672215, No. 61976086) and Hunan Science and Technology Innovation Project (No. 2017XK2102).

References (56)

BarbosaI.B. et al.
Looking beyond appearances: Synthetic training data for deep CNNs in re-identification
Comput. Vis. Image Underst.
(2018)
ZhangZ. et al.
Person re-identification based on heterogeneous part-based deep network in camera networks
IEEE Trans. Emerg. Top. Comput. Intell.
(2018)
ZhongW. et al.
Combining multilevel feature extraction and multi-loss learning for person re-identification
Neurocomputing
(2019)
Binghui, C., Weihong, D., Jiani, H., 2019. Mixed high-order attention network for person re-identification. In:...
Chang, X., Hospedales, T.M., Xiang, T., 2018. Multi-level factorisation net for person re-identifification. In: CVPR....
ChengD. et al.
Fusion of multiple person re-id methods with model and data-aware abilities
IEEE Trans. Cybern.
(2020)
DaiJ. et al.
R-FCN: Object detection via region-based fully convolutional networks
(2016)
Das, A., Chakraborty, A., Roy-Chowdhury, A.K., 2014. Consistent re-identification in a camera network. In: European...
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F., 2009. ImageNet: A large-scale hierarchical image database....
DiW. et al.
A novel deep model with multi-loss and efficient training for person re-identification
Neurocomputing
(2019)

FelzenszwalbP.F. et al.

Object detection with discriminatively trained part based models

IEEE Trans. Pattern Anal. Mach. Intell.

(2010)

Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L., 2017. Look into person: Self-supervised structure-sensitive learning...

He, L., Liang, J., Li, H., Sun, Z., 2018. Deep spatial feature reconstruction for partial person re-identification:...

HermansA. et al.

In defense of the triplet loss for person re-identification

(2017)

HuangZ. et al.

Like what you like: Knowledge distill via neuron selectivity transfer

(2017)

HuangH. et al.

EANet: Enhancing alignment for cross-domain person re-identification

(2018)

JieruJ. et al.

Frustratingly Easy Person Re-Identification: Generalizing Person Re-ID in Practice

(2019)

JunweiH. et al.

Advanced deep-learning techniques for salient and category-specific object detection: A survey

IEEE Signal Process. Mag.

(2018)

Kalayeh, M.M., Basaran, E., Gokmen, M., Kamasak, M.E., Shah, M., 2018. Human semantic parsing for person...

Kumar, V., Namboodiri, A., Paluri, M., Jawahar, C.V., 2017. Pose-Aware person recognition. In: CVPR. pp....

Lei, J.B., Caruana, R., 2014. Do deep nets really need to be deep? In: International Conference on Neural Information...

Li, D., Chen, X., Zhang, Z., Huang, K., 2017. Learning deep context-aware features over body and latent parts for...

LiW. et al.

Person re-identification by deep joint learning of multi-loss classification

Liao, S., Yang Hu, X.Z., Li, S.Z., 2015. Person re-identifification by local maximal occurrence representation and...

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., 2016. SSD: Single shot multibox...

Ma, A.J., Yuen, P.C., Li, J., 2013. Domain transfer support vector ranking for person re-identification without target...

QianX. et al.

Pose-normalized image generation for person re-identification

Ristani, E., Solera, F., Zou, R.S., Cucchiara, R., Tomasi, C., 2016. Performance measures and a data set for...

Cited by (25)

An efficient feature pyramid attention network for person re-identification
2024, Image and Vision Computing
For person re-identification, occlusion, appearance similarity and background clutter have always been challenges. In order to effectively address the challenges, we propose an efficient feature pyramid attention network (FPA-Net), which combines visual features from different levels to focus on both detail features and information. Specifically, we embed a pair of attention mechanisms that complement each other in the backbone network to focus on the discriminant features of person areas. In addition, we designed a novel feature pyramid structure, which propagates the feature information from the cross-level through the top feature to the bottom feature and from the bottom feature to the top feature to supplement the detail information of the feature. Finally, we integrate features form different scales through a lightweight transition block to generate more discriminant features. Our method performed experimental analysis on four datasets: Market-1501, DukeMTMC-ReID, CUHK03(Detected) and MSMT17. A large number of experimental results prove that the performance of the method is significantly better than the existing state-of-the-art methods.
On learning distribution alignment for video-based visible-infrared person re-identification
2023, Computer Vision and Image Understanding
This paper studies the matching problem of cross-modality video data from a discrete distribution alignment view. Central to this discussion is the visible-infrared person re-identification (VI-reID), a crucial feature that bolsters surveillance systems’ efficacy in monitoring individuals across diverse lighting conditions. Going beyond traditional image-to-image matching paradigms, a recent study shows that temporal information can bring richer cues to encode the pedestrian representation, improving the representation power of deep neural networks. However, this integration further complicates cross-modality data matching due to the joint processing of spatial and temporal information. This paper formulates the video data as a discrete distribution and aligns the cross-modality video representation by reducing the matching cost between the two distributions. To this end, a natural idea for aligning the videos is to reduce the divergence of distributions. Moreover, the powerful optimal transport (OT) scheme, which generates the optimal matching flows and establishes the relevance of two distributions, is also employed as a way to measure the distance of distributions. Nevertheless, we observe that endowing the OT in the advanced VI-reID feature extractor leads to a non-symmetric measurement. To mitigate this, the paper introduces a new metric, namely symmetric optimal transport (SOT), reformulating OT into a symmetric form. Thorough analyses and empirical studies affirm the superiority of the proposed SOT, which significantly outperforms the current state-of-the-art methods according to standard benchmarking evaluations.
Multi-Scale Semantic and Detail Extraction Network for Lightweight Person Re-Identification
2023, Computer Vision and Image Understanding
Exploring multi-level information to obtain fine-grained features is the key factor to improve the performance of person re-identification (Re-ID). However, existing models for person Re-ID only focus on learning the high-level semantic information while neglecting the low-level detail information. To alleviate this issue, we propose a lightweight person Re-ID method termed Multi-Scale Semantic and Detail Extraction Network (MSDENet) to obtain robustness and discriminative feature representation for the Re-ID task. Specifically, we design a Series Channel-Spatial Attention (SCSA) and embed it into the lightweight backbone network to focus on the key parts of pedestrian images. Meanwhile, we propose a Multi-Scale Semantic and Detail Extraction (MSDE) method to extract multi-scale features of semantic information and detail information, which can effectively capture the feature diversity of pedestrian images. Furthermore, we design a Feature Enhancement Fusion (FEF), which enhances and fuses the fine-grained features of semantic extraction and detail extraction branches to better obtain the discriminative feature representation. Extensive experiments conducted on popular datasets Market1501, MSMT17, and CUHK03 demonstrate that the proposed MSDENet has competitive performance compared with the state-of-the-art methods.
Person re-identification: A retrospective on domain specific open challenges and future trends
2023, Pattern Recognition
Person Re-Identification (Re-ID) is a critical aspect of visual surveillance systems, which aims to automatically recognize and locate individuals across a multi-camera network with non-overlapping fields-of-view. Despite significant progress in recent years through the use of deep learning-based approaches, there remain many vision-related challenges, such as occlusion, pose, background clutter, misalignment, scale, viewpoint, low resolution & illumination, and cross-domain generalization across camera modalities, that hinder the accurate identification of individuals. The majority of the proposed approaches directly or indirectly aim to solve one or multiple of these existing challenges. To further advance the development of Re-ID solutions, a comprehensive review of the current approaches is necessary. However, no focused review currently exists that analyses and highlights specific aspects for further development. To fill this gap, we present a systematic challenge-specific literature survey of about 300 papers published between 2015 and 2022, which reviews Re-ID approaches from a solution-oriented perspective. This survey is the first of its kind to provide an in-depth analysis of the different approaches used to address the various challenges in Re-ID. Furthermore, our review highlights several prominent and diverse research trends in the Re-ID domain. These trends offer a visionary perspective regarding ongoing person Re-ID research, and they may eventually lead to the development of practical real-world solutions. We highlighted the AI ethics that must be followed while developing a Re-ID solution, and recently being practiced as well. Another exciting future dimension of person Re-ID research is the long-term Re-ID, which is still under evolution. Overall, our survey aims to serve as a valuable resource for researchers and practitioners working in the field of Re-ID and to inspire the development of innovative and effective Re-ID solutions.
M2FINet: Modality-specific and Modality-shared Features Interaction Network for RGB-IR Person Re-Identification
2023, Computer Vision and Image Understanding
Learning rich and discriminative person-related modality-shared feature representations to distinguish the same person in different modalities is significant for RGB-Infrared Person Re-IDentification (RGB-IR ReID). However, most existing models often directly extract modality-shared features from those modality-specific features without considering their interactions. This will result in the ineffective exploration of person-related modality-shared features. To address such a problem, a novel Modality-specific and Modality-shared Features Interaction Network (M2FINet) model is proposed for RGB-IR ReID in this paper. Especially, in the proposed M2FINet model, a Cross-level Feature Guidance and Injection (CFGI) module is carefully designed to establish and exploit the interactions between the middle-level modality-specific features and the high-level modality-shared features. Specifically, the proposed CFGI module mainly consists of two streams, a shared-to-specific feature guidance (H2P) stream and a specific-to-shared feature injection (P2H) stream. Among that, the H2P stream aims to take high-level modality-shared features as the prior information to guide the re-exploration of more rich and discriminative modality-shared semantic information from such middle-level modality-specific features. The P2H stream aims to enhance the representation ability and discriminability of high-level modality-shared features by introducing more modality-shared detail information from the middle-level modality-specific features. On top of that, a simple but effective feature aggregation module, i.e., Focusing on Person (FOP), is further designed in our proposed model to reinforce such discriminative modality-shared features within those person-related regions via a multi-pooling feature aggregation manner. Extensive experiments on two public benchmarks, i.e, SYSU-MM01 and RegDB, show that our proposed model consistently improves the accuracy of RGB-IR ReID. Without bells and whistles, it achieves Rank-1/mAP by 74.73%/68.96% on the large-scale SYSU-MM01 dataset.
G<sup>2</sup>DA: Geometry-guided dual-alignment learning for RGB-infrared person re-identification
2023, Pattern Recognition
Citation Excerpt :
Whereas the adversarial training process is unstable, and such fine-grained alignment seems to profit less from global features alone, leaving large room for further performance improvement. To mine as much discriminative cues as possible, most ReID works [2,3,6,35,36] focus their efforts on local feature learning. Horizontal division approach usually partitions images into equal horizontal strips from top to bottom, while the pre-defined rigid grids are not well adapted to pose variations, imperfect pedestrian detectors and heavy occlusions [1,9].
RGB-Infrared (IR) person re-identification aims to retrieve person-of-interest from heterogeneous cameras, easily suffering from large image modality discrepancy caused by different sensing wavelength ranges. Existing works usually minimize such discrepancy by aligning modality distribution of global features, while neglecting deep semantics and high-order structural relations within each class. This might render the misalignment between heterogeneous samples. In this paper, we propose Geometry-Guided Dual-Alignment (G $^{2}$ DA) learning, which yields better sample-level modality alignment for RGB-IR ReID by solving a graph-enabled distribution matching task that maximizes agreement between multi-modality node representations considering edge topology. Specifically, we covert RGB/IR images into semantic-aligned graphs, in which whole-part features and their similarities are represented by nodes and associated edges, respectively. To simultaneously implement node- and edge-wise alignment (Dual Alignment), we introduce Optimal Transport (OT) as a metric to calculate cross-modality human body matching scores. By minimizing the displacement cost across RGB-IR graphs, G $^{2}$ DA could learn not just modality-invariant but structurally consistent cross-modality representations. Furthermore, we advance a Message Fusion Attention (MFA) mechanism to adaptively smooth the node representations within each RGB/IR graph, effectively alleviating occlusions caused by other individuals and/or objects. Extensive experiments on two standard benchmark datasets validate the superiority of G $^{2}$ DA, yielding competitive performance against previous state-of-the-arts.

View all citing articles on Scopus

View full text