Learning refined attribute-aligned network with attribute selection for person re-identification

doi:10.1016/j.neucom.2020.03.057

Neurocomputing

Volume 402, 18 August 2020, Pages 124-133

https://doi.org/10.1016/j.neucom.2020.03.057 Get rights and content

Abstract

Effective person re-identification (Re-ID) is often required in real applications. While most exiting approaches either assume the detected pedestrian bounding box well-aligned or utilize limited human structural information (pose, attention, segmentation) to calibrate the misalignment. However, the value of utilizing attributes for pedestrian alignment is still under explored. Furthermore, the hierarchy of attributes in previous works has been largely ignored, appearance feature and attribute feature are often fused in a rigid way. This directly limits the discriminatory and robustness of feature representation. In this paper, we propose a Refined Attribute-aligned Network (RAN), which consists of a coarse-alignment and a fine-alignment module. First, the pre-trained part and attribute predictor are used to generate body parts and candidate attributes. Then the body parts are used for coarse alignment and the attributes are selected by an agent. The agent is optimized with policy gradient algorithm, which can maximize the accumulative reward to increase the probability of proper attribute selection. Finally, for the fine-alignment, the attribute maps and body part features are aggregated by a bilinear-pooling layer to support accurate Re-ID. Extensive experimental results based on multiple datasets including CUHK03, DukeMTMC and Market-1501 demonstrate the superiority of our method over state-of-the-art methods.

Introduction

Smart camera constitutes one of the most important information technologies impacting various aspects of our everyday life, including how we track and monitor in different types of public spaces. Generally, the main objective of intelligent Re-ID system is to retrieve pedestrian images across non-overlapping cameras over different time. Re-ID becomes an increasingly vital visual analytics task and enjoys a wide range of real applications.

The misalignment of body part (i.e. that the body parts of query images are misaligned across the gallery image, as shown in Fig. 1(a).) can greatly degrade the performance of existing Re-ID systems. To overcome this limitation, recent approaches have tried to leverage the localization information and combine the representation over them [15], [25]. Proper alignment plays an important role in supporting effective Re-ID. As shown in Fig. 1(b), the mainstream alignment strategies used in existing Re-ID methods can be generally divided into three independent groups: attention-based, part-based and pose-based.

A few researchers try to employ attention method to refine the features via deep learning [24], [26], [48], [50], [51], [55]. Basic idea is to take advantage of the cues from high-level semantic to align pedestrian indirectly, which can eliminate the useless noise features and enhance the importance of the meaningful local parts. Nevertheless, the learning of attention is not explicitly supervised, making it much less effective to enhance feature’s quality.

On the other hand, a group of works compute the local representations by partitioning the pedestrian image into cells [12], [39], [43], [52], [56]. For instance, person bounding boxes are segmented into horizontal stripes or grids to extract features. Then, metric measurement is applied based on the relevances among the parts. Whereas these methods subject to the lack of fine-grained part localization within the bounding box, which makes it unreliable for pedestrian alignment.

Another group of strategies utilize pose estimator to achieve proper alignment via detecting the key points of the pedestrian [14], [35], [46], [47], [57]. Pose-guided methods do provide exact key points for calibration. However, these strategies are excessive relied on highly-accurate pose estimation, even state-of-the-art pose estimation frameworks are often error-prone in person re-identification datasets.

Human attributes learning [21], [22], [27], [31], [32], [36] has been proven to be an effective way to improve person retrieval systems. Human attributes describe the visual properties of a distinguishable part of body, clothes or accessories. In fact, attribute features contain detailed localization information, which is complementary to the global and local features. Thus, taking full use of attributes are especially effective for fine-grained Re-ID task. However, conventional methods utilize fixed categories of attributes to aid Re-ID task, which are not capable enough to deal with the great variance of different pedestrian images. The variance can come from either inaccurate pedestrian detection results, or the pedestrians themselves due to different types of clothing, occlusions, poses, or even the camera viewpoints. Thus fusing fixed categories of attributes with pedestrian feature will bring noisy information when the improper attributes are selected. In view of this, we introduce an attribute selection model to determine which attribute to use adaptively. However, this procedure is non-differentiable. To handle this problem, we introduce Reinforcement Learning (RL) techniques to optimize such an attribute selection model. As shown in Fig. 1(c), the proposed RAN contains two alignment modules, i.e. Coarse Alignment module and Fine Alignment module.

Firstly, we pre-train a part predictor and an attribute extractor on RAP [16], which includes fine-labeled attributes within body parts in Section 3.1. Thus the pre-trained part and attribute predictors can generate coarse body parts and candidate attributes. The part predictor outputs body localization grids of a pedestrian to extract part features, which can be used for a coarse alignment in body part level.

In order to mine the high-level semantic cues of attributes, we design an agent network for attribute selection to refine the part features. In particular, we utilize an off-line searching method for objective (attribute selection) strategy to generate reward. The agent is trained in a straightforward and explicitly supervised fashion, which ensures that RAN enjoys high robustness and generalization to handle the misalignment issue. Finally, we compute the bilinear mapping of the part features and selected attribute maps for fine alignment, which can produce the final feature representation for Re-ID. Extensive experiments performed on different large scale datasets including CUHK03 [17], DukeMTMC [30], [59] and Market-1501 [58] demonstrate the effectiveness of the proposed RAN. To summarize, our main contributions are highlighted as follows:

•
First, we propose the Refined Attribute-aligned Network (RAN) to handle the misalignment in Re-ID task in a coarse-to-fine fashion. The Coarse Alignment (CA) module preliminarily aligns the human body parts with part features. The Fine Alignment (FA) module utilizes the high-level semantic cues and localization information of attribute feature to further enhance the part features from CA.
•
Second, an agent is proposed for attribute selection with reinforcement learning, which can select proper attribute features to fuse with the part-level features via bilinear pooling. The agent provides significant flexibility for attribute selection and further enhance the attribute features for fine alignment.

Section snippets

Related work

In this section, we review recent attention-based, part-based and pose-based methods.

Attention-based alignment methods. A group of researchers exploit attention methods to handle the misalignment by learning high-level semantic information. Ref. [23] introduced a comparative attention framework to compute the distance between query and gallery images. Combining the response of different body parts as attention map, Ref. [50] assembled the pixels with higher attention values to locate the

Learning framework

Our approach mainly consists of two parts, i.e. Coarse Alignment module and Fine Alignment module, which is illustrated in Fig. 2. When a probe image is input to the CA module, the part predictor localizes the three body parts (head, upper body, lower body) and extracts the part features separately. Meanwhile, the attribute features are extracted when input images go through FA module, Then an agent for attribute selection is proposed to conduct fine alignment. Eventually, the features

Datasets and evaluation protocols

Our proposed method is evaluated on three mainstream large-scale Person Re-ID benchmarks including Market-1501 [58], DukeMTMC-ReID [30], [59] and CUHK03 [17].

The CUHK03 contains 14,097 images with 1467 identities, which are obtained by 6 cameras in the CUHK campus. This dataset provides two types of annotations: hand-labeled and DPM-detected bounding boxes. Primitively, CUHK03 offers both hand-labeled and DPM-detected bounding boxes, and we use the latter in this paper. Due to the time

Conclusion

In this work, we introduce a novel Refined Attribute-aligned Network (RAN), emphasizing the misalignment in Re-ID. With the guidance of attribute localization information, our approach extracts body part features for coarse alignment. Furthermore, we design an agent for attribute selection with reinforcement learning. The body part features and attribute maps are fused by a bilinear pooling operation, which realizes the complementation of localized part features and the fine-grained attribute

CRediT authorship contribution statement

Yuxuan Shi: Conceptualization, Methodology, Software, Writing - original draft. Hefei Ling: Supervision, Writing - review & editing. Lei Wu: Visualization, Investigation. Jialie Shen: Writing - review & editing. Ping Li: Project administration, Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the Natural Science Foundation of China under Grant U1536203 and 61972169, the National key research and development program of China (2016QY01W0200), the Major Scientific and Technological Project of Hubei Province (2018AAA068 and 2019AAA051).

Yuxuan Shi obtained the B.S. degree from Wuhan University of S cience and Technology, China in 2014. He rece ived the M.S. degree from Wuhan University of Techno logy, China in 2017. Currently, he is seeking his Ph.D. degree in School of Comp uter Science and Techn ology at Huazhong University of Science and Technology, China. His research interests include computer visio n and multimedia data analysis, image classification and person re identification.

References (62)

X. Bai et al.
Deep-person: Learning discriminative deep features for person re-identification
Pattern Recognition
(2020)
I.B. Barbosa et al.
Looking beyond appearances: Synthetic training data for deep cnns in re-identification
Computer Vision and Image Understanding
(2018)
X. Fan et al.
Spherereid: Deep hypersphere manifold embedding for person re-identification
Journal of Visual Communication and Image Representation
(2019)
Y. Huang et al.
Deepdiff: Learning deep difference features on human body parts for person re-identification
Neurocomputing
(2017)
F. Letsch et al.
Localizing salient body motion in multi-person scenes using convolutional neural networks
Neurocomputing
(2019)
Y. Lin et al.
Improving person re-identification by attribute and identity learning
Pattern Recognition
(2019)
H. Ling et al.
Improving person re-identification by multi-task learning
Neurocomputing
(2019)
Y. Liu et al.
A new patch selection method based on parsing and saliency detection for person re-identification
Neurocomputing
(2020)
L. Lu et al.
A two-level attention-based interaction model for multi-person activity recognition
Neurocomputing
(2018)
T. Chen et al.
Abd-net: Attentive but diverse person re-identification
Proceedings of the IEEE International Conference on Computer Vision
(2019)

Y. Chen et al.

Person re-identification by deep learning multi-scale representations

Proceedings of the IEEE international conference on computer vision

(2017)

J. Deng et al.

Imagenet: A large-scale hierarchical image database

2009 IEEE conference on computer vision and pattern recognition

(2009)

P. Felzenszwalb et al.

A discriminatively trained, multiscale, deformable part model

2008 IEEE conference on computer vision and pattern recognition

(2008)

Y. Fu et al.

Horizontal pyramid matching for person re-identification

Proceedings of the AAAI Conference on Artificial Intelligence

(2019)

M. Geng et al.

Deep transfer learning for person re-identification

(2016)

L. He et al.

Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Cited by (31)

Discriminative feature mining with relation regularization for person re-identification
2023, Information Processing and Management
The appearance attribute and pose are two important and complementary features, so integrating them can effectively alleviate the impact of misalignment and occlusion on re-identification. In this paper, we deeply investigate the inner relation between attribute features and the spatial semantic relation between key-point region features of the pose in a person image and propose a person re-identification method based on discriminative feature mining with relation regularization. Firstly, an attribute relation detector based on nonlinear graph convolution is built on mining the inner correlation between attribute features of a person, providing relational attribute features for more effectively distinguishing persons with a similar appearance. Then, we construct a hierarchical pose pyramid to model the multi-grained semantic features of key-point regions of the pose and propose intra-graph and cross-graph node relation information propagation structures to infer the spatial semantic relation between node features within-graph and between-graph. This module is robust to complex pose changes and can suppress noise background redundancy caused by inaccurate key point detection and occlusion. Finally, a refined feature model is proposed to effectively fuse the global appearance feature with the relational attribute and multi-grained pose features, thus providing a more discriminative fusion feature for person re-identification. Many experiments on three large-scale datasets verify the effectiveness and state-of-the-art performance of the proposed method.
Delving into the representation learning of deep hashing
2022, Neurocomputing
Searching for the nearest neighbor is a fundamental problem in the computer vision field, and deep hashing has become one of the most representative and widely used methods, which learns to generate compact binary codes for visual data. In this paper, we first delve into the representation learning of deep hashing and surprisingly find that deep hashing could be a double-edged sword, i.e., deep hashing can accelerate the query speed and decrease the storage cost in the nearest neighbor search progress, but it greatly sacrifices the discriminability of deep representations especially with extremely short target code lengths. To solve this problem, we propose a two-step deep hashing learning framework. The first step focuses on learning deep discriminative representations with metric learning. Subsequently, the learning framework concentrates on simultaneously learning compact binary codes and preserving representations learned in the former step from being sacrificed. Extensive experiments on two general image datasets and four challenging image datasets validate the effectiveness of our proposed learning framework. Moreover, the side effect of deep hashing is successfully mitigated with our learning framework.
Spatial-wise and channel-wise feature uncertainty for occluded person re-identification
2022, Neurocomputing
Occluded person re-identification is a challenging task since the available data often suffers from information incompleteness and spatial misalignment. Most state-of-the-art occluded models rely on the external model to provide additional semantic information. However, for the time being, external models, such as the human parsing model and the pose estimation model cannot provide accurate semantic information under a complex occlusion environment and may introduce errors to the Re-ID model instead. In this paper, we propose an occluded person Re-ID model that mines the latent recognizable information of the person image itself, without the help of external models. Feature/Data uncertainty can reduce the influence of noisy samples in datasets and has been discussed in person Re-ID and face recognition, we extend the uncertainty to the micro feature level, and propose the spatial-wise and channel-wise feature uncertainty to constantly refine the features in the spatial domain and the channel domain respectively during feature construction by weakening the influence of noise features. Extensive experiments on the occluded datasets and holistic datasets have proved the effectiveness of our proposed methods.
Attribute disentanglement and registration for occluded person re-identification
2022, Neurocomputing
Citation Excerpt :
In order to calculate the loss more effectively, Hard Mining Triplet Loss [13] was proposed to pick out the image pairs of hard pedestrian samples. The feature learning based methods target to learn discriminative and robust features to represent pedestrians [4,16,22,25,38,41,45,52,66]. Many approaches have utilized human pose estimation or parsing models to capture semantic information of human body parts.
Occluded person re-identification is a challenging task which suffers from various obstacles. However, existing occluded Re-ID methods tend to exploit body detectors for pedestrian alignment, which are over-reliant on detection and lack of a flexible matching mechanism. To address this issue, we propose an Attribute Disentanglement and Registration (ADR) network to excavate non-occluded regions via attribute feature disentanglement, which can be matched effectively with a robust and soft attribute registration. The proposed ADR takes full advantages of pedestrian attributes’ high-level semantic concepts to alleviate the occlusion problem. First, the Attribute Disentanglement (AD) module obtains the representations of different attributes by localizing their spatial positions. Then the Attribute Registration (AR) module searches and matches these localized regions between different pedestrian images to conduct a registration, which allows the attribute features to be adaptively and efficiently matched. Extensive experiments on occluded, partial, and holistic Re-ID benchmarks demonstrate the effectiveness of the proposed ADR framework as well as its superiority over the existing state-of-the-art methods.
LABNet: Local graph aggregation network with class balanced loss for vehicle re-identification
2021, Neurocomputing
Citation Excerpt :
Zhong et al. [92] proposed a part-based attention model to alleviate the misalignment problem within multiple instances of the same person due to severe changes in human pose and imperfect pedestrian detection. A reinforcement learning-based method is introduced by Shi et al. [67] to tackle the misalignment issue. A coarse alignment is proposed by selecting proper attributes by an agent and a finer alignment is performed from the coarsely aligned features using a bilinear-pooling layer.
Vehicle re-identification is an important computer vision task where the objective is to identify a specific vehicle among a set of vehicles seen at various viewpoints. Recent methods based on deep learning utilize a global average pooling layer after the backbone feature extractor, however, this ignores any spatial reasoning on the feature map. In this paper, we propose local graph aggregation on the backbone feature map, to learn associations of local information and hence improve feature learning as well as reduce the effects of partial occlusion and background clutter. Our local graph aggregation network considers spatial regions of the feature map as nodes and builds a local neighborhood graph that performs local feature aggregation before the global average pooling layer. We further utilize a batch normalization layer to improve the system effectiveness. Additionally, we introduce a class balanced loss to compensate for the imbalance in the sample distributions found in the most widely used vehicle re-identification datasets. Finally, we evaluate our method in three popular benchmarks and show that our approach outperforms many state-of-the-art methods.
Domain-adaptive person re-identification via domain alignment and mutual pseudo-label refinement
2024, Multimedia Systems

View all citing articles on Scopus

Hefei Ling obtained the B.S., M.S., PhD degre e from Huazhong Univer sity of Science and Technology, China in 1999, 2002, 2005 respectively. He is currently serving as a professor in the School of Computer Science and Technology, HUST. Prof.Ling served as a visiting professor in University College London from 2008 to 2009. He has published more than 100 papers. Now h e serves as director of digital media and Intelligent Technology Research Institute.

Lei Wu received the B.E. degree in Information and Computing Science from Wuhan University of Science and Technology, Wuhan, China in 2016. Now he is currently pursuing the PhD degree at Huazhong University of Science and Technology, Wuhan, China. His research interest includes information retrieval and non convex optimization.

Jialie Shen is Reader in Computer Sc ience with School of Electron ics, Electrical Engineering and Computer Science, Queen’s University of Belfast, UK. His main researc h interests include information retrieval, video analytics and machine learning.

Ping Li is a lecturer in school of Computer science and Technology, Huazhong University of Science and Technology(HUST). He received his Ph.D. degree in Computer Application from HUST in 2009. His research intere sts include multimedia security, image retrieval and machine learning.

View full text

Learning refined attribute-aligned network with attribute selection for person re-identification

Abstract

Introduction

Section snippets

Related work

Learning framework

Datasets and evaluation protocols

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Pattern Recognition

Computer Vision and Image Understanding

Journal of Visual Communication and Image Representation

Neurocomputing

Neurocomputing

Pattern Recognition

Neurocomputing

Neurocomputing

Neurocomputing

Abd-net: Attentive but diverse person re-identification

Proceedings of the IEEE International Conference on Computer Vision

Person re-identification by deep learning multi-scale representations

Proceedings of the IEEE international conference on computer vision

Imagenet: A large-scale hierarchical image database

2009 IEEE conference on computer vision and pattern recognition

A discriminatively trained, multiscale, deformable part model

2008 IEEE conference on computer vision and pattern recognition

Horizontal pyramid matching for person re-identification

Proceedings of the AAAI Conference on Artificial Intelligence

Deep transfer learning for person re-identification

Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

In defense of the triplet loss for person re-identification

Scalable metric learning via weighted approximate rank component analysis

European conference on computer vision

Human semantic parsing for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

A richly annotated dataset for pedestrian attribute recognition

arXiv preprint arXiv:1603.07054

Deepreid: Deep filter pairing neural network for person re-identification

Proceedings of the IEEE conference on computer vision and pattern recognition

Person re-identification by deep joint learning of multi-loss classification

arXiv preprint arXiv:1705.04724

Harmonious attention network for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Person re-identification by local maximal occurrence representation and metric learning

Proceedings of the IEEE conference on computer vision and pattern recognition

End-to-end comparative attention networks for person re-identification

IEEE Transactions on Image Processing

Hydraplus-net: Attentive deep features for pedestrian analysis

Proceedings of the IEEE international conference on computer vision

Person re-identification using CNN features learned from combination of attributes

2016 23rd International Conference on Pattern Recognition (ICPR)

Fast and scalable polynomial kernels via explicit feature maps

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Yolov3: An incremental improvement

arXiv preprint arXiv:1804.02767

Performance measures and a data set for multi-target, multi-camera tracking

European Conference on Computer Vision

Person re-identification by deep learning attribute-complementary information

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops