Elsevier

Neurocomputing

Volume 402, 18 August 2020, Pages 124-133
Neurocomputing

Learning refined attribute-aligned network with attribute selection for person re-identification

https://doi.org/10.1016/j.neucom.2020.03.057Get rights and content

Abstract

Effective person re-identification (Re-ID) is often required in real applications. While most exiting approaches either assume the detected pedestrian bounding box well-aligned or utilize limited human structural information (pose, attention, segmentation) to calibrate the misalignment. However, the value of utilizing attributes for pedestrian alignment is still under explored. Furthermore, the hierarchy of attributes in previous works has been largely ignored, appearance feature and attribute feature are often fused in a rigid way. This directly limits the discriminatory and robustness of feature representation. In this paper, we propose a Refined Attribute-aligned Network (RAN), which consists of a coarse-alignment and a fine-alignment module. First, the pre-trained part and attribute predictor are used to generate body parts and candidate attributes. Then the body parts are used for coarse alignment and the attributes are selected by an agent. The agent is optimized with policy gradient algorithm, which can maximize the accumulative reward to increase the probability of proper attribute selection. Finally, for the fine-alignment, the attribute maps and body part features are aggregated by a bilinear-pooling layer to support accurate Re-ID. Extensive experimental results based on multiple datasets including CUHK03, DukeMTMC and Market-1501 demonstrate the superiority of our method over state-of-the-art methods.

Introduction

Smart camera constitutes one of the most important information technologies impacting various aspects of our everyday life, including how we track and monitor in different types of public spaces. Generally, the main objective of intelligent Re-ID system is to retrieve pedestrian images across non-overlapping cameras over different time. Re-ID becomes an increasingly vital visual analytics task and enjoys a wide range of real applications.

The misalignment of body part (i.e. that the body parts of query images are misaligned across the gallery image, as shown in Fig. 1(a).) can greatly degrade the performance of existing Re-ID systems. To overcome this limitation, recent approaches have tried to leverage the localization information and combine the representation over them [15], [25]. Proper alignment plays an important role in supporting effective Re-ID. As shown in Fig. 1(b), the mainstream alignment strategies used in existing Re-ID methods can be generally divided into three independent groups: attention-based, part-based and pose-based.

A few researchers try to employ attention method to refine the features via deep learning [24], [26], [48], [50], [51], [55]. Basic idea is to take advantage of the cues from high-level semantic to align pedestrian indirectly, which can eliminate the useless noise features and enhance the importance of the meaningful local parts. Nevertheless, the learning of attention is not explicitly supervised, making it much less effective to enhance feature’s quality.

On the other hand, a group of works compute the local representations by partitioning the pedestrian image into cells [12], [39], [43], [52], [56]. For instance, person bounding boxes are segmented into horizontal stripes or grids to extract features. Then, metric measurement is applied based on the relevances among the parts. Whereas these methods subject to the lack of fine-grained part localization within the bounding box, which makes it unreliable for pedestrian alignment.

Another group of strategies utilize pose estimator to achieve proper alignment via detecting the key points of the pedestrian [14], [35], [46], [47], [57]. Pose-guided methods do provide exact key points for calibration. However, these strategies are excessive relied on highly-accurate pose estimation, even state-of-the-art pose estimation frameworks are often error-prone in person re-identification datasets.

Human attributes learning [21], [22], [27], [31], [32], [36] has been proven to be an effective way to improve person retrieval systems. Human attributes describe the visual properties of a distinguishable part of body, clothes or accessories. In fact, attribute features contain detailed localization information, which is complementary to the global and local features. Thus, taking full use of attributes are especially effective for fine-grained Re-ID task. However, conventional methods utilize fixed categories of attributes to aid Re-ID task, which are not capable enough to deal with the great variance of different pedestrian images. The variance can come from either inaccurate pedestrian detection results, or the pedestrians themselves due to different types of clothing, occlusions, poses, or even the camera viewpoints. Thus fusing fixed categories of attributes with pedestrian feature will bring noisy information when the improper attributes are selected. In view of this, we introduce an attribute selection model to determine which attribute to use adaptively. However, this procedure is non-differentiable. To handle this problem, we introduce Reinforcement Learning (RL) techniques to optimize such an attribute selection model. As shown in Fig. 1(c), the proposed RAN contains two alignment modules, i.e. Coarse Alignment module and Fine Alignment module.

Firstly, we pre-train a part predictor and an attribute extractor on RAP [16], which includes fine-labeled attributes within body parts in Section 3.1. Thus the pre-trained part and attribute predictors can generate coarse body parts and candidate attributes. The part predictor outputs body localization grids of a pedestrian to extract part features, which can be used for a coarse alignment in body part level.

In order to mine the high-level semantic cues of attributes, we design an agent network for attribute selection to refine the part features. In particular, we utilize an off-line searching method for objective (attribute selection) strategy to generate reward. The agent is trained in a straightforward and explicitly supervised fashion, which ensures that RAN enjoys high robustness and generalization to handle the misalignment issue. Finally, we compute the bilinear mapping of the part features and selected attribute maps for fine alignment, which can produce the final feature representation for Re-ID. Extensive experiments performed on different large scale datasets including CUHK03 [17], DukeMTMC [30], [59] and Market-1501 [58] demonstrate the effectiveness of the proposed RAN. To summarize, our main contributions are highlighted as follows:

  • First, we propose the Refined Attribute-aligned Network (RAN) to handle the misalignment in Re-ID task in a coarse-to-fine fashion. The Coarse Alignment (CA) module preliminarily aligns the human body parts with part features. The Fine Alignment (FA) module utilizes the high-level semantic cues and localization information of attribute feature to further enhance the part features from CA.

  • Second, an agent is proposed for attribute selection with reinforcement learning, which can select proper attribute features to fuse with the part-level features via bilinear pooling. The agent provides significant flexibility for attribute selection and further enhance the attribute features for fine alignment.

Section snippets

Related work

In this section, we review recent attention-based, part-based and pose-based methods.

Attention-based alignment methods. A group of researchers exploit attention methods to handle the misalignment by learning high-level semantic information. Ref. [23] introduced a comparative attention framework to compute the distance between query and gallery images. Combining the response of different body parts as attention map, Ref. [50] assembled the pixels with higher attention values to locate the

Learning framework

Our approach mainly consists of two parts, i.e. Coarse Alignment module and Fine Alignment module, which is illustrated in Fig. 2. When a probe image is input to the CA module, the part predictor localizes the three body parts (head, upper body, lower body) and extracts the part features separately. Meanwhile, the attribute features are extracted when input images go through FA module, Then an agent for attribute selection is proposed to conduct fine alignment. Eventually, the features

Datasets and evaluation protocols

Our proposed method is evaluated on three mainstream large-scale Person Re-ID benchmarks including Market-1501 [58], DukeMTMC-ReID [30], [59] and CUHK03 [17].

The CUHK03 contains 14,097 images with 1467 identities, which are obtained by 6 cameras in the CUHK campus. This dataset provides two types of annotations: hand-labeled and DPM-detected bounding boxes. Primitively, CUHK03 offers both hand-labeled and DPM-detected bounding boxes, and we use the latter in this paper. Due to the time

Conclusion

In this work, we introduce a novel Refined Attribute-aligned Network (RAN), emphasizing the misalignment in Re-ID. With the guidance of attribute localization information, our approach extracts body part features for coarse alignment. Furthermore, we design an agent for attribute selection with reinforcement learning. The body part features and attribute maps are fused by a bilinear pooling operation, which realizes the complementation of localized part features and the fine-grained attribute

CRediT authorship contribution statement

Yuxuan Shi: Conceptualization, Methodology, Software, Writing - original draft. Hefei Ling: Supervision, Writing - review & editing. Lei Wu: Visualization, Investigation. Jialie Shen: Writing - review & editing. Ping Li: Project administration, Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the Natural Science Foundation of China under Grant U1536203 and 61972169, the National key research and development program of China (2016QY01W0200), the Major Scientific and Technological Project of Hubei Province (2018AAA068 and 2019AAA051).

Yuxuan Shi obtained the B.S. degree from Wuhan University of S cience and Technology, China in 2014. He rece ived the M.S. degree from Wuhan University of Techno logy, China in 2017. Currently, he is seeking his Ph.D. degree in School of Comp uter Science and Techn ology at Huazhong University of Science and Technology, China. His research interests include computer visio n and multimedia data analysis, image classification and person re identification.

References (62)

  • Y. Chen et al.

    Person re-identification by deep learning multi-scale representations

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • J. Deng et al.

    Imagenet: A large-scale hierarchical image database

    2009 IEEE conference on computer vision and pattern recognition

    (2009)
  • P. Felzenszwalb et al.

    A discriminatively trained, multiscale, deformable part model

    2008 IEEE conference on computer vision and pattern recognition

    (2008)
  • Y. Fu et al.

    Horizontal pyramid matching for person re-identification

    Proceedings of the AAAI Conference on Artificial Intelligence

    (2019)
  • M. Geng et al.

    Deep transfer learning for person re-identification

    (2016)
  • L. He et al.

    Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • A. Hermans et al.

    In defense of the triplet loss for person re-identification

    (2017)
  • C. Jose et al.

    Scalable metric learning via weighted approximate rank component analysis

    European conference on computer vision

    (2016)
  • M.M. Kalayeh et al.

    Human semantic parsing for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • D. Li et al.

    A richly annotated dataset for pedestrian attribute recognition

    arXiv preprint arXiv:1603.07054

    (2016)
  • W. Li et al.

    Deepreid: Deep filter pairing neural network for person re-identification

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2014)
  • W. Li et al.

    Person re-identification by deep joint learning of multi-loss classification

    arXiv preprint arXiv:1705.04724

    (2017)
  • W. Li et al.

    Harmonious attention network for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • S. Liao et al.

    Person re-identification by local maximal occurrence representation and metric learning

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2015)
  • H. Liu et al.

    End-to-end comparative attention networks for person re-identification

    IEEE Transactions on Image Processing

    (2017)
  • X. Liu et al.

    Hydraplus-net: Attentive deep features for pedestrian analysis

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • T. Matsukawa et al.

    Person re-identification using CNN features learned from combination of attributes

    2016 23rd International Conference on Pattern Recognition (ICPR)

    (2016)
  • N. Pham et al.

    Fast and scalable polynomial kernels via explicit feature maps

    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

    (2013)
  • J. Redmon et al.

    Yolov3: An incremental improvement

    arXiv preprint arXiv:1804.02767

    (2018)
  • E. Ristani et al.

    Performance measures and a data set for multi-target, multi-camera tracking

    European Conference on Computer Vision

    (2016)
  • A. Schumann et al.

    Person re-identification by deep learning attribute-complementary information

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2017)
  • Cited by (31)

    • Attribute disentanglement and registration for occluded person re-identification

      2022, Neurocomputing
      Citation Excerpt :

      In order to calculate the loss more effectively, Hard Mining Triplet Loss [13] was proposed to pick out the image pairs of hard pedestrian samples. The feature learning based methods target to learn discriminative and robust features to represent pedestrians [4,16,22,25,38,41,45,52,66]. Many approaches have utilized human pose estimation or parsing models to capture semantic information of human body parts.

    • LABNet: Local graph aggregation network with class balanced loss for vehicle re-identification

      2021, Neurocomputing
      Citation Excerpt :

      Zhong et al. [92] proposed a part-based attention model to alleviate the misalignment problem within multiple instances of the same person due to severe changes in human pose and imperfect pedestrian detection. A reinforcement learning-based method is introduced by Shi et al. [67] to tackle the misalignment issue. A coarse alignment is proposed by selecting proper attributes by an agent and a finer alignment is performed from the coarsely aligned features using a bilinear-pooling layer.

    View all citing articles on Scopus

    Yuxuan Shi obtained the B.S. degree from Wuhan University of S cience and Technology, China in 2014. He rece ived the M.S. degree from Wuhan University of Techno logy, China in 2017. Currently, he is seeking his Ph.D. degree in School of Comp uter Science and Techn ology at Huazhong University of Science and Technology, China. His research interests include computer visio n and multimedia data analysis, image classification and person re identification.

    Hefei Ling obtained the B.S., M.S., PhD degre e from Huazhong Univer sity of Science and Technology, China in 1999, 2002, 2005 respectively. He is currently serving as a professor in the School of Computer Science and Technology, HUST. Prof.Ling served as a visiting professor in University College London from 2008 to 2009. He has published more than 100 papers. Now h e serves as director of digital media and Intelligent Technology Research Institute.

    Lei Wu received the B.E. degree in Information and Computing Science from Wuhan University of Science and Technology, Wuhan, China in 2016. Now he is currently pursuing the PhD degree at Huazhong University of Science and Technology, Wuhan, China. His research interest includes information retrieval and non convex optimization.

    Jialie Shen is Reader in Computer Sc ience with School of Electron ics, Electrical Engineering and Computer Science, Queen’s University of Belfast, UK. His main researc h interests include information retrieval, video analytics and machine learning.

    Ping Li is a lecturer in school of Computer science and Technology, Huazhong University of Science and Technology(HUST). He received his Ph.D. degree in Computer Application from HUST in 2009. His research intere sts include multimedia security, image retrieval and machine learning.

    View full text