Person Re-identification with Global-Local Background_bias Net

https://doi.org/10.1016/j.jvcir.2020.102961Get rights and content

Highlights

  • A novel local segmentation network named FPSN, which can reduce the interference of background features in metric learning.

  • To better obtain accurate and complete feature information, GASN combines foreground features, panoramic features, local features, and global features.

  • The BCN constrains the relationship between foreground features, background features, and panorama features.

Abstract

Person Re-identification (Re-ID) is an important technique in intelligent video surveillance. Because of the variations on camera viewpoints and body poses, there are some problems such as body misalignment, the diverse background clutters and partial bodies occlusion, etc. To address these problems, we propose the Global-Local Background_bias Net (GLBN), a novel network architecture that consists of Foreground Partial Segmentation Net (FPSN), Global Aligned Supervision Net (GASN) and Background_bias Constraint Net (BCN) modules. Firstly, to enhance the adaptability of foreground features and reduce the interference of the background, FPSN is applied to perform local segmentation on the foreground image. Secondly, global features generated by GASN are purposed to supervise the learning of local features. Finally, BCN constrains the background information to reduce the impact of background information again. Extensive experiments implemented on the mainstream evaluation datasets including Market1501, DukeMTMC-reID and CUHK03 indicate that our method is efficient and robust.

Introduction

Person Re-identification (Re-ID) is a technique that uses computer vision technology to predict whether a given query image and other images from different cameras in the same database belong to the same person [1]. The appearance of a person image can differ greatly by various factors such as camera viewpoints, person's body poses, illumination, occlusion, low image resolution, and background clutters, which poses a huge challenge for person Re-ID [2], [3].

In the early days, most methods directly learn the features of the whole image as a single data sample in the person Re-ID, so that each pixel of the image has the same decision-making effect. However, it has been found that when calculating the similarity between images, the pixel values of the background area have a negligible effect in experiments. Several pedestrian images in the Market1501 dataset are shown in Fig. 1. We can obtain that the background deviation from the same person is large, while between the different persons is small. It will lead to a high degree of similarity between the different persons with similar backgrounds and a certain degree of difference between the same person with different backgrounds, and eventually, the image similarity learning in the network has a certain bias. To deal with the above problem, we use the foreground segmentation network to create a new dataset that removes background information on person Re-ID. In the second row of Fig. 1, it can be seen that the foreground segmentation method ignores the background to make the feature distance from the same pedestrian closer and different pedestrians farther.

In most of the previous methods of person Re-ID, in addition to focusing on the influence of background bias on global feature information, attention is also paid to the effect of local features on metric learning between different images. Many methods use convolutional neural networks (CNNs) to learn global feature representations in an end-to-end manner [2], while neglecting spatiality. The main drawbacks of using only global features are as follows:

1) Different camera angles cause misalignment of pedestrian images. As shown in Fig. 2(a), because the camera angles are different, the same body part of the same pedestrian appears at different positions in the image, so the phenomenon of misalignment occurs in various parts of the body.

2) Inaccurate bounding boxes result in a lack of emphasis on local differences. In Fig. 2(b), the left image lacks body parts due to the inaccurate pedestrian bounding boxes, which makes it difficult to learn metrics between images.

3) The occlusion of obstacles makes deviations in the characteristic distance between the same pedestrians. In Fig. 2(c), pedestrians are easily obscured by moving obstacles (such as cars, other persons, etc.) or stationary obstacles (such as trees, guardrails, etc.), which leads to the phenomenon in which pedestrians have only part of their bodies in pedestrian images.

Recently, many scholars have realized the importance of local features. The Siamese Long Short-Term Memory architecture proposed in [4] can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. First, it divides the image vertically into several parts, and then places the divided images into the network in order. Although the method can improve the recognition ability of local feature representation, just dividing the whole body into several fixed parts does not consider whether the parts are aligned, which may cause some images to have differences among the same pedestrians. Afterward, for the case of misalignment, the pose invariant embedding (PIE) proposed by Zheng et al. [5] is used as a pedestrian descriptor. A PoseBox with the 14 joints is detected with the convolutional pose machine [6] and the same keypoints are aligned by affine transformations. The Global-Local-Alignment Descriptor (GLAD) [7] solves the problem of pedestrian posture changes by extracting keypoints, dividing the body into blocks, and finally realizes the fusion of global and local features. These local feature alignment methods all require additional skeleton keypoints or a pose estimation model to locate the body part [5], [7], [8].

In order to locate more accurately and divide finely, we adopt the Human Pose Estimator (HPE) [9] to estimate the 18 joints. As shown in Fig. 2, the second row of foreground images is segmented into three parts of the body based on the 18 joints. On the one hand, the occlusion of the obstacle or the inaccurate bounding boxes can enhance the local feature information. On the other hand, for the image misalignment, keypoints can be used for alignment in the Partial Block. Therefore, the combination of the global and the local is crucial for person Re-ID. To align locally, some methods use strip or grid segmentation to reduce the effect of partial offset [4], [10]. However, it is not enough to perform partial segmentation. Due to occlusion and inaccurate bounding boxes, the lack of body parts in the person image is inevitable, which naturally leads to a large contrast error in the network. As shown in Fig. 2(b) and (c), local segmentation and alignment have been performed based on joints, but the phenomenon of missing parts still exists. According to the occlusion mechanism, combined with the local component features of the occluded pedestrians, the loss function is set to different weight coefficients to solve the problem of missing components. For the combination of local and global, we use the information of joint alignment to drive the network to learn the alignment features from the original image.

To solve the challenge brought by background bias: different pedestrians have similar characteristics, but the same pedestrians have different characteristics. The proposed GLBN model uses the foreground segmentation network to remove the background in person Re-ID. This can alleviate the background bias problem and make it robust under different background changes. Meanwhile, the deep recognition framework based on joint alignment is proposed and applied to foreground segmentation, which improves the ability of person Re-ID.

Above all, we propose a network architecture named Global-Local Background_bias Net (GLBN) which consists of a joint-aligned local network named Foreground Partial Segmentation Net (FPSN), a global network named Global Aligned Supervision Net (GASN) and a constrained background network named Background_bias Constraint Net (BCN). Firstly, FPSN is used not only to learn network features but also to act as a regulator to guide the global network to learn the semantic alignment features. Secondly, GASN is added with a supervising function for the fusion of global features, local features, foreground features and panorama features, which is used to supervise the learning of local features and promote joint optimization of the entire network. Finally, we can shorten the distance between the foreground features and the panorama features and widen the distance between background features and panorama features by using BCN to restrict the relationship between foreground features, background features, and panorama features.

The rest of this article is laid out as follows. The second part discusses some of the work related to our approach. The overall framework of the network will be presented in the third section, which includes the main components of the network. The fourth part will introduce the results through several experiments and analyze the validity of the results. The final section, the last part of the paper, we will summarize and prospect the full text.

Section snippets

Related work

Most person Re-ID approaches focus on retrieving a person-of-interest from a database of collected individual images. In fact, in cross-camera surveillance applications, in addition to the individual Re-ID task, there is another kind of person Re-ID task that matches a group of persons across different camera views [11]. The main task of this article is to implement the individual person Re-ID to deal with obstacles brought by the appearance changes of individuals.

Person Re-ID is to compare the

Proposed method

We use ResNet [17] as backbone to extract image features. Using global information to supervise local information and local information to guide global information, we propose the Global-Local Background_bias Net (GLBN), a network which consists of three parts. Our network is depicted in Fig. 3. The three parts of the proposed network are Foreground Partial Segmentation (FPS) which contains three sub-networks for pedestrian parts information learning, Global Aligned Supervision (GAS) to obtain

Experiment

In this section, we mainly focus on six aspects below:

(1) Person Re-ID datasets and evaluation protocol; (2) experimental implementation details; (3) verification of model validity; (4) robustness analysis with adaptive weights; (5) comparison with the latest methods; (6) in-depth analysis of the method.

Conclusion

Problems of person Re-ID due to the pose change, alignment, occlusion, and background clutters may have a great influence on recognition accuracy. This requires the algorithm to extract useful foreground information and match its local features so that the local features and the global features together form the feature representation of the pedestrian image to obtain a better recognition effect. In this paper, we introduce the noise-weakened person Re-ID model with adaptive partial

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of our paper.

Compliance with ethical standards

Funding: This study was funded by the National Natural Science Foundation of China (grant number 61672202) and State Key Program of NSFC-Shenzhen Joint Foundation (grant number U1613217).

References (52)

  • D. Yi et al.

    Deep metric learning for practical person re-identification

    Proceedings of International Conference on Pattern Recognition (ICPR)

    (2014)
  • W. Lin et al.

    Group re-identification with multi-grained matching and integration

    IEEE Transactions on Cybernetics

    (2019)
  • M. Koestinger et al.

    Large scale metric learning from equivalence constraints

  • D. Chung et al.

    A two stream Siamese convolutional neural network for person re-identification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • H. Liu et al.

    End-to-end comparative attention networks for person re-identification

    IEEE Trans. Image Process.

    (2017)
  • W. Chen, X. Chen, J. Zhang, K. Huang, Beyond triplet loss: a deep quadruplet network for person re-identification,...
  • C. Shen et al.

    Deep siamese network with multi-level similarity perception for person re-identification

  • E. Ristani, C. Tomasi, Features for multi-target multi-camera tracking and re-identification, in: Proc. IEEE CVPR, Jun....
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
  • W. Lin et al.

    Learning correspondence structures for person re-identification

    IEEE Trans. Image Process.

    (2017)
  • T.F. Chan et al.

    Active contours without edges

    IEEE Transactions on Image Processing, New Jersey, USA

    (2001)
  • S. Lankton, D. Nain, A. Yezzi, Hybrid geodesic region-based curve evolutions for image segmentation, in: Proceedings of...
  • L. Fan et al.

    Foreground object segmentation from dense multi-view images

    J. Comput.-Aided Design & Comput. Graphics

    (2009)
  • H. Wang et al.

    Visual localization and segmentation based on foreground/background modeling

    IEEE International Conference on Acoustics Speech & Signal Processing. IEEE

    (2010)
  • D. Tang, H. Fu, and X. Cao, Topology preserved regular superpixel, in Proc. IEEE Int. Conf. Multimedia Expo (ICME),...
  • H. Fu, X. Cao, Z. Tu, and D. Lin, Symmetry constraint for foreground extraction, IEEE Transaction on Cybernetics,...
  • View full text