Person Re-identification with Global-Local Background_bias Net
Introduction
Person Re-identification (Re-ID) is a technique that uses computer vision technology to predict whether a given query image and other images from different cameras in the same database belong to the same person [1]. The appearance of a person image can differ greatly by various factors such as camera viewpoints, person's body poses, illumination, occlusion, low image resolution, and background clutters, which poses a huge challenge for person Re-ID [2], [3].
In the early days, most methods directly learn the features of the whole image as a single data sample in the person Re-ID, so that each pixel of the image has the same decision-making effect. However, it has been found that when calculating the similarity between images, the pixel values of the background area have a negligible effect in experiments. Several pedestrian images in the Market1501 dataset are shown in Fig. 1. We can obtain that the background deviation from the same person is large, while between the different persons is small. It will lead to a high degree of similarity between the different persons with similar backgrounds and a certain degree of difference between the same person with different backgrounds, and eventually, the image similarity learning in the network has a certain bias. To deal with the above problem, we use the foreground segmentation network to create a new dataset that removes background information on person Re-ID. In the second row of Fig. 1, it can be seen that the foreground segmentation method ignores the background to make the feature distance from the same pedestrian closer and different pedestrians farther.
In most of the previous methods of person Re-ID, in addition to focusing on the influence of background bias on global feature information, attention is also paid to the effect of local features on metric learning between different images. Many methods use convolutional neural networks (CNNs) to learn global feature representations in an end-to-end manner [2], while neglecting spatiality. The main drawbacks of using only global features are as follows:
1) Different camera angles cause misalignment of pedestrian images. As shown in Fig. 2(a), because the camera angles are different, the same body part of the same pedestrian appears at different positions in the image, so the phenomenon of misalignment occurs in various parts of the body.
2) Inaccurate bounding boxes result in a lack of emphasis on local differences. In Fig. 2(b), the left image lacks body parts due to the inaccurate pedestrian bounding boxes, which makes it difficult to learn metrics between images.
3) The occlusion of obstacles makes deviations in the characteristic distance between the same pedestrians. In Fig. 2(c), pedestrians are easily obscured by moving obstacles (such as cars, other persons, etc.) or stationary obstacles (such as trees, guardrails, etc.), which leads to the phenomenon in which pedestrians have only part of their bodies in pedestrian images.
Recently, many scholars have realized the importance of local features. The Siamese Long Short-Term Memory architecture proposed in [4] can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. First, it divides the image vertically into several parts, and then places the divided images into the network in order. Although the method can improve the recognition ability of local feature representation, just dividing the whole body into several fixed parts does not consider whether the parts are aligned, which may cause some images to have differences among the same pedestrians. Afterward, for the case of misalignment, the pose invariant embedding (PIE) proposed by Zheng et al. [5] is used as a pedestrian descriptor. A PoseBox with the 14 joints is detected with the convolutional pose machine [6] and the same keypoints are aligned by affine transformations. The Global-Local-Alignment Descriptor (GLAD) [7] solves the problem of pedestrian posture changes by extracting keypoints, dividing the body into blocks, and finally realizes the fusion of global and local features. These local feature alignment methods all require additional skeleton keypoints or a pose estimation model to locate the body part [5], [7], [8].
In order to locate more accurately and divide finely, we adopt the Human Pose Estimator (HPE) [9] to estimate the 18 joints. As shown in Fig. 2, the second row of foreground images is segmented into three parts of the body based on the 18 joints. On the one hand, the occlusion of the obstacle or the inaccurate bounding boxes can enhance the local feature information. On the other hand, for the image misalignment, keypoints can be used for alignment in the Partial Block. Therefore, the combination of the global and the local is crucial for person Re-ID. To align locally, some methods use strip or grid segmentation to reduce the effect of partial offset [4], [10]. However, it is not enough to perform partial segmentation. Due to occlusion and inaccurate bounding boxes, the lack of body parts in the person image is inevitable, which naturally leads to a large contrast error in the network. As shown in Fig. 2(b) and (c), local segmentation and alignment have been performed based on joints, but the phenomenon of missing parts still exists. According to the occlusion mechanism, combined with the local component features of the occluded pedestrians, the loss function is set to different weight coefficients to solve the problem of missing components. For the combination of local and global, we use the information of joint alignment to drive the network to learn the alignment features from the original image.
To solve the challenge brought by background bias: different pedestrians have similar characteristics, but the same pedestrians have different characteristics. The proposed GLBN model uses the foreground segmentation network to remove the background in person Re-ID. This can alleviate the background bias problem and make it robust under different background changes. Meanwhile, the deep recognition framework based on joint alignment is proposed and applied to foreground segmentation, which improves the ability of person Re-ID.
Above all, we propose a network architecture named Global-Local Background_bias Net (GLBN) which consists of a joint-aligned local network named Foreground Partial Segmentation Net (FPSN), a global network named Global Aligned Supervision Net (GASN) and a constrained background network named Background_bias Constraint Net (BCN). Firstly, FPSN is used not only to learn network features but also to act as a regulator to guide the global network to learn the semantic alignment features. Secondly, GASN is added with a supervising function for the fusion of global features, local features, foreground features and panorama features, which is used to supervise the learning of local features and promote joint optimization of the entire network. Finally, we can shorten the distance between the foreground features and the panorama features and widen the distance between background features and panorama features by using BCN to restrict the relationship between foreground features, background features, and panorama features.
The rest of this article is laid out as follows. The second part discusses some of the work related to our approach. The overall framework of the network will be presented in the third section, which includes the main components of the network. The fourth part will introduce the results through several experiments and analyze the validity of the results. The final section, the last part of the paper, we will summarize and prospect the full text.
Section snippets
Related work
Most person Re-ID approaches focus on retrieving a person-of-interest from a database of collected individual images. In fact, in cross-camera surveillance applications, in addition to the individual Re-ID task, there is another kind of person Re-ID task that matches a group of persons across different camera views [11]. The main task of this article is to implement the individual person Re-ID to deal with obstacles brought by the appearance changes of individuals.
Person Re-ID is to compare the
Proposed method
We use ResNet [17] as backbone to extract image features. Using global information to supervise local information and local information to guide global information, we propose the Global-Local Background_bias Net (GLBN), a network which consists of three parts. Our network is depicted in Fig. 3. The three parts of the proposed network are Foreground Partial Segmentation (FPS) which contains three sub-networks for pedestrian parts information learning, Global Aligned Supervision (GAS) to obtain
Experiment
In this section, we mainly focus on six aspects below:
(1) Person Re-ID datasets and evaluation protocol; (2) experimental implementation details; (3) verification of model validity; (4) robustness analysis with adaptive weights; (5) comparison with the latest methods; (6) in-depth analysis of the method.
Conclusion
Problems of person Re-ID due to the pose change, alignment, occlusion, and background clutters may have a great influence on recognition accuracy. This requires the algorithm to extract useful foreground information and match its local features so that the local features and the global features together form the feature representation of the pedestrian image to obtain a better recognition effect. In this paper, we introduce the noise-weakened person Re-ID model with adaptive partial
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of our paper.
Compliance with ethical standards
Funding: This study was funded by the National Natural Science Foundation of China (grant number 61672202) and State Key Program of NSFC-Shenzhen Joint Foundation (grant number U1613217).
References (52)
- et al.
Deepreid: Deep filter pairing neural network for person re-identification
Computer Vision and Pattern Recognition (CVPR)
(2014) - et al.
Person re-identification by local maximal occurrence representation and metric learning
Computer Vision and Pattern Recognition (CVPR)
(2015) - et al.
Person re-identification by deep learning multi-scale representations
ICCV W. on Cross-domain Human Identification
(2016) - et al.
Scalable person re-identification: A benchmark
- et al.
A siamese long short-term memory architecture for human re-identification
- L. Zheng, Y. Huang, H. Lu, Y. Yang, Pose invariant embedding for deep person re-identiflcation, arXiv preprint...
- S.-E. Wei, V. Ramakrishna, T. Kanade, Y. Sheikh, Convolutional pose machines, arXiv preprint arXiv:1602.00134,...
- et al.
GLAD: Global-local-alignment descriptor for pedestrian retrieval
ACM MM
(2017) - et al.
Spindle net: Person re-identification with human body region guided feature decomposition and fusion
- Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multiperson 2d pose estimation using part affinity fields, in:...
Deep metric learning for practical person re-identification
Proceedings of International Conference on Pattern Recognition (ICPR)
Group re-identification with multi-grained matching and integration
IEEE Transactions on Cybernetics
Large scale metric learning from equivalence constraints
A two stream Siamese convolutional neural network for person re-identification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
End-to-end comparative attention networks for person re-identification
IEEE Trans. Image Process.
Deep siamese network with multi-level similarity perception for person re-identification
Learning correspondence structures for person re-identification
IEEE Trans. Image Process.
Active contours without edges
IEEE Transactions on Image Processing, New Jersey, USA
Foreground object segmentation from dense multi-view images
J. Comput.-Aided Design & Comput. Graphics
Visual localization and segmentation based on foreground/background modeling
IEEE International Conference on Acoustics Speech & Signal Processing. IEEE
Cited by (1)
Foreground segmentation-relevant multi-feature fusion person re-identification
2023, Journal of Image and Graphics