Masking for better discovery: Weakly supervised complementary body regions mining for person re-identification

https://doi.org/10.1016/j.eswa.2022.116636Get rights and content

Highlights

  • Weakly supervised data augmentation based network for fine-grained person re-identification.

  • Modeling the spatial relationship of a fine-grained region for person re-identification.

  • Discovering more complementary feature representation in a weakly supervised fashion.

Abstract

Person re-identification still facing several challenges related to many factors such as complex poses, occlusion, misalignment and bad detection. Recent works in the literature focus on extracting local information from the human body but most of them rely on full supervision during training.

We propose in this paper a new end-to-end trainable neural network, named Attention Dropping Network (ADN), for diverse rich visual cues discovery without extra human semantic parsing. ADN aims to find fine-grained local information to address the shared person re-identification challenges. Concretely, our network consists of two branches. The Attention Global Branch learns pixel-level local regions based on a fine-grained attention mechanism while the Feature Dropping Branch learns additional missed features in a weakly supervised manner. The fine-grained attention mechanism allows our model to be robust to complex pose variations and to avoid the redundant backgrounds. Extensive experiments over three benchmark datasets (Market-1501, DukeMTMC-reID, and CUHK03) demonstrate the effectiveness and robustness of the proposed network in handling the problems of complex poses, misalignment and occlusions.

Introduction

Person re-identification (Re-ID) aims to find specific pedestrian through non-overlapping surveillance cameras deployed at different locations (Gong et al., 2014). It has a very important application in the field of video surveillance (Bialkowski et al., 2012). Nonetheless, person Re-ID still facing many challenges to accurately differentiate specific targets from different surveillance scenarios. The pedestrian images often have background clutter problems (Ghorbel et al., 2019, Tian et al., 2018), illumination variation (Huang et al., 2019), strong occlusion (Huang et al., 2018) and unconstrained pose (Cho & Yoon, 2016). All these problems can affect the performance of person Re-ID. In fact, images that belong to the same person usually come under significantly different viewpoints and poses which make the intra-class variances high (Wang et al., 2015). In addition, images that belong to different classes can be very similar apart from some minor differences making the inter-class variances very low. The main challenge of each person Re-ID system is then to build a rich and suitable feature representation to be robust to all problems cited above and to overcome the inter-class confusion and the intra-class variation. Apart from the difficulties of extracting features, the small sample size in some person Re-ID databases can be an obstacle to learn a good model of each identity’s intra-class variability (Gong et al., 2014).

In the literature, several works have been devoted to deal with these specific challenges. These last years, the convolutional neural networks (CNNs) are widely used to extract many level features, which contributes to having a more relevant representation for each image. Wang et al. (2018) showed that Deep Learning (CNN-based) methods achieve higher recognition rates than the traditional methods. Some existing methods of person Re-ID learn the discriminative features and similarity measurement (Ahmed et al., 2015, Hermans et al., 2017, Zheng et al., 2017). Several recent works have shown that it is crucially important to embed local information of the person’s images to deal with large variances in appearance (Huang et al., 2017, Li et al., 2017, Yao et al., 2019). Similarly, others depend on additional spatial information to prevent background clutter and arbitrary misalignment (Wei et al., 2018, Zheng et al., 2019). The semantic segmentation is also used to guide a model to extract features from a specific region of the image (Gao et al., 2019, Ghorbel et al., 2019, Ghorbel et al., 2020, Song et al., 2018), but such methods require almost large labeled datasets and considerable additional computational costs. Recently, several works tend to use the attention mechanism which provides an effective method to enhance Re-ID networks’ performances (Jiang et al., 2020, Li et al., 2019, Li, Zhu et al., 2018, Si et al., 2018, Xu et al., 2018). However, most of these works are overlooking the fine-grained details that may contain important distinguishing features. In the same direction, weakly supervised learning (Cao et al., 2015, Zhang et al., 2018), which allows to find the most discriminative region of an image using only image level labels, has gained much popularity as a solution to address the problem of labeled data scarcity in computer vision. Such based methods have proven their effectiveness in object localization (Gao et al., 2018, Zhang, Yang et al., 2018), but they only cover the most discriminative part of an object and not the entire object region. Other works have been devoted to design re-ranking methods (Mansouri et al., 2021, Zhong et al., 2017) in order to enhance the person Re-ID performances by resorting the ranking list of any baseline method.

In this work, we propose the Attention Dropping Network (ADN), a weakly supervised method for diverse rich visual cues discovery without extra human semantic parsing. ADN extracts complementary visual features which can enhance traditional global representation for person re-identification. Fig. 1 shows two images of the same person with different pose and different background. We show in this figure how ADN can overcome the specific person Re-id challenges by focusing on the most discriminating region of the image. In addition, our proposed network allows to focus on finer details of the person image and then extract more discriminating features. Our method learns pixel-level local regions based on a fine-grained attention mechanism, that aims to extract more useful fine-grained foreground features and reduce the impact of redundant backgrounds to help to distinguish persons who are dressed in very similar clothes. The Attention Dropping Network is a two-branch network consisting of an Attention Global Branch (AGB) and a Feature Dropping Branch (FDB) where the attention-based dropping data augmentation is applied. The AGB aims to extract fine-grained features based on M local attention maps. In this branch, a bilinear pooling module is used to model the local pairwise feature interactions, based on the local attention maps, across every channel, and keep saving the spatial relationship. Since that attention mechanism focuses on the most relevant features, the AGB branch can skip some less important features that can contribute to the final prediction. In order to remedy this problem, we considered a second branch, named FDB. In Additional local detailed features are learned through the FDB in a weakly supervised manner. Explicitly, through the FDB, we randomly select one attention map and erase out of the image the most discriminative region according to this attention map. This operation is made during the training phase in order to encourage the network to learn from the extent of the image that can contain other useful features that were ignored by the first branch.

The main contributions of this paper are as follows:

(1) Modeling the spatial relationship of a fine-grained region by the use of bilinear pooling operation between the most discriminative region’s attention map and the feature maps extracted from the whole image.

(2) Discovering a more complementary feature representation by the use of the Feature Dropping Branch which is an attentive feature learning module operating in a weakly supervised fashion.

(3) Our proposed network is trained in an end-to-end fashion and achieves state-of-the-art performances over three public person re-identification datasets.

Section snippets

Related work

In this section, we present a review on person re-identification and fine-grained classification methods.

Proposed Attention Dropping Network (ADN)

The overall network architecture of the proposed Attention Dropping Network (ADN) is illustrated in Fig. 3. The proposed ADN model consists of a basic feature extractor backbone followed by two branches: an Attention Global Branch (AGB) and a Feature Dropping Branch (FDB) where an attention-based dropping data augmentation is applied. The AGB aims to extract the global feature representations. Generally, the images used for person Re-ID do not provide labels for every pixel in the image which

Experiments

In this section, we empirically evaluate the proposed Attention Dropping Network (ADN) on three benchmark datasets: Market-1501 (Zheng et al., 2015), DukeMTMC-reID (Ristani et al., 2016) and CUHK03 (Li et al., 2014).

Market-1501 (Zheng et al., 2015) is composed of 32,668 pedestrian images. It has 1501 different persons in it. It is partitioned into 750 identities for the training stage and 751 for the testing stage. On average, there are 3.6 images for each person under each camera. Images are

Conclusion

We proposed in this paper a novel end-to-end trainable person Re-ID model, called ADN, that aims to discover discriminative body regions in a weakly supervised fashion without extra human semantic parsing. Our aim is to focus on fine-grained features and avoid background and pose variations. The proposed network consists of two branches. On the first side, the Attention Global Branch leverages both attention mechanism and bilinear pooling to find fine-grained local information in a robust way

CRediT authorship contribution statement

Mahmoud Ghorbel: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing –original draft, Writing - review & editing. Sourour Ammar: Conception and design of study, Analysis and/or interpretation of data, Writing –original draft, Writing – review & editing. Yousri Kessentini: Conception and design of study, Analysis and/or interpretation of data, Writing –original draft, Writing - review & editing. Mohamed Jmaiel: Conception and design of study,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This project is carried out under the MOBIDOC scheme, funded by the EU, Tunisia through the EMORI program and managed by the ANPR. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU, Tunisia used for this research. All authors approved the version of the manuscript to be published.

References (52)

  • Gao, M., Li, A., Yu, R., Morariu, V. I., & Davis, L. S. (2018). C-wsl: Count-guided weakly supervised localization. In...
  • GhorbelM. et al.

    Improving person re-identification by background subtraction using two-stream convolutional networks

  • GhorbelM. et al.

    Fusing local and global features for person re-identification using multi-stream deep neural networks

  • GhoshP. et al.

    Understanding center loss based network for image retrieval with few training data

  • GongS. et al.

    The re-identification challenge

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE...
  • HermansA. et al.

    In defense of the triplet loss for person re-identification

    (2017)
  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer...
  • HuangH. et al.

    Adversarially occluded samples for person re-identification

  • HuangY. et al.

    Illumination-invariant person re-identification

  • KimD. et al.

    Two-phase learning for weakly supervised object localization

  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems

    (2012)
  • Li, D., Chen, X., Zhang, Z., & Huang, K. (2017). Learning deep context-aware features over body and latent parts for...
  • LiK. et al.

    Tell me where to look: Guided attention inference network

    (2018)
  • Li, W., Zhao, R., Xiao, T., & Wang, X. (2014). Deepreid: Deep filter pairing neural network for person...
  • Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In Proceedings of the...
  • Cited by (10)

    View all citing articles on Scopus
    View full text