Masking for better discovery: Weakly supervised complementary body regions mining for person re-identification

doi:10.1016/j.eswa.2022.116636

Expert Systems with Applications

Volume 197, 1 July 2022, 116636

https://doi.org/10.1016/j.eswa.2022.116636 Get rights and content

Highlights

•
Weakly supervised data augmentation based network for fine-grained person re-identification.
•
Modeling the spatial relationship of a fine-grained region for person re-identification.
•
Discovering more complementary feature representation in a weakly supervised fashion.

Abstract

Person re-identification still facing several challenges related to many factors such as complex poses, occlusion, misalignment and bad detection. Recent works in the literature focus on extracting local information from the human body but most of them rely on full supervision during training.

We propose in this paper a new end-to-end trainable neural network, named Attention Dropping Network (ADN), for diverse rich visual cues discovery without extra human semantic parsing. ADN aims to find fine-grained local information to address the shared person re-identification challenges. Concretely, our network consists of two branches. The Attention Global Branch learns pixel-level local regions based on a fine-grained attention mechanism while the Feature Dropping Branch learns additional missed features in a weakly supervised manner. The fine-grained attention mechanism allows our model to be robust to complex pose variations and to avoid the redundant backgrounds. Extensive experiments over three benchmark datasets (Market-1501, DukeMTMC-reID, and CUHK03) demonstrate the effectiveness and robustness of the proposed network in handling the problems of complex poses, misalignment and occlusions.

Introduction

Person re-identification (Re-ID) aims to find specific pedestrian through non-overlapping surveillance cameras deployed at different locations (Gong et al., 2014). It has a very important application in the field of video surveillance (Bialkowski et al., 2012). Nonetheless, person Re-ID still facing many challenges to accurately differentiate specific targets from different surveillance scenarios. The pedestrian images often have background clutter problems (Ghorbel et al., 2019, Tian et al., 2018), illumination variation (Huang et al., 2019), strong occlusion (Huang et al., 2018) and unconstrained pose (Cho & Yoon, 2016). All these problems can affect the performance of person Re-ID. In fact, images that belong to the same person usually come under significantly different viewpoints and poses which make the intra-class variances high (Wang et al., 2015). In addition, images that belong to different classes can be very similar apart from some minor differences making the inter-class variances very low. The main challenge of each person Re-ID system is then to build a rich and suitable feature representation to be robust to all problems cited above and to overcome the inter-class confusion and the intra-class variation. Apart from the difficulties of extracting features, the small sample size in some person Re-ID databases can be an obstacle to learn a good model of each identity’s intra-class variability (Gong et al., 2014).

In the literature, several works have been devoted to deal with these specific challenges. These last years, the convolutional neural networks (CNNs) are widely used to extract many level features, which contributes to having a more relevant representation for each image. Wang et al. (2018) showed that Deep Learning (CNN-based) methods achieve higher recognition rates than the traditional methods. Some existing methods of person Re-ID learn the discriminative features and similarity measurement (Ahmed et al., 2015, Hermans et al., 2017, Zheng et al., 2017). Several recent works have shown that it is crucially important to embed local information of the person’s images to deal with large variances in appearance (Huang et al., 2017, Li et al., 2017, Yao et al., 2019). Similarly, others depend on additional spatial information to prevent background clutter and arbitrary misalignment (Wei et al., 2018, Zheng et al., 2019). The semantic segmentation is also used to guide a model to extract features from a specific region of the image (Gao et al., 2019, Ghorbel et al., 2019, Ghorbel et al., 2020, Song et al., 2018), but such methods require almost large labeled datasets and considerable additional computational costs. Recently, several works tend to use the attention mechanism which provides an effective method to enhance Re-ID networks’ performances (Jiang et al., 2020, Li et al., 2019, Li, Zhu et al., 2018, Si et al., 2018, Xu et al., 2018). However, most of these works are overlooking the fine-grained details that may contain important distinguishing features. In the same direction, weakly supervised learning (Cao et al., 2015, Zhang et al., 2018), which allows to find the most discriminative region of an image using only image level labels, has gained much popularity as a solution to address the problem of labeled data scarcity in computer vision. Such based methods have proven their effectiveness in object localization (Gao et al., 2018, Zhang, Yang et al., 2018), but they only cover the most discriminative part of an object and not the entire object region. Other works have been devoted to design re-ranking methods (Mansouri et al., 2021, Zhong et al., 2017) in order to enhance the person Re-ID performances by resorting the ranking list of any baseline method.

In this work, we propose the Attention Dropping Network (ADN), a weakly supervised method for diverse rich visual cues discovery without extra human semantic parsing. ADN extracts complementary visual features which can enhance traditional global representation for person re-identification. Fig. 1 shows two images of the same person with different pose and different background. We show in this figure how ADN can overcome the specific person Re-id challenges by focusing on the most discriminating region of the image. In addition, our proposed network allows to focus on finer details of the person image and then extract more discriminating features. Our method learns pixel-level local regions based on a fine-grained attention mechanism, that aims to extract more useful fine-grained foreground features and reduce the impact of redundant backgrounds to help to distinguish persons who are dressed in very similar clothes. The Attention Dropping Network is a two-branch network consisting of an Attention Global Branch (AGB) and a Feature Dropping Branch (FDB) where the attention-based dropping data augmentation is applied. The AGB aims to extract fine-grained features based on $M$ local attention maps. In this branch, a bilinear pooling module is used to model the local pairwise feature interactions, based on the local attention maps, across every channel, and keep saving the spatial relationship. Since that attention mechanism focuses on the most relevant features, the AGB branch can skip some less important features that can contribute to the final prediction. In order to remedy this problem, we considered a second branch, named FDB. In Additional local detailed features are learned through the FDB in a weakly supervised manner. Explicitly, through the FDB, we randomly select one attention map and erase out of the image the most discriminative region according to this attention map. This operation is made during the training phase in order to encourage the network to learn from the extent of the image that can contain other useful features that were ignored by the first branch.

The main contributions of this paper are as follows:

(1) Modeling the spatial relationship of a fine-grained region by the use of bilinear pooling operation between the most discriminative region’s attention map and the feature maps extracted from the whole image.

(2) Discovering a more complementary feature representation by the use of the Feature Dropping Branch which is an attentive feature learning module operating in a weakly supervised fashion.

(3) Our proposed network is trained in an end-to-end fashion and achieves state-of-the-art performances over three public person re-identification datasets.

Section snippets

Related work

In this section, we present a review on person re-identification and fine-grained classification methods.

Proposed Attention Dropping Network (ADN)

The overall network architecture of the proposed Attention Dropping Network (ADN) is illustrated in Fig. 3. The proposed ADN model consists of a basic feature extractor backbone followed by two branches: an Attention Global Branch (AGB) and a Feature Dropping Branch (FDB) where an attention-based dropping data augmentation is applied. The AGB aims to extract the global feature representations. Generally, the images used for person Re-ID do not provide labels for every pixel in the image which

Experiments

In this section, we empirically evaluate the proposed Attention Dropping Network (ADN) on three benchmark datasets: Market-1501 (Zheng et al., 2015), DukeMTMC-reID (Ristani et al., 2016) and CUHK03 (Li et al., 2014).

Market-1501 (Zheng et al., 2015) is composed of 32,668 pedestrian images. It has 1501 different persons in it. It is partitioned into 750 identities for the training stage and 751 for the testing stage. On average, there are 3.6 images for each person under each camera. Images are

Conclusion

We proposed in this paper a novel end-to-end trainable person Re-ID model, called ADN, that aims to discover discriminative body regions in a weakly supervised fashion without extra human semantic parsing. Our aim is to focus on fine-grained features and avoid background and pose variations. The proposed network consists of two branches. On the first side, the Attention Global Branch leverages both attention mechanism and bilinear pooling to find fine-grained local information in a robust way

CRediT authorship contribution statement

Mahmoud Ghorbel: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing –original draft, Writing - review & editing. Sourour Ammar: Conception and design of study, Analysis and/or interpretation of data, Writing –original draft, Writing – review & editing. Yousri Kessentini: Conception and design of study, Analysis and/or interpretation of data, Writing –original draft, Writing - review & editing. Mohamed Jmaiel: Conception and design of study,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This project is carried out under the MOBIDOC scheme, funded by the EU, Tunisia through the EMORI program and managed by the ANPR. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU, Tunisia used for this research. All authors approved the version of the manuscript to be published.

References (52)

HuangY. et al.
DeepDiff: Learning deep difference features on human body parts for person re-identification
Neurocomputing
(2017)
IslamK.
Person search: New paradigm of person re-identification: A survey and outlook of recent works
Image and Vision Computing
(2020)
JiangM. et al.
Cross-level reinforced attention network for person re-identification
Journal of Visual Communication and Image Representation
(2020)
LiR. et al.
Deep attention network for person re-identification with multi-loss
Computers and Electrical Engineering
(2019)
QiL. et al.
Exploiting spatial relation for fine-grained image classification
Pattern Recognition
(2019)
Ahmed, E., Jones, M., & Marks, T. K. (2015). An improved deep learning architecture for person re-identification. In...
BialkowskiA. et al.
A database for person re-identification in multi-camera surveillance networks
CaoC. et al.
Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks
ChoY.-J. et al.
Improving person re-identification via pose-aware multi-shot matching
GaoH. et al.
Parts semantic segmentation aware representation learning for person re-identification
Applied Sciences
(2019)

Gao, M., Li, A., Yu, R., Morariu, V. I., & Davis, L. S. (2018). C-wsl: Count-guided weakly supervised localization. In...

GhorbelM. et al.

Improving person re-identification by background subtraction using two-stream convolutional networks

GhorbelM. et al.

Fusing local and global features for person re-identification using multi-stream deep neural networks

GhoshP. et al.

Understanding center loss based network for image retrieval with few training data

GongS. et al.

The re-identification challenge

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE...

HermansA. et al.

In defense of the triplet loss for person re-identification

(2017)

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer...

HuangH. et al.

Adversarially occluded samples for person re-identification

HuangY. et al.

Illumination-invariant person re-identification

KimD. et al.

Two-phase learning for weakly supervised object localization

KrizhevskyA. et al.

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems

(2012)

Li, D., Chen, X., Zhang, Z., & Huang, K. (2017). Learning deep context-aware features over body and latent parts for...

LiK. et al.

Tell me where to look: Guided attention inference network

(2018)

Li, W., Zhao, R., Xiao, T., & Wang, X. (2014). Deepreid: Deep filter pairing neural network for person...

Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In Proceedings of the...

Cited by (10)

An attention-based CNN for automatic whole-body postural assessment
2024, Expert Systems with Applications
Fully automatic postural assessment is highly useful, but has been challenging. Conventional methods either require manual assessment by ergonomists or depend on special devices that are intrusive, thus being hardly feasible in daily activities and workplaces. In this work, an attention-based convolutional neural network (CNN) is developed for automatic whole-body postural assessment. The proposed network learns to identify highly relevant regions (or body parts) and extract features automatically. Risk of the posture is estimated from the extracted features accordingly. To evaluate the proposed method, a postural dataset, referred to as pH36M, is created by re-targeting Human3.6M, one of the largest publicly available datasets for pose estimation using the Rapid Entire Body Assessment (REBA) criteria. Experimental results on pH36M demonstrate that proposed method achieves promising performance in comparison to baselines and the average assessment scores are substantially aligned with human assessment with a Kappa value of 0.73.
An adaptive self-correction joint training framework for person re-identification with noisy labels
2024, Expert Systems with Applications
Current person re-identification (ReID) methods heavily rely on well-annotated training data, and their performance suffers from significant degradation in the presence of noisy labels that are ubiquitous in real-life scenes. The reason is that noisy labels not only affect the prediction results of the classifier, but also impede feature refinement, making it difficult to distinguish between different person features. To address these issues, we propose an Adaptive Self-correction Classification (ASC) loss and an Adaptive Margin Self-correction Triplet (AMSTri) loss. Specifically, ASC loss helps the network to produce better predictions by balancing annotations and prediction labels, and pays more attention to the minority samples with the help of a focusing factor. On the other hand, the AMSTri loss introduces an adaptive margin that varies with sample features to accommodate complex data variations, and utilizes predicted labels to generate reliable triples for feature refinement. We then present an end-to-end adaptive self-correction joint training framework incorporating ASC loss and AMSTri loss to achieve a robust ReID model. Our comprehensive experiments demonstrate that the proposed framework outperforms most existing counterparts.
Research on an unsupervised person re-identification based on image quality enhancement method
2023, Engineering Applications of Artificial Intelligence
Research on person re-identification(Re-ID) has important value in pedestrian detection, target tracking, criminal investigation, and other related fields. In unsupervised pedestrian recognition algorithms, the accuracy of pseudo-labels is crucial to the recognition results. However, in practical scenarios, low-quality images caused by factors such as differences in camera resolution and shooting angles can affect the extraction of pedestrian features by these algorithms, thereby negatively impacting the accuracy of the labels and the learning process of the model. To address this problem, we propose an image quality enhancement algorithm for unsupervised person Re-ID (IQE). To the best of our knowledge, this study is the first to introduce detail enhancement and the application of low-light enhancement algorithms into unsupervised person Re-ID. By improving the feature extraction quality based on these two aspects, higher-quality pseudo-labels can be constructed. This method improves the accuracy of feature extraction and clustering, thereby increasing the accuracy of pseudo-labels and reducing the interference of noisy pseudo-labels. The experimental results showed that the IQE method outperformed state-of-the-art person Re-ID methods in terms of Rank-1 accuracy and mAP. Specifically, IQE achieved an 87.9% rank-1 accuracy and a 71.2% mAP on the Market-1501 dataset; a 78.1% rank-1 accuracy and a 61.7% mAP On the DukeMTMC-reID dataset; and a 51.1% rank-1 accuracy and 24.2% mAP on the MSMT17 dataset.
Information complementary attention-based multidimension feature learning for person re-identification
2023, Engineering Applications of Artificial Intelligence
With the need for criminal investigation technology and the development of deep learning, the task of person re-identification has gradually become a research hotspot. Recently, various neural network-based person re-identification technologies designed by researchers have shown excellent results. However, most of the frameworks focus on complex structural design or redundant networks to guide model construction, which hugely increases the cost of train and application cost. In addition, the correlation between the channel information and spatial information on the pedestrian feature map is also relatively lacking. Therefore, we design a lightweight attention module to address the lack of correlation question response. The proposed module sequentially extracts person images’ channel and spatial features and effectively associates the two kinds of information through sequential connections. The proposed attention module has a simple structure, and the parameter increase in the backbone network is tiny. We place the fuse module in each feature extraction layer to focus on the pedestrian information extracted by each layer. To solve the problem of complex model structure, we choose the residual network as the backbone network and the attention mechanism to extract person features without using pose point estimation or additional network assistance to reduce model complexity. We adjust the drop rate of the person classification layer to improve the model’s generalization ability. We estimate the performance of our method on three public datasets: Market-1501, DukeMTMC-reID, and CUHK03 (both detected and labeled) demonstrate the proposed method’s effectiveness and obtain highly competitive performance on the three datasets.
Dbtcc: Unsupervised Pedestrian Re-Identification Combined Double-Branch Transformer and Clustering Contrastive Learning
2023, SSRN
An Attention-Based Cnn for Automatic Whole-Body Postural Assessment
2023, SSRN

View all citing articles on Scopus

View full text

Masking for better discovery: Weakly supervised complementary body regions mining for person re-identification

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed Attention Dropping Network (ADN)

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Image and Vision Computing

Journal of Visual Communication and Image Representation

Computers and Electrical Engineering

Pattern Recognition

A database for person re-identification in multi-camera surveillance networks

Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks

Improving person re-identification via pose-aware multi-shot matching

Parts semantic segmentation aware representation learning for person re-identification

Applied Sciences