Scale-invariant batch-adaptive residual learning for person re-identification

doi:10.1016/j.patrec.2019.11.032

Pattern Recognition Letters

Volume 129, January 2020, Pages 279-286

https://doi.org/10.1016/j.patrec.2019.11.032 Get rights and content

Highlights

•
Integration of scale invariant (SI) convolution in residual architectures.
•
Proposition of batch-adaptive triplet loss function for better deep metric learning.
•
Design of two residual architectures, one deeper SI-TriNet and the other shallower SISR-32.
•
Superior performance over state-of-the arts in two benchmark datasets.

Abstract

The problem of person re-identification (re-ID) deals with matching two similar persons in probe and gallery sets. The underlying pattern matching task can become more complex as similar persons can appear in different scales in the two sets. In this paper, we address this challenging problem of scale-invariant person re-ID. As a solution, we propose two scale-invariant residual networks with a new loss function for deep metric learning. The first network, termed as Scale Invariant Triplet Network (SI-TriNet), is deeper and is trained from the pre-trained weights. In contrast, the second network, named Scale-Invariant Siamese Resnet-32 (SISR-32), is shallower and uses training from the scratch. Deep metric learning for both the networks are realized through a batch adaptive triplet loss function. Extensive comparisons and ablation studies on the benchmark Market-1501 and CUHK03 datasets clearly demonstrate the effectiveness of the proposed formulation.

Introduction

The problem of person re-identification (re-ID) deals with appearance based matching of pedestrians from two camera views, termed as probe and gallery [1], [2]. Re-identification problem becomes difficult to solve as the same person can look very different in these views due to variations in pose, illumination and viewpoint. Furthermore, these persons can also undergo scale variations thereby adding more distortions in their appearance attributes. Variations in scale may occur due to factors like inaccurate localization of a person within a detected bounding box and variations in the physical distance of a person from different cameras in the 3D world. This scale variation greatly increases the complexity of the re-ID task.

Several existing methods have addressed the re-ID problem with two main focus, namely, (a) robust feature descriptor generation [1], [3] and (b) better metric learning [4], [5], [6]. However, none of these works have explicitly addressed the issue of scale variation. Typically, the existing approaches train a deep Siamese neural network for robust feature extraction. Residual networks (ResNets) [7] have emerged as a popular deep network as they yield comparable performance to other deep models with much less model parameters. Another crucial factor, which plays a key role in achieving accurate person re-ID, is metric learning. Triplet loss function has become popular for better metric learning [8]. More recently, in [6], the authors use batch hard triplet loss to mine most relevant (hardest) triplets within a batch. However, even such model is found to be susceptible to outliers.

In this paper, we address the problem of scale-invariant person re-ID using scale-invariant residual networks and a new loss function for deep metric learning. Our main contributions are the following: (a) we propose two scale invariant residual architectures and establish that these networks have better gradient activation over conventional ResNets; and (b) we introduce a batch adaptive triplet loss function with better triplet mining capability (over existing batch-hard triplet loss [6]) within a Siamese configuration.

Section snippets

Related work

Earlier most of works in re-ID [1], [4] are based on hand-crafted system. However, they have failed to yield good results in complex scenarios. Inspired by the excellent performance of the Convolutional Neural Networks (CNNs) in image classification tasks, many recent works in re-ID have explored deep learning as a part of their solution. For example, see the works reported in [2], [9], [10], [11]. However, accuracy of such models are limited by the unavailability of large training data.

Proposed architecture

Feature descriptors (filter kernels) in a CNN are able to detect relevant features irrespective of their spatial locations. However, this behavior (at a given scale) cannot be automatically guaranteed while dealing with more than one scale. Learning feature detectors (filter kernels) that can respond to similar patterns at multiple scales is more likely to improve the recognition task in current deep architectures. Such a neural network can be termed as a scale-invariant convolution neural

Proposed deep metric learning

Weinberger and Saul [5] proposed metric learning for k-nearest neighbor classification via “Large Margin Nearest neighbor (LMNN)” loss. A problem with LMNN is it cannot properly handle fixed targeted neighbors, FaceNet [8] introduced “triplet loss” [8], which was more suitable for deep metric learning. If g_a, g_p and g_n be an arbitrary anchor, positive and negative (triplet) person set and D_a,p and D_a,n represents the similarity measure (square euclidean distance) between embedding of pairs (g_a,

Triplet siamese configuration

Deeper neural networks generally perform better than the shallower ones but such networks are difficult to train due to over-fitting resulting from unavailability of sufficient data. A better option is to train a network under consideration over the existing pre-trained models. In this work, we propose two SI residual networks, one deeper and one shallower. The deeper SI network is developed from the pre-trained ResNet-50 [7] architecture while the shallower network is built from stacking

Experimental results

In this section, we present a brief discussion of the datasets, followed by the evaluation protocol and training details. Then, we show detailed comparisons with several state-of-the-art methods and also include two ablation studies.

Conclusion

In this paper, we proposed two scale-invariant residual networks for robust person re-ID tasks. We also introduced a new triplet loss function for better metric learning. Superior performance over the current state-of-the-art approaches on the benchmark Market-1501 and CUHK03 datasets indicate the effectiveness of our formulation. In future, we plan to extend the proposed framework for re-ID problems in open settings.

Declaration of competing interest

We hereby declare that we do not have any conflict of interest for this manuscript.

References (25)

C. Zhao et al.
Multilevel triplet deep learning model for person re-identification
Pattern Recognit. Lett.
(2019)
L. Zhang et al.
Learning a discriminative null space for person re-identification
CVPR
(2016)
R.R. Varior et al.
Gated siamese convolutional neural network architecture for human re-identification
ECCV
(2016)
M. Kostinger et al.
Large scale metric learning from equivalence constraints
CVPR
(2012)
K.Q. Weinberger et al.
Distance metric learning for large margin nearest neighbor classification
J. Mach. Learn. Res.
(2009)
A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, (2017)....
K. He et al.
Deep residual learning for image recognition
CVPR
(2016)
F. Schroff et al.
FaceNet: A unified embedding for face recognition and clustering
CVPR
(2015)
Y. Zhang et al.
Deep mutual learning
CVPR
(2018)
W. Li et al.
Harmonious attention network for person re-identification
CVPR
(2018)

Y. Wang et al.

Resource aware person re-identification across multiple resolutions

CVPR

(2018)

W. Li et al.

Person re-identification by deep joint learning of multi-loss classification

Proceedings of IJCAI

(2017)

Cited by (11)

Counterfactual attention alignment for visible-infrared cross-modality person re-identification
2023, Pattern Recognition Letters
Visible-infrared person re-identification (VI-ReID) copes with cross-modality matching between the daytime visible and night-time infrared images. Existing methods try to use attention modules to enhance multi-modality feature representations, but ignore measures of attention quality and lack direct and effective supervision of the attention learning process. To solve these problems, we propose a counterfactual attention alignment (CAA) strategy by mining intra-modality attention information with counterfactual causality and aligning the cross-modality attentions. Specifically, a self-weighted part attention module is designed to extract the pairwise attention information in local parts. The counterfactual attention alignment strategy obtains the learning results of the attention module through counterfactual intervention, and aligns the attention maps of the two modalities to find better shared cross-modality attention regions. Then the effect of the aligned attention on network prediction is used as a supervision signal to directly guide the attention module to learn more effective attention information. Extensive experimental results demonstrate that the proposed approach outperforms other state-of-the-art methods on two standard benchmarks.
AVPL: Augmented visual perception learning for person Re-identification and beyond
2022, Pattern Recognition
Citation Excerpt :
And the object function required that the feature distance of positive sample pairs must be smaller than that of the negative sample pairs. Sikdar et al. [27] propose a batch adaptive triplet loss with better triplet mining capability within a Siamese configuration. In this work, the positives are weighted based on their hardness with respect to the anchor (i.e. similarity).
In this work, we propose an Augmented Visual Perception Learning (AVPL) method for Person Re-identification (ReID) which is inspired by the two-stream hypothesis theory of Human Visual System (HVS). Deep learning methods dominate ReID and many state-of-the-art performances are achieved from the perspective of optimizing the model of ’what’ visual pathway. It does not blend ’what’ and ’where’ well. Our AVPL method uses the essential mechanism of the ventro-dorsal stream of the ’where’ visual pathway to expand the perception field of the model, and integrates with the ’what’ to complete the information of the visually salient regions. A novel Batch Attention (BA), the key component of our Augmented Visual Perception (AVP) module, is proposed to apply fusion and augmentation into all input feature maps within each batch. Through AVP module, the improved attention-based model attaches more importance to enhancement of salient features, therefore acquiring better perceptual ability of salient regions which provide the most distinguishably distinctions for ReID. Extensive experiments have been carried out on four main stream ReID datasets and two recognition datasets. In terms of ReID, our method achieves rank-1 accuracy of 95.2% on CUHK03-NP, 98.7% on Market-1501, 96.0% on DukeMTMC-reID and 88.8% on MSMT17-V1, outperforming the state-of-the-art methods by a large margin. Besides, it has been experimentally proven to be applicable and effective in other recognition tasks including facial expression recognition and action recognition with an improved accuracy.
HMMN: Online metric learning for human re-identification via hard sample mining memory network
2021, Engineering Applications of Artificial Intelligence
Citation Excerpt :
In addition, Li et al. (2018), Lawen et al. (2020), Zhou et al. (2021) and Li et al. (2021) have been published describing the design of small networks for person re-identification problem to reduce heavy computational resource consumption, which is more suitable for edge computing. It is worth noting the jobs (Sikdar et al., 2020; Sikdar and Chowdhury, 2020), which introduce an open-set person re-identification problem and propose batch-adaptive triplet mining technique for person re-identification. Traditional or close-set re-ID systems are not equipped to handle such cases and raise several false alarms as a result.
Effective metric learning is important in various applications, especially for re-identification. Compared with most existing re-identification methods which are not suitable for a real-time update mode, we exploit a novel memory-based strategy for mining hard triplets in online metric learning. This strategy is realized with an end-to-end deep learning based framework using an external memory pool. Our proposed pipeline is able to explicitly provide hard negative and positive samples to generate effective triplets, which are important for online metric learning due to the representative triplets could provide distinctive information to help understand the concept of metric learning between categories. In addition, a “focal-triplet loss” function is proposed to deal with the lack of positive or negative samples for one anchor, and the imbalance between easy and hard triplets for mini-batch. Experimental results on Market-1501, CUHK03 and DukeMTMC-reID demonstrate the effectiveness of our method, and its performance even outperforms that of some existing offline methods.
Appearance feature enhancement for person re-identification
2021, Expert Systems with Applications
Citation Excerpt :
E.g., Lv et al. (Lv, Li, Nai, Chen, & Yuan, 2020) proposed an expanded neighborhoods distance (END) to re-rank the re-ID results to address the problem of low intra-class similarity and high inter-class similarity. Besides, some researchers (Sikdar & Chowdhury, 2020) carefully designed the scale-invariant residual network to extract scale-invariant deep features. Also some researchers deploy the CNN architecture into unsupervised Re-ID tasks.
Person re-identification (Re-ID) has important practical application value in intelligent video analysis. Due to the illumination, occlusion, and pose variation, person Re-ID is still a challenging problem. Some recent Re-ID methods based on ResNet-50 have achieved high accuracy, but performance degradation is caused by pose variation. To address this issue, Pose-Invariant Convolutional Baseline (PICB) embed with the proposed Pooling Fusion Block (PFB) is put forward as a new baseline for person Re-ID task. On the basis of PICB, an end-to-end network named Appearance-Enhanced Feature Learning Network (AEFLN) is proposed to simultaneously learn diversity body features and discriminative part features. Specially, a novel (DBFL) strategy is presented to learn diversity body features, which could alleviate the potential local minima problem generated by optimizing model with randomly initialized parameters in PFB. In addition, uniform part-level feature extractors are applied to learn part features, which compensates for body features’ lack of distinguishable local information. In testing phase, body features and part features are integrated to represent the enhanced appearance feature for each person image. Comprehensive experiments have demonstrated that our method can outperform the sate-of-the-art results on several public available datasets, including Market-1501, CUHK03 and DukeMTMC-reID. For instance, we achieve 74.8% (+11.1%) and 76.5% (+19.0%) in Rank-1 accuracy and mAP on CUHK03 dataset.
Deep learning algorithms for person re-identification: sate-of-the-art and research challenges
2024, Multimedia Tools and Applications
Unsupervised learning of local features for person re-identification with loss function
2023, International Journal of Autonomous and Adaptive Communications Systems

View all citing articles on Scopus

^☆: Handle by Associate Editor S. Sarkar.

View full text

Scale-invariant batch-adaptive residual learning for person re-identification☆

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed architecture

Proposed deep metric learning

Triplet siamese configuration

Experimental results

Conclusion

Declaration of competing interest

Pattern Recognit. Lett.

Learning a discriminative null space for person re-identification

CVPR

Gated siamese convolutional neural network architecture for human re-identification

ECCV

Large scale metric learning from equivalence constraints

CVPR

Distance metric learning for large margin nearest neighbor classification

J. Mach. Learn. Res.

Deep residual learning for image recognition

CVPR

FaceNet: A unified embedding for face recognition and clustering

CVPR

Deep mutual learning

CVPR

Harmonious attention network for person re-identification

CVPR

Resource aware person re-identification across multiple resolutions

CVPR

Person re-identification by deep joint learning of multi-loss classification

Proceedings of IJCAI