Elsevier

Pattern Recognition Letters

Volume 129, January 2020, Pages 279-286
Pattern Recognition Letters

Scale-invariant batch-adaptive residual learning for person re-identification

https://doi.org/10.1016/j.patrec.2019.11.032Get rights and content

Highlights

  • Integration of scale invariant (SI) convolution in residual architectures.

  • Proposition of batch-adaptive triplet loss function for better deep metric learning.

  • Design of two residual architectures, one deeper SI-TriNet and the other shallower SISR-32.

  • Superior performance over state-of-the arts in two benchmark datasets.

Abstract

The problem of person re-identification (re-ID) deals with matching two similar persons in probe and gallery sets. The underlying pattern matching task can become more complex as similar persons can appear in different scales in the two sets. In this paper, we address this challenging problem of scale-invariant person re-ID. As a solution, we propose two scale-invariant residual networks with a new loss function for deep metric learning. The first network, termed as Scale Invariant Triplet Network (SI-TriNet), is deeper and is trained from the pre-trained weights. In contrast, the second network, named Scale-Invariant Siamese Resnet-32 (SISR-32), is shallower and uses training from the scratch. Deep metric learning for both the networks are realized through a batch adaptive triplet loss function. Extensive comparisons and ablation studies on the benchmark Market-1501 and CUHK03 datasets clearly demonstrate the effectiveness of the proposed formulation.

Introduction

The problem of person re-identification (re-ID) deals with appearance based matching of pedestrians from two camera views, termed as probe and gallery [1], [2]. Re-identification problem becomes difficult to solve as the same person can look very different in these views due to variations in pose, illumination and viewpoint. Furthermore, these persons can also undergo scale variations thereby adding more distortions in their appearance attributes. Variations in scale may occur due to factors like inaccurate localization of a person within a detected bounding box and variations in the physical distance of a person from different cameras in the 3D world. This scale variation greatly increases the complexity of the re-ID task.

Several existing methods have addressed the re-ID problem with two main focus, namely, (a) robust feature descriptor generation [1], [3] and (b) better metric learning [4], [5], [6]. However, none of these works have explicitly addressed the issue of scale variation. Typically, the existing approaches train a deep Siamese neural network for robust feature extraction. Residual networks (ResNets) [7] have emerged as a popular deep network as they yield comparable performance to other deep models with much less model parameters. Another crucial factor, which plays a key role in achieving accurate person re-ID, is metric learning. Triplet loss function has become popular for better metric learning [8]. More recently, in [6], the authors use batch hard triplet loss to mine most relevant (hardest) triplets within a batch. However, even such model is found to be susceptible to outliers.

In this paper, we address the problem of scale-invariant person re-ID using scale-invariant residual networks and a new loss function for deep metric learning. Our main contributions are the following: (a) we propose two scale invariant residual architectures and establish that these networks have better gradient activation over conventional ResNets; and (b) we introduce a batch adaptive triplet loss function with better triplet mining capability (over existing batch-hard triplet loss [6]) within a Siamese configuration.

Section snippets

Related work

Earlier most of works in re-ID [1], [4] are based on hand-crafted system. However, they have failed to yield good results in complex scenarios. Inspired by the excellent performance of the Convolutional Neural Networks (CNNs) in image classification tasks, many recent works in re-ID have explored deep learning as a part of their solution. For example, see the works reported in [2], [9], [10], [11]. However, accuracy of such models are limited by the unavailability of large training data.

Proposed architecture

Feature descriptors (filter kernels) in a CNN are able to detect relevant features irrespective of their spatial locations. However, this behavior (at a given scale) cannot be automatically guaranteed while dealing with more than one scale. Learning feature detectors (filter kernels) that can respond to similar patterns at multiple scales is more likely to improve the recognition task in current deep architectures. Such a neural network can be termed as a scale-invariant convolution neural

Proposed deep metric learning

Weinberger and Saul [5] proposed metric learning for k-nearest neighbor classification via “Large Margin Nearest neighbor (LMNN)” loss. A problem with LMNN is it cannot properly handle fixed targeted neighbors, FaceNet [8] introduced “triplet loss” [8], which was more suitable for deep metric learning. If ga, gp and gn be an arbitrary anchor, positive and negative (triplet) person set and Da,p and Da,n represents the similarity measure (square euclidean distance) between embedding of pairs (ga,

Triplet siamese configuration

Deeper neural networks generally perform better than the shallower ones but such networks are difficult to train due to over-fitting resulting from unavailability of sufficient data. A better option is to train a network under consideration over the existing pre-trained models. In this work, we propose two SI residual networks, one deeper and one shallower. The deeper SI network is developed from the pre-trained ResNet-50 [7] architecture while the shallower network is built from stacking

Experimental results

In this section, we present a brief discussion of the datasets, followed by the evaluation protocol and training details. Then, we show detailed comparisons with several state-of-the-art methods and also include two ablation studies.

Conclusion

In this paper, we proposed two scale-invariant residual networks for robust person re-ID tasks. We also introduced a new triplet loss function for better metric learning. Superior performance over the current state-of-the-art approaches on the benchmark Market-1501 and CUHK03 datasets indicate the effectiveness of our formulation. In future, we plan to extend the proposed framework for re-ID problems in open settings.

Declaration of competing interest

We hereby declare that we do not have any conflict of interest for this manuscript.

References (25)

  • C. Zhao et al.

    Multilevel triplet deep learning model for person re-identification

    Pattern Recognit. Lett.

    (2019)
  • L. Zhang et al.

    Learning a discriminative null space for person re-identification

    CVPR

    (2016)
  • R.R. Varior et al.

    Gated siamese convolutional neural network architecture for human re-identification

    ECCV

    (2016)
  • M. Kostinger et al.

    Large scale metric learning from equivalence constraints

    CVPR

    (2012)
  • K.Q. Weinberger et al.

    Distance metric learning for large margin nearest neighbor classification

    J. Mach. Learn. Res.

    (2009)
  • A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, (2017)....
  • K. He et al.

    Deep residual learning for image recognition

    CVPR

    (2016)
  • F. Schroff et al.

    FaceNet: A unified embedding for face recognition and clustering

    CVPR

    (2015)
  • Y. Zhang et al.

    Deep mutual learning

    CVPR

    (2018)
  • W. Li et al.

    Harmonious attention network for person re-identification

    CVPR

    (2018)
  • Y. Wang et al.

    Resource aware person re-identification across multiple resolutions

    CVPR

    (2018)
  • W. Li et al.

    Person re-identification by deep joint learning of multi-loss classification

    Proceedings of IJCAI

    (2017)
  • Cited by (11)

    • AVPL: Augmented visual perception learning for person Re-identification and beyond

      2022, Pattern Recognition
      Citation Excerpt :

      And the object function required that the feature distance of positive sample pairs must be smaller than that of the negative sample pairs. Sikdar et al. [27] propose a batch adaptive triplet loss with better triplet mining capability within a Siamese configuration. In this work, the positives are weighted based on their hardness with respect to the anchor (i.e. similarity).

    • HMMN: Online metric learning for human re-identification via hard sample mining memory network

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In addition, Li et al. (2018), Lawen et al. (2020), Zhou et al. (2021) and Li et al. (2021) have been published describing the design of small networks for person re-identification problem to reduce heavy computational resource consumption, which is more suitable for edge computing. It is worth noting the jobs (Sikdar et al., 2020; Sikdar and Chowdhury, 2020), which introduce an open-set person re-identification problem and propose batch-adaptive triplet mining technique for person re-identification. Traditional or close-set re-ID systems are not equipped to handle such cases and raise several false alarms as a result.

    • Appearance feature enhancement for person re-identification

      2021, Expert Systems with Applications
      Citation Excerpt :

      E.g., Lv et al. (Lv, Li, Nai, Chen, & Yuan, 2020) proposed an expanded neighborhoods distance (END) to re-rank the re-ID results to address the problem of low intra-class similarity and high inter-class similarity. Besides, some researchers (Sikdar & Chowdhury, 2020) carefully designed the scale-invariant residual network to extract scale-invariant deep features. Also some researchers deploy the CNN architecture into unsupervised Re-ID tasks.

    • Unsupervised learning of local features for person re-identification with loss function

      2023, International Journal of Autonomous and Adaptive Communications Systems
    View all citing articles on Scopus

    Handle by Associate Editor S. Sarkar.

    View full text