Person re-identification based on multi-region-set ensembles

https://doi.org/10.1016/j.jvcir.2016.06.009Get rights and content

Highlights

  • We propose a simple and effective strategy to combine semantic regions with their contextual information.

  • We achieve higher rank-1 recognition rate and competitive performance compared with the state-of-the-art on four challenging datasets.

Abstract

Person re-identification is an important topic in video surveillance. We present a new feature representation for person re-identification based on multi-region-set ensembles (MRSE) by combining some semantic regions with their contextual information. The motivation of this paper is that people can recognize and identify whether it’s the same person by one or several local regions of the appearance. Our approach is divided into three steps: firstly, we segment the person into some semantic regions such as “hair”, “face”, “up-cloth” (upper clothes) and “lo-cloth” (lower clothes). After getting these regions, we form multiple sets of different combination by selecting a few regions and concatenating the features of them. We then combine the distance of all multiple sets for computing the similarity of a query-gallery image pair. Finally, we achieve higher rank-1 recognition rate and competitive performance compared with the state-of-the-art on iLIDS, VIPeR, CAVIAR4REID, 3DPeS four challenging datasets.

Introduction

Person re-identification (re-ID), to match a given probe image against a gallery of candidate images in disjoint cameras, becomes more and more important for solving many visual problems in video surveillance, such as tracking criminals [1], human analysis [2]. As shown in Fig. 1, this task is challenging because different persons may have similar appearance and the same person in different cameras can be various in viewpoint, pose, illumination and occlusion.

To address these challenges, most existing work on person re-identification can be divided into two stages: (1) robust feature representation to these variations, and (2) discriminative metric learning model. A good overview can be found in [3]. We briefly review these methods as follows.

In the feature representation, the histograms of color and texture are the most widely-used appearance-based features for person re-identification. They are combined to describe the appearance of the person image across different camera views [4], [5], [6], [7], [8], [9], [10]. Farenzena et al. [4] presented an appearance-based method which combined the spatial arrangement of colors into stable regions. Bazzani et al. [5] focused on the overall chromatic content by incorporating global and local color histogram of the human appearance. Recently, Yang et al. [11] proposed a novel color descriptor based on salient color names which outperformed the state-of-the-art performance for person re-identification. In addition, Liu et al. [12] combined the appearance features with the gait feature of shape and temporal information to enhance person re-identification.

To improve robustness against low resolution, occlusions and pose, viewpoint and illumination changes, some researchers apply the pipeline [13], [14]: (1) divide a person image into several horizontal stripes of equal size, (2) calculate the appearance-based feature of each part, (3) concatenate them to form the feature of the person image as shown in Fig. 2(a).

However, due to the non-rigid structures of pedestrians, their poses may vary a lot under different conditions, and the person without adjacent views may have different backgrounds. Person re-ID still faces challenges such as the human poses variations and complicated backgrounds, even though the image is divided into several horizontal stripes. In addition, image segmentation is a critical preprocessing step to computer vision [15], [16], [17]. In order to improve person re-identification, image segmentation is introduced to suppress the inferences of complicated backgrounds [18], [11]. Chang et al. [18] proposed a single-shot person re-identification algorithm based on pedestrian segmentation where pedestrian foreground was segmented by combining the shape prior information with the color seed. Then they divided the person image into parts of several horizontal stripes. Yang et al. [11] used the same way to extract features from horizontal stripes in the foreground region. But horizontal stripes still lack semantic information even though removing image background.

After extracting feature of the person image pair, the problem of matching persons can be converted to metric learning problem by minimizing the intra-class distances while maximizing the inter-class distances [19], [9], [20], [21], [14]. For instance, Zheng et al. [19] learned the metric between a pair of person images as a relative distance comparison (RDC) learning problem. They maximized the likelihood of a pair of true matches which had a relatively smaller distance than that of a wrong match pair in a soft discriminant manner. In [20]. Li et al. showed that traditional metric learning methods based on a fixed threshold were insufficient and sub-optimal for the verification problem and they learned metric viewed as a joint model of a distance metric and a locally adaptive thresholding rule. Pedagadi et al. [21] proposed a metric learning algorithm using unsupervised PCA and supervised LFDA dimensionality reduction on low manifold learning framework. Xiong et al. [14] used kernel-based metric learning method in conjunction with a ranking ensemble voting scheme which outperforms the state-of-the-art performance for re-ID. Although these methods have achieved dramatic progress, the performance of re-ID still depends on feature representation to a great extent. This paper focuses on presenting a refined feature representation, so any metric-learning approaches can be used.

As dividing a person image into several horizontal stripes still faces difficulties in the representation of appearance in different viewpoints. We seek to reduce background interference and develop a refined feature representation by extracting several semantic regions and their contextual information, as shown in Fig. 2(b). We firstly segment the person into some regions such as “hair”, “face”, “up-cloth” (upper clothes) and “lo-cloth” (lower clothes). After getting these regions, we form multiple sets of different combination by selecting a few regions and concatenating the feature of these regions through multi-region-set ensembles. Experimental results are demonstrated to show the efficiency of our proposed method.

This paper is organized as follows: Section 2 briefly describes the related work. Section 3 details on the proposed method. Section 4 presents the experimental evaluation of the proposed approach and Section 5 summarizes the proposed method and the following work.

Section snippets

Related work

Because of person re-identification has important application in the field of video surveillance, it has attracted much attention in the last years and the literature on solving it is vast. Detailed survey is beyond the scope of this paper, we review some related works on multiple visual features fusion model [4], [22], based-parts model [23], [24].

Our model

In this section, we describe our model for person identification. The output of our model is to generate the ranking of the given probe image against the gallery of candidate images. During the training stage, the input is a set of person pairs. We first segment the person into some semantic regions and form multiple sets of different combination by selecting a few regions and concatenating the feature of them. We then treat these multiple sets as training data for a multi-region-set metric

Datasets

In this section, we evaluate our method on four publicly available datasets (iLIDS1 [32], VIPeR2 [31], CAVIAR4REID3 [33], 3DPeS4 [34]), which can be downloaded from the authors’ homepage.

In the i-LIDS dataset [32], the images were captured from a airport arrival

Conclusion

In this paper, we presented a new method for person re-identification based on semantic regions extraction and multi-region-set ensembles which was different from the traditional methods which divided a person image into several stripes of equal size. We focused on how to combine semantic regions with their contextual information closely for improving performance. Experimental results are conducted to show the efficiency of our proposed method. The limitation for our proposed method is that it

Acknowledgment

This work was supported in part by National Natural Science Foundation of China (No. 61271289, 61502084), and by the program for Science and Technology Innovative Research Team for Young Scholars in Sichuan Province, China (No. 2014TD0006).

References (37)

  • M. Hirzer et al.

    Person re-identification by efficient impostor-based metric learning

  • Y. Yang et al.

    Salient color names for person re-identification

  • R. Satta, Appearance descriptors for person re-identification: a comprehensive review, Available from:...
  • F. Xiong et al.

    Person re-identification using kernel-based metric learning methods

  • F. Meng et al.

    Object co-segmentation based on shortest path algorithm and saliency model

    IEEE Transactions on Multimedia

    (2012)
  • H. Li et al.

    Unsupervised multiclass region cosegmentation via ensemble clustering and energy minimization

    IEEE Trans. Circ. Syst. Video Technol.

    (2014)
  • H. Li et al.

    Repairing bad co-segmentation using its quality evaluation and segment propagation

    IEEE Trans. Image Process.

    (2014)
  • Y.-C. Chang et al.

    Single-shot person re-identification based on improved random-walk pedestrian segmentation

  • Cited by (4)

    • Person re-identification based on re-ranking with expanded k-reciprocal nearest neighbors

      2019, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Pedestrians which are critical research objects in computer vision make the person re-identification (re-id) [8–10,23] problem as a very active research topic in recent years.

    View full text