Person re-identification by enhanced local maximal occurrence representation and generalized similarity metric learning

doi:10.1016/j.neucom.2018.04.013

Neurocomputing

Volume 307, 13 September 2018, Pages 25-37

https://doi.org/10.1016/j.neucom.2018.04.013 Get rights and content

Abstract

To solve the challenging person re-identification problem, great efforts have been devoted to feature representation and metric learning. However, existing feature extractors are either stripe-based or dense-block-based, the fine details and coarse appearance are not well integrated. What is more, the metrics are generally learned independently from distance view or bilinear similarity view. Few works have exploited the mutual complementary effects of their combination. To address these issues, we propose a new feature representation termed enhanced Local Maximal Occurrence (eLOMO) which fuses a new overlapping-stripe-based descriptor with the Local Maximal Occurrence (LOMO) extracted from dense blocks. Such integration makes eLOMO resemble the coarse-to-fine recognition mechanism of human vision system, thus it can provide a more discriminative descriptor for re-identification. Besides, we show the advantages of learning generalized similarity by combining the Mahalanobis distance and bilinear similarity together. Specifically, we derive a logistic metric learning method to jointly learn a distance metric and a bilinear similarity metric, which exploits both the distance and angle information from training data. Taking advantage of learning in the intra-class subspace, the proposed method can be solved efficiently by coordinate descent optimization. Experiments on four challenging datasets including VIPeR, PRID450S, QMUL GRID, and CUHK01, show that the proposed method outperforms the state-of-the-art approaches significantly.

Introduction

Person re-identification is the task of matching individuals across disjoint camera views over distributed spaces, which plays an important role in intelligent video surveillance. Although it is assumed that people do not change clothes in different camera views, person re-identification still remains a challenging problem due to large appearance variations caused by illumination, pose, viewpoint, and occlusion.

Great efforts have been devoted for years to tackle person re-identification along two directions. One is to design robust visual descriptors against cross-view variations, and the other is to learn a discriminant similarity/distance functissson to determine whether an image pair belongs to the same person or not. For visual descriptors, a number of feature representations have been proposed, such as Symmetry-Driven Accumulation of Local Features (SDALF) [1], Mid-level Filter (MLF) [2], Biologically Inspired Features (BIF) [3], Salient Color Names (SCN) [4], Local Maximal Occurrence (LOMO) [5], and the Gaussian of Gaussian (GOG) descriptor [6]. Most of them are extracted from either horizontal stripes or dense blocks. Although impressive advancement has been made, designing a more robust yet discriminative descriptor remains an open problem.

As for similarity/distance function learning, a number of metric learning algorithms have been devised [5], [7], [8], [9], [10], [11], [12], [13], [14]. Some of them, like [10], [11], [13], [14], focus on learning a Mahalanobis distance metric from distance constraints. While some other works, like [7], [12], seek for a bilinear similarity metric by utilizing the angle information between instances in high-dimensional feature space. However, most of the works fail to exploit the mutual complementary effects of their combination. Only considering either of them may lead to a less discriminative similarity measurement.

In this paper, we propose an efficient feature representation termed enhanced Local Maximal Occurrence (eLOMO), and a Generalized Similarity Metric Learning (GSML) method for person re-identification. The eLOMO is an integration of a new overlapping-stripe-based descriptor with the existing LOMO [5] feature. The stripe-based descriptor can better exploit coarse appearance information from larger regions, while LOMO is good at capturing the fine details of dense blocks. Thus the fusion of them can lead to a coarse-to-fine representation which is in line with the human recognition mechanism. To learn a discriminant similarity function, we combine the Mahalanobis distance and the bilinear similarity together, such that the distance and angle information of training data are exploited simultaneously. The proposed method is formulated as a logistic metric learning problem with Positive Semi-definite (PSD) constraints, and we derive an efficient coordinate descent algorithm to solve it based on the Accelerated Proximal Gradient (APG) optimization method. To suppress the large intra-class variations of cross-view appearances, we project samples into the intra-class subspace before learning. The pipeline of the proposed method is shown in Fig. 1.

We conduct extensive experiments to validate the efficacy of the proposed method. Experimental results show that the proposed method achieves significant improvements over existing approaches on four challenging person re-identification datasets, namely VIPeR [15], PRID450S [16], QMUL GRID [17], and CUHK01 [2].

The rest of this paper is organized as follows. In Section 2 we briefly review the related works and discuss their differences with our method. Section 3 introduces the eLOMO feature representation. Section 4 presents the GSML in detail. The experimental results and the analysis of our method are reported in Section 5. Finally, we draw some conclusions in Section 6.

Section snippets

Related work

Given one probe image containing an individual of interest, the task of person re-identification is to find its true match (or usually the best match) from a large number of gallery images. Existing works for solving this problem generally follow a two-step paradigm. Firstly, a robust and distinctive feature representation is extracted for every pedestrian image. Secondly, the similarity/distance for each probe-gallery image pair is measured by a certain metric, which is then used to rank the

Enhanced local maximal occurrence representation

Similar to the coarse-to-fine recognition mechanism of human vision system, a discriminative feature representation for visual learning should also take both fine details and holistic appearance information into consideration. The advantage is that they can work co-operatively to capture the invariance of pedestrian appearance in different camera views. As a result, it will greatly help to identify the interested target. Although some descriptors like LOMO and GOG have considered computing

Problem formulation

Let {X, Z, Y} be a given cross-view training set, where $X \in R^{d \times n}$ and $Z \in R^{d \times m}$ are the feature matrices of probe set and gallery set, with n and m samples in a d-dimensional feature space respectively; $Y \in R^{n \times m}$ is the matching label matrix between X and Z, with $y_{i j} = 1$ if (x_i, z_j) is a positive pair (i.e. x_i and z_j represent the same person), and $y_{i j} = - 1$ otherwise. The re-identification task is to learn a similarity function f(x_i, z_j) to measure the similarity between each pair ${(x_{i}, z_{j})}_{i, j = 1}^{n, m}$ .

Experiments

We evaluate the proposed method on four widely used person re-identification datasets including VIPeR [15], PRID450S [16], QMUL GRID [17], and CUHK01 [2]. Fig. 5 shows some image pairs randomly selected from these datasets. The performance is evaluated by the Cumulative Matching Characteristics (CMC) curve which represents the expectation of finding the right match in top r matches. To get a robust performance for comparison, we repeat the experiment procedure 10 times with random

Conclusion

In this paper, we have proposed a discriminative and robust feature representation termed eLOMO, and an effective metric learning method called GSML for person re-identification. The eLOMO fuses the features extracted from both horizontal stripes and dense blocks, such that the fine details and holistic appearance information can be integrated together to enhance the discrimination. The proposed GSML jointly learns a Mahalanobis distance metric and a bilinear similarity metric to simultaneously

Acknowledgment

This work was partially supported by National Natural Science Foundation of China (NSFC Grant No. 61773272, 61272258, 61301299, 61572085, 61170124, 61272005), Provincial Natural Science Foundation of Jiangsu (Grant No. BK20151254, BK2015-1260), Science and Education Innovation based Cloud Data fusion Foundation of Science and Technology Development Center of Education Ministry(2017B03112), Six talent peaks Project in Jiangsu Province (DZXX-027), Key Laboratory of Symbolic Computation and

Husheng Dong received his M.S. degree from School of Computer Science & Technology, Soochow University in 2008, and he is pursuing the Ph.D. degree now. He is also a teacher of Suzhou Institute of Trade & Commerce. His research interest includes computer vision, image and video processing, and machine learning.

References (58)

MaB. et al.
Covariance descriptor based on bio-inspired features for person re-identification and face verification
Image Vis. Comput.
(2014)
A. Bedagkar-Gala et al.
A survey of approaches and trends in person re-identification
Image Vis. Comput.
(2014)
M. Farenzena et al.
Person re-identification by symmetry-driven accumulation of local features
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2010)
ZhaoR. et al.
Learning mid-level filters for person re-identification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014)
YangY. et al.
Salient color names for person re-identification
Proceedings of the European Conference on Computer Vision
(2014)
LiaoS. et al.
Person re-identification by local maximal occurrence representation and metric learning
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2015)
T. Matsukawa et al.
Hierarchical gaussian descriptor for person re-identification
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
ChenJ. et al.
Relevance metric learning for person re-identification by exploiting listwise similarities
IEEE Trans. Image Process.
(2015)
M. Hirzer et al.
Relaxed pairwise learned metric for person re-identification
Proceedings of the European Conference on Computer Vision
(2012)
M. Köestinger et al.
Large scale metric learning from equivalence constraints
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2012)

LiaoS. et al.

Efficient PSD constrained asymmetric metric learning for person re-identification

Proceedings of the IEEE International Conference on Computer Vision

(2015)

A. Mignon et al.

Pcca: a new approach for distance learning from sparse pairwise constraints

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2012)

NguyenH.V. et al.

Cosine similarity metric learning for face verification

Proceedings of the Asian Conference on Computer Vision

(2010)

K.Q. Weinberger et al.

Distance metric learning for large margin nearest neighbor classification

J. Mach. Learn. Res.

(2009)

ZhengW.-S. et al.

Reidentification by relative distance comparison

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

D. Gray et al.

Viewpoint invariant pedestrian recognition with an ensemble of localized features

Proceedings of the European Conference on Computer Vision

(2008)

P.M. Roth et al.

Mahalanobis Distance Learning for Person Re-identification

(2014)

C.C. Loy et al.

Multi-camera activity correlation analysis

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2009)

G. Doretto et al.

Appearance-based person reidentification in camera networks: problem overview and current approaches

J. Ambient Intell. Humaniz. Comput.

(2011)

GongS. et al.

Person Re-identification

(2014)

L. Zheng, Y. Yang, A.G. Hauptmann, Person re-identification: past, present and future, arXiv:1610.02984...

ChenY.C. et al.

Mirror representation for modeling view-specific transform in person re-identification

Proceedings of the International Conference on Artificial Intelligence

(2015)

G. Lisanti et al.

Person re-identification by iterative re-weighted sparse ranking

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

LiZ. et al.

Learning locally-adaptive decision functions for person verification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2013)

ZhaoR. et al.

Person re-identification by salience matching

Proceedings of the IEEE International Conference on Computer Vision

(2013)

ChenS.Z. et al.

Deep ranking for person re-identification via joint representation learning

IEEE Trans. Image Process.

(2016)

XiaoT. et al.

Learning deep feature representations with domain guided dropout for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

Y.C. Chen, X. Zhu, W.S. Zheng, J.H. Lai, Person re-identification by camera correlation aware feature augmentation,...

SunC. et al.

Person re-identification via distance metric learning with latent variables

IEEE Trans. Image Process.

(2017)

Cited by (39)

Learning transformer-based attention region with multiple scales for occluded person re-identification
2023, Computer Vision and Image Understanding
Occluded person re-identification(Re-ID), with the aim of matching occluded person pairs under cross-camera, remains challenging due to incomplete information and spatial misalignment. The state-of-the-art (SOTA) methods usually include a two-stage architecture based on the existing pose estimation models or the attention mechanism to generate human masks to extract features, which complicate the model and introduce additional biases. To address this issue, we propose a novel end-to-end transformer-based occluded person Re-ID model. Specifically, our model contains two crucial components: (1) the features of global and non-occluded person regions are extracted by two independent Transformer-based feature extraction networks respectively; (2) the distribution of common non-occluded human regions is learnt via a multiheaded self-attention mechanism, and then the Minimized Character-box Proposal (MCP) is utilized to generate accurate shared non-occluded crops. In our model, non-occluded human regions are not annotated and only weakly-supervision of ID labels with multiheaded self-attention are employed to jointly learn the distribution. Further, the human feature contains multi-scale information which is extracted from our dual-branch architecture. Extensive experiment results on four benchmarks of person Re-ID for two tasks (occluded, partial) demonstrate the effectiveness of our proposed framework which achieves the SOTA or the comparable performance on all benchmarks.
Cross-modality disentanglement and shared feedback learning for infrared-visible person re-identification
2022, Knowledge-Based Systems
Infrared-visible person re-identification (IV-ReID) has become a research hotspot in the field of computer vision. Compared with traditional person re-identification, the IV-ReID task is still very challenging due to huge difference between modalities. Most existing approaches are designed to bridge the cross-modality gap through single feature-level constraints, but the results are not very satisfactory. To this end, a novel cross-modality disentanglement and shared feedback (CMDSF) learning framework is proposed. The framework consists of a cross-modality images disentanglement network (CMIDN) and a dual-path shared feedback learning network (DSFLN). Specifically, the former uses a pairing strategy to more efficiently disentangle the cross-modality features and constrain the feature distribution distances between modalities. It achieves modality-level alignment while maintaining specific identity-consistency. The latter adopts a dual-path shared module (DSM) to obtain discriminative mid-level feature information, and achieves feature-level alignment. Furthermore, a feedback scoring module (FSM) with a negative feedback mechanism is proposed to compensate for the weak supervision defect of objective loss during backpropagation. It optimizes model parameters by providing a strong feedback signal. In summary, we propose an efficient learning framework with two parts jointly trained and optimized in an end-to-end manner. Extensive experimental results on two cross-modality datasets demonstrate that our method achieves a competitive performance compared with the state-of-the-art methods.
H-net: Unsupervised domain adaptation person re-identification network based on hierarchy
2022, Image and Vision Computing
Due to the high cost of manual labeling for supervised person re-identification (re-ID), unsupervised domain adaptation (UDA) person re-ID has been attracting the attention of many scholars. In this research, target domain datasets and source domain datasets are two indispensable datasets, and although there are many different pictures of the same person in the target domain, these pictures are precious to the network in different degrees. However, the existing UDA person re-ID algorithms does not treat different samples in the target domain differently, they just treat positive samples as indistinguishable samples. Not only that, although the triplet loss has been re-identified by unsupervised person re-ID, the noise of the hardest sample hasn't been carried out well. In this paper, a novel and robust network model named unsupervised domain adaptation hierarchical person re-identification network (H-Net) is proposed, which not only effectively reduces the impact of inaccurate identification of the hardest sample but also treats different positive samples differently by hierarchical feature collection. Numerous experimental results on Market-1501 and DukeMTMC-reID demonstrate that the proposed H-Net outperforms the existing methods and can significantly improve the accuracy of person re-ID.
Dual-path image pair joint discrimination for visible–infrared person re-identification
2022, Journal of Visual Communication and Image Representation
Citation Excerpt :
Specifically, we compare our method with feature learning methods (zero-padding, one-stream, two-stream [13], TONE [14], ECMC [59], EAKD [60]), metric learning method (BDTR, BCTR [3], D-HSME [54], BEAT [58]) and image generation methods (cmGAN [55], D2RL [56], CMPG [47]). In addition, we also compare it with traditional methods, including feature extraction methods (HOG [50], LOMO [51], MLBP [52]) metric learning method (GSM [63]) and visible light person re-id methods (PCB [53], SVDNet [57], MDAN [61], CPN [62]), and the results show that our method is very competitive. The results on SYSU-MM01 dataset and RegDB dataset are shown in Tables 1 and 2 respectively.
Because the imaging spectra of infrared images and visible light images are different, there is a huge modal difference between visible light images and infrared ones. Existing methods use image conversion to solve the problem of modal difference between two images, but these methods usually fail to focus on the complete information of images, which lead to the results of cross modal person re-identification are unstable. To solve this problem, we propose a new visible–infrared person re-identification method, called dual-path image pair joint discriminant model (DPJD), which simultaneously optimizes the distance within and between classes, and supervises the network learning to identify feature representations. We generate images with different modalities for the samples, and separately compose the same modality image pair and different modality image pair so as to overcome the inconsistent alignment issues. In addition, we also propose a discriminant module based on dual-path (DMDP) to improve the generation quality and discrimination accuracy of image pairs. Experiments on two benchmark datasets SYSU-MM01 and RegDB demonstrate its effectiveness.
Visible–infrared person re-identification based on key-point feature extraction and optimization
2022, Journal of Visual Communication and Image Representation
Citation Excerpt :
Results on RegDB Dataset. We compare with methods described in HOG [40], LOMO [41], BDTR [42], D2RL [43], JSIA [12], CDP [27] on RegDB dataset. We test visible2thermal and thermal2visible modes respectively.
Feature extraction for visible–infrared person re-identification (VI-ReID) is challenging because of the cross-modality discrepancy in the images taken by different spectral cameras. Most of the existing VI-ReID methods often ignore the potential relationship between features. In this paper, we intend to transform low-order person features into high-order graph features, and make full use of the hidden information between person features. Therefore, we propose a multi-hop attention graph convolution network (MAGC) to extract robust person joint feature information using residual attention mechanism while reducing the impact of environmental noise. The transfer of higher order graph features within MAGC enables the network to learn the hidden relationship between features. We also introduce the self-attention semantic perception layer (SSPL) which can adaptively select more discriminant features to further promote the transmission of useful information. The experiments on VI-ReID datasets demonstrate its effectiveness.
Weighted multi-view common subspace learning method
2021, Pattern Recognition Letters
How to use multi-view data effectively has become one of the challenging problems in the computer vision community. The existing multi-view learning methods are mainly based on common subspace learning which aim to explore the discriminative information between multi-view data and find its potential common subspace. Most of the existing multi-view subspace learning methods rely on the within-class scatter matrix and between-class scatter matrix while capturing the discriminative information of multiple views. However, these methods just roughly minimize the within-class distance and maximize the between-class distance, and do not make full use of the intra-view and inter-view information. To address this problem, we propose a weighted common subspace learning method, which can effectively adjust the contribution ratio of between-class and within-class information through a weighted parameter, so that an optimized common subspace can be obtained. And we use the maximum scatter difference criterion as the metric of inter-view and intra-view after projection. Extensive experiments on the public data sets show the superiority of this method.

View all citing articles on Scopus

Ping Lu received her B.Eng and M.S. degree from School of Computer Science and Technology, Soochow University in 2002 and 2005, respectively. She is an associate professor at Suzhou Institute of Trade & Commerce. Her research interest includes digital image processing and pattern recognition.

Shan Zhong received her M.S. and Ph.D. from Jiang University (2007) and Soochow University (2017), respectively. She is a teacher of Changshu Institute of Technology now. Her research interests include machine learning and Deep learning.

Chunping Liu received her Ph.D. degree in pattern recognition and artificial intelligence from Nanjing University of Science & Technology in 2002. She is now a professor of School of Computer Science & Technology, Soochow University. Her research interests include computer vision, image analysis and recognition, in particular in the domains of visual saliency detection, object detection and recognition and scene understanding.

Yi Ji received her M.S. Degree from National University of Singapore, Singapore and Ph.D. degree from INSA de Lyon, France. She is now an associate professor in School of Computer Science & Technology of Soochow University. Her research areas are 3D action recognition and complex scene understanding.

Shengrong Gong received his M.S. degree from Harbin Institute of Technology in 1993, and his Ph.D. degree from Beihang University in 2001. He is the dean of School of Computer Science and Engineering, Changshu Institute of Technology, and a professor and doctoral supervisor of School of Computer Science & Technology, Soochow University. His research interests include image and video processing, pattern recognition, and computer vision.

View full text

Person re-identification by enhanced local maximal occurrence representation and generalized similarity metric learning

Abstract

Introduction

Section snippets

Related work

Enhanced local maximal occurrence representation

Problem formulation

Experiments

Conclusion

Acknowledgment

Image Vis. Comput.

Image Vis. Comput.

Person re-identification by symmetry-driven accumulation of local features

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning mid-level filters for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Salient color names for person re-identification

Proceedings of the European Conference on Computer Vision

Person re-identification by local maximal occurrence representation and metric learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Hierarchical gaussian descriptor for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Relevance metric learning for person re-identification by exploiting listwise similarities

IEEE Trans. Image Process.

Relaxed pairwise learned metric for person re-identification

Proceedings of the European Conference on Computer Vision

Large scale metric learning from equivalence constraints

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Efficient PSD constrained asymmetric metric learning for person re-identification

Proceedings of the IEEE International Conference on Computer Vision

Pcca: a new approach for distance learning from sparse pairwise constraints

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Cosine similarity metric learning for face verification

Proceedings of the Asian Conference on Computer Vision

Distance metric learning for large margin nearest neighbor classification

J. Mach. Learn. Res.

Reidentification by relative distance comparison

IEEE Trans. Pattern Anal. Mach. Intell.

Viewpoint invariant pedestrian recognition with an ensemble of localized features

Proceedings of the European Conference on Computer Vision

Mahalanobis Distance Learning for Person Re-identification

Multi-camera activity correlation analysis

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Appearance-based person reidentification in camera networks: problem overview and current approaches

J. Ambient Intell. Humaniz. Comput.

Person Re-identification

Mirror representation for modeling view-specific transform in person re-identification

Proceedings of the International Conference on Artificial Intelligence

Person re-identification by iterative re-weighted sparse ranking

IEEE Trans. Pattern Anal. Mach. Intell.

Learning locally-adaptive decision functions for person verification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Person re-identification by salience matching

Proceedings of the IEEE International Conference on Computer Vision

Deep ranking for person re-identification via joint representation learning

IEEE Trans. Image Process.

Learning deep feature representations with domain guided dropout for person re-identification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Person re-identification via distance metric learning with latent variables

IEEE Trans. Image Process.