Improving person re-identification by multi-task learning

doi:10.1016/j.neucom.2019.01.027

Neurocomputing

Volume 347, 28 June 2019, Pages 109-118

https://doi.org/10.1016/j.neucom.2019.01.027 Get rights and content

Abstract

For person re-identification, the core task is to find effective representations of a person image. As Multi-Task Learning can achieve great performance in seeking robust features, we propose a novel Multi-Task Learning Network (MTNet) with four different losses for person re-identification (re-ID). Our MTNet is an end-to-end deep learning framework, which all the parameters and losses can be jointly optimized. In our method we combine two tasks closely corresponding to person re-identification, pedestrian identity task and pedestrian attribute task, who provide complementary information from different perspective by integrating multi-context informations. Attribute focuses on some special aspects of a person, while identity pays more attention to overall contour and appearance. Meanwhile, both classification and verification losses are employed to optimize the distance of samples. Identification losses are used to construct a large class space, while verification losses are applied optimize the space by minimizing the distance between similar images and maximizing the distance between dissimilar images. In the experiments, our MTNet achieves the state-of-the-art results on two typical datasets Market1501 [1] and DukeMTMC-reID [2].

Introduction

Person re-identification has a potential significance in security application. It is usually considered as an image retrieval issue, which matches person from different cameras and ranks the gallery images according to the similarities. Existing methods mainly focus on extracting robust representations [3], [4], [5], [6], [7], [8], [9], [10], [11] and learning matching functions or metrics [5], [12], [13], [14], [15], [16], [17]. With the excellent performance in computer vision tasks, deep learning has also been adopted to Re-ID community [6], [18], [19], [20], [21] and has gained promising results.

This paper aims at learning to extract robust representations and improving the performance of person re-ID in large-scale datasets. Both person identity and attributes are essential information in surveillance videos. In general, the identity feature can be considered as overall feature, which depends on the overall contour and appearance. The attributes can be considered as features that are only related to some aspects of a pedestrian. Identity feature contains overall information, while attribute feature is more focused on details. So identity feature is more effective for re-id, and attribute feature is more effective for attribute recognition. As mentioned in [22], multi-task learning can obtain more robust feature. These two tasks are also complement with each other. As shown in Fig. 1, the identity task and the attribute task could help each other to achieve better results. For example, the classification system fails in the first query example, due to the similar appearance, such as yellow upper clothing and black lower short pant. Nevertheless, with the gender attribute, the boy in yellow clothing is excluded.

To our knowledge, there are three kinds of methods using deep learning for person re-identification. The classification model [1], [6], [19] devotes to distinguishing the different samples. In these methods, images from different categories in the feature space may be very close to each other, which makes it very difficult and challenging to correctly identify the new samples or new identities. So, the verification model [23], [24] is proposed to minimize the distance between similar images, and maximize the distance between dissimilar ones. However, the verification models have weak abilities in expanding feature space. Besides, a number of previous works [18], [25] treat person re-identification as a binary-category classification task or a similarity regression task. In these works, given an input pair of images, the network determine whether the two images represent the same person.

In our paper, MTNet takes advantage of the two tasks of identity and attribute by combing the two methods of classification and verification. In our framework, the verification technology is embedded in the attributes, which makes up the defect of simple attribute identification model in a certain extent. As shown in the Fig. 1, we define a identity label and a set of attribute labels for every pedestrian (the second boxes). Based on these labels, we use four independent branches to train a robust multi-task network (the third boxes), which further achieves the person re-identification task and person attribute recognition task (the forth boxes).

Our main contributions are summarized as follows:

•
We proposed a multi-task learning network (MTNet). It learns an end-to-end CNN embedding for person re-ID and an attribute prediction model simultaneously. As shown in Fig. 2, four different tasks are integrated to MTNet including identity classification, identity verification, attribute classification and attribute verification.
•
We employ the verification loss for attribute task, which is the first time in the field of person re-ID task.The attribute verification task not only assists attribute classificatioin task but also promotes the identity verification task.
•
We achieve the state-of-the-art results on two large-scale person re-ID datasets Market1501 [1] and DukeMTMC-reID [2].

Section snippets

Related work

This section briefly reviews several closely related works, classification based methods, verification based methods and attributes based methods.

Approach

In this section, we introduce our whole network architecture and the definition of the multiple tasks.

Datasets and evaluation metrics

The Market1501 dataset [1] is a large-scale person re-ID dataset, which contains 32,668 gallery images and 3368 query images captured by 6 cameras. Each annotated identity is presented in at least two cameras, so that cross-camera search can be performed. The images are automatically detected by the deformable part model (DPM) instead of hand-drawn bounding boxes, which is closer to the realistic setting. According to the default setting, there are 12,936 cropped images of 751 identities for

Conclusions

In this paper, we propose a novel multi-task learning network for person re-identification by learning multiple complementary information end-to-end. The four different sub-tasks are applied to some extent and mutually benefited from the multi-task learning procedure. The multi-task network includes identity classification loss, identity verification loss, attribute classification loss and attribute verification loss. We introduce the attribute loss to improve the detailed discriminant ability

Acknowledgments

This work was supported in part by the Natural Science Foundation of China under Grant U1536203 and 61672254 in part by the National Key Research and Development Program of China (2016QY01W0200), in part by the Major Scientific and Technological Project of Hubei Province (2018AAA068).

Hefei Ling Prof.Ling obtained the B.S, M.S, Ph.D. degree from Huazhong University of Science and Technology, China in 1999, 2002, 2005, respectively. He is currently serving as a professor in the School of Computer Science and Technology, HUST. Prof.Ling served as a visiting professor in University College London from 2008 to 2009. He has published more than 100 papers. Now he serves as director of digital media and Intelligent Technology Research Institute.

References (52)

MaB. et al.
Covariance descriptor based on bio-inspired features for person re-identification and face verification
IVC
(2014)
R. Layne et al.
Person re-identification by attributes
Proceedings of the BMVC
(2014)
ZhengL. et al.
Scalable person re-identification: a benchmark
Proceedings of the ICCV
(2015)
ZhengZ. et al.
Unlabeled samples generated by GAN improve the person re-identification baseline in vitro
Proceedings of the ICCV
(2017)
YangY. et al.
Salient color names for person re-identification
Proceedings of the ECCV
(2014)
LiaoS. et al.
Person re-identification by local maximal occurrence representation and metric learning
Proceedings of the CVPR
(2015)
SunY. et al.
Svdnet for pedestrian retrieval
Proceedings of the ICCV
(2017)
Xiao-Yuan JingM.L. et al.
Face and palmprint pixel level fusion and Kernel DCV-RBF classifier for small sample biometric recognition
Pattern Recognit.
(2007)
JingX.-Y. et al.
A face and palmprint recognition approach based on discriminant DCT feature extraction
IEEE Trans. Syst. Man Cybern. Part B (Cybern.)
(2004)
ZhuX. et al.
Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics
IEEE Trans. Image Process.
(2018)

ZhuX. et al.

Image to video person re-identification by learning heterogeneous dictionary pair with feature projection matrix

IEEE Trans. Inf. Forensics Secur.

(2018)

JingX.-Y. et al.

Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning

IEEE Trans. Image Process.

(2017)

ZhengW.S. et al.

Person re-identification by probabilistic relative distance comparison

Proceedings of the CVPR

(2011)

M. Köstinger et al.

Large scale metric learning from equivalence constraints

Proceedings of the CVPR

(2012)

LiZ. et al.

Learning locally-adaptive decision functions for person verification

Proceedings of the CVPR

(2013)

G. Lisanti et al.

Matching people across camera views using Kernel canonical correlation analysis

Proceedings of the ICDSC

(2014)

LiaoS. et al.

Efficient PSD constrained asymmetric metric learning for person re-identification

Proceedings of the ICCV

(2015)

ShenY. et al.

Person re-identification with correspondence structure learning

Proceedings of the ICCV

(2015)

E. Ahmed et al.

An improved deep learning architecture for person re-identification

Proceedings of the CVPR

(2015)

XiaoT. et al.

Learning deep feature representations with domain guided dropout for person re-identification

Proceedings of the CVPR

(2016)

WangF. et al.

Joint learning of single-image and cross-image representations for person re-identification

Proceedings of the CVPR

(2016)

R.R. Varior et al.

Gated siamese convolutional neural network architecture for human re-identification

Proceedings of the ECCV

(2016)

S. Ruder, An overview of multi-task learning in deep neural networks, in:...

LiuH. et al.

Deep supervised hashing for fast image retrieval

Proceedings of the CVPR

(2016)

A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, in:...

LiW. et al.

Deepreid: deep filter pairing neural network for person re-identification

Proceedings of the CVPR

(2014)

Cited by (51)

On exploring pose estimation as an auxiliary learning task for Visible–Infrared Person Re-identification
2023, Neurocomputing
Visible–infrared person re-identification (VI-ReID) has been challenging due to the existence of large discrepancies between visible and infrared modalities. Most pioneering approaches reduce intra-modality variations and inter-modality discrepancies by learning modality-shared features. However, an explicit modality-shared cue, i.e., body keypoints, has not been fully exploited in VI-ReID. Additionally, existing feature learning paradigms imposed constraints on either global features or partitioned feature stripes, which neglect the prediction consistency of global and part features. To address the above problems, we exploit Pose Estimation as an auxiliary learning task to assist VI-ReID in an end-to-end framework. By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality ID-related features. On top of it, the learnings of global features and local features are seamlessly synchronized by Hierarchical Feature Constraint (HFC), where the former supervises the latter using the knowledge distillation strategy. Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins. Specifically, our method achieves nearly 20% mAP improvements against the state-of-the-art method on the RegDB dataset. Our intriguing findings highlight the usage of auxiliary task learning in VI-ReID. Our source code is available at https://github.com/yoqim/Pose_VIReID.
3D human pose and shape estimation via de-occlusion multi-task learning
2023, Neurocomputing
Three-dimensional human pose and shape estimation is to compute a full human 3D mesh given a single image. The contamination of features caused by occlusion usually degrades its performance significantly. Recent progress in this field typically addressed the occlusion problem implicitly. By contrast, in this paper, we address it explicitly using a simple yet effective de-occlusion multi-task learning network. Our key insight is that feature for mesh parameter regression should be noiseless. Thus, in the feature space, our method disentangles the occludee that represents the noiseless human feature from the occluder. Specifically, a spatial regularization and an attention mechanism are imposed in the backbone of our network to disentangle the features into different channels. Furthermore, two segmentation tasks are proposed to supervise the de-occlusion process. The final mesh model is regressed by the disentangled occlusion-aware features. Experiments on both occlusion and non-occlusion datasets are conducted, and the results prove that our method is superior to the state-of-the-art methods on two occlusion datasets, while achieving competitive performance on a non-occlusion dataset. We also demonstrate that the proposed de-occlusion strategy is the main factor to improve the robustness against occlusion. The code is available at https://github.com/qihangran/De-occlusion_MTL_HMR.
Kernel-based learning of orthogonal functions
2023, Neurocomputing
Estimating a set of orthogonal functions from a finite set of noisy data plays a crucial role in several areas such as imaging, dictionary learning and compressed sensing. The problem turns out especially hard due to its intrinsic non-convexity. In this paper, we solve it by recasting it in the framework of multi-task learning in Hilbert spaces, where orthogonality plays a role as inductive bias. Two perspectives are analyzed. The first one is mainly theoretic. It considers a formulation of the problem where non-orthogonal function estimates are seen as noisy data belonging to an infinite-dimensional space from which orthogonal functions have to be reconstructed. We then provide results concerning the existence and the convergence of the optimizers. The second one is more oriented towards applications. It consists in a learning scheme where orthogonal functions are directly inferred from a finite amount of noisy data. It relies on regularization in reproducing kernel Hilbert spaces and on the introduction of special penalty terms promoting orthogonality among tasks. The problem is then cast in a Bayesian framework, overcoming non-convexity through an efficient Markov chain Monte Carlo scheme. If orthogonality is not certain, our scheme can also understand from data if such form of task interaction really holds.
Language and vision based person re-identification for surveillance systems using deep learning with LIP layers
2023, Image and Vision Computing
Real-time surveillance systems have become a necessity of today's life owing to their relevance in the contemporary era for security reasons to ensure a secure and safe environment. Presently, Person re-identification (Re-ID)-based surveillance systems are becoming increasingly more prevalent and sophisticated since they do not require human intervention and are more reliable to deploy in public spaces leveraging multi-camera networks. However, one of the major problems in Person ReID is the visual appearance i-e the appearance of a person in an image is greatly affected by different camera views. As a result, the discriminative set of features must be learned in a deep learning model in order to re-identify persons from opposing camera viewpoints. To address this challenge, we propose an image/text-retrieval-based Person ReId method in which both visual and text-based features are exploited to carry out person re-identification. More precisely, the textual descriptions of the images are taken into account as text features with Glove Word Embedding followed by 1D-MAPCNN and fused with image-level features extracted using the GoogLeNet model. In addition, the feature discriminability is enhanced using local importance-based pooling (LIP) layers in which adaptive significance weights are learned during downsampling. Moreover, from two different modalities, feature refinement is done during training with the help of attention mechanisms using the Convolutional Block Attention module (CBAM) and the proposed shared attention neural network. It is observed that LIP layers along with both vision and textual features are playing a key role in acquiring discriminative features even if the visual appearance of the same person is greatly affected due to camera pose conditions. The proposed method is validated on the CUHK-PADES dataset and has 15.34% and 24.39% rank-1 improvement in text and image-based retrievals.
Deep Convolution Neural Network sharing for the multi-label images classification
2022, Machine Learning with Applications
Addressing issues related to multi-label classification is relevant in many fields of applications. In this work. We present a multi-label classification architecture based on Multi-Branch Neural Network Model (MBNN) that permits the network to encode data from multiple semi-parallel subnetworks or layers outputs separately. Different types of neural networks can be used in the MBNN, but the proposal is made with Convolutional Neural Networks subnetworks, trained, and joined in classifying the outputs (i.e., labels). The proposed work makes it possible to perform incremental changes on existing Multitask Learning architectures for an adaptation to the multi-label classification. These transformations lead us to define two new architectures (neural network multi-outputs and neural network multi-features) using the feature extractors from the pre-trained neural networks. The empirical and statistical results verify that the proposed multibranch neural network architecture performs better than other simple multi-label classification architectures. Later, the “network with multi-features” obtained the highest classification score than other deep neural networks with 83.31% of the f1-score for the Amazon rainforest dataset. The f1-score values are 88.81% for Pascal VOC 2007 dataset, 87.71% for Nuswide, and 88.64% for Pascal VOC 2012.
Spatial-wise and channel-wise feature uncertainty for occluded person re-identification
2022, Neurocomputing
Citation Excerpt :
Evaluation Metrics. The model is evaluated with standard metrics as in most person Re-ID literatures [18], namely the cumulative matching cure (CMC) and the mean Average Precision (mAP). All the experiments are performed in single query setting (Table 1).
Occluded person re-identification is a challenging task since the available data often suffers from information incompleteness and spatial misalignment. Most state-of-the-art occluded models rely on the external model to provide additional semantic information. However, for the time being, external models, such as the human parsing model and the pose estimation model cannot provide accurate semantic information under a complex occlusion environment and may introduce errors to the Re-ID model instead. In this paper, we propose an occluded person Re-ID model that mines the latent recognizable information of the person image itself, without the help of external models. Feature/Data uncertainty can reduce the influence of noisy samples in datasets and has been discussed in person Re-ID and face recognition, we extend the uncertainty to the micro feature level, and propose the spatial-wise and channel-wise feature uncertainty to constantly refine the features in the spatial domain and the channel domain respectively during feature construction by weakening the influence of noise features. Extensive experiments on the occluded datasets and holistic datasets have proved the effectiveness of our proposed methods.

View all citing articles on Scopus

Ziyang Wang received the B.E. degree in Software Engineer from Shangdong University, Weihai, China in 2016. He is currently pursuing the M.S. degree at Huazhong University of Science and Technology, Wuhan, China. His research interest includes computer vision and multimedia data analysis, such as largescale multimedia indexing and retrieval.

Ping Li is a lecturer in school of Computer science & Technology, Huazhong University of Science and Technology(HUST). He received his Ph.D. degree in Computer Application from HUST in 2009. His research interests include multimedia security, image retrieval and machine learning.

Yuxuan Shi is currently a Ph.D. student in the School of Computer Science and Technology at Huazhong University of Science and Technology. He received the B.S. degree in electronic information engineering from Wuhan University of Science and Technology, Wuhan, China. He received his M.S. in traffic information engineering and control at Wuhan University of Technology. His research interest includes computer vision and multimedia data analysis, such as largescale multimedia indexing and retrieval.

Jiazhong Chen received his M.S. and Ph.D. degrees from Huazhong University of Science and Technology (HUST), Wuhan, China, in 1999 and 2003. He is currently an associate professor in the school of computer science and technology, HUST. His current research interests include computer vision, image processing, machine learning, and multimedia communications.

View full text

Improving person re-identification by multi-task learning

Abstract

Introduction

Section snippets

Related work

Approach

Datasets and evaluation metrics

Conclusions

Acknowledgments

IVC

Scalable person re-identification: a benchmark

Proceedings of the ICCV

Unlabeled samples generated by GAN improve the person re-identification baseline in vitro

Proceedings of the ICCV

Salient color names for person re-identification

Proceedings of the ECCV

Person re-identification by local maximal occurrence representation and metric learning

Proceedings of the CVPR

Svdnet for pedestrian retrieval

Proceedings of the ICCV

Face and palmprint pixel level fusion and Kernel DCV-RBF classifier for small sample biometric recognition

Pattern Recognit.

A face and palmprint recognition approach based on discriminant DCT feature extraction

IEEE Trans. Syst. Man Cybern. Part B (Cybern.)

Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics

IEEE Trans. Image Process.

Image to video person re-identification by learning heterogeneous dictionary pair with feature projection matrix

IEEE Trans. Inf. Forensics Secur.

Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning

IEEE Trans. Image Process.

Person re-identification by probabilistic relative distance comparison

Proceedings of the CVPR

Large scale metric learning from equivalence constraints

Proceedings of the CVPR

Learning locally-adaptive decision functions for person verification

Proceedings of the CVPR

Matching people across camera views using Kernel canonical correlation analysis

Proceedings of the ICDSC

Efficient PSD constrained asymmetric metric learning for person re-identification

Proceedings of the ICCV

Person re-identification with correspondence structure learning

Proceedings of the ICCV

An improved deep learning architecture for person re-identification

Proceedings of the CVPR

Learning deep feature representations with domain guided dropout for person re-identification

Proceedings of the CVPR

Joint learning of single-image and cross-image representations for person re-identification

Proceedings of the CVPR

Gated siamese convolutional neural network architecture for human re-identification

Proceedings of the ECCV

Deep supervised hashing for fast image retrieval

Proceedings of the CVPR

Deepreid: deep filter pairing neural network for person re-identification

Proceedings of the CVPR