Cross-level reinforced attention network for person re-identification

doi:10.1016/j.jvcir.2020.102775

Journal of Visual Communication and Image Representation

Volume 69, May 2020, 102775

https://doi.org/10.1016/j.jvcir.2020.102775 Get rights and content

Highlights

•
The fusion of different level features can lay the foundation for generating a more discriminative attention map.
•
The combination of hard attention and soft attention can reduce the influence of interference information on the model.
•
The improved attention module works on both space and channel to achieve more comprehensive weight adjustments.
•
The lightweight person re-identification network achieves outstanding performance.

Abstract

Attention mechanism is a simple and effective method to enhance discriminative performance of person re-identification (Re-ID). Most of previous attention-based works have difficulty in eliminating the negative effects of meaningless information. In this paper, a universal module, named Cross-level Reinforced Attention (CLRA), is proposed to alleviate this issue. Firstly, we fuse features of different semantic levels using adaptive weights. The fused features, containing richer spatial and semantic information, can better guide the generation of subsequent attention module. Then, we combine hard and soft attention to improve the ability to extract important information in spatial and channel domains. Through the CLRA, the network can aggregate and propagate more discriminative semantic information. Finally, we integrate the CLRA with Harmonious Attention CNN (HA-CNN) and form a novel Cross-level Reinforced Attention CNN (CLRA-CNN) to optimize person Re-ID. Experiment results on several public benchmarks show that the proposed method achieves state-of-the-art performance.

Introduction

Person re-identification (Re-ID) targets to find specific pedestrian through non-crossing surveillance cameras deployed in different locations [1], [2], [3], [4]. It has very important application in the field of video surveillance. However, there are still many challenges to accurately distinguish specific targets from different surveillance scenarios. The pedestrian images often have the problems of background clutter, changeable lighting conditions, serious occlusion and large pose difference [5], [6], [7], [8], all of which have a large impact on pedestrian matching. Numerous works have been proposed to relieve this dilemma. Among them, the attention-based methods demonstrate convincing performance.

Attention mechanism provides an effective method to enhance Re-ID networks. According to the difference in calculation method of attention mechanism, it is mainly divided into two types: soft attention and hard attention. The existing methods are usually based on the soft attention [1], [9], [10], [11], [12], which aims to find out the probability distribution relations among a series of elements and obtain an attention map for input features. Each position of the input features is assigned a probability, and the result of the probability distribution reflects the importance of different elements. All the information is reweighted adaptively before being aggregated. In this way, we can extract discriminative information and avoid interference with unimportant information, thereby improving the model. Unlike the general research on soft attention, only a few works are developed on the basis of hard attention. Hard attention only processes the information that is considered most relevant and completely ignores other parts. Using hard attention, we can not only eliminate the interference of irrelevant information, but also obtain higher computational efficiency [9], [13].

According to the difference of attention domain, attention mechanism also can be divided into spatial attention [14] and channel attention [10], [1]. As the name implies, spatial attention can model the interdependence of different spatial positions. By introducing the spatial attention, the convolution layer can sense the whole spatial distribution relationship. Similar to spatial attention, channel attention models the relationship between different channels [10]. Each channel map of high-level features can be viewed as a response to specific class, and different semantic responses are correlated. The interdependence between channel maps can be used to improve the representation of specific semantic features.

Although a lot of attention-based works have made great progress, there are still some defects. First, most of the works [3], [10], [11], [12], [14], [15], [16] tend to involve a single type of attention mechanism. Due to these limitations, they often have difficulty in obtaining the optimal result of weight distribution. For soft attention, it is impossible to completely remove the interference information. Confidence assignment is still a challenge. Meanwhile, hard attention chooses features in a discrete way. This leads to the fact that we cannot choose features by means of some gradient-based learning methods. On the other hand, some works only consider spatial attention or channel attention and cannot make full use of the information.

To solve these issues, we design a Cross-level Reinforced Attention (CLRA). It mainly consists of the Cross-level Feature Fusion (CLFF) module and the Reinforced Attention (RA) module. Low-level features contain more abundant spatial information, while the semantic information of high-level features is more representative [17]. The CLFF fuses features of different levels, enabling subsequent attention module to grasp the dependencies between features more accurately. The RA can fuse soft attention with hard attention to make full use of their respective strengths, which achieves the attention reinforcement of both in the aspect of space and channel. In this work, hard attention helps soft attention to select features, enhancing the weight distribution effect of soft attention. Meanwhile, soft attention provides a stable environment and motivation information for training hard attention. This method can improve the prediction quality of soft attention mechanism and the trainability of hard attention mechanism. Using CLRA, we can enable the network to focus on more discriminating pedestrian features and alleviate the interference of irrelevant factors. The contributions of this work are:

(1)
We design a Cross-level Feature Fusion (CLFF) module, which aims to generate more discriminative features by combining the features of different levels. This module lays the foundation for generating a more discriminative attention map.
(2)
We design a Reinforced Attention (RA) module, which contains two parts: Reinforced Spatial Attention (RSA) and Reinforced Channel Attention (RCA). It combines the characteristics of soft attention and hard attention, and makes use of hard attention to assist in strengthening soft attention.
(3)
We combine the CLFF and RA to form a Cross-level Reinforced Attention (CLRA) module and integrate it into HA-CNN [1] to optimize person Re-ID. We validate the proposed method on several common datasets, Market-1501 [18], DukeMTMC-ReID [19], CUHK03 [20] and MSMT17 [21]. The results verify that our method achieves excellent performance.

Section snippets

Related work

One of the foundations of person Re-ID is to capture the long-term relationship between features. Many existing works often tend to model this relationship by increasing the depth of the network, which are relatively costly. Then, more and more attention-based methods are proposed to optimize Re-ID. Attention mechanism can model long-range dependencies and has been widely used in the field of natural language processing (NLP) in recent years [15], [22], [23]. In particular, Vaswani et al. [12]

Overview

The proposed CLRA module mainly consists of two parts: Cross-level Feature Fusion (CLFF) and Reinforced Attention (RA). The CLFF can obtain new features including rich semantic information and spatial information by merging features of different levels, which can better guide the generation of subsequent attention map. The RA is composed of Reinforced Spatial Attention (RSA) and Reinforced Channel Attention (RCA). In view of the largely independence between RSA and RCA, we use two branches to

Experiments

In this section, the performance of CLRA-CNN is compared with some state-of-the-art methods on several common datasets. We use PyTorch to implement the proposed method and perform experiments on three NVIDIA TITAN Xp GPUs. And then, detailed ablation analysis is conducted to validate the effectiveness of CLRA components.

Conclusion

In this work, we propose a novel network based on attention mechanism, Cross-level Reinforced Attention Network (CLRA-CNN), to effectively study the feature representation and attention selection for person re-identification. Compared with other methods, we fuse features of different levels to guide the generation of attention module. Meanwhile, we better combine soft attention and hard attention to achieve a more reasonable weight distribution. In this way, we can make full use of these

CRediT authorship contribution statement

Min Jiang: Conceptualization, Software, Writing - review & editing, Funding acquisition. Cong Li: Methodology, Software, Writing - original draft. Jun Kong: Supervision, Project administration, Funding acquisition. Zhende Teng: Data curation, Validation. Danfeng Zhuang: Investigation, Data curation.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (61362030, 61201429, 61876072), China Postdoctoral Science Foundation (2015M581720, 2016M600360), Jiangsu Postdoctoral Science Foundation (1601216C), Scientific and Technological Aid Program of Xinjiang (2017E0279).

References (47)

W. Zhong et al.
Discriminative representation learning for person re-identification via multi-loss training
J. Vis. Commun. Image Represent.
(2019)
W. Li et al.
Harmonious attention network for person re-identification
L. Zheng, Y. Yang, A.G. Hauptmann, Person re-identification: Past, present and future, arXiv preprint arXiv:1610.02984...
J. Xu et al.
Attention-aware compositional network for person re-identification
R. Zhao et al.
Person re-identification by saliency learning
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
T. Xiao et al.
Learning deep feature representations with domain guided dropout for person re-identification
R.R. Varior et al.
Gated siamese convolutional neural network architecture for human re-identification
L. Zhang et al.
Learning a discriminative null space for person re-identification
M. Malinowski et al.
Learning visual question answering by bootstrapping hard attention
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and...

Y. Chen, Y. Kalantidis, J. Li, S. Yan, J. Feng, A 2-nets: Double attention networks, in: Advances in Neural Information...

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you...

T. Shen et al.

Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling

F. Wang et al.

Residual attention network for image classification

J. Fu et al.

Dual attention network for scene segmentation

D. Li et al.

Learning deep context-aware features over body and latent parts for person re-identification

C. Yu et al.

Learning a discriminative feature network for semantic segmentation

L. Zheng et al.

Scalable person re-identification: A benchmark

Z. Zheng et al.

Unlabeled samples generated by gan improve the person re-identification baseline in vitro

W. Li et al.

Deepreid: Deep filter pairing neural network for person re-identification

L. Wei et al.

Person transfer gan to bridge domain gap for person re-identification

T. Luong et al.

Effective approaches to attention-based neural machine translation

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, Y. Wu, Exploring the limits of language modeling, arXiv preprint...

Cited by (0)

^☆: This paper has been recommended for acceptance by Zicheng Liu.

View full text

Cross-level reinforced attention network for person re-identification☆

Highlights

Abstract

Introduction

Section snippets

Related work

Overview

Experiments

Conclusion

CRediT authorship contribution statement

Acknowledgments

J. Vis. Commun. Image Represent.

Harmonious attention network for person re-identification

Attention-aware compositional network for person re-identification

Person re-identification by saliency learning

IEEE Trans. Pattern Anal. Mach. Intell.

Learning deep feature representations with domain guided dropout for person re-identification

Gated siamese convolutional neural network architecture for human re-identification

Learning a discriminative null space for person re-identification

Learning visual question answering by bootstrapping hard attention