Elsevier

Neurocomputing

Volume 465, 20 November 2021, Pages 184-194
Neurocomputing

Graph similarity rectification for person search

https://doi.org/10.1016/j.neucom.2021.08.136Get rights and content

Abstract

In person search task, it is hard to retrieve the query persons undergoing large visual changes. To tackle this problem, we propose to exploit the context information to rectify the original individual similarity for better retrieval. To this end, we propose to model a query frame and a gallery frame as a graph pair, and then design the Siamese Residual Graph Convolutional Networks (SR-GCN) to aggregate context information to generate graph similarity as a complement of the original similarity. To model the relationships between context persons, we define the joint similarity adjacency matrix which assigns the proposed joint similarity as the edge weight to measure the contributions a context person makes to the aggregation. Therefore, the context node with a higher possibility to be a co-traveler of the target node makes more contributions to the matching of the target node. To further enhance the discriminative power of individual features, we also design a Random Proxy Center loss which explicitly constrains the intra-class variations to be smaller than the inter-class variations in the feature space and could make use of unlabeled samples. Experimental results on two public datasets show that our approach performs favorably against the state-of-the-art methods.

Introduction

Given a frame containing the query person, the person search task aims to locate the query person among all gallery video frames. Different from person Re-ID [1] which directly compares person image patches cropped by manual annotations or person detectors (e.g. DPM [2] and ACF [3]), person search integrates person detection and person Re-ID into a unified task which is closer to the real-world surveillance applications. Generally, existing person search methods can be categorized into one-step and two-step methods. One-step methods jointly handle the person detection and Re-ID using an end-to-end model, while two-step methods perform two subtasks separately. As pointed out by [4], since person detection focuses on the differences between the foreground persons and background, while Re-ID focuses on the differences of various persons, jointly solving the two subtasks in an end-to-end one-step framework leads to the sub-optimal performance. Thus, we adopt the two-step framework for person search in this paper.

Compared with person Re-ID, one advantage of person search is that the additional context information can be exploited since we directly retrieve the target person in the original video frames. The intuition behind this point is that individuals appearing in a frame are likely to be also together in other frames [5]. We analyze the quantity distribution of the context co-travelers on the two person search datasets. As shown in Fig. 1, many context co-travelers appear together with specific persons in more than one frame. The context information from co-travelers can help to re-identify the query person, especially when the query person undergoes large appearance changes caused by pose and viewpoint variations or occlusion. Fig. 2 presents an example to show how context information can be utilized to improve the performance of person search. As shown in Fig. 2, if we only focus on the individual similarity, we will match the query person (within the red bounding-box in the query frame) to the wrong person (within the yellow bounding-box in the negative frame) rather than the right person (within the red bounding-box in the positive frame) as the similarity with the negative candidate (0.4) is higher than that with the positive candidate (0.3) due to the pose change. However, if we take the context persons into consideration, the similarity with the positive candidate can be rectified to be higher than that with the negative candidate, as shown in Fig. 2(c).

Yan et al. [6] also propose to integrate the context information for person search. However, the complex distribution of co-travelers is not discussed in [6], which limits the performance of their method. As shown in Fig. 1, the quantity distribution of context co-travelers is nonuniform and varies from dataset to dataset. Besides, there only exist a small number of true matches for a query person among the whole gallery set.

To make better use of context co-travelers, we propose the Siamese Residual Graph Convolutional Networks (SR-GCN) to learn graph similarity based on the context co-travelers. Specifically, for a query frame and a gallery frame, we first select the valuable query-gallery pairs and build a graph pair to model the context relationships between these query-gallery pairs. The graph pair consists of a source graph (the query frame) and a target graph (the gallery frame) where the query instance and its best candidate are taken as the center node and other instances are the context nodes. To deal with the complex distribution of context co-travelers, we determine the number of nodes in the source and target graphs in terms of the number of instances appearing in the query frame, and propose the joint similarity adjacency matrix to model the relationships between the center node and context nodes. The joint similarity adjacency matrix assigns the proposed joint similarity as the edge weight to measure the contributions a context node makes to the aggregation. Lastly, to further alleviate the negative impact of noise nodes, we propose to use the graph similarity learned by SR-GCN as a supplement to rectify the original target query-gallery similarity instead of directly using it as the final similarity. Experimental results demonstrate that the proposed SR-GCN is more effective to exploit the context instances for better retrieval results.

Besides the context aggregation strategy, the quality of individual features is also critical for person search. We enhance the discriminative power of individual representations by designing a new metric learning loss function, Random Proxy Center (RPC) loss. To exploit the unlabeled samples provided in person search datasets, the proposed RPC loss takes unlabeled samples as negative ones for any labeled samples to construct triplets, which allows the unlabeled samples to directly take part in the optimization. The nature of the widely-used OIM [7] loss in person search area is the cross-entropy loss which leads to not compact enough intra-class features [8]. Compared to the OIM loss, our RPC loss can make the CNN model learn more compact intra-class features and more separate inter-class features by directly optimizing the intra-class and inter-class variations.

In summary, our contributions are as follows: (1) We propose to model the context instances in the query and gallery frames as a graph pair and design the Siamese Residual Graph Convolutional Networks (SR-GCN) to integrate the context information for better retrieval. In the proposed SR-GCN, we propose the joint similarity adjacency matrix to describe the complex relationships between context instances for high-quality context aggregation. (2) We propose a Random Proxy Center (RPC) loss function for better feature learning. The proposed RPC loss can exploit unlabeled samples and achieve better performance compared to the widely-used OIM loss in person search task and several other loss functions. (3) Our approach performs favorably against state-of-the-art methods on two public person search datasets. Extensive ablation studies demonstrate the effectiveness of the proposed method.

Section snippets

Related work

Person re-identification. In the beginning, the traditional person Re-ID task used hand-crafted features [9], [10] to represent person image or manually designed distance metric [11], [12], [13]. After deep learning achieves great success, it is also applied in the Re-ID task and has become a mainstream method in the Re-ID community. Some works [14], [15], [16], [17], [18], [19], [20], [21] focus on extracting robust and discriminative features of person images. Some methods [22], [23], [24],

Method

The whole person search framework is illustrated in Fig. 3. We first employ the Faster R-CNN to detect persons, and then adopt the Se-ResNet-50 [47] trained with the proposed RPC loss to extract the CNN features for each detected person. Finally, the proposed SR-GCN model further processes the context information based on the CNN features to generate the graph similarity which is used to rectify the original individual similarity. The Faster R-CNN person detector, Se-ResNet-50 and the SR-GCN

Datasets and evaluation protocol

PRW. The PRW dataset [33] is obtained from videos taken in a university with six cameras. It provides 11,816 frames with 43,110 bounding-boxes belonging to 932 labeled identities and many unlabeled identities. Among them, 15,575 bounding-boxes belonging to 482 labeled identities are split into the training set as well as many unlabeled bounding-boxes, and 6,112 frames together with 2,057 query persons are split into the testing set.

CUHK-SYSU. The CUHK-SYSU dataset [7] is collected from both

Discussion

In this paper, we propose the SR-GCN model to exploit the context information between a query frame and a gallery frame, and design the RPC loss to extract more discriminative CNN features. Experimental results validates the effectiveness of the proposed method.

CGPS [6] also proposes to exploit the context information for person search. To this end, CGPS directly models the person to person matching problem as a fixed-size group to group matching problem, ignoring the complex distribution of

Conclusion

In this paper, we propose a two-step person search method. The Faster R-CNN is first employed to detect possible persons, and then the Se-ResNet-50 model trained with the proposed RPC loss function is used to extract robust and discriminative CNN features. Finally, the SR-GCN model aggregates the context information to rectify the individual similarity of the target query-gallery pair to further improve the retrieval performance. Extensive experiments on two public person search datasets

CRediT authorship contribution statement

Chuang Liu: Conceptualization, Methodology, Software, Validation, Writing - original draft. Hua Yang: Writing - review & editing, Project administration, Supervision. Ji Zhu: Writing - review & editing. Xinzhe Li: Writing - review & editing. Zhigang Chang: Writing - review & editing. Shibao Zheng: Project administration, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (NSFC, Grant Nos. 61771303, 62071292), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 19DZ1209303, 20DZ1200203, 18DZ2270700, 2021SHZDZX0102), and SJTU Yitu/Thinkforce Joint Laboratory for Visual Computing and Application.

Chuang Liu received the Bachelor’s Degree from Xidian University, China, and is currently pursuing the Ph.D. degree at the Department of Electronic Engineering, Shanghai Jiao Tong University, China. His research interests are computer vision tasks like person detection, person re-identification and person search.

References (55)

  • L. Chen et al.

    Person re-identification from virtuality to reality via modality invariant adversarial mechanism

    Neurocomputing

    (2020)
  • Z. Chang et al.

    Weighted bilinear coding over salient body parts for person re-identification

    Neurocomputing

    (2020)
  • S. Gong, M. Cristani, S. Yan, C.L. Chen, Person Re-Identification, Springer, 2014....
  • P.F. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • P. Dollár et al.

    Fast feature pyramids for object detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • D. Chen et al.

    Person search via a mask-guided two-stream cnn model, in

  • R. Mazzon et al.

    Detection and tracking of groups in crowd

  • Y. Yan et al.

    Learning context graph for person search, in

  • T. Xiao et al.

    Joint detection and identification feature learning for person search

  • Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning approach for deep face recognition, in: European...
  • X. Wang et al.

    Shape and appearance context modeling, in

  • R. Zhao et al.

    Unsupervised salience learning for person re-identification, in

  • D. Gray et al.

    Viewpoint invariant pedestrian recognition with an ensemble of localized features

  • S. Liao et al.

    Efficient psd constrained asymmetric metric learning for person re-identification

  • W.-S. Zheng et al.

    Person re-identification by probabilistic relative distance comparison, in

  • S. Liao et al.

    Person re-identification by local maximal occurrence representation and metric learning

  • J. Liu et al.

    Multi-scale triplet cnn for person re-identification

  • Y. Yan et al.

    Person re-identification via recurrent feature aggregation, in

  • D. Chen et al.

    Similarity learning with spatial constraints for person re-identification

  • W.-S. Zheng et al.

    Towards open-world person re-identification by one-shot group-based verification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • Z. Zhou et al.

    See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification

  • L. Chen, H. Yang, Q. Xu, Z. Gao, Harmonious attention network for person re-identification via complementarity between...
  • W. Lin et al.

    Learning correspondence structures for person re-identification

    IEEE Trans. Image Process.

    (2017)
  • C. Su et al.

    Pose-driven deep convolutional model for person re-identification

  • L. Wei et al.

    Global-local-alignment descriptor for pedestrian retrieval

  • G. Wang et al.

    Learning discriminative features with multiple granularities for person re-identification

  • E. Ahmed et al.

    An improved deep learning architecture for person re-identification, in

  • Cited by (3)

    • End-to-end feature diversity person search with rank constraint of cross-class matrix

      2023, Neurocomputing
      Citation Excerpt :

      As persons vary according to the camera view, pose, occlusion, lighting, background and resolution, inter-class images may be more similar than intra-class images, both person detection and person ReID are full of challenges. Traditional methods employ a two-step strategy to solve the person search problem [4,1,5,6], in which the person detection and person ReID feature encoders are trained independently and cascaded friendly, as shown in Fig. 1 (a). Specifically, the rectangular area of the detected person is obtained by the person detection model.

    Chuang Liu received the Bachelor’s Degree from Xidian University, China, and is currently pursuing the Ph.D. degree at the Department of Electronic Engineering, Shanghai Jiao Tong University, China. His research interests are computer vision tasks like person detection, person re-identification and person search.

    Hua Yang received the Ph.D. degree in communication and information from Shanghai Jiaotong University, in 2004, and both the B.S. and M.S. degrees in communication and information from Haerbin Engineering University, China in 1998 and 2001, respectively. She is currently an associate professor in the Department of Electronic Engineering, Shanghai Jiaotong University, China. She received the first prize of Shanghai technical invention in 2017 and champion of wider person search challenge as an advisor in ECCV2018. Her current research interests include computer vision, machine learning, and smart video surveillance applications.

    Ji Zhu received the B.E. degree from the School of Electronic Engineering, Xidian University, China, and is currently pursuing the Ph.D. degree at the Department of Electronic Engineering, Shanghai Jiao Tong University, China. He also works as a research scientist at Visbody. His research interests include computer vision, deep learning, and computer graphics.

    Xinzhe Li received his B.S. degree in electronic information engineering from Dalian University of Technology, Dalian, China, in 2015. He is current a Ph.D. student at the Department of Electronic Engineering, Shanghai Jiao Tong University (SJTU), Shanghai, China. His current research interests include few-shot learning, meta-learning and data cleaning.

    Zhigang Chang is a Ph.D. candidate in Department of Electronic Engineering of Shanghai Jiao Tong University, under the supervision of Prof. Shibao Zheng. Before that, I graduated from South China University of Technology on Electronic Information Engineering in 2016. He is now a visiting student in MM Lab at Nanyang Technological University under the supervision of Associate Prof. Chen Change Loy. His research interests are computer vision tasks like pedestrian detection and person re-identification.

    Shibao Zheng received his B.S. degree in communication engineering from Xidian University, Xi’an and M.S. degree in the signal and information processing from the 54th institute of CETC, Shijiazhuang, China, in 1983 and 1986, respectively. He is currently a professor of electronic engineering department and vice director of Elderly Health Information and Technology Institute, Shanghai Jiao Tong University (SJTU), Shanghai, China. And he is also a professor committee member of Shanghai Key Laboratory of Digital Media Processing and Transmission, and a Consultant Expert of ministry of public security in video surveillance field. His current research interests include urban video surveillance system, intelligent video analysis, and elderly health technology, etc.

    View full text