Elsevier

Neurocomputing

Volume 423, 29 January 2021, Pages 255-263
Neurocomputing

Salient points driven pedestrian group retrieval with fine-grained representation

https://doi.org/10.1016/j.neucom.2020.09.054Get rights and content

Highlights

  • Explore a less studied issue for the group retrieval in crowd video analysis field.

  • A non-uniform sampling processing is use to divide group into fine granularity units.

  • A cost-trim processing is use to lighten the influence caused by high cost pairs.

Abstract

People often take part in various social activities in the form of groups in public area. As the primary constituent units of crowd, groups retrieval has become one of the urgent issues for the security departments. In this paper, collection of stable individuals with some social relationship, called group, is selected as the research object, and a novel task of pedestrian group retrieval is introduced. Different from the individual person matching, groups often show high aggregation due to their inherent characteristics, occlusions in group individuals therefore are more serious. As a result, the performance of individual person based detection and matching will be affected. Meanwhile, group matching also needs to address the problems like variations in the shape or configuration. Therefore, we suggest that the group entirety may be disassembled into fine-grained representation and then design a salient points driven framework for pedestrian group retrieval. The work focuses on the problems of overall appearance characteristics extraction of a deformable pedestrian collection and matching of groups at varying scales. Experiments on Pedestrian-Groups2 dataset and Road Group dataset demonstrate the effectiveness of our proposed framework for Pedestrian Group retrieval.

Introduction

Group retrieval is a less studied issue in crowded scene analysis, the major goal of this task is to search and match the same pedestrian collective from different non-overlapping cameras images. In the past decade, crowd understanding and analysis have been an active field in computer vision and undergone rapid development in various aspects, including crowd event detection [1], [2], [3], [4], crowd counting [5], [6], [7], [8] and segmentation [9], [10].

As one of the major constituent units of crowd, groups have become an important research object in public safety management field. They contains the information that facilitates the understanding about collective phenomena, which raises great interest from the researchers on the study of group, such as group detection [11], [12], [13], [14] or group activity recognition [15], [16], [17]. However, the understanding of group remains challenging, especially in the aspect of the cross-camera group retrieval, which is still few studies. The research on group retrieval is of practical value for security management, such as anomaly crowd source tracing or group movement route detection. Especially in recent years, terrorist attacks, violent incidents as well as premeditated gangs’ crimes are often caused by groups, resulting in severer consequence and impact on the society than incidents cause by single individuals. Therefore, group retrieval has become one of the urgent issues for the security departments. This paper conducts an exploration into this less studied issue and proposes a salient points driven framework for pedestrian group retrieval.

As defined in [14], pedestrian group is a cluster of members who tend to move together for a sustained period of time. They usually have some kind of social relationships, e.g. friends or family members. In generally, crowd in public area has two kinds of constituent units: unorganized individual and group. The individual person re-identification has been widely studied [18], [19], [20]. A group generally refers to a collection composed of more than one individual. In here, we mainly focus on small groups composed of several pedestrians. Furthermore, there is no consistent definition of group; following the interpretation in [14], we prefer to ignore the collections who randomly walks together in public area. Because for practical application purposes, security manage departments are more interested in stable groups with some kind of relationship.

As in Fig. 1(f), we can see that groups display their aggregation characteristics in real public area; this characteristic results in the frequent occurrence of serious intra-group occlusion. Therefore, compared with individual re-identification, group re-identification confronts many challenges besides the difference in the number of individuals. We first observe Fig. 1(a) to (d), occlusion generally occurs frequently in each group; furthermore, such as those marked by green box in bottom row of Fig. 1(a) and top of (c), there might be more serious occlusions in some group individuals. At the same time, comparing with the individuals marked by yellow box in two images of Fig. 1(b), the occluded part always changes across cameras in most cases; in Fig. 1, the rectangle boxes of the same color in every column indicate the same person in different camera scenes. As shown in Fig. 1(a) and (d), we can see that there are significant changes in the relative locations of individuals in the groups. In addition, the individuals in Fig. 1(b) and (c) are also becoming farther or closer from each other while in motion. Therefore, the appearances of groups show the changes of configuration and shape described by the Fig. 1(e). At last, it is clear by observing each column in Fig. 1 that some groups contain different numbers of individuals, and show different matching scale.

In summary, there are three challenges in the group retrieval. The first and most challenge is the high aggregation of individuals in group, this causes them to highly occlude each other, and the occlusion style generally varies randomly. The individual-based matching method turns out to be increasingly hard to conduct in such case. The second challenge concerns that the relative position and distance of individuals may change with movement, and the group structure will also change accordingly; this makes it impossible to measure the similarity through the global matching method. The third challenge results from the inconsistent scales in the matching of different group due to the difference in individual number.

This paper extends our work presented in [21] by adding the following contents: (1) the proposed method was extended with a cost-trim processing to further improve the performance; (2) new dataset named Pedestrian-Groups2 was constructed, which extended the dataset introduced in our previous work; the number of pedestrian groups in dataset increased from 30 to 134, and it contains more group various situations; (3) for Pedestrian-Groups2 dataset, new experiments were completely conducted; in addition, more experiments on Road Group dataset were run to evaluate the effectiveness of our method. We first find the key points of the salient feature in the group image by using the key point detection algorithm. Combining with perspective transformation, the sampling of group appearance characteristics is conducted with these salient key points as the center, and a series of non-uniform image blocks are obtained to form a group appearance representation collection. We then extract appearance features vectors such as color, texture and structure for each block in the collection, and carry out the clustering analysis for these features with normalization process, the obtained cluster centers are then collected as group entire descriptor (GED). Finally, for two groups to be matched, we compute the optimal GED matching distance, and use it as a metric for groups identification after a cost-trim processing.

The main contributions of our work are summarized as follows: (1) a local sampling of group characteristics method and a salient points driven framework for group retrieval are proposed to generate group appearance representation collection and solve the issue concerning group matching in different camera; (2) for further improving the matching performance, a cost-trim processing is proposed to lighten the influence caused by high cost matching element pairs; and (3) we extend and construct a new cross cameras group dataset Pedestrian-Groups2 for evaluating our proposed methods; it includes 135 pedestrian groups and contains various situations such as occlusion each other, group shape and relative position or distance changes.

Section snippets

Related works

In the field of public video surveillance, many works have been proposed for crowded scene analysis [22], [1], [23], and the crowds are studied as an entity at a macroscopic level. Nevertheless, as the groups are the primary constituent unit of crowd, some works [24], [14], [25] tend to leverage group-level characteristic for crowd analysis. Research in [24] focus on the impacts of small groups and social relations on crowd dynamics in emergency evacuation. In [14], the properties of

Extraction of group appearance representation

As discussed above, the relative locations of individuals in walking group are easier to change. Therefore, their overall appearance may be quite different under two non-overlapping cameras, even for the same group. As a moving group is compressible and deformable just as fluids, when we are identifying groups, we generally first pay our attention to some discriminating local parts that can distinguish it from others instead of directly giving an overall matching. Then, those characteristics

Group matching

The acquisition of group appearance representation collection enables a complete group to be represented with a set of image units of smaller granularity. However, due to the difference in scenario or pedestrian group, the number of appearance blocks obtained by each group image is different. Therefore, the appearance feature vectors extracted from the image blocks cannot be directly applied to the matching of groups. We solve the problem in two steps. Firstly, clustering analysis is carried

Experiments

Groups retrieval is one of the less studied issue in crowd video analysis field. In this section, we evaluate our proposed method on two datasets, and the framework of the proposed approach is shown in Fig. 6. The cumulative matching characteristics (CMC) are computed to evaluate the performance. The CMC curve can be associated with the ratio of the probe group found in the top r matches in the group image gallery.

Conclusion

Group retrieval across non-overlapping cameras is an emerging research issue concerning application of computer vision technology in the field of crowd analysis. In this paper, local non-uniform sampling method is adopted to decompose the deformative group and a group retrieval framework driven by salient points is proposed. Experimental evaluations are conducted on Pedestrian-Groups2 dataset and Road Group dataset. The experimental results suggest our method can effectively identify the group

CRediT authorship contribution statement

Xiao-Han Chen: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Jian-Huang Lai: Project administration, Funding acquisition, Supervision, Resources, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by National Key Research and Development Program of China (2016YFB1001003), the NSFC (62076258).

Xiao-Han Chen received the Ph.D. degree in computer science and technology in 2020 from Sun Yat-sen University, China; He is currently a teacher with Faculty of Mathematics and Computer Science, Guangdong Ocean University, China. His research interests include computer vision and machine learning, including crowd analysis in video surveillance.

References (44)

  • J. Liu et al.

    Decidenet: Counting varying density crowds through attention guided detection and density estimation

  • B. Sheng et al.

    Crowd counting via weighted VLAD on a dense attribute feature map

    IEEE Transactions on Circuits and Systems for Video Technology

    (2018)
  • W. Lin et al.

    A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes

    IEEE Transactions on Image Processing

    (2016)
  • S. Ali, M. Shah, A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis, in:...
  • X. Li, M. Chen, F. Nie, Q. Wang, A multiview-based parameter free framework for group detection., in: AAAI, 2017, pp....
  • F. Solera et al.

    Socially constrained structural learning for groups detection in crowd

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2016)
  • J. Shao et al.

    Learning scene-independent group descriptors for crowd understanding

    IEEE Transactions on Circuits and Systems for Video Technology

    (2017)
  • M. Qi et al.

    stagNet: An attentive semantic RNN for group activity recognition

  • X. Li et al.

    Sbgar: Semantics based group activity recognition

  • T. Shu, S. Todorovic, S.-C. Zhu, CERN: confidence-energy recurrent network for group activity recognition, in: IEEE...
  • S. Liao et al.

    Person re-identification by local maximal occurrence representation and metric learning

  • M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by symmetry-driven accumulation...
  • Cited by (2)

    Xiao-Han Chen received the Ph.D. degree in computer science and technology in 2020 from Sun Yat-sen University, China; He is currently a teacher with Faculty of Mathematics and Computer Science, Guangdong Ocean University, China. His research interests include computer vision and machine learning, including crowd analysis in video surveillance.

    Jian-Huang Lai received the Ph.D. degree in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where he is currently a Professor of the School of Data and Computer Science. His current research interests are in the areas of computer vision, digital image processing, pattern recognition, multimedia communication and multiple target tracking. He has published over 100 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g., IEEE TPAMI, IEEE TNN, IEEE TKDE, IEEE TIP, IEEE TSMC (Part B), IEEE TCSVT, Pattern Recognition, ICCV, CVPR, and ICDM. He serves as a vide director of the Image and Graphics Association of China. He is a Fellow of Image and Graphics Society of China. He is also a senior member of the IEEE.

    View full text