1 Introduction

In Pattern Recognition and Computer Vision, several works are focused to automate the scene analysis [1, 2]. They are based on some social, biological and psychological theories. These works have shown that scenes with crowds are composed by small groups of people and the behavior of the groups are given by the interactions between people [3].

The small groups detection is an important step in scene analysis, allowing to obtain high levels of semantic interpretation. It has several application in surveillance-video, such as: anomaly detection and video classification [4,5,6,7].

Steading Conversational Group (a.k.a. F-Formation) is a kind of small groups, which has achieved great interest in the scientific community. An F-Formation is composed by stationary people, which interact through social signals (i.e. non-verbal expressions) [1]. Beside, the people form patterns of space and orientation between them, when they are interacting. Moreover, they have equal and exclusive access to a space inside of the F-Formation [8].

An F-Formation is composed by three spaces: O-space, P-space and R-space (see Fig. 1(a)). The O-space is an empty space, which is surrounded by oriented people toward it. This is the most important space, because most of the algorithms reported in the literature are based on it. The P-space involves the O-space, while the R-space is the complement of the P-space.

An F-Formation can take different geometrical forms: L-form, Face-to-Face, Side-by-Side and Circular form (see Figs. 1(b), (c), (d) and (e), respectively). When the number of people in an F-Formation is longer than two, the F-Formation has commonly the circular form.

Fig. 1.
figure 1

Examples of F-formation spaces (a) and F-Formation geometrical forms (b–e).

Several approaches were proposed to detect F-Formations [9,10,11,12]. They have used some features, such as: the people positions on the ground floor and their head orientations.

The first approach is based on the Hough transformation [9, 10], where an accumulator space for finding many local maximum by a vote strategy is created. Each local maximum represents an O-space center, where the people are assigned.

Other approaches are based on graph theory [13], where people and their relations are represented by vertices and edges, respectively. In this way, the F-Formation detection is reduced to the maximal clique detection (i.e. dominant set) problem [11, 14].

The methods proposed in [12, 15] are based on the game-theory [16], where the F-Formation detection is reduced to a clustering problem over an evolutionary environment.

The aforementioned approaches have high computational complexity, are based on complex theories and have had difficulties for rehearsing. In the literature, little efforts have been realized for reducing the mentioned difficulties, but, to the best of our knowledge, only in [17], the authors treat to solver it. However, this method requires large number of parameters and the group detection is not automatic. Furthermore, it is designed for detecting F-Formations in sequence of images. Based on [17], we propose a solution for reducing number of parameter and automatically detecting F-Formations.

The main contributions of this paper are: (1) a new method for detecting F-Formations in an image, (2) a new image representation, where a membership function for computing social people relations is introduced, and (3) an automatic clustering for associating people with their O-space.

The basic outline of this paper is the following. In Sect. 2, some basic concepts are provided. Section 3 contains the description of the proposed method. The experimental results are discussed in Sect. 4. Finally, conclusions and some ideas about future directions are exposed in Sect. 5.

2 Basic Concepts

In this section, we show a set of concepts, which are required for understanding our proposal.

Definition 1

(Fuzzy relation). Let X and Y be two sets, a fuzzy relation from X to Y is a membership function \(\rho :X \times Y \rightarrow [0,1]\). If \(X=Y\), then \(\rho \) is named fuzzy relation on X.

By Definition 1, the similarity relation can be defined as follows.

Definition 2

(Similarity relation (by Zadeh [18])). The fuzzy relation \(\rho \) on X is a similarity relation if for all \(x,y \in X\) the followed properties are fulfilled:

  • Reflexivity: \(\rho (x,x) = 1\)

  • Symmetry: \(\rho (x,y) = \rho (y,x)\)

  • Transitivity: \(\rho (x,y) \ge \displaystyle \max _{z \in X} \left\{ \min \left\{ \rho (x,z),\rho (z,y) \right\} \right\} \).

Sometimes, a fuzzy relation is represented by a matrix, which is known as fuzzy matrix (see Definition 3).

Definition 3

(Fuzzy matrix). A matrix M is an \(m \times n\) fuzzy matrix if each cell of M has a value in the interval [0, 1].

On a fuzzy matrix M, which represents a similarity relation, we define an F-Formation as follows:

Definition 4

(F-Formation). An F-Formation is a set of connected cell indexes of M, where their corresponding cell values are greater than a given \(\alpha \), \(\in [0,1]\).

3 The Proposed Method

Given a database of images with people positions on the floor and orientations (i.e. the head or body orientations), our proposal carry out two steps: (1) to build a representation, where people relations are modeled through fuzzy relations, which later are codified in a fuzzy matrix (see Sect. 3.1), and (2) to cluster fuzzy relations for detecting F-Formations (see Sect. 3.2).

3.1 Representation

Let \((x_k, y_k)\) and \(\sigma _k\), be the position and the orientation of a person \(p_k\), and let \(v_k = [x_k + r \cdot \cos {{\sigma }_k}, y_k + r \cdot \sin {{{\sigma }_k}}]\) be their vote point [9], where r is the vote length. The visual field interaction between people \(p_i\) and \(p_j\), \(i,j \in [1,k]\) is computed by their frustum interception (i.e. the vote point interception).

In [17], the authors proposed an idea based on vote points, where each person frustumFootnote 1 is represented by a vote point. However, this idea fall on assumption of perfect alienation [9], where some F-Formations detections could be missed (see Fig. 2(a)). For this reason, we propose an alternative of the idea proposed in [17], where we represent each person frustum by three vote points. In this way, we avoid the assumption of perfect alienation (see Fig. 2(b)).

Fig. 2.
figure 2

Frustum model.

According to [17], a valid frustum interception between \(p_i\) and \(p_j\), \(i,j \in [1,k]\), is fulfilled by the following two rules: (1) both vote points \(v_i\) and \(v_j\) must be on the same side of the segment d, and (2) the distance between \(p_i\) and \(p_j\) (i.e. length of d) must be longer than the distance between \(v_i\) and \(v_j\). Notice that, the previous rules are accomplished only for \(r = d/2\) value.

In our proposal (see Algorithm 1), for each people \(p_i\) and \(p_j\) we compute their votes points \(v_{i,l}\) and \(v_{j,l}\), \(l \in [1,3]\) by the Eqs. 1, 2 and 3. For searching a valid frustum interception, we use the Eqs. 6 and 7 after build a matrix X where their elements are values of the distances between the vote points.

$$\begin{aligned} v_{k,1} = [x_k + r \cdot \cos {({\sigma }_k + \gamma )}, y_k + r \cdot \sin {({\sigma }_k+\gamma )}] \end{aligned}$$
(1)
$$\begin{aligned} v_{k,2} = [x_k + r \cdot \cos {{\sigma }_k}, y_k + r \cdot \sin {{{\sigma }_k}}] \end{aligned}$$
(2)
$$\begin{aligned} v_{k,3} = [x_k + r \cdot \cos {({\sigma }_k - \gamma )}, y_k + r \cdot \sin {({\sigma }_k - \gamma )}] \end{aligned}$$
(3)
$$\begin{aligned} \varGamma _{k, l} =({v_{k,l}}_x - x_i)*(y_j - y_i) - ({v_{k,l}}_y - y_i)*(x_j - x_i) \end{aligned}$$
(4)
$$\begin{aligned} dv_{i,j} = \sqrt{({v_{i,l}}_x - {v_{j,l}}_x)^2 + ({v_{i,l}}_y - {v_{j,l}}_y)^2} \end{aligned}$$
(5)
$$\begin{aligned} \rho _{i,j} = \left\{ \begin{array}{ll} 1 &{} \qquad \text {if } \varGamma _{i,l} \ge 0 \text { and } \varGamma _{j,l} \ge 0 \text { or } \varGamma _{i,l} \le 0 \text { and } \varGamma _{j,l} \le 0, dv_{i,j} < d \\ -1 &{} \qquad \text { if other case }\\ \end{array} \right. \end{aligned}$$
(6)
$$\begin{aligned} \kappa _{i,j} = \rho _{i,j} dv_{i,j} \end{aligned}$$
(7)
$$\begin{aligned} X = \left( \begin{array}{ccc} \kappa (v_{i,1}, v_{j,1}) &{} \kappa (v_{i,1}, v_{j,2}) &{} \kappa (v_{i,1}, v_{j,3}) \\ \kappa (v_{i,2}, v_{j,1}) &{} \kappa (v_{2,1}, v_{j,2}) &{} \kappa (v_{2,1}, v_{j,3}) \\ \kappa (v_{i,3}, v_{j,1}) &{} \kappa (v_{i,3}, v_{j,2}) &{} \kappa (v_{i,3}, v_{j,3}) \\ \end{array} \right) \end{aligned}$$
$$\begin{aligned} \mu _{i,j} = (1 - \frac{u}{d}) \exp ( \frac{-d}{h}) \end{aligned}$$
(8)
figure a

For computing the social relation between two people, we propose the membership function \(\mu _{i,j}\) (see Eq. 8), the h values are taken from the Hall theory [19]. The Hall theory characterizes people social interactions by physical distances. The value of u is the minimum value taken from the positive elements of X.

When all element of X are negative values, then \(\mu _{i,j} = 0\). However, when \(i = j\) (i.e. \(d = 0\)), \(\mu _{i,j} = 1\). Notice that, \(\mu _{i,j}\) is a fuzzy relation on a people set, and our representation is a fuzzy matrix M with \(\mu _{i,j}\) values.

3.2 Clustering

We propose the Algorithm 2 for clustering, which uses ClusteringRF algorithm [17] for transforming the input fuzzy matrix M in a similarity relation matrix \(M'\) (i.e. a fuzzy relation fulfills the reflexivity, symmetry and transitivity properties) [18] and generating a partition \(C_ \alpha = \{c_1... c_k\}\) by an \(\alpha \)-cut.

For determining the number of clusters (i.e. F-Formations number within an image), we use a naive average of scores and select a \(C_ \alpha \) for the maximal \(w_\alpha \) value. Notice that, \(|C_ \alpha |\) is the cluster number, \(|c_k|\) is the number of elements in the cluster k and \(M'(i,j)\) is a value of the fuzzy matrix \(M'\).

figure b

4 Experimental Results

In this section, we present the experimental evaluation of our proposed method; comparing its results against the best results reported in the literature over two real-world databases (Coffee Break [9] and GDet [9]) and one synthetic database (Synth [9]).

4.1 Databases

Coffee Break database [9], was obtained from a real-world environment in outdoor scenario, from a single camera with a resolution of \(1440 \times 1080\) px. It is composed by social events of people which are interacting and enjoying a cup of coffee. This database has 120 annotated images by psychologists using several questionnaires, where head orientations were estimated considering four directions: front, back, left and right.

GDet database [9], was obtained from an indoor scenario of vending machines area with several occlusions. It has 403 images, which were acquired by two low resolution cameras with \(352 \times 328\) px, located on opposite angles of the room. Ground truth generations were carried out by psychologists, where the head orientations were estimated considering four directions (front, back, left and right) and people position were computed by a particle filter tracking algorithm [20].

Synth database [9] was generated by a trained expert, contains 100 situations provided by using 10 different based situations and slightly varying the position and head orientations of the people. It is important to highlight that, there are not noise in this database.

4.2 Experiments

For evaluating our proposal, we use the validation protocol proposed in [12], where a group is correctly detected if at least \(\lceil T \cdot |G| \rceil \) of its members are found and not more than \(\lceil (1-T) \cdot |G| \rceil \) are not members. The value |G| is the cardinality of the labeled group and \(T = 2/3\). For each image, the precision p, sensitivity s and the parameter F1 are computed for each group formation.

$$\begin{aligned} p=\frac{tp}{tp + fp}, s=\frac{tp}{tp + fn}, F1=2\cdot \frac{p \cdot s }{p + s} \end{aligned}$$

Our experiments were carried out with C++ on Eclipse, using opencv and armadillo libraries, over a personal computer Intel(R) Core(TM) 2 Duo CPU with 1.83 GHz and 2 GB RAM, with the Ubuntu 18.04.2 distribution.

Table 1 shows the obtained results by the related works reported in the literature, as well as, the results achieved by our proposal. In the first column of this table, the name of the methods are shown. In the other three columns, the precision, sensitivity and F1 values achieved over Coffee Break, GDet and Synth, highlighting the best results of each columns.

Table 1. F1 measure results achieved by several methods over three databases

We varied h values between intimate and social space of Hall theory [19] (i.e [0, 360]). For generating each people frustum, the orientation \(\sigma _k\) are token of the people head and the vote points \(v_{k, g}, g \in \{1,3\}\) are computed with angles \(\sigma _k \pm \gamma \), where \(\gamma \) values are between 0 and 60 for an effective visual field. Our best results are achieved with \(h = 70\) and \(\gamma = 30\) in Coffee Break, \(h = 10\) and \(\gamma = 60\) in GDet and \(h = 116\) and \(\gamma = 30\) in Synth database.

We obtained different results because theses databases represent different environments, where the crowd level changed in the scene (i.e. the people number, occlusion and distance of interaction between them). For this reason, in practice, the parameter h and \(\gamma \) must be carefully selected in a pre-processing step. We recommend to decrease value of h, and to increase value of \(\gamma \), when the crowd level increase.

For showing only an example, in Fig. 3, we show a result of our proposal in Coffee Break database, where circles with the same color over head people represent the same detected Steading Conversational Groups. Notice that, are 4 small groups (green, red, yellow and blue groups), with cardinality between 2 and 3.

Fig. 3.
figure 3

A visual result of our method in Coffee Break database. (Color figure online)

5 Conclusions and Future Works

In this paper, we proposed a new method for detecting Steading Conversational Group (F-Formation) in a still image. We based our proposal on fuzzy relations theory for building a new representation with three vote points. Moreover, we proposed a clustering on fuzzy relations, for obtaining the best number of F-Formation. We evaluate our proposal over two real-world databases (Coffee Break and GDet) and a synthetic one (Synth).

The results archived by our proposal outperform the best ones reported in the literature over the GDet database, keeping similar results over Coffee Break, while in Synth we obtained the best results. Based on our experiments, we can conclude that our proposal is an effective and simple solution. In the future, we will explore several internal index validation clustering for improving as possible the number of F-Formation.