1 Introduction

Observing people interacting with their environment can provide clues about its 3D structure. Facets of this that have been studied within computer vision include inferring functional objects as “dark matter” [64], ground plane paths [30], and modeling human-object interactions for understanding events and participants from RGB-D video [61]. 3D representations enable answering questions that are awkward or not accessible with 2D representations. For example, one might want to ask if there are paths that can be taken that are not visible to security cameras. In this paper, we present a system that infers 3D locations that people look at, including ones not visible to the camera, from monocular, uncalibrated video. For example, we can infer the 3D location of an interesting poster that draws people’s gazes by observing the people passing by (Fig. 1).

Fig. 1.
figure 1

Temporal 3D scene understanding through joint inference of people’s locations, their head posse, and locations of what they’re looking at. The gaze cones of the red person for the current (red) and previous times (faded red) intersect to help localize a target in 3D on the left wall. The hypothesis that they are looking at the same object from two different views makes this analogous to stereo vision. The blue person adds a third view. Furthermore, the hypothesis that the green person is looking at the red person enriches our understanding of the scene, and can help improve both the estimate of the green person’s head pose as well as the location of the red person. (Color figure online)

To this end, we develop a fully 3D Bayesian modeling approach that represents where people are, their head poses (thus approximate gaze directions), and what 3D location they are looking at, which might be one of the other persons that we are tracking, or an interesting location that attract people’s visual attentions in a scene. Our model further embodies the camera parameters of an assumed stationary monocular video camera, so that we can infer it rather than rely on having calibrated cameras.

Our joint inference approach is motivated by the following observations: (1) the 3D locations of what people might be looking at can help estimate gaze direction and therefore head pose; (2) other people in the scene are possible targets of visual attention, and if we are tracking them in 3D, joint inference of their location and gazes from others should be beneficial; and (3) scenes often contain likely locations of visual attention (e.g., a visually interesting poster), and multiple spatio-temporal gaze cones can help pinpoint them in 3D analogously with multiple views (Fig. 1). We also make use of the following observations from Brau et al. [13] regarding tracking of people walking on a ground plane: (1) 3D representation simplifies handling occlusions (which become evidence instead of confounds); (2) 3D representation allows for a meaningful prior on velocity (and here, head turning angular velocity); and (3) one can infer camera parameters jointly with the scene, as people walking tend to maintain fixed height, and thus are like calibration probes that transport themselves to different depths.

We specify the joint probability of the latent model and the association of person detections across frames (Sect. 3). The data association implies a hypothesis for the number of people in the scene at each point in time. To compare models of differing dimensions in a principled way, we approximately marginalize out all the continuous model parameters. These include the locations of each person, their gaze angles, and the locations of the static points drawing visual attention that we are trying to discover from gazing behavior. We compute these approximate marginals using MCMC sampling to maximize the distribution, and then apply the Laplace approximation. We combine this with multiple MCMC sampling strategies to explore the space of models (Sect. 4).

Because our goals are new, we contribute a modest data set with the 3D locations of what participants are looking at, which is not available in other data sets with people walking about (see Sect. 5 for further discussion). In the contributed data set, participants recorded what they were looking at while they were walking around, and we established the ground truth 3D locations for all targets (people and other objects) using ground truth 2D detections (Sect. 6).

Our Contributions include: (1) operationalizing the observation that multiple gaze angles estimated from head pose can be used to learn 3D locations that people look at; (2) extending the approach proposed by Brau et al. [13] to include head pose, a walking direction prior, and a more efficient sampling approach; (3) joint inference of head pose and 3D location of what people are looking at while walking; (4) inferring who is looking at whom or what (both anonymously defined); and (5) a new data set for what people are looking at while they walk around, and where those objects or people are in 3D.

2 Related Work

Multiple Target Tracking (MOT). Despite significant progress, multiple-target tracking remains a challenge due to issues such as noisy and complex evidence, occlusion, abrupt motion, and an unknown number of targets. This work is in the tracking-by-detection paradigm [3, 4, 9, 13, 17, 31, 37, 44, 46, 54, 66, 69]. Typically, these approaches first acquire the image locations of people a video sequence, and then find the tracks of each underlying target by solving the data association problem and inferring the target locations. Both 2D and 3D models have been used to represent the underlying targets. Effectively working in 2D requires explicit modeling of occluded targets (e.g., [37, 69]). Conversely, 3D models can treat occlusions and smooth motion naturally  [13, 28].

Head Pose Estimation. There is a rich history in methods to estimate head pose from single images (e.g., [11, 12, 21, 22, 25, 26, 33, 34, 38, 39]. In video, information flow between frames has been exploited by a number of researchers (e.g., [6, 57, 65, 70]). More similar to us is model-based tracking methods that fit a 3D model to the tracked features across a video (e.g., [32, 45, 56, 62, 63]). Head and body pose have also been estimated jointly via correlations between outputs of body pose and head pose classifiers [14, 15]. In contrast, we model this coupling through a joint distribution on 3D body and head poses.

Head pose is a strong cue for visual focus of attention (VFoA) recognition which has potential applications such as measuring the attractiveness of advertisements or shop displays in public spaces as well as analyzing the social dynamics of meetings. Much research in VFoA focuses on dynamic meeting scenarios, where people usually sit around meeting tables while being video recorded by multiple cameras [5, 7, 8, 19, 42, 43, 51,52,53, 58, 59]. Most of these methods exploit context-related information from speech and motion activity and the potential VFoA is a predefined discrete set with known locations. In addition, the number of people in the scene is fixed and they are considered to be seated in typically known locations, which makes sense given the application.

VFoA estimation has also been considered in surveillance settings in the context of understanding behavior [10, 27, 48, 49], where, so far, visual attention has been limited to image coordinates, and one person at a time. However, Benfold and Reid [10] use a camera calibrated to the ground plane to estimate a visual attention map representing the amount of attention received by each square meter of the ground in a town center scene. Similar to us, they identify interesting regions in the scene based on the inferred visual attention map. However, while the map can be projected into the video to visualize it, 3D location is not inferred.

Another application of estimating VFoA is human-robot interaction scenarios, which involves both person-to-person and robot-to-person interactions [36, 47, 67]. Approaches in this domain often assume known head poses (orientations and locations) of the targets (persons, robots, and objects). For example, Massé et al. proposed a switching Kalman filter formulation to jointly estimate the gaze and the VFoA of several persons from observed head poses and object locations [36]. In addition, they also assume the number of persons and objects are known and remain constant over time. In contrast, we propose simultaneously inferring the number of the targets and their locations in the scene while estimating their VFoAs using image evidence.

3 Statistical Model

Figure 2 shows our generative statistical model for temporal scene understanding using probabilistic graphical modeling notation. The scene consists of multiple people moving on the ground plane throughout the video. At each frame, each person may have their visual attention on another person or on one of several static objects that are located in 3D space. We model the visual focus of attention and the static objects explicitly. At each frame, each person may also generate a detection box, and the data association groups these detection boxes by person (or noise). Finally, we model the camera, which projects the scene onto the image plane, generating the observed data.

Fig. 2.
figure 2

Generative graphical model for temporal scene understanding. We use bold font for aggregate variables (e.g., \(\mathbf {z}\) represents state vectors for each person for each frame). The data association, \(\omega \), specifies the number of people and which detections (body, face) are associated with them. \(\omega \) depends on hyper-parameters collectively denoted by \(\gamma \) (Sect. 3.1). \(\varvec{\chi }\) is the set of static 3D points that people look at. The visual focus of attention (VFoA), \(\varvec{\xi }\), of each person is, for each frame, either one of these 3D points or another person. The temporal scene \(\mathbf {z}\) consists of the 3D state (location, size, head pose) of each person at each frame (Sect. 3.2). \(\mathbf {z}\) projects onto 2D to create model frames via the camera C, generating person detections, B, optical flow, \({I^f}\), and face landmarks, \(I^k\) (Sect. 3.4).

We place prior distributions on each of the model variables mentioned above. Similarly, for each type of data we use, we have a likelihood function that captures its dependence on the model. We combine these functions to get the posterior distribution, which we maximize (see Sect. 4).

3.1 Association

Following previous work [13], we define an association \(\omega = \{\tau _r \subset B\}_{r=0}^m\) to be a partition of B, the set of all detections (body, face) for the entire video. Here, each \(\tau _r\), \(r = 1, \dots , m\), called a track, is the set of detections which are associated to person r, and \(\tau _0\) is the set of spurious detections, generated by a noise process [41]. The prior distribution \(p(\omega )\) has hyper parameters \(\lambda _A\), \(\kappa \), \(\theta \), and \(\lambda _N\) representing the expected detections per person per frame, new tracks per frame, track length, and noise detections per frame [13].

3.2 Scene and VFoA

Our 3D scene model consists of a set of moving persons, represented using 3D cylinders and ellipsoids, which we call the temporal scene, and a set of static objects, represented by 3D points. These objects are assumed to command attention from the people in the scene, which we model explicitly for each person at each frame, and call visual focus of attention (VFoA).

Static Objects. The scene contains a set of \(\widehat{m}\) static objects, denoted by \(\varvec{\chi }= (\varvec{\chi }_1, \dots , \varvec{\chi }_{\widehat{m}})\), \(\varvec{\chi }_r \in \mathbb {R}^3\). Since we do not have any prior information regarding their locations, we set a uniform distribution on their positions over the visible 3D space. We model interesting locations as independent from each other by using a joint prior of \(p(\varvec{\chi }) = p(\widehat{m}) \prod _{r=1}^{\widehat{m}} p(\varvec{\chi }_r)\), where \(p(\widehat{m})\) is Poisson.

Visual Focus of Attention (VFoA). The scene also contains m people, one for each association track \(\tau _r \in \omega \). Each person has a VFoA at each frame that encodes who or what they are observing, if anything. We use \(\xi _{rj} \in \{0, \dots , m + \widehat{m}\}\) to denote the VFoA of person r at frame j, e.g., \(\xi _{rj} = r'\) indicates person r is looking at person or object \(r'\) at frame j, where values of \(1 \le \xi _{rj} \le m\) indicate focus on a person, \(m < \xi _{rj} \le m + \widehat{m}\) on an object, and \(\xi _{rj} = 0\) indicates no focus. A priori, people tend to focus on the same visual target in consecutive frames, and we set a simple Markov prior on \(\varvec{\xi }_r = (\varvec{\xi }_{r1}, \dots , \varvec{\xi }_{m l_m})\), where \(\xi _{rj} = \xi _{r j - 1}\) with high probability. The prior for the entire VFoA set is \(p(\varvec{\xi }\, \vert \,\omega ) = \prod _{r=1}^m p(\varvec{\xi }_r)\).

Temporal Scene. Each person r has temporal 3D state \(\mathbf {w}_r = (\mathbf {w}_{r1}, \dots , \mathbf {w}_{rl_r})\), where each single-frame state consists of the person’s ground-plane position \(\mathbf {x}_{rj} \in \mathbb {R}^2\), body yaw \(q_{rj}\), head pitch \(p_{rj}\), and head yaw \(y_{rj}\), so that \(\mathbf {w}_{rj} = (\mathbf {x}_{rj}, q_{rj}, p_{rj}, y_{rj})\), \(j=1, \dots , l_r\). Importantly, the head yaw \(y_{rj}\) is measured relative to the body yaw \(q_{rj}\), i.e., \(y_{rj} = 0\) when person r at frame j is looking straight ahead. Additionally, each person has three size dimensions: width, height, and thickness, denoted by \(d^{\mathsf {w}}_r\), \(d^{\mathsf {h}}_r\), and \(d^{\mathsf {g}}_r\). We will denote the full 3D configuration of track \(\tau _r\) by \(\mathbf {z}_r = (\mathbf {w}_r, d^{\mathsf {w}}_r, d^{\mathsf {h}}_r, d^{\mathsf {g}}_r)\). Conceptually, at any given frame j, this can be thought of as a \(d^{\mathsf {w}}_r \times d^{\mathsf {h}}_r \times d^{\mathsf {g}}_r\) cylinder whose “front” side is oriented at angle \(q_{rj}\), with an ellipsoid on top that has a pitch of \(p_{rj}\) and a yaw of \(y_{rj}\) (Fig. 3).

We call \(\mathbf {x}_r = (\mathbf {x}_{r1}, \dots , \mathbf {x}_{r l_r})\) the trajectory of person r, and place a Gaussian process (GP) prior on it to promote smoothness. We use analogous definitions for the body angle trajectory \(\mathbf {q}_r\), the head pitch trajectory \(\mathbf {p}_r\), and the head yaw trajectory \(\mathbf {y}_r\) (e.g., for body angle, \(\mathbf {q}_r = (q_{r1}, \dots , q_{r l_r})\)). We use similar smooth GP priors for these trajectories. Importantly, the priors on the head angle trajectories \(\mathbf {p}_r\) and \(\mathbf {y}_r\) depend on which objects they observe, encoded by \(\varvec{\xi }_r\), and their locations, which are contained in \(\varvec{\chi }\) and \(\mathbf {x}_{-r}\) (all trajectories except \(\mathbf {x}_r\)); e.g., for head pitch, \(p(\mathbf {p}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r})\). We express this dependence by setting the mean of the GP prior to an angle pointing in the direction of the observed object, if any, at each frame.

Fig. 3.
figure 3

3D model for a person (left) and its projection into the image plane (right). Person r at time (frame) j consists of a cylinder at position \(\mathbf {x}_{rj}\), of width \(d^{\mathsf {w}}_r\), height \(d^{\mathsf {h}}_r\), and thickness \(d^{\mathsf {g}}_r\) (not illustrated) with body angle \(q_{rj}\) (the black stripe on the cylinder represents its “front”) relative the z-axis of the world. Further, person r’s head, represented by the ellipsoid, has yaw \(y_{rj}\) relative to the front of the cylinder and pitch \(p_{rj}\) indicated by the red arc. Its projection under camera C yields three boxes: model box \(h_{rj}\), model body box \(o_{rj}\), and model face box \(g_{rj}\).

The prior over a person’s full physical state, \(p(\mathbf {z}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega )\), expands to \(p(d^{\mathsf {w}}_r, d^{\mathsf {h}}_r, d^{\mathsf {g}}_r) p(\mathbf {x}_r \, \vert \,\omega ) p(\mathbf {q}_r \, \vert \,\omega ) p(\mathbf {p}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega ) p(\mathbf {y}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega )\), by conditional independence of the state variables given the context variables. We condition on \(\omega \) as it encodes track length probability. Our overall state prior includes an energy function that makes trajectory intersection unlikely, which is better for inference than a simple constraint (details omitted). Excluding the energy function, the overall prior is: \(p(\mathbf {z}\, \vert \,\varvec{\xi }, \varvec{\chi }, \omega ) = \prod _{r = 1}^m p(\mathbf {z}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega )\), where m is the number of people in the scene.

3.3 Camera

We use a standard perspective camera model [23] with the simplifying assumptions used by Del Pero et al. [18]. Specifically, the world coordinate origin is on the ground plane (we use the xz-plane), and the camera center is \((0, \eta , 0)\), with pitch \(\psi \), and focal length f. This simplified camera has unit aspect ratio, and roll, yaw, axis skew, and principal point offset are all zero. We denote the camera parameters as \(C = (\eta , \psi , f)\) and give them vague normal priors whose parameters we set manually.

3.4 Data and Likelihood

We use three sources of evidence: person detectors, face landmarks associated with person detections, and optical flow. A person detector [20] provides bounding boxes \(B_t = \{b_{t1}, \dots , b_{tN_t}\}\), \(t=1, \dots , T\), where \(N_t\) is the number of detections at frame t. We define \(B = \cup _{t=1}^T B_t\) to be the set of all such boxes. We parameterize each box \(b_{tj}\) by \((b_{tj}^{\text {x}}, b_{tj}^{\text {top}}, b_{tj}^{\text {bot}})\), representing the x-coordinate of the center, and the y-coordinates of the top and bottom, respectively.

A face landmark detector [71] provides five 2D points for each face, \(\mathbf {k}_{ti} = (k^1_{ti}, \dots , k^5_{ti})\), representing centers of the eyes, the corners of the mouth, and the tip of the nose, of the ith detection at frame t. We use \(I^k_t = \{\mathbf {k}_{t1}, \dots , \mathbf {k}_{tN}\}\) to represent all face landmarks detected at frame t, and define \(I^k = \{I^k_1, \dots , I^k_T\}\). A dense optical flow estimator [35] provides velocity vectors \(I^f_t = \{v_{t1}, \dots , v_{tN_I}\}\) for each frame \(t = 1, \dots , T-1\), where \(N_I\) is the number of pixels in the frame. We also define \(I = (I^f, I^k)\).

To compute the data likelihood from evidence in 2D frames, we first convert the 3D model to 2D at each time point, by projecting the 3D scene \(\mathbf {z}\) on to the image (via the camera C) as follows.

Model Boxes. For each person r at frame j, we compute a set of points on the surface of their body cylinder and head ellipsoid and project them into the image. We then find a tight bounding box on the image plane, \(h_{rj}\), called the model box. Similarly, using the cylinder and ellipsoid separately, we compute a model body box, \(o_{rj}\), and a model face box, \(g_{rj}\) (see Fig. 3). Using this formulation, we can reason about occlusion in 3D, as we can efficiently compute the non-occluded regions of boxes [13], denoted by \(\widehat{o}_{rj}\) (body) and \(\widehat{o}_{rj}\) (face).

Face Features. We project five face locations on the ellipsoid representing the centers of the eyes, the nose, the corners of the mouth (see Fig. 3). We denote the projected face features by \(\mathbf {m}_{rj} = (m_{rj}^1, \dots , m_{rj}^5)\), using a special value when a feature is not visible to the camera.

Image Plane Motion Directions. We define two 2D direction vectors, called model body vector and model face vector, which represent the 3D motion of the body cylinder (respectively, face ellipsoid) projected onto the image. To compute the model face vector for person r at its jth frame, we pick a visible point on the head ellipsoid and project that point onto the image at frames j and \(j + 1\). Then, the model face vector \(c_{rj}\) is given by the difference between the two projected points. We perform the analogous computation using the body cylinder to get the model body vector \(u_{rj}\).

Likelihood. We define a likelihood function for each of the data sources discussed above, \(p(B \, \vert \,\omega , \mathbf {z}, C)\), \(p(I^f \, \vert \,\mathbf {z}, C)\), and \(p(I^k \, \vert \,\mathbf {z}, C)\). Since B, \(I^f\), and \(I^k\) are conditionally independent given \(\mathbf {z}\) and C (see Fig. 2), the total likelihood function is given by a product of these three functions.

Detection Box Likelihood. We assume each assigned detection box has i.i.d Laplace-distributed errors with respect to their assigned model box in the x-coordinate of its center and the y-coordinates of its top and bottom. Our likelihood includes video specific noise rate for box detections, and detector specific miss rate, both of which are critical for inferring the number of tracks [13].

Face Landmark Likelihood. We associate landmark \(\mathbf {k}_{ti}\) to person r at frame t if its centroid is near the center of model face box \(g_{rt}\). Then, we assume a Gaussian noise model around each of the model face features \(\mathbf {m}_{rj}\). Specifically, for every \(\mathbf {k} \in I^k\), \(k^i \sim \mathcal {N}(m^i_{rj}, \Sigma _{I^k}^i)\). for \(i = 1, \dots , 5\), where \(m^i_{rj}\) is the model face feature assigned to \(k^i\). Assuming independence of all landmarks, we get a landmark likelihood of

$$\begin{aligned} p(I^k \, \vert \,\mathbf {z}, C) = \prod _{\mathbf {k} \in I^k} p(\mathbf {k} \, \vert \,\mathbf {m}(\mathbf {k})), \end{aligned}$$
(1)

where \(\mathbf {m}(\mathbf {k})\) is the predicted face feature for landmark \(\mathbf {k}\). Because we link faces to boxes, noisy detections are not relevant. However, the probability of missing a face detection, conditioned on the model (and box) is strongly dependent on whether the face is frontal, or sufficiently in profile that only one eye is visible. Hence we calibrate miss rate for these two cases using held out data.

Optical Flow Likelihood. We place a Laplace distribution on the difference between the non-occluded model body vectors and the average optical flow in the corresponding model body box, and similarly for model face vectors [13].

4 Inference

We wish to find the MAP estimate of \(\omega \) as a good solution to the data association problem. In addition, we need to infer the camera parameters C, and the association prior parameters \(\gamma = (\kappa , \theta , \lambda _N)\), which we want to be video specific. We add to this block of parameters, which do not vary in dimension, the discrete VFoA variables \(\varvec{\xi }\). Hence, we seek \((\omega , \gamma , C, \varvec{\xi })\) that maximizes the posterior

$$\begin{aligned} p(\omega , \gamma , C, \varvec{\xi }\, \vert \,B, I) \propto p(\omega \, \vert \, \gamma ) p(\gamma ) p(C) p(\varvec{\xi }\, \vert \,\omega ) p(B, I \, \vert \,\omega , C, \varvec{\xi }), \end{aligned}$$
(2)

where the marginal data likelihood \(p(B, I \, \vert \,\omega , C, \varvec{\xi })\) is given by

$$\begin{aligned} \int p(B \, \vert \,\omega , \mathbf {z}, C) p(I \, \vert \,\mathbf {z}, C) p(\mathbf {z}\, \vert \,\varvec{\xi }, \varvec{\chi }, \omega ) p(\varvec{\chi }) \, \mathrm {d}\varvec{\chi }\,\mathrm {d}\mathbf {z}. \end{aligned}$$
(3)

4.1 Block Sampling over \(\gamma \), \(\omega \), C and \({\xi }\)

Since expression (2) has no closed form, we approximate its maximum using MCMC block sampling, which successively draws samples from the conditional distributions \(p(\gamma \, \vert \,\omega )\), \(p(\omega \, \vert \,\gamma , \varvec{\xi }, C, B, I)\), \(p(C \, \vert \,\omega , \varvec{\xi }, B, I)\), and \(p(\varvec{\xi }\, \vert \,\omega , C, B, I)\). During sampling, we are required to evaluate the posterior (2), which contains the integral in expression (3). Since this integral cannot be performed analytically, nor can it be computed numerically due to the high dimensionality of \((\mathbf {z}, \varvec{\chi })\), we estimate its value using the Laplace-Metropolis approximation [24]. This approximation requires obtaining the best 3D scene \((\mathbf {z}^*, \varvec{\chi }^*)\) with respect to the posterior distribution \(p(\mathbf {z}, \varvec{\chi }\, \vert \,B, I, \omega , C, \varvec{\xi })\), which we estimate using MCMC (see Sect. 4.2), keeping track of the best scene across samples.

We use Gibbs to directly draw samples of the association parameters \(\gamma \) from the conditional posterior \(p(\gamma \, \vert \,\omega )\), an extension of the MCMCDA algorithm [40] to sample values for \(\omega \) from \(p(\omega \, \vert \,\gamma , \varvec{\xi }, C, B, I)\) [13], and random-walk Metropolis-Hastings (MH) to draw samples of the camera parameters \(\eta \), \(\psi \), and f from the distribution \(p(C \, \vert \,\omega , \varvec{\xi }, B, I)\).

We also use MH to sample \(\varvec{\xi }\) from \(p(\varvec{\xi }\, \vert \,\omega , C, B, I)\) using the following proposal mechanism. For each person r in the scene, at each frame j, we find the set of objects or persons in the current scene estimate \((\mathbf {z}^*, \varvec{\chi }^*)\) that intersect (up to a threshold) with person r’s gaze vector. Then, we build a distribution over these objects, which is biased towards the closer ones, as well as the VFoA in the previous frame. We draw a sample from this distribution and assign it to \(\varvec{\xi }_{rj}\). We then accept or reject the sample using the standard MH acceptance probability.

4.2 Estimating \((\mathrm{Z}^*, \mathcal {X}^*)\)

To approximate the MAP estimate of \((\mathbf {z}^*, \varvec{\chi }^*)\), we alternate sample over \(\mathbf {z}\) and \(\varvec{\chi }\) under the distribution

$$\begin{aligned} p(\mathbf {z}, \varvec{\chi }\, \vert \,B, I, \omega , C, \varvec{\xi }) \propto p(\varvec{\chi }) p(\mathbf {z}\, \vert \,\varvec{\xi }, \varvec{\chi }, \omega ) p(B, I \, \vert \,\mathbf {z}, \varvec{\chi }, \omega , C) . \end{aligned}$$
(4)

To sample over \(\varvec{\chi }\), we use random-walk MH to perturb the position of each interesting point \(\varvec{\chi }_r\). We also perform a birth move to introduce new points in the scene. First, we construct a set of candidate points by intersecting all gaze rays across all frames using the current estimate of the temporal 3D state of the persons in the scene \(\mathbf {z}\) (see Fig. 4). Then, we choose a point from the candidates uniformly at random and add it to \(\varvec{\chi }\). We also use a death move, where we remove an element from \(\varvec{\chi }\) is uniformly at random.

Fig. 4.
figure 4

Proposing static objects. On the left we show a bird’s eye view of three people with their corresponding gaze vectors at frame 1. The intersection of two of them creates a candidate static object (red circle). On the right, we show frame 100 of the same video, which also contains three subjects generating four additional candidates. The three lighter lines are gazes recorded in at previous times. The red circles is a candidate generated solely by gazes in the current frame. The three blue circles are candidates generated by intersecting gaze at the current frame with gazes from the previous frames. Finally, the light red circle is the candidate from frame 1. (Color figure online)

To explore the space of \(\mathbf {z}\), we use an efficient Gaussian process posterior sampling mechanism based on inducing points [55]. The basic idea is to construct a proposal distribution by drawing samples from the conditional GP prior and a set of inducing point locations that provide a low-dimensional representation of the function. We iterate over persons \(r = 1, \dots , m\) and over the different trajectories of each \(\mathbf {x}_r\), \(\mathbf {q}_r\), \(\mathbf {p}_r\), and \(\mathbf {y}_r\), drawing a sample at each iteration. More specifically, for a given trajectory, say \(\mathbf {q}_r = (q_{r1}, \dots , q_{rl_r})\), we arbitrarily choose a subset of \((1, \dots , l_r)\) as inducing points, denoted by \((j_1, \dots , j_{l'_r})\). Then, for each inducing point \(j_c\), we draw a sample from the conditional GP prior \(q_{rj_c}' \sim p(q_{rj} \, \vert \,\mathbf {q}_{rj_{-c}})\), and a sample from the predictive distribution \(\mathbf {q}_r' \sim p(\mathbf {q}_r \, \vert \,\mathbf {q}_{rj_{-c}}, q_{rj_c}')\), where \(\mathbf {q}_{rj_{-c}}\) represents \(\mathbf {q}_r\) at the set of inducing points excluding \(j_c\). The sample is accepted or rejected using the MH acceptance probability ratio using only the likelihood function \(p(B, I \, \vert \,\mathbf {z}, \varvec{\chi }, \omega , C)\).

5 Evaluation Dataset and Measures

Several datasets exist for evaluating VFoA recognition in meeting scenarios [5, 7, 8, 29, 58, 59]. Since most of the participants in available meeting datasets are seated throughout the videos, these datasets are not well-suited for evaluating our system, which relies on the ability to detect standing people, and is targeted for scenarios with a diversity of gaze directions in both pitch and yaw. Similarly, datasets such as the Vernissage Corpus dataset [29], which simulates an art gallery scenario, contain many frames where only the upper bodies of the participants are visible. Data sets with walking persons on the other hand uniformly do not encode 3D locations of what people are looking at. While data sets like the challenging SALSA [1], cocktail party [68] and coffee break [16], have head pose annotations, this does not suffice for our goals. Thus we created a new dataset with multiple participants moving freely about while looking at different static targets and each other.

5.1 A New Dataset for 3D Gaze

We captured and annotated six indoor and two outdoor video sequences. Each setting contained several static object locations, several of which were not visible to the camera. Video participants were asked to walk around and look at each other or the stationary objects, indicating when they started and stopped focusing on each target with an audio recording device. All 8 of our videos were between 40 and 90 seconds long with 3 to 4 people and 5 to 8 objects total (including objects that were not visible). Indoor videos had an image resolution of 1920 \(\times \) 1080. Outdoor video resolution was 1440 \(\times \) 1080 (Fig. 5).

Fig. 5.
figure 5

From left to right, sample frames from two outdoor videos and two indoor videos. The outdoor videos were taken on top of a garage rooftop and within a library courtyard. The indoor videos were shot in a classroom and within a hallway. Each video participant walks inside the scene and records (via an audio recorder) what they are looking at – either another person or a stationary object. All objects in the indoor videos are visible to the camera and can be seen in the frames. Some of the objects in the outdoor videos are not visible to the camera.

Annotation and Ground Truth. We annotated bounding boxes around each target at each frame using the VATIC annotation tool [60]. We then estimated the ground truth for the 3D positions of each target and the camera parameters in each video by minimizing the reprojection error with respect to 3D locations and heights using the tops and the bottoms of the ground truth boxes. We also used the VFoA audio annotations described above to estimate the ground truth head orientations (pitch and yaw) of each person at every frame where the person was looking at a target. To determine the locations of points not visible to the camera, we measured their locations, and locations that were visible in a shared coordinate system. We then mapped the locations of invisible points to the camera coordinate system.

5.2 Evaluation Measures

Trajectory and Head Pose Evaluation. To evaluate the 3D trajectories of the inferred targets, we first find the best match between the inferred tracks and the ground truth tracks using the Hungarian method with pairwise Euclidean distances. We then use two conventional metrics for tracking: MOTA (for accuracy of the data associations) and MOTP (for precision of the estimated 3D tracks) [50]. Per convention, we set the MOTP threshold to 1 m. To evaluate head pose estimation, we compute the the equivalent of MOTP for both yaw and pitch between the inferred head poses and their corresponding ground truth head poses (measured in degrees) at frames in which they are available.

To Evaluate VFoA Estimation, we compare the inferred VFoA of a tracked person to the ground truth VFoA at each frame where it exists. Let \(N_c\) be the number of frames where the VFoA is correctly estimated, \(N_m\) be the number of frames where we fail to infer a VFoA (misses), and \(N_e\) be the number of frames where we infer an incorrect VFoA. We then compute the following three scores for the VFoA estimation: \(\text {accuracy } = N_c / N, \text { mistakes } = N_e / N, \text { missed }= N_m / N,\) where N is the total number of frames that the ground truth for that person records that they were looking at one of the scene VFoA targets. Note that this excludes evaluating the VFoA when the tracked person is transitioning from looking at one target to another. For each video, we compute the average scores over all the tracked persons.

Evaluating Inferring Interesting Locations. Finally, we evaluate how well we can infer the interesting locations in a scene by first finding the best matching between the inferred interesting locations and the preset ground truth locations using the Hungarian method with 1 meter threshold. We then compute the recall and precision for the inferred interesting locations and their average distance to the ground truth locations.

6 Experiments and Results

We ran two sets of experiments to evaluate the performance of our method. We do not compare to others on our main tasks since we are not aware of any relevant published results. We first ran our algorithm and ablated variants on our dataset to assess the impact of different aspects of our approach. We then compare our person tracking performance against our previously published results [13] for people tracking alone to check the effect of the extensions for gaze tracking and object discovery on basic tracking on the well known TUD dataset [2].

Experiments on Our Dataset. We experiment with enabling and disabling inference over three different parts of the model: the 3D head pose \((\mathbf {p}, \mathbf {y})\), the VFoA \(\varvec{\xi }\), and the static objects \(\varvec{\chi }\), and replace each with a baseline algorithm. We denote the entire model MGG (for “multiple gaze geometry”).

When we disable inference over \((\mathbf {p}, \mathbf {y})\), we simply set the head pose same as the walking direction at each frame (MGG-NO-HEAD). When disabling inference over \(\varvec{\xi }\), we set the VFoA of each person at each frame to the object or person first intersects their gaze ray (MGG-NO-VO). Finally, when turning off inference over \(\varvec{\chi }\), we estimate the static objects by computing a histogram of the intersections of all the 3D gaze directions of all the people across all the frames, then taking the locations of the top 5 bins with the highest votes (MGG-BASELINE).

Table 1. Performance of different modes of our algorithm on our dataset. Numbers are averaged over eight videos. The first row shows our method with all parts enabled, while the next three rows each shows the algorithm with different aspects disabled, e.g., MGG-NO-HEAD is the stereo gaze algorithm without inferring head pose (see Sect. 6 for details). Each column shows a different evaluation measure. We evaluate using the MOTA (with 1.0 m threshold) and MOTP for distance and angles. For VFoA we use the measures defined in Sect. 5.2.
Table 2. Object discovery performance. Numbers are averaged over eight videos. The algorithms are the same as in Table 1, and the measures are defined in Sect. 5.2. We tabulate performance separately for objects not visible in any frame. The performance here may be favorably biased towards invisible objects because they tended to be behind the camera, and looking at them meant a more frontal image of the viewer, which entails better pose estimation.

Table 1 provides the tracking and head pose estimation results on our dataset. While MOTA and MOTP on position are comparable across all algorithms, the estimated yaw of the head is poor without head pose data. This is not surprising, as the participants in our videos often do not look straight ahead, partly due to the construction of the experiment. By jointly modeling position and pose, we maintain good performance on tracking, while obtaining reasonable accuracy of head yaw, surpassing MGG-NO-HEAD by a significant amount (\(\sim 40~^{\circ }\)). The gain for pitch was more modest, but the absolute error in pitch was less to begin with, which was biased by our instructions and our environment. However, this is ecologically valid, as typical viewing angles are not that far from level.

Table 1 also provides the results for the estimated VFoA. On average, we can correctly identify the VFoA target 48% of the time, much better than the baseline (13%), and better than the ablated MGG-NO-VO version (31%). The later result suggests, perhaps not surprisingly, that learning the 3D locations that people might be looking at provides additional information beyond gaze angles determined from image data alone.

Results for object discovery are shown in Table 2. Here we define success by correctly estimating the location within one meter. We correctly identified 48% of the instances that are available to be identified across the eight videos (recall). In addition, among the ones our method proposes as interesting locations, 59% are correct (precision). The average distance error is a little more than half a meter, which is driven by the choice of the one-meter threshold. Figure 6 shows some example frames of the resulting inferred 3D scene when running the full algorithm (MGG) compared with the baseline (MGG-BASELINE).

Fig. 6.
figure 6

Visualization of the inferred 3D targets in three scene settings. The top row shows a visualization of the results of the baseline algorithm (MGG-BASELINE), in which the yaw of the gaze direction is set based on the walking directions, and the static objects are estimated from the gaze intersections. The bottom row shows the results of the proposed method on the same frames of the same videos. The arrow on the head indicates the gaze direction and the arrow on the body cylinder indicates the body direction. A tracked person’s VFoA is indicated by a line segment from their head connecting to one of the discovered 3D points (yellow spheres) or one of the other tracked people. In the last column, the objects are outside the visible image area. (Color figure online)

Experiments on TUD Benchmark Videos. We compared tracking performance to a similar system for tracking only [13], to evaluate whether incorporating gaze tracking and object inference reduce the tracking performance. We found that we in fact do better on the TUD data, suggesting that the joint inference is helpful (Table 3).

Table 3. Tracking results on the TUD dataset. We compare to [13], which shows that joint inference over additional scene attributes yielded a tracking performance boost as well.

7 Conclusion

We demonstrated the feasibility of discovering interesting visual locations, specified in 3D, from multiple person gazes observed in monocular video. In particular, on a data set developed for the task, we found that we can infer what people are looking at 59% of the time, and where it is within about .58 m. We also found that joint inference over the various scene attributes generally improved the accuracy of the individual estimates. In brief, gaze is both part of scene semantics, and can help determine other aspects of scene semantics.