Multiple-Gaze Geometry: Inferring Novel 3D Locations from Gazes Observed in Monocular Video

Brau, Ernesto; Guan, Jinyan; Jeffries, Tanya; Barnard, Kobus

doi:10.1007/978-3-030-01225-0_38

Multiple-Gaze Geometry: Inferring Novel 3D Locations from Gazes Observed in Monocular Video

Conference paper
First Online: 06 October 2018

2311 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11208))

Abstract

We develop using person gaze direction for scene understanding. In particular, we use intersecting gazes to learn 3D locations that people tend to look at, which is analogous to having multiple camera views. The 3D locations that we discover need not be visible to the camera. Conversely, knowing 3D locations of scene elements that draw visual attention, such as other people in the scene, can help infer gaze direction. We provide a Bayesian generative model for the temporal scene that captures the joint probability of camera parameters, locations of people, their gaze, what they are looking at, and locations of visual attention. Both the number of people in the scene and the number of extra objects that draw attention are unknown and need to be inferred. To execute this joint inference we use a probabilistic data association approach that enables principled comparison of model hypotheses. We use MCMC for inference over the discrete correspondence variables, and approximate the marginalization over continuous parameters using the Metropolis-Laplace approximation, using Hamiltonian (Hybrid) Monte Carlo for maximization. As existing data sets do not provide the 3D locations of what people are looking at, we contribute a small data set that does. On this data set, we infer what people are looking at with 59% precision compared with 13% for a baseline approach, and where those objects are within about 0.58 m.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Observing people interacting with their environment can provide clues about its 3D structure. Facets of this that have been studied within computer vision include inferring functional objects as “dark matter” [64], ground plane paths [30], and modeling human-object interactions for understanding events and participants from RGB-D video [61]. 3D representations enable answering questions that are awkward or not accessible with 2D representations. For example, one might want to ask if there are paths that can be taken that are not visible to security cameras. In this paper, we present a system that infers 3D locations that people look at, including ones not visible to the camera, from monocular, uncalibrated video. For example, we can infer the 3D location of an interesting poster that draws people’s gazes by observing the people passing by (Fig. 1).

To this end, we develop a fully 3D Bayesian modeling approach that represents where people are, their head poses (thus approximate gaze directions), and what 3D location they are looking at, which might be one of the other persons that we are tracking, or an interesting location that attract people’s visual attentions in a scene. Our model further embodies the camera parameters of an assumed stationary monocular video camera, so that we can infer it rather than rely on having calibrated cameras.

Our joint inference approach is motivated by the following observations: (1) the 3D locations of what people might be looking at can help estimate gaze direction and therefore head pose; (2) other people in the scene are possible targets of visual attention, and if we are tracking them in 3D, joint inference of their location and gazes from others should be beneficial; and (3) scenes often contain likely locations of visual attention (e.g., a visually interesting poster), and multiple spatio-temporal gaze cones can help pinpoint them in 3D analogously with multiple views (Fig. 1). We also make use of the following observations from Brau et al. [13] regarding tracking of people walking on a ground plane: (1) 3D representation simplifies handling occlusions (which become evidence instead of confounds); (2) 3D representation allows for a meaningful prior on velocity (and here, head turning angular velocity); and (3) one can infer camera parameters jointly with the scene, as people walking tend to maintain fixed height, and thus are like calibration probes that transport themselves to different depths.

We specify the joint probability of the latent model and the association of person detections across frames (Sect. 3). The data association implies a hypothesis for the number of people in the scene at each point in time. To compare models of differing dimensions in a principled way, we approximately marginalize out all the continuous model parameters. These include the locations of each person, their gaze angles, and the locations of the static points drawing visual attention that we are trying to discover from gazing behavior. We compute these approximate marginals using MCMC sampling to maximize the distribution, and then apply the Laplace approximation. We combine this with multiple MCMC sampling strategies to explore the space of models (Sect. 4).

Because our goals are new, we contribute a modest data set with the 3D locations of what participants are looking at, which is not available in other data sets with people walking about (see Sect. 5 for further discussion). In the contributed data set, participants recorded what they were looking at while they were walking around, and we established the ground truth 3D locations for all targets (people and other objects) using ground truth 2D detections (Sect. 6).

Our Contributions include: (1) operationalizing the observation that multiple gaze angles estimated from head pose can be used to learn 3D locations that people look at; (2) extending the approach proposed by Brau et al. [13] to include head pose, a walking direction prior, and a more efficient sampling approach; (3) joint inference of head pose and 3D location of what people are looking at while walking; (4) inferring who is looking at whom or what (both anonymously defined); and (5) a new data set for what people are looking at while they walk around, and where those objects or people are in 3D.

2 Related Work

Multiple Target Tracking (MOT). Despite significant progress, multiple-target tracking remains a challenge due to issues such as noisy and complex evidence, occlusion, abrupt motion, and an unknown number of targets. This work is in the tracking-by-detection paradigm [3, 4, 9, 13, 17, 31, 37, 44, 46, 54, 66, 69]. Typically, these approaches first acquire the image locations of people a video sequence, and then find the tracks of each underlying target by solving the data association problem and inferring the target locations. Both 2D and 3D models have been used to represent the underlying targets. Effectively working in 2D requires explicit modeling of occluded targets (e.g., [37, 69]). Conversely, 3D models can treat occlusions and smooth motion naturally [13, 28].

Head Pose Estimation. There is a rich history in methods to estimate head pose from single images (e.g., [11, 12, 21, 22, 25, 26, 33, 34, 38, 39]. In video, information flow between frames has been exploited by a number of researchers (e.g., [6, 57, 65, 70]). More similar to us is model-based tracking methods that fit a 3D model to the tracked features across a video (e.g., [32, 45, 56, 62, 63]). Head and body pose have also been estimated jointly via correlations between outputs of body pose and head pose classifiers [14, 15]. In contrast, we model this coupling through a joint distribution on 3D body and head poses.

Head pose is a strong cue for visual focus of attention (VFoA) recognition which has potential applications such as measuring the attractiveness of advertisements or shop displays in public spaces as well as analyzing the social dynamics of meetings. Much research in VFoA focuses on dynamic meeting scenarios, where people usually sit around meeting tables while being video recorded by multiple cameras [5, 7, 8, 19, 42, 43, 51,52,53, 58, 59]. Most of these methods exploit context-related information from speech and motion activity and the potential VFoA is a predefined discrete set with known locations. In addition, the number of people in the scene is fixed and they are considered to be seated in typically known locations, which makes sense given the application.

VFoA estimation has also been considered in surveillance settings in the context of understanding behavior [10, 27, 48, 49], where, so far, visual attention has been limited to image coordinates, and one person at a time. However, Benfold and Reid [10] use a camera calibrated to the ground plane to estimate a visual attention map representing the amount of attention received by each square meter of the ground in a town center scene. Similar to us, they identify interesting regions in the scene based on the inferred visual attention map. However, while the map can be projected into the video to visualize it, 3D location is not inferred.

Another application of estimating VFoA is human-robot interaction scenarios, which involves both person-to-person and robot-to-person interactions [36, 47, 67]. Approaches in this domain often assume known head poses (orientations and locations) of the targets (persons, robots, and objects). For example, Massé et al. proposed a switching Kalman filter formulation to jointly estimate the gaze and the VFoA of several persons from observed head poses and object locations [36]. In addition, they also assume the number of persons and objects are known and remain constant over time. In contrast, we propose simultaneously inferring the number of the targets and their locations in the scene while estimating their VFoAs using image evidence.

3 Statistical Model

Figure 2 shows our generative statistical model for temporal scene understanding using probabilistic graphical modeling notation. The scene consists of multiple people moving on the ground plane throughout the video. At each frame, each person may have their visual attention on another person or on one of several static objects that are located in 3D space. We model the visual focus of attention and the static objects explicitly. At each frame, each person may also generate a detection box, and the data association groups these detection boxes by person (or noise). Finally, we model the camera, which projects the scene onto the image plane, generating the observed data.

We place prior distributions on each of the model variables mentioned above. Similarly, for each type of data we use, we have a likelihood function that captures its dependence on the model. We combine these functions to get the posterior distribution, which we maximize (see Sect. 4).

3.1 Association

Following previous work [13], we define an association $\omega = \{\tau _r \subset B\}_{r=0}^m$ to be a partition of B, the set of all detections (body, face) for the entire video. Here, each $\tau _r$, $r = 1, \dots , m$, called a track, is the set of detections which are associated to person r, and $\tau _0$ is the set of spurious detections, generated by a noise process [41]. The prior distribution $p(\omega )$ has hyper parameters $\lambda _A$, $\kappa $, $\theta $, and $\lambda _N$ representing the expected detections per person per frame, new tracks per frame, track length, and noise detections per frame [13].

3.2 Scene and VFoA

Our 3D scene model consists of a set of moving persons, represented using 3D cylinders and ellipsoids, which we call the temporal scene, and a set of static objects, represented by 3D points. These objects are assumed to command attention from the people in the scene, which we model explicitly for each person at each frame, and call visual focus of attention (VFoA).

Static Objects. The scene contains a set of $\widehat{m}$ static objects, denoted by $\varvec{\chi }= (\varvec{\chi }_1, \dots , \varvec{\chi }_{\widehat{m}})$, $\varvec{\chi }_r \in \mathbb {R}^3$. Since we do not have any prior information regarding their locations, we set a uniform distribution on their positions over the visible 3D space. We model interesting locations as independent from each other by using a joint prior of $p(\varvec{\chi }) = p(\widehat{m}) \prod _{r=1}^{\widehat{m}} p(\varvec{\chi }_r)$, where $p(\widehat{m})$ is Poisson.

Visual Focus of Attention (VFoA). The scene also contains m people, one for each association track $\tau _r \in \omega $. Each person has a VFoA at each frame that encodes who or what they are observing, if anything. We use $\xi _{rj} \in \{0, \dots , m + \widehat{m}\}$ to denote the VFoA of person r at frame j, e.g., $\xi _{rj} = r'$ indicates person r is looking at person or object $r'$ at frame j, where values of $1 \le \xi _{rj} \le m$ indicate focus on a person, $m < \xi _{rj} \le m + \widehat{m}$ on an object, and $\xi _{rj} = 0$ indicates no focus. A priori, people tend to focus on the same visual target in consecutive frames, and we set a simple Markov prior on $\varvec{\xi }_r = (\varvec{\xi }_{r1}, \dots , \varvec{\xi }_{m l_m})$, where $\xi _{rj} = \xi _{r j - 1}$ with high probability. The prior for the entire VFoA set is $p(\varvec{\xi }\, \vert \,\omega ) = \prod _{r=1}^m p(\varvec{\xi }_r)$.

Temporal Scene. Each person r has temporal 3D state $\mathbf {w}_r = (\mathbf {w}_{r1}, \dots , \mathbf {w}_{rl_r})$, where each single-frame state consists of the person’s ground-plane position $\mathbf {x}_{rj} \in \mathbb {R}^2$, body yaw $q_{rj}$, head pitch $p_{rj}$, and head yaw $y_{rj}$, so that $\mathbf {w}_{rj} = (\mathbf {x}_{rj}, q_{rj}, p_{rj}, y_{rj})$, $j=1, \dots , l_r$. Importantly, the head yaw $y_{rj}$ is measured relative to the body yaw $q_{rj}$, i.e., $y_{rj} = 0$ when person r at frame j is looking straight ahead. Additionally, each person has three size dimensions: width, height, and thickness, denoted by $d^{\mathsf {w}}_r$, $d^{\mathsf {h}}_r$, and $d^{\mathsf {g}}_r$. We will denote the full 3D configuration of track $\tau _r$ by $\mathbf {z}_r = (\mathbf {w}_r, d^{\mathsf {w}}_r, d^{\mathsf {h}}_r, d^{\mathsf {g}}_r)$. Conceptually, at any given frame j, this can be thought of as a $d^{\mathsf {w}}_r \times d^{\mathsf {h}}_r \times d^{\mathsf {g}}_r$ cylinder whose “front” side is oriented at angle $q_{rj}$, with an ellipsoid on top that has a pitch of $p_{rj}$ and a yaw of $y_{rj}$ (Fig. 3).

We call $\mathbf {x}_r = (\mathbf {x}_{r1}, \dots , \mathbf {x}_{r l_r})$ the trajectory of person r, and place a Gaussian process (GP) prior on it to promote smoothness. We use analogous definitions for the body angle trajectory $\mathbf {q}_r$, the head pitch trajectory $\mathbf {p}_r$, and the head yaw trajectory $\mathbf {y}_r$ (e.g., for body angle, $\mathbf {q}_r = (q_{r1}, \dots , q_{r l_r})$). We use similar smooth GP priors for these trajectories. Importantly, the priors on the head angle trajectories $\mathbf {p}_r$ and $\mathbf {y}_r$ depend on which objects they observe, encoded by $\varvec{\xi }_r$, and their locations, which are contained in $\varvec{\chi }$ and $\mathbf {x}_{-r}$ (all trajectories except $\mathbf {x}_r$); e.g., for head pitch, $p(\mathbf {p}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r})$. We express this dependence by setting the mean of the GP prior to an angle pointing in the direction of the observed object, if any, at each frame.

The prior over a person’s full physical state, $p(\mathbf {z}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega )$, expands to $p(d^{\mathsf {w}}_r, d^{\mathsf {h}}_r, d^{\mathsf {g}}_r) p(\mathbf {x}_r \, \vert \,\omega ) p(\mathbf {q}_r \, \vert \,\omega ) p(\mathbf {p}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega ) p(\mathbf {y}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega )$, by conditional independence of the state variables given the context variables. We condition on $\omega $ as it encodes track length probability. Our overall state prior includes an energy function that makes trajectory intersection unlikely, which is better for inference than a simple constraint (details omitted). Excluding the energy function, the overall prior is: $p(\mathbf {z}\, \vert \,\varvec{\xi }, \varvec{\chi }, \omega ) = \prod _{r = 1}^m p(\mathbf {z}_r \, \vert \,\varvec{\xi }_r, \varvec{\chi }, \mathbf {x}_{-r}, \omega )$, where m is the number of people in the scene.

3.3 Camera

We use a standard perspective camera model [23] with the simplifying assumptions used by Del Pero et al. [18]. Specifically, the world coordinate origin is on the ground plane (we use the xz-plane), and the camera center is $(0, \eta , 0)$, with pitch $\psi $, and focal length f. This simplified camera has unit aspect ratio, and roll, yaw, axis skew, and principal point offset are all zero. We denote the camera parameters as $C = (\eta , \psi , f)$ and give them vague normal priors whose parameters we set manually.

3.4 Data and Likelihood

We use three sources of evidence: person detectors, face landmarks associated with person detections, and optical flow. A person detector [20] provides bounding boxes $B_t = \{b_{t1}, \dots , b_{tN_t}\}$, $t=1, \dots , T$, where $N_t$ is the number of detections at frame t. We define $B = \cup _{t=1}^T B_t$ to be the set of all such boxes. We parameterize each box $b_{tj}$ by $(b_{tj}^{\text {x}}, b_{tj}^{\text {top}}, b_{tj}^{\text {bot}})$, representing the x-coordinate of the center, and the y-coordinates of the top and bottom, respectively.

A face landmark detector [71] provides five 2D points for each face, $\mathbf {k}_{ti} = (k^1_{ti}, \dots , k^5_{ti})$, representing centers of the eyes, the corners of the mouth, and the tip of the nose, of the ith detection at frame t. We use $I^k_t = \{\mathbf {k}_{t1}, \dots , \mathbf {k}_{tN}\}$ to represent all face landmarks detected at frame t, and define $I^k = \{I^k_1, \dots , I^k_T\}$. A dense optical flow estimator [35] provides velocity vectors $I^f_t = \{v_{t1}, \dots , v_{tN_I}\}$ for each frame $t = 1, \dots , T-1$, where $N_I$ is the number of pixels in the frame. We also define $I = (I^f, I^k)$.

To compute the data likelihood from evidence in 2D frames, we first convert the 3D model to 2D at each time point, by projecting the 3D scene $\mathbf {z}$ on to the image (via the camera C) as follows.

Model Boxes. For each person r at frame j, we compute a set of points on the surface of their body cylinder and head ellipsoid and project them into the image. We then find a tight bounding box on the image plane, $h_{rj}$, called the model box. Similarly, using the cylinder and ellipsoid separately, we compute a model body box, $o_{rj}$, and a model face box, $g_{rj}$ (see Fig. 3). Using this formulation, we can reason about occlusion in 3D, as we can efficiently compute the non-occluded regions of boxes [13], denoted by $\widehat{o}_{rj}$ (body) and $\widehat{o}_{rj}$ (face).

Face Features. We project five face locations on the ellipsoid representing the centers of the eyes, the nose, the corners of the mouth (see Fig. 3). We denote the projected face features by $\mathbf {m}_{rj} = (m_{rj}^1, \dots , m_{rj}^5)$, using a special value when a feature is not visible to the camera.

Image Plane Motion Directions. We define two 2D direction vectors, called model body vector and model face vector, which represent the 3D motion of the body cylinder (respectively, face ellipsoid) projected onto the image. To compute the model face vector for person r at its jth frame, we pick a visible point on the head ellipsoid and project that point onto the image at frames j and $j + 1$. Then, the model face vector $c_{rj}$ is given by the difference between the two projected points. We perform the analogous computation using the body cylinder to get the model body vector $u_{rj}$.

Likelihood. We define a likelihood function for each of the data sources discussed above, $p(B \, \vert \,\omega , \mathbf {z}, C)$, $p(I^f \, \vert \,\mathbf {z}, C)$, and $p(I^k \, \vert \,\mathbf {z}, C)$. Since B, $I^f$, and $I^k$ are conditionally independent given $\mathbf {z}$ and C (see Fig. 2), the total likelihood function is given by a product of these three functions.

Detection Box Likelihood. We assume each assigned detection box has i.i.d Laplace-distributed errors with respect to their assigned model box in the x-coordinate of its center and the y-coordinates of its top and bottom. Our likelihood includes video specific noise rate for box detections, and detector specific miss rate, both of which are critical for inferring the number of tracks [13].

Face Landmark Likelihood. We associate landmark $\mathbf {k}_{ti}$ to person r at frame t if its centroid is near the center of model face box $g_{rt}$. Then, we assume a Gaussian noise model around each of the model face features $\mathbf {m}_{rj}$. Specifically, for every $\mathbf {k} \in I^k$, $k^i \sim \mathcal {N}(m^i_{rj}, \Sigma _{I^k}^i)$. for $i = 1, \dots , 5$, where $m^i_{rj}$ is the model face feature assigned to $k^i$. Assuming independence of all landmarks, we get a landmark likelihood of

$$\begin{aligned} p(I^k \, \vert \,\mathbf {z}, C) = \prod _{\mathbf {k} \in I^k} p(\mathbf {k} \, \vert \,\mathbf {m}(\mathbf {k})), \end{aligned}$$

(1)

where $\mathbf {m}(\mathbf {k})$ is the predicted face feature for landmark $\mathbf {k}$. Because we link faces to boxes, noisy detections are not relevant. However, the probability of missing a face detection, conditioned on the model (and box) is strongly dependent on whether the face is frontal, or sufficiently in profile that only one eye is visible. Hence we calibrate miss rate for these two cases using held out data.

Optical Flow Likelihood. We place a Laplace distribution on the difference between the non-occluded model body vectors and the average optical flow in the corresponding model body box, and similarly for model face vectors [13].

4 Inference

We wish to find the MAP estimate of $\omega $ as a good solution to the data association problem. In addition, we need to infer the camera parameters C, and the association prior parameters $\gamma = (\kappa , \theta , \lambda _N)$, which we want to be video specific. We add to this block of parameters, which do not vary in dimension, the discrete VFoA variables $\varvec{\xi }$. Hence, we seek $(\omega , \gamma , C, \varvec{\xi })$ that maximizes the posterior

$$\begin{aligned} p(\omega , \gamma , C, \varvec{\xi }\, \vert \,B, I) \propto p(\omega \, \vert \, \gamma ) p(\gamma ) p(C) p(\varvec{\xi }\, \vert \,\omega ) p(B, I \, \vert \,\omega , C, \varvec{\xi }), \end{aligned}$$

(2)

where the marginal data likelihood $p(B, I \, \vert \,\omega , C, \varvec{\xi })$ is given by

$$\begin{aligned} \int p(B \, \vert \,\omega , \mathbf {z}, C) p(I \, \vert \,\mathbf {z}, C) p(\mathbf {z}\, \vert \,\varvec{\xi }, \varvec{\chi }, \omega ) p(\varvec{\chi }) \, \mathrm {d}\varvec{\chi }\,\mathrm {d}\mathbf {z}. \end{aligned}$$

(3)

4.1 Block Sampling over $\gamma $, $\omega $, C and ${\xi }$

Since expression (2) has no closed form, we approximate its maximum using MCMC block sampling, which successively draws samples from the conditional distributions $p(\gamma \, \vert \,\omega )$, $p(\omega \, \vert \,\gamma , \varvec{\xi }, C, B, I)$, $p(C \, \vert \,\omega , \varvec{\xi }, B, I)$, and $p(\varvec{\xi }\, \vert \,\omega , C, B, I)$. During sampling, we are required to evaluate the posterior (2), which contains the integral in expression (3). Since this integral cannot be performed analytically, nor can it be computed numerically due to the high dimensionality of $(\mathbf {z}, \varvec{\chi })$, we estimate its value using the Laplace-Metropolis approximation [24]. This approximation requires obtaining the best 3D scene $(\mathbf {z}^*, \varvec{\chi }^*)$ with respect to the posterior distribution $p(\mathbf {z}, \varvec{\chi }\, \vert \,B, I, \omega , C, \varvec{\xi })$, which we estimate using MCMC (see Sect. 4.2), keeping track of the best scene across samples.

We use Gibbs to directly draw samples of the association parameters $\gamma $ from the conditional posterior $p(\gamma \, \vert \,\omega )$, an extension of the MCMCDA algorithm [40] to sample values for $\omega $ from $p(\omega \, \vert \,\gamma , \varvec{\xi }, C, B, I)$ [13], and random-walk Metropolis-Hastings (MH) to draw samples of the camera parameters $\eta $, $\psi $, and f from the distribution $p(C \, \vert \,\omega , \varvec{\xi }, B, I)$.

We also use MH to sample $\varvec{\xi }$ from $p(\varvec{\xi }\, \vert \,\omega , C, B, I)$ using the following proposal mechanism. For each person r in the scene, at each frame j, we find the set of objects or persons in the current scene estimate $(\mathbf {z}^*, \varvec{\chi }^*)$ that intersect (up to a threshold) with person r’s gaze vector. Then, we build a distribution over these objects, which is biased towards the closer ones, as well as the VFoA in the previous frame. We draw a sample from this distribution and assign it to $\varvec{\xi }_{rj}$. We then accept or reject the sample using the standard MH acceptance probability.

4.2 Estimating $(\mathrm{Z}^, \mathcal {X}^)$

To approximate the MAP estimate of $(\mathbf {z}^*, \varvec{\chi }^*)$, we alternate sample over $\mathbf {z}$ and $\varvec{\chi }$ under the distribution

$$\begin{aligned} p(\mathbf {z}, \varvec{\chi }\, \vert \,B, I, \omega , C, \varvec{\xi }) \propto p(\varvec{\chi }) p(\mathbf {z}\, \vert \,\varvec{\xi }, \varvec{\chi }, \omega ) p(B, I \, \vert \,\mathbf {z}, \varvec{\chi }, \omega , C) . \end{aligned}$$

(4)

To sample over $\varvec{\chi }$, we use random-walk MH to perturb the position of each interesting point $\varvec{\chi }_r$. We also perform a birth move to introduce new points in the scene. First, we construct a set of candidate points by intersecting all gaze rays across all frames using the current estimate of the temporal 3D state of the persons in the scene $\mathbf {z}$ (see Fig. 4). Then, we choose a point from the candidates uniformly at random and add it to $\varvec{\chi }$. We also use a death move, where we remove an element from $\varvec{\chi }$ is uniformly at random.

To explore the space of $\mathbf {z}$, we use an efficient Gaussian process posterior sampling mechanism based on inducing points [55]. The basic idea is to construct a proposal distribution by drawing samples from the conditional GP prior and a set of inducing point locations that provide a low-dimensional representation of the function. We iterate over persons $r = 1, \dots , m$ and over the different trajectories of each $\mathbf {x}_r$, $\mathbf {q}_r$, $\mathbf {p}_r$, and $\mathbf {y}_r$, drawing a sample at each iteration. More specifically, for a given trajectory, say $\mathbf {q}_r = (q_{r1}, \dots , q_{rl_r})$, we arbitrarily choose a subset of $(1, \dots , l_r)$ as inducing points, denoted by $(j_1, \dots , j_{l'_r})$. Then, for each inducing point $j_c$, we draw a sample from the conditional GP prior $q_{rj_c}' \sim p(q_{rj} \, \vert \,\mathbf {q}_{rj_{-c}})$, and a sample from the predictive distribution $\mathbf {q}_r' \sim p(\mathbf {q}_r \, \vert \,\mathbf {q}_{rj_{-c}}, q_{rj_c}')$, where $\mathbf {q}_{rj_{-c}}$ represents $\mathbf {q}_r$ at the set of inducing points excluding $j_c$. The sample is accepted or rejected using the MH acceptance probability ratio using only the likelihood function $p(B, I \, \vert \,\mathbf {z}, \varvec{\chi }, \omega , C)$.

5 Evaluation Dataset and Measures

Several datasets exist for evaluating VFoA recognition in meeting scenarios [5, 7, 8, 29, 58, 59]. Since most of the participants in available meeting datasets are seated throughout the videos, these datasets are not well-suited for evaluating our system, which relies on the ability to detect standing people, and is targeted for scenarios with a diversity of gaze directions in both pitch and yaw. Similarly, datasets such as the Vernissage Corpus dataset [29], which simulates an art gallery scenario, contain many frames where only the upper bodies of the participants are visible. Data sets with walking persons on the other hand uniformly do not encode 3D locations of what people are looking at. While data sets like the challenging SALSA [1], cocktail party [68] and coffee break [16], have head pose annotations, this does not suffice for our goals. Thus we created a new dataset with multiple participants moving freely about while looking at different static targets and each other.

5.1 A New Dataset for 3D Gaze

We captured and annotated six indoor and two outdoor video sequences. Each setting contained several static object locations, several of which were not visible to the camera. Video participants were asked to walk around and look at each other or the stationary objects, indicating when they started and stopped focusing on each target with an audio recording device. All 8 of our videos were between 40 and 90 seconds long with 3 to 4 people and 5 to 8 objects total (including objects that were not visible). Indoor videos had an image resolution of 1920 $\times $ 1080. Outdoor video resolution was 1440 $\times $ 1080 (Fig. 5).

Annotation and Ground Truth. We annotated bounding boxes around each target at each frame using the VATIC annotation tool [60]. We then estimated the ground truth for the 3D positions of each target and the camera parameters in each video by minimizing the reprojection error with respect to 3D locations and heights using the tops and the bottoms of the ground truth boxes. We also used the VFoA audio annotations described above to estimate the ground truth head orientations (pitch and yaw) of each person at every frame where the person was looking at a target. To determine the locations of points not visible to the camera, we measured their locations, and locations that were visible in a shared coordinate system. We then mapped the locations of invisible points to the camera coordinate system.

5.2 Evaluation Measures

Trajectory and Head Pose Evaluation. To evaluate the 3D trajectories of the inferred targets, we first find the best match between the inferred tracks and the ground truth tracks using the Hungarian method with pairwise Euclidean distances. We then use two conventional metrics for tracking: MOTA (for accuracy of the data associations) and MOTP (for precision of the estimated 3D tracks) [50]. Per convention, we set the MOTP threshold to 1 m. To evaluate head pose estimation, we compute the the equivalent of MOTP for both yaw and pitch between the inferred head poses and their corresponding ground truth head poses (measured in degrees) at frames in which they are available.

To Evaluate VFoA Estimation, we compare the inferred VFoA of a tracked person to the ground truth VFoA at each frame where it exists. Let $N_c$ be the number of frames where the VFoA is correctly estimated, $N_m$ be the number of frames where we fail to infer a VFoA (misses), and $N_e$ be the number of frames where we infer an incorrect VFoA. We then compute the following three scores for the VFoA estimation: $\text {accuracy } = N_c / N, \text { mistakes } = N_e / N, \text { missed }= N_m / N,$ where N is the total number of frames that the ground truth for that person records that they were looking at one of the scene VFoA targets. Note that this excludes evaluating the VFoA when the tracked person is transitioning from looking at one target to another. For each video, we compute the average scores over all the tracked persons.

Evaluating Inferring Interesting Locations. Finally, we evaluate how well we can infer the interesting locations in a scene by first finding the best matching between the inferred interesting locations and the preset ground truth locations using the Hungarian method with 1 meter threshold. We then compute the recall and precision for the inferred interesting locations and their average distance to the ground truth locations.

6 Experiments and Results

We ran two sets of experiments to evaluate the performance of our method. We do not compare to others on our main tasks since we are not aware of any relevant published results. We first ran our algorithm and ablated variants on our dataset to assess the impact of different aspects of our approach. We then compare our person tracking performance against our previously published results [13] for people tracking alone to check the effect of the extensions for gaze tracking and object discovery on basic tracking on the well known TUD dataset [2].

Experiments on Our Dataset. We experiment with enabling and disabling inference over three different parts of the model: the 3D head pose $(\mathbf {p}, \mathbf {y})$, the VFoA $\varvec{\xi }$, and the static objects $\varvec{\chi }$, and replace each with a baseline algorithm. We denote the entire model MGG (for “multiple gaze geometry”).

When we disable inference over $(\mathbf {p}, \mathbf {y})$, we simply set the head pose same as the walking direction at each frame (MGG-NO-HEAD). When disabling inference over $\varvec{\xi }$, we set the VFoA of each person at each frame to the object or person first intersects their gaze ray (MGG-NO-VO). Finally, when turning off inference over $\varvec{\chi }$, we estimate the static objects by computing a histogram of the intersections of all the 3D gaze directions of all the people across all the frames, then taking the locations of the top 5 bins with the highest votes (MGG-BASELINE).

Table 1. Performance of different modes of our algorithm on our dataset. Numbers are averaged over eight videos. The first row shows our method with all parts enabled, while the next three rows each shows the algorithm with different aspects disabled, e.g., MGG-NO-HEAD is the stereo gaze algorithm without inferring head pose (see Sect. 6 for details). Each column shows a different evaluation measure. We evaluate using the MOTA (with 1.0 m threshold) and MOTP for distance and angles. For VFoA we use the measures defined in Sect. 5.2.

Full size table

Table 2. Object discovery performance. Numbers are averaged over eight videos. The algorithms are the same as in Table 1, and the measures are defined in Sect. 5.2. We tabulate performance separately for objects not visible in any frame. The performance here may be favorably biased towards invisible objects because they tended to be behind the camera, and looking at them meant a more frontal image of the viewer, which entails better pose estimation.

Full size table

Table 1 provides the tracking and head pose estimation results on our dataset. While MOTA and MOTP on position are comparable across all algorithms, the estimated yaw of the head is poor without head pose data. This is not surprising, as the participants in our videos often do not look straight ahead, partly due to the construction of the experiment. By jointly modeling position and pose, we maintain good performance on tracking, while obtaining reasonable accuracy of head yaw, surpassing MGG-NO-HEAD by a significant amount ($\sim 40~^{\circ }$). The gain for pitch was more modest, but the absolute error in pitch was less to begin with, which was biased by our instructions and our environment. However, this is ecologically valid, as typical viewing angles are not that far from level.

Table 1 also provides the results for the estimated VFoA. On average, we can correctly identify the VFoA target 48% of the time, much better than the baseline (13%), and better than the ablated MGG-NO-VO version (31%). The later result suggests, perhaps not surprisingly, that learning the 3D locations that people might be looking at provides additional information beyond gaze angles determined from image data alone.

Results for object discovery are shown in Table 2. Here we define success by correctly estimating the location within one meter. We correctly identified 48% of the instances that are available to be identified across the eight videos (recall). In addition, among the ones our method proposes as interesting locations, 59% are correct (precision). The average distance error is a little more than half a meter, which is driven by the choice of the one-meter threshold. Figure 6 shows some example frames of the resulting inferred 3D scene when running the full algorithm (MGG) compared with the baseline (MGG-BASELINE).

Experiments on TUD Benchmark Videos. We compared tracking performance to a similar system for tracking only [13], to evaluate whether incorporating gaze tracking and object inference reduce the tracking performance. We found that we in fact do better on the TUD data, suggesting that the joint inference is helpful (Table 3).

Table 3. Tracking results on the TUD dataset. We compare to [13], which shows that joint inference over additional scene attributes yielded a tracking performance boost as well.

Full size table

7 Conclusion

We demonstrated the feasibility of discovering interesting visual locations, specified in 3D, from multiple person gazes observed in monocular video. In particular, on a data set developed for the task, we found that we can infer what people are looking at 59% of the time, and where it is within about .58 m. We also found that joint inference over the various scene attributes generally improved the accuracy of the individual estimates. In brief, gaze is both part of scene semantics, and can help determine other aspects of scene semantics.

References

Alameda-Pineda, X., et al.: Salsa: a novel dataset for multimodal group behavior analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1707–1720 (2016)
Article Google Scholar
Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
Google Scholar
Andriyenko, A., Schindler, K.: Globally optimal multi-target tracking on a hexagonal lattice. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 466–479. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_34
Chapter Google Scholar
Andriyenko, A., Schindler, K., Roth, S.: Discrete-continuous optimization for multi-target tracking. In: CVPR, pp. 1926–1933 (2012)
Google Scholar
Ba, S.O., Hung, H., Odobez, J.M.: Visual activity context for focus of attention estimation in dynamic meetings. In: IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 1424–1427. IEEE (2009)
Google Scholar
Ba, S.O., Odobez, J.-M.: Probabilistic head pose tracking evaluation in single and multiple camera setups. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds.) CLEAR/RT -2007. LNCS, vol. 4625, pp. 276–286. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68585-2_26
Chapter Google Scholar
Ba, S.O., Odobez, J.M.: Recognizing visual focus of attention from head pose in natural meetings. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(1), 16–33 (2009)
Article Google Scholar
Ba, S.O., Odobez, J.M.: Multiperson visual focus of attention from head pose and meeting contextual cues. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 101–116 (2011)
Article Google Scholar
Benfold, B., Reid, I.: Stable multi-target tracking in real-time surveillance video. In: CVPR, pp. 3457–3464 (2011)
Google Scholar
Benfold, B., Reid, I.: Guiding visual surveillance by tracking human attention. In: BMVC, pp. 1–11 (2009)
Google Scholar
Beymer, D.J.: Face recognition under varying pose. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1994, pp. 756–761. IEEE (1994)
Google Scholar
Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1063–1074 (2003)
Article Google Scholar
Brau, E., Guan, J., Simek, K., Del Pero, L., Dawson, C.R., Barnard, K.: Bayesian 3D tracking from monocular video. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3368–3375. IEEE (2013)
Google Scholar
Chen, C., Heili, A., Odobez, J.M.: A joint estimation of head and body orientation cues in surveillance video. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 860–867. IEEE (2011)
Google Scholar
Chen, C., Odobez, J.M.: We are not contortionists: coupled adaptive learning for head and body orientation estimation in surveillance video. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1544–1551. IEEE (2012)
Google Scholar
Cristani, M., et al.: Social interaction discovery by statistical analysis of F-formations. In: BMVC (2011)
Google Scholar
Dehghan, A., Assari, S.M., Shah, M.: GMMCP tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: CVPR, vol. 1, p. 2 (2015)
Google Scholar
Del Pero, L., Guan, J., Brau, E., Schlecht, J., Barnard, K.: Sampling bedrooms. In: CVPR, pp. 2009–2016 (2011)
Google Scholar
Duffner, S., Garcia, C.: Visual focus of attention estimation with unsupervised incremental learning. IEEE Trans. Circuits Syst. Video Technol. 26(12), 2264–2272 (2016)
Article Google Scholar
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. In: IEEE PAMI (2009)
Google Scholar
Gee, A., Cipolla, R.: Determining the gaze of faces in images. Image Vis. Comput. 12(10), 639–647 (1994)
Article Google Scholar
Gu, L., Kanade, T.: 3D alignment of face in a single image. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1305–1312. IEEE (2006)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, New York (2000)
MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning; Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2001)
MATH Google Scholar
Horprasert, T., Yacoob, Y., Davis, L.S.: Computing 3d head orientation from a monocular image sequence. In: 25th Annual AIPR Workshop on Emerging Applications of Computer Vision, pp. 244–252. International Society for Optics and Photonics (1997)
Google Scholar
Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector machines (SVM). In: Proceedings of the Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 154–156. IEEE (1998)
Google Scholar
Huang, Y., Duan, D., Cui, J., Davoine, F., Wang, L., Zha, H.: Joint estimation of head pose and visual focus of attention. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 3332–3336. IEEE (2014)
Google Scholar
Isard, M., MacCormick, J.: BraMBLe: a Bayesian multiple-blob tracker. In: ICCV, pp. 34–41 (2001)
Google Scholar
Jayagopi, D.B., et al.: The vernissage corpus: a multimodal human-robot-interaction dataset. Technical report (2012)
Google Scholar
Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15
Chapter Google Scholar
Kuo, C., Huang, C., Nevatia, R.: Multi-target tracking by on-line learned discriminative appearance models. In: CVPR, pp. 685–692 (2010)
Google Scholar
La Cascia, M., Sclaroff, S., Athitsos, V.: Fast, reliable head tracking under varying illumination: an approach based on registration of texture-mapped 3d models. IEEE Trans. Pattern Anal. Mach. Intell. 22(4), 322–336 (2000)
Article Google Scholar
Li, Y., Gong, S., Liddell, H.: Support vector regression and classification based multi-view face detection and recognition. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 300–305. IEEE (2000)
Google Scholar
Li, Y., Gong, S., Sherrah, J., Liddell, H.: Support vector machine based multi-view face detection and recognition. Image Vis. Comput. 22(5), 413–427 (2004)
Article Google Scholar
Liu, C.: Exploring new representations and applications for motion analysis. Ph.D. thesis, M.I.T (2009)
Google Scholar
Massé, B., Ba, S., Horaud, R.: Simultaneous estimation of gaze direction and visual focus of attention for multi-person-to-robot interaction. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2016)
Google Scholar
Milan, A., Leal-Taixé, L., Schindler, K., Reid, I.: Joint tracking and segmentation of multiple targets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5397–5406 (2015)
Google Scholar
Murphy-Chutorian, E., Trivedi, M.M.: Head pose estimation in computer vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 607–626 (2009)
Article Google Scholar
Niyogi, S., Freeman, W.T.: Example-based head tracking. In: Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pp. 374–378. IEEE (1996)
Google Scholar
Oh, S.: Bayesian formulation of data association and Markov chain Monte Carlo data association. In: Robotics: Science and Systems Conference (RSS) Workshop Inside Data association (2008)
Google Scholar
Oh, S., Russell, S., Sastry, S.: Markov chain Monte Carlo data association for general multiple target tracking problems (2004)
Google Scholar
Otsuka, K., Takemae, Y., Yamato, J.: A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In: Proceedings of the 7th International Conference on Multimodal Interfaces, pp. 191–198. ACM (2005)
Google Scholar
Otsuka, K., Yamato, J., Takemae, Y., Murase, H.: Conversation scene analysis with dynamic Bayesian network basedon visual head tracking. In: 2006 IEEE International Conference on Multimedia and Expo, pp. 949–952. IEEE (2006)
Google Scholar
Pirsiavash, H., Ramanan, D., Fowlkes, C.: Globally-optimal greedy algorithms for tracking a variable number of objects. In: CVPR, pp. 1201–1208 (2011)
Google Scholar
Sankaranarayanan, K., Chang, M.C., Krahnstoever, N.: Tracking gaze direction from far-field surveillance cameras. In: 2011 IEEE Workshop on Applications of Computer Vision (WACV), pp. 519–526. IEEE (2011)
Google Scholar
Segal, A.V., Reid, I.: Latent data association: Bayesian model selection for multi-target tracking. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2904–2911. IEEE (2013)
Google Scholar
Sheikhi, S., Odobez, J.-M.: Recognizing the visual focus of attention for human robot interaction. In: Salah, A.A., Ruiz-del-Solar, J., Meriçli, Ç., Oudeyer, P.-Y. (eds.) HBU 2012. LNCS, vol. 7559, pp. 99–112. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34014-7_9
Chapter Google Scholar
Smith, K., Ba, S.O., Gatica-Perez, D., Odobez, J.M.: Tracking the multi person wandering visual focus of attention. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 265–272. ACM (2006)
Google Scholar
Smith, K., Ba, S.O., Odobez, J.M., Gatica-Perez, D.: Tracking the visual focus of attention for a varying number of wandering people. IEEE Trans. Pattern Anal. Mach. Intell. 30(7), 1212–1229 (2008)
Article Google Scholar
Stiefelhagen, R., Bernardin, K., Bowers, R., Garofolo, J., Mostefa, D., Soundararajan, P.: The CLEAR 2006 evaluation. In: Stiefelhagen, R., Garofolo, J. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 1–44. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-69568-4_1
Chapter Google Scholar
Stiefelhagen, R., Yang, J., Waibel, A.: Modeling focus of attention for meeting indexing. In: Proceedings of the seventh ACM International Conference on Multimedia (Part 1), pp. 3–10. ACM (1999)
Google Scholar
Stiefelhagen, R., Yang, J., Waibel, A.: Modeling focus of attention for meeting indexing based on multiple cues. IEEE Trans. Neural Netw. 13(4), 928–938 (2002)
Article Google Scholar
Stiefelhagen, R., Zhu, J.: Head orientation and gaze direction in meetings. In: Extended Abstracts on Human Factors in Computing Systems, CHI 2002, pp. 858–859. ACM (2002)
Google Scholar
Tang, S., Andres, B., Andriluka, M., Schiele, B.: Subgraph decomposition for multi-target tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5033–5041 (2015)
Google Scholar
Titsias, M.K., Lawrence, N.D., Rattray, M.: Efficient sampling for Gaussian Process inference using control variables. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1681–1688. Curran Associates Inc., Vancouver, British Columbia, Canada (2008)
Google Scholar
Valenti, R., Sebe, N., Gevers, T.: Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 21(2), 802–815 (2012)
Article MathSciNet Google Scholar
Voit, M., Nickel, K., Stiefelhagen, R.: Head pose estimation in single- and multi-view environments - results on the CLEAR’07 benchmarks. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds.) CLEAR/RT -2007. LNCS, vol. 4625, pp. 307–316. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68585-2_29
Chapter Google Scholar
Voit, M., Stiefelhagen, R.: Deducing the visual focus of attention from head pose estimation in dynamic multi-view meeting scenarios. In: Proceedings of the 10th International Conference on Multimodal Interfaces, pp. 173–180. ACM (2008)
Google Scholar
Voit, M., Stiefelhagen, R.: 3D user-perspective, voxel-based estimation of visual focus of attention in dynamic meeting scenarios. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, p. 51. ACM (2010)
Google Scholar
Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 101, 1–21 (2013). https://doi.org/10.1007/s11263-012-0564-1
Article Google Scholar
Wei, P., Zhao, Y., Zheng, N., Zhu, S.C.: Modeling 4d human-object interactions for joint event segmentation, recognition, and object localization. IEEE Trans Pattern Anal. Mach. Intell. 39, 1165–1179 (2016)
Article Google Scholar
Wu, Y., Toyama, K.: Wide-range, person-and illumination-insensitive head orientation estimation. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 183–188. IEEE (2000)
Google Scholar
Xiao, J., Moriyama, T., Kanade, T., Cohn, J.F.: Robust full-motion recovery of head by dynamic templates and re-registration techniques. Int. J. Imaging Syst. Technol. 13(1), 85–94 (2003)
Article Google Scholar
Xie, D., Todorovicy, S., Zhu, S.C.: Inferring “dark matter” and “dark energy” from videos. In: ICCV (2013)
Google Scholar
Yang, R., Zhang, Z.: Model-based head pose tracking with stereovision. In: Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 255–260. IEEE (2002)
Google Scholar
Yi, Y., Xu, H.: Hierarchical data association framework with occlusion handling for multiple targets tracking. IEEE Signal Process. Lett. 21(3), 288–291 (2014)
Article MathSciNet Google Scholar
Yücel, Z., Salah, A.A., Mericli, C., Meriçli, T., Valenti, R., Gevers, T.: Joint attention by gaze interpolation and saliency. IEEE Trans. Cybern. 43(3), 829–842 (2013)
Article Google Scholar
Zen, G., Lepri, B., Ricci, E., Lanz, O.: Space speaks: towards socially and personality aware visual surveillance. In: 1st ACM International Workshop on Multimodal Pervasive Video Analysis, pp. 37–42. ACM, Firenze, Italy (2010)
Google Scholar
Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network flows. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
Google Scholar
Zhao, G., Chen, L., Song, J., Chen, G.: Large head movement tracking using sift-based registration. In: Proceedings of the 15th International Conference on Multimedia, pp. 807–810. ACM (2007)
Google Scholar
Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2879–2886. IEEE (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

CiBO Technologies, Cambridge, MA, 02141, USA
Ernesto Brau & Jinyan Guan
University of Arizona, Tucson, AZ, 85711, USA
Tanya Jeffries & Kobus Barnard

Authors

Ernesto Brau
View author publications
You can also search for this author in PubMed Google Scholar
Jinyan Guan
View author publications
You can also search for this author in PubMed Google Scholar
Tanya Jeffries
View author publications
You can also search for this author in PubMed Google Scholar
Kobus Barnard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ernesto Brau .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brau, E., Guan, J., Jeffries, T., Barnard, K. (2018). Multiple-Gaze Geometry: Inferring Novel 3D Locations from Gazes Observed in Monocular Video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11208. Springer, Cham. https://doi.org/10.1007/978-3-030-01225-0_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-01225-0_38
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01224-3
Online ISBN: 978-3-030-01225-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multiple-Gaze Geometry: Inferring Novel 3D Locations from Gazes Observed in Monocular Video

Abstract

1 Introduction

2 Related Work