1 Introduction

Gaze direction is one of the most useful interaction cues in human-machine interactions (HMI). Gaze tracking has been widely applied in fields such as human–computer interaction [1,2,3], human-robot interaction [4,5,6], and virtual reality [7, 8]. Many existing gaze tracking technologies use invasive methods [9], such as wearing eye tracking glasses [10] and attaching electrooculogram sensors around the eyes [11]. However, many people cannot use such methods due to their sensitivity to body attached hardware [12]. In addition, wearing hardware limits people’s activities and may cause discomfort, too. While there are some methods to avoid direct contact with the hardware, such as using eye trackers that can be place in front of the participant [13], the calibration process of these devices are time consuming and may not work for some participants such as very young children. In addition, participants need to hold their head still during tracking, which is not applicable in HMI studies that require large head movements [12, 14]. In addition, most of these devices are expensive and only available to professionals. Thus, gaze tracking methods using invasive hardware and/or requiring head to be held do not satisfy many HMI scenarios.

Since people tend to turn their head to a target when looking at it, head orientation can also be used to indicate gaze direction. Previous research has used frontal head orientation to approximate gaze direction directly [15]. However, literature [16, 17] and common sense tell us that differences exist between people’s frontal orientation and their gaze direction. Therefore, other studies modeled gaze direction as an uncertainty with assumed probability distribution on top of head orientation [18] or captured clear eye images (usually requires cameras positioned closely to the participant) and ensemble gaze angles on top of head orientation [19, 20]. Nevertheless, only a few works [16, 17] have studied the difference between head orientation and gaze direction carefully, while most of them focused on this difference for attention tracking, instead of the explicit relation between gaze and head orientation.

We aim to solve these problems in this paper. First, we conducted a detailed evaluation on the difference between frontal head orientation and gaze direction. Based on this difference, we introduced a novel gaze direction estimation method that accommodates the rotation of head, without the need of invasive hardware or capturing clear eye images.

This new method approximates gaze direction of a person using his/her head orientation information. Head orientation can be detected remotely and precisely using computer vision techniques [21]. In this paper, we chose Microsoft Kinect, which is a low cost device available to the public. Seven adults participated in a data collection experiment, where their synchronized gaze direction data and head orientation data were recorded. These data were used to train mapping functions that use head orientation as input to calculate the corresponding gaze direction.

Comparing to previous gaze direction estimation methods, the proposed method can be implemented directly on top of any existing head orientation method and does not require any extra hardware for gaze tracking. Since the computational cost of the mapping functions is low, real-time gaze tracking is possible if the functions are applied on real-time head orientation estimation methods. As remote head orientation estimation methods [21], such as the CSIRO Face Analysis SDK [22] and Microsoft Kinect, have been well developed, applying the gaze mapping functions on top of these methods results in non-invasive gaze tracking easily.

This paper is structured as follows: Sect. 2 shows the data collection and processing of this study. Section 3 presents the difference between gaze direction and frontal head orientation. Section 4 introduces how the gaze direction estimation functions were designed accordingly and their accuracies. Finally, Sect. 5 concludes this article and discusses future works.

2 Data Collection and Processing

Functions that map head orientation to gaze direction were derived by data fitting. We conducted a data collection experiment with seven adults recruited as participants.

In each experiment session, a participant was asked to look at a moving marker (a red spot) projected on a wall (a plane). The position of the marker was used to indicate the ground truth of the participant’s gaze direction. Before the experimental trials, the marker was at the center of the display region, and the participant’s head orientation when his/she was looking at the marker was recorded as a baseline value. This value was used to calibrate the data for each participant as discussed in the later part of this section. The marker’s movement was arranged in multiple trials. In each trial, the marker started from a random position and moved horizontally, vertically, and diagonally following a random order. The participant may unconsciously anticipate the motion of the marker if it moved in the same direction for a long period. To solve this problem, we kept each trial short, which lasted from a few seconds to about a minute. Meanwhile, the moving speed of the marked was adjusted so that participants could follow the motion of the marker easily. After each trial, the marker disappeared and showed up in another position to start the next trial until the end of the experiment session. The combined path of the marker in all the trials covered a region of 360 cm × 117 cm. Figure 1 illustrates an example of the path that the moving marker followed.

Fig. 1.
figure 1

Moving marker path example (Color figure online)

During the experiment, the participant was seated facing the central part of the display region. The center of the head was approximately 150 cm from the ground and 160 cm from the display region. Therefore, in order to look at the whole display region, the participants’ gaze needed to shift from \( - 48.37^\circ \) to \( 48.37^\circ \) in horizontal direction and from \( - 18.81^\circ \) to \( 21.34^\circ \) in vertical direction, respectively, as shown in Fig. 2. This simulates a normal gaze range when people are communicating with other agents or paying attention to objects in front of them. When a participant was looking at the marker, his/her head orientation was estimated using Microsoft Kinect. As shown in Fig. 2, the Kinect was placed at the bottom of the display region’s central part, facing the participant. Thus, the participant’s head was in the view of the Kinect to estimate the head pose, while the Kinect did not block the participant’s vision towards the display region. The head orientation data were recorded and synchronized with the participant’s ground truth gaze direction, and thus data pairs can be used to calculate the difference between gaze direction and frontal head orientation as well as fit mathematical functions that map head orientation to gaze direction.

Fig. 2.
figure 2

Data collection experiment configuration

The frontal head orientation (a vector) \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {f} \) was computed from the head rotation quaternions given by the Kinect for Windows Software Development Kit 2.0. The term \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {f} \) was projected horizontally and vertically, and thus generated two angles, \( \alpha \) and \( \beta \), which represented the head rotation horizontally and vertically from the frontal direction, respectively. As mentioned in the last section, before the experimental trials, each participant’s baseline head orientation was recorded when they were instructed to look at the center of the display region, which was approximately right in front of the participant. The baseline orientation was used to calibrate a participant’s head orientation. This is necessary since different participant tended to look at the same spot in the display region with different head orientation. For example, some people may raise head a little more than others, and some people may turn their head to one side a little more than the other side. Therefore, we recorded a participant’s head orientation at the calibration point as \( \left( {\alpha_{c} ,\beta_{c} } \right) \), and all the head orientation data was subtracted by this pair to eliminate the baseline differences, i.e., \( \alpha_{f} = \alpha - \alpha_{c} \), \( \beta_{f} = \beta - \beta_{c} + \Delta \beta \). The calibration point is slightly higher than the exact horizontal direction. \( \Delta \beta \) is the difference between the horizontal direction and the vertical direction of the calibration point, which is approximately 1.43° in this study.

The calibrated head orientation angles \( \alpha_{f} \) and \( \beta_{f} \) are used to fit the mapping functions. As shown in Fig. 3, we define the position of the moving marker as \( (x,y) \), which is associated with the ground truth gaze direction \( (\alpha_{g} ,\beta_{g} ) \). Based on the geometry of the experimental setup,

Fig. 3.
figure 3

Illustration of the gaze angles and head orientation angles

$$ \alpha_{g} = \tan^{ - 1} (x/160), $$
(1)
$$ \beta_{g} = \tan^{ - 1} (y + 4/160). $$
(2)

As illustrated in Fig. 3, the gaze direction does not overlap with the head orientation exactly.

Data streams of \( (\alpha_{g} ,\beta_{g} ) \) and \( (\alpha_{f} ,\beta_{f} ) \) were synchronized so that all the ground truth gaze direction data were paired with their synchronized and calibrated head orientation data. From the 7 participants, 7281 pairs of data were collected initially. Even though the participants were instructed to stare at the moving marker as much as possible, we still observed occasional unconscious gaze shift and its corresponding head rotation shift during the experiment. Therefore, to eliminate outliers, we divided the display region in to 300 grids, and each grid was 12 cm (length) by 11.7 cm (height). For each grid, the following steps were executed:

  • Step 1: All the \( (\alpha_{f} ,\beta_{f} ) \) that correspond to \( (\alpha_{g} ,\beta_{g} ) \) within this grid were used to calculate their mean values \( (\bar{\alpha }_{f} ,\bar{\beta }_{f} ) \) and the standard deviation \( (\sigma_{f}^{\alpha } ,\sigma_{f}^{\beta } ) \).

  • Step 2: For a head orientation data point \( (\alpha_{f}^{i} ,\beta_{f}^{i} ) \), if \( \left| {\alpha_{f}^{i} } \right| > \bar{\alpha }_{f} + 2\sigma_{f}^{\alpha } \) or \( \left| {\beta_{f}^{i} } \right| > \bar{\beta }_{f} + 2\sigma_{f}^{\beta } \), this point was removed from the dataset along it’s paired gaze point.

After these two steps, 554 pairs (7.6%) were removed from the initial synchronized data, leaving 6727 pairs for fitting the mapping functions.

3 Difference Between Gaze Direction and Frontal Head Orientation

Figure 4 plots \( \alpha_{g} \) and \( \beta_{g} \) against \( (\alpha_{f} ,\beta_{f} ) \) in a 3 dimensional space. We can see that the frontal head orientation and gaze direction are related but not coincident. The distributions of the points are slightly non-linear.

Fig. 4.
figure 4

Plots of \( \alpha_{g} \) and \( \beta_{g} \) against \( (\alpha_{f} ,\beta_{f} ) \)

In addition, from the data, we found that the further a participant was looking away from his/her frontal direction, the larger the difference between his/her head orientation and gaze direction was. This is an important behavioral phenomenon that needs to be considered when approximating gaze direction using head orientation.

We divided the data into 21 sets in both horizontal and vertical directions, based on the data’s distance from the center of the display. Thus, each horizontal set covers a range of about 4.61°, and each vertical set covers a range of about 1.91°. Then, we calculated the difference between the frontal head orientation and the gaze direction in each direction within each set. Figure 5 shows the patterns clearly, where each bar indicates the difference in one set (marked from −10 to 10). Subfigures (a) and (b) illustrate the average values of \( \left| {\alpha_{g} - \alpha_{f} } \right| \) and \( \left| {\beta_{g} - \beta_{f} } \right| \), respectively. Larger x-coordinates represent longer distances from the center of the display. In the horizontal direction, bar at 0 indicates the difference in the center set ([−2.31°, 2.31°]), bars at negative x coordinates are the differences on the left plane, and bars at positive x coordinates are the differences on the right plane. In the vertical direction, bar at 0 indicates the difference in the center set ([−0.96°, 0.96°]), bars at negative x coordinates are the differences on the lower plane, and bars at positive x coordinates are the differences on the upper plane.

Fig. 5.
figure 5

Differences between gaze direction and frontal head orientation

From Fig. 5, we can see that the differences between gaze direction and the frontal head orientation became larger when the participants looked further away from the center in both horizontal and vertical directions. In horizontal direction, the difference in the range of [−20.75°, 16.14°] (bar −4 to bar 3) was lower than 8°. The minimum difference (4.31°) was at [−2.31°, 2.31°] (bar 0). The difference increased as the participants’ gaze shifted to the sides. At bar −10 and 10, the differences are 16.71° and 16.26°, respectively. In the vertical direction, the difference in the range of [−10.51°, 8.60°] (bar −5 to bar 4) was lower than 6°. The minimum difference (3.86°) was at [−4.78°, −2.87°] (bar −2). The difference increased as the participants’ gaze shifted upwards and downwards. At bar −10 and 10, the differences are 11.41° and 16.03°, respectively.

4 Derivation of the Mapping Functions for Gaze Direction Estimation

Two mapping functions \( F_{1} \) and \( F_{2} \) were derived as \( \alpha_{g} = F_{1} (\alpha_{f} ,\beta_{f} ) \) and \( \beta_{g} = F_{2} (\alpha_{f} ,\beta_{f} ) \) by fitting 2D surfaces using the points clouds. Linear and polynomial regressions were conducted with linear interpolation. The average error of linear regression for \( F_{1} \) and \( F_{2} \) was 7.81° and 7.83°, respectively. The RMSDs decrease slightly in polynomial regressions, however, high orders beyond the second cause overfitting and thus are not appropriate. Therefore, we choose the second-order polynomial regression, and the average error of \( F_{1} \) and \( F_{2} \) was 7.74° and 7.63°, respectively. The form of the mapping functions are as follows, with the coefficients listed in Table 1:

Table 1. Coefficients of \( F_{1} \) and \( F_{2} \) (rounded to 6 decimals)
$$ F = P_{00} + P_{10} \alpha_{f} + P_{01} \beta_{f} + P_{02} \alpha_{f}^{2} + P_{11} \alpha_{f} \beta_{f} + P_{02} \beta_{f}^{2} . $$
(3)

Therefore, \( F_{1} \) and \( F_{2} \) can be used to approximate the participants’ gaze direction using their frontal head orientation within the range of the experimental setup. Figure 6, shows the fitted surface that represent \( F_{1} \) and \( F_{2} \), respectively. The nonlinearity of the two functions can be observed in the graph, especially in \( F_{2} \). These two functions can be easily embedded to any existing head orientation estimation methods for gaze tracking.

Fig. 6.
figure 6

Plots of mapping functions

The mapping functions were embedded into the existing Kinect head orientation estimation program. Due to the low computational cost, this gaze tracking method can work in real-time (about 30 frames per second).

5 Conclusion and Discussion

Since people tend to turn their head towards the target when looking at it, we can extract gaze direction using frontal head orientation. In this work, we first studied the relation, especially the difference, between people’s gaze direction and their frontal head orientation. Then, we proposed a gaze direction estimation method based on frontal head orientation. Seven participants were recruited and instructed to look at a moving marker that indicated their ground truth gaze direction. Meanwhile, their head orientation when looking at the moving marker was recorded using Microsoft Kinect. We found that people’s gaze direction deviate from their head orientation when they look away from frontal direction. In order to use head orientation to estimate gaze direction, mapping functions could be fitted using the synchronized gaze and frontal head orientation data. Two second-order polynomial mapping functions were derived through regression. The average error of the proposed method is below 8°. The functions are with low computational cost and are independent of the particular devices/hardware for head orientation detection. Therefore, it can be applied on any existing head orientation estimation techniques, such as using a Microsoft Kinect, to achieve real-time gaze tracking with low cost. Comparing with other gaze tracking methods, the proposed method does not require any other hardware/sensors particularly for gaze tracking except those for head orientation estimation. Since head orientation has been studied for decades and there are many available methods and products for it, extending them for gaze tracking using the proposed method is easy and handy.

However, there are a few limitations of the current work that need to be addressed in the future. First, the accuracy (around 7°) is low compared with that of using commercial eye trackers under careful calibration and withholding head pose. Therefore, the proposed method may not be applied to studies that require very accurate gaze tracking, such as tracking the exact point that a participant is looking at. Instead, the proposed method can be used in scenarios where a general and quick referring of people’s gaze is enough, such as distinguishing what large objects a participant looks at and the gaze switching between these objects [16]. Since the objects in many human-machine interaction studies are separated much more than 8° visually, the proposed method has a great potential to provide a quick solution for referring gaze in these studies. Examples include: (1) human-robot interaction studies, where the system needs to distinguish whether a participant is looking at a robot or another object positioned far away from the robot [14]; and (2) human-computer interaction studies, where participants need to look at different monitors during the interaction [12].

The second limitation of the current study is the small sample size. Only seven participants’ data were used, and thus the fitted functions are tuned for this small group. In order to develop a general function that works well for larger population, more data need to be collected from a large sample in the future.

Another way to improve the current work is to study how the combination of head orientation and body gestures impacts the gaze direction. The current work studied participants’ looking behaviors when they were facing the object (i.e., the display region). However, people’s gaze direction with respect to the head orientation may be influenced by body posture. For example, when a person is called from back, he/she may turn the upper body half way and rotate the head to look back, or this person may turn the whole body to the back completely without turning the head with respect to the body. These are more complex cases compared with the current study and have not been thoroughly and symmetrically addressed in previous research.

In summary, this paper introduced an easy and quick way to estimate gaze from frontal head orientation. The proposed method gives a coarse gaze direction estimation and can be applied on human-machine interactions that require rough gaze direction. This method can be embedded into any existing head orientation estimation methods and does not require extra hardware. Although limitations exist, we believe the current work is an important step towards more accurate, cost effective, and non-invasive gaze tracking technologies.