1 Introduction

The visualisation and animation of human movements [1,2,3] has recently been garnering the attention of numerous researchers. In order to improve the computer-human interactions it is necessary to develop algorithms that can interpret the behavioural movements made by humans and also mimic these actions in a natural way.

Over the years it has become quite common to use large databases to train and test active appearance models (AAM) that pinpoint and track the locations of landmark points in a human face [4,5,6]. One of the fresher approaches presented in [7] mixed Lucas-Kanade optical flow with and active appearance model that utilizes gradient descent. As a different approach, the correlation between appearance and shape was used in [8]. This type of models follow and find the necessary facial features easily when the head orientation is near-frontal, with only slight changes in angle, but tend to go awry when the proportions of the face change due to rotation.

To prevent that, in [9] a method using auxiliary attributes was proposed. The authors added traits such as gender, whether the person was wearing glasses, or whether the person was looking to left, right or front. This extra info was included in their deep convolutional network. It was shown that this approach gives more accurate results, as the AAM could be aligned based on samples more similar to each other, rather than using a model without any data stratification.

There have been several techniques proposed on robust tracking by incorporating geometric constraints to correlate the position of facial landmarks [10, 11]. Even though single step face alignment methods have been proposed [12, 13], the most common and recent approach for face alignment is to model the relationship between texture and geometry with a cascade of regression functions [14, 15]. Many methods used RGB-D cameras to conduct real-time depth-based facial tracking and tried to register dynamic expression model with observed depth data [16,17,18].

However, the aforementioned method of head pose estimation relies heavily on the lighting conditions. To have accurate head orientation recognition in real-life, the method would have to be trained with a large variety of poses in a all kinds of different lighting conditions. On the other hand training with highly disperse data could make the classification less reliable and produce false detection of angle or facial landmarks. To cross that gap, in [19] the depth-data and RGB data from Kinect were used. As a preliminary step the authors constructed a 3D head shape-model. After which they took three RGB images of the subject, from the left, the right and the front. These images were fitted onto the 3D AAM. Later the 3D AAM was aligned to the input 3D frame by benefiting from using the RGB data as constraining parameters.

Fanelli et al. [20] proposed a head pose estimation purely based on depth data. Fanelli et al. [21] also tried using their method with data from the Microsoft Kinect 1, but the sensor at that time gave fairly inaccurate results due to the low quality of available depth information. With this work a depth database, BIWI, was also provided. However adding different AAM type algorithms on top of Fanelli’s framework has produced fairly accurate real-time facial landmark tracking applications [22, 23]. Now that the Kinect 2 is available with much more accurate depth information and higher resolution RGB images, the results of depth-based head pose estimation algorithms could be improved upon and further analysed to achieve faster and smoother facial landmark detection and tracking [24].

Another application for depth information is face-recognition that does not depend on lighting conditions. In [25] such real-time identification was proposed. The face was segmented out based on the depth discontinuity and disregarded those with a resolution lower than \(60\times 60\). Then a suitable number of canonical faces were formed, and consequentially iterative closest point (ICP) algorithm [26] was used to align gallery faces to the standardised faces. In the recognition stage, the ICP algorithm was used to align the probed face to the canonical ones, and then the gallery face with most similar aligned faces was picked as the match. This proved to be a robust and computationally cheap method.

One of the most important concepts to understand about the 3D sensors and algorithms trained on the data acquired by them, is that the RGB-D data provided by a sensor is unique, as some outputs are denser, while others produce a lesser error. Due to this fact, a method trained on a set of data from one sensor is incompatible with data from other sensors. The necessity for a head pose database for Microsoft Kinect 2 rises from the fact that nowadays this is one of the most accurate and easily available RGB-D sensors. Additionally, for further development of depth-based recognition methods, a variety of databases with heads in different poses is needed; thus, in this paper, we present the SASE database, which is gathered with Kinect 2 and can contribute to future research within the depth-related facial recognition field.

The SASE database is composed of depth and RGB frames of 50 subjects performing different head poses in front of the sensor. The head poses have high variations in yaw, pitch and roll angles, resulting in a myriad of poses. For each subjects 600+ frames were captured and most of them are labelled with location and rotation angles.

The rest of the paper is organized as follows: in Sect. 2 a short overview of available 3D face databases is presented. In Sect. 3 the acquisition of the SASE database, in quintessence and method of calculating ground truth values are described in detail. Also images of the setup, the process and necessary formula are provided, after which a conclusion is drawn in the last section.

2 Brief Overview of Existing 3D Head Databases

The number of available depth databases is quite insignificant, and most of them have been captured with sophisticated scanners that require lots of time for data collection. These types of devices have very few real-time applications, which reduces the overall fruitfulness of datasets captured with them. Databases that capture high definition laser scanners are: ND-2004 [27], BJUT-3D [28] and UMD-DB [29]. High-quality stereo-imaging systems were used for capturing the BU-3DFE [30], XM2VTSDB [31] and the Texas 3D-FRD [32].

Fig. 1.
figure 1

3D and depth samples from databases: (a) BJUT-3D, (b) Kinect FDB, (d) BU-3DFE, (e) TEXAS 3DFDB, (f) York, (g) BIWI, (h) Warehouse and (i) Bosphorus.

Table 1. Summarized comparison of nine head pose databases.

Some databases were captured using a system of structured lights instead of a depth camera, like the 3D-RMA database [33] and the Bosphorus [34] database. While Bosphorus contains high quality data, only 4000 points are provided for the depth in 3D-RMA. By using synchronised cameras, the Spacetime faces [35], which contains face meshes made up of 23000 vertices, was captured. There is not RGB nor grayscale information provided along with the previously mentioned datasets.

Databases captured using Kinect 1 include the Biwi Kinect Database [20] and the KinecFaceDB [36]. Facewarehouse [37] contains raw depth RGB-D data and also reconstructed faces using Kinect Fusion. There are also two online 3D databases where the sensor is not specified: the University of York 3D Face Database [38] and the 3dMD [39] database, which as a project also contains 3D reconstruction of heads and entire human bodies. Samples from nine databases that had both 2D and 3D samples available are shown in Fig. 1. The detailed comparison of the aforementioned nine head pose databases is summarized in Table 1.

These databases are available for testing 3D facial recognition or head pose estimation algorithms. Regretfully, not all of them include varying head poses and one of them is missing 2D data (3D-RMA), while in the case of the others (captured with laser scanners), the RGB data was not well-aligned with the 3D data. Another issue with the available databases is that even though they can be used for the testing of 3D face recognition and head pose estimation, the fitted models are sensor dependent. It makes sense that a classifier trained on one type of input data would fail with the test data acquired using a different scanning device.

Given that there exist various methods for RGB-D head-pose estimation, but all require a database captured with the same type of scanner and there is a lack of such publicly avalible collections, in the next section we present the novel SASE database, captured with Kinect 2, that attempts to address the aforementioned issues.

3 SASE Database Description

3.1 Overall Description

The database introduced in this paper contains RGB-D information (\(424\times 512\) 16-bit depth frames and \(1080\times 1920\) RGB frames) obtained using the Microsoft Kinect 2 sensor of different head poses of 50 subjects, including 32 male and 18 female in the age range of 7–35 years old. The subjects were asked to move their heads slowly in front of the device to achieve different combinations of yaw, pitch and roll rotation. Altogether around 600+ frames of each subject were recorded. For those frames where the nose tip location was attainable, the ground truth of the 3D nose tip location and head orientation described by yaw, roll, pitch angles is provided by using the formulae shown in Sect. 3.4. The rest of the samples were retained as more sophisticated methods like ICP can be used in the future to label them. The depth information (scaled for display purposes) and corresponding RGB data can be seen in part (a) in Fig. 2.

3.2 Kinect 2

The Microsoft Kinect 2 consists of 3 main components, namely, RGB camera, IR emitter and IR sensor. The RGB resolution of this new sensor is \(1080\times 1920\) which is the resolution of a full HD image, in comparison to the Kinect 1’s \(480\times 640\). The IR is used to employ time of flight technology to calculate the distance of each point. Which results in 1 mm depth accuracy at around 1 m distance. Even though this version also gives false information at very abrubt edges (70+ degrees), the failure angles are steeper than the ones with Kinect 1 [40].

3.3 Acquisition Details

In this section details of the setup and recording process are explained thoroughly. Overall the recording process elapsed about a month as the subjects were recorded during a number of sessions, which differed in the number of people captured.

The software used for the capture was a python script written using the Kinect 2 python library and OpenCV. The laptop used for the capturing process has an i5-4200u processor with an integrated graphics card and 8GB of RAM. It also carried an SSD to speed up the frame rate. However, due to restrictions of the processor of the laptop, the frame rate was measured to be at 5fps.

The head poses in the database are with values of yaw varying from \(-75\) to 75, pitch and roll varying from \(-45\) to 45. These constrains were chosen because they represent the maximum angles that can be achieved under normal conditions by a human sitting in front of a camera, and only moving the head while not changing their body position. The aforementioned restrictions were achieved experimentally and are not necessarily applied to all humans but rather seem to be an average trend.

The angle limitations are different for each subject. This is due to the fact that not all people can rotate their head the same exact amount. In order to avoid this problem, all the people were trained in advance, so that they did not rotate their heads too much during the capturing process. Also participants were free to perform different facial expressions in the different poses when capturing the data in order to have a more natural database. This resulted in a collection of mostly neutral faces with some happy expressions mixed in. It is important to note that this data base is not focusing on representing various emotions and thus can not be used for emotion recognition applications.

The sketch and the actual experimental setup can be seen in Fig. 3(a) and (b), respectively. The Kinect 2 was placed on a stand, and the subject sat approximately 1 m distance away from the camera. A white canvas screen was used as background.

In order to label the database, five (in case of facial hair sometimes six) light blue stickers were stuck onto each participants face: one on the forehead/between eyebrows, one on the chin, one on the tip of the nose and two on the cheekbones/cheeks as can be seen in part (b) in Fig. 2.

These locations were picked as they are visible from various angles by the camera. However, the exact placement and even symmetry of the markers are unimportant as the markers remained unchanged throughout the whole recording process of each subject. Only the marker on the nose tip was placed exactly at the same spot for each person. Due to the fact that the 3D coordinates of the nose tip are considered to be the location of the head provided in the database labels.

Fig. 2.
figure 2

(a) Cutouts from the database, for rows as (pitch, yaw, roll): (\(-32\), 0, 3), (2, \(-49\), 2) and (3, \(-1\), \(-39\)), respectively and (b) the placement of coloured stickers.

Fig. 3.
figure 3

The (a) sketck and (b) real scene of the setup of acquiring data for the database.

The illumination condition was kept low in order not to over illuminate the light blue markers and make them undetectable. The color of the marker was chosen as it is easily distinguishable from the human face. The thickness of the stickers is negligible, thus they do not cause notable occlusions to the depth information.

3.4 Optimisation and Ground Truth Values

In order to calculate head poses, the initial pose of the person was taken as the reference pose. The initial pose has a frontal orientation, in which the subject is looking at the camera. Considering the noise of the sensor, 20 frames of this pose were captured to average a good starting value for further calculations. After which markers were used to calculate the pairwise difference between an averaged initial pose and the current pose.

Fig. 4.
figure 4

In (a) vectors used for calculation and in (b) rotation angles yaw, pitch and roll, explained

The detection of these markers has been done by using their colour information. As the poses are changing, not all of these markers are visible to the acquisition device all the time. In order to be able to calculate the orientation of the head pose, we need at least three of the markers to be visible. Their real-world coordinates and the vectors between them were calculated. Then the rotation matrix between the initial and current vectors was found, which was used to obtain the orientation of the head. These vectors are illustrated in Fig. 4, part (a). The central point of the head is considered to be the nose tip because this is easy to locate using depth information and fits the application of the database.

The following optimisation process and calculation of rotation angles is described in fine detail in order to illustrate how the markers were used. Also it has been shown why these markers could be placed at different places for each subject.

In this paper, the head pose is viewed in a 3D Cartesian coordinate system. The x-axis is defined horizontally and parallel to the sensor, right-side is positive, the y-axis is defined vertically, pointed upwards, the z-axis is defined perpendicular to both of these axises, so that they form a left-hand system. In this coordinate system, the head pose can be defined as a set of six parameters, angles for pitch, yaw and roll as seen in Fig. 4 part (b), and 3D location coordinates xyz.

In this database the nose marker is used for calculating the translation of the head. By subtracting the location of the nose from the rest of the markers, the rotation of the head can be viewed as the rotation of an object around a fixed point in space. This way only rotation angles remain to be determined.

For calculation of the angles, all the acquired markers are matched to the original positions. For the first few frames, the average vectors starting from the nose are calculated. For next steps, all the vectors from the nose to all existing vertices (visible markers) are used to determine the angles from simple optimisation problem.

From Euler’s fixed point rotation theorem [41], it follows that any 3D rotation can be described as the product of 3 separate rotations around each axis. The pitch describes the rotation angle around the x axis, yaw around the y-axis and roll around the z-axis. Thus using the Euler fixed point rotations, the matrix that describes the rotation for pitch angle, \(\alpha \), is:

$$\begin{aligned} R_\alpha = \begin{pmatrix} 1&{}0&{}0\\ 0&{}cos(\alpha )&{}-sin(\alpha ) \\ 0&{}sin(\alpha )&{}cos(\alpha )\end{pmatrix} \end{aligned}$$
(1)

Similarly, matrices \(R_\beta \) and \(R_\gamma \) for yaw and roll are defined, respectively.

By using matrix multiplication, the overall rotation matrix \(R(\alpha \beta \gamma ) = (r_{i j}^{\alpha \beta \gamma })\) can be achieved by:

$$\begin{aligned} R(\alpha ,\beta ,\gamma ) = R_\alpha R_\beta R_\gamma \end{aligned}$$
(2)

So when the initial vectors are in the matrix \(X = (x_{i j})\) and the new vectors are in the matrix \(\tilde{X} = (\tilde{x}_{i j})\), then the rotation can be written as:

$$\begin{aligned} \tilde{X} =R(\alpha ,\beta ,\gamma ) X \end{aligned}$$
(3)

In the case of more than three equations, a linear system may not be uniquely solvable. It is an overdetermined linear equation system [42], which can be solved as a least-squares optimisation problem:

$$\begin{aligned} \underset{\alpha ,\beta ,\gamma }{{\text {argmin}}} {\bigg [\sum _i{\sum _j{\big (x_{i j} -\sum _k{r_{i k}^{\alpha \beta \gamma } \tilde{x}_{k j}}\big )^2}}\bigg ]} \end{aligned}$$
(4)

The optimisation was performed by the default constrained optimisation process [43] provided by SciPy [44]. The minimum and maximum angle restrictions explained to the subjects were also fed into the optimisation process. In Fig. 5 various head poses of one of the subject in the SASE database are illustrated with the respective rotated bases vectors. Blue vector is the rotated y-axis, red vector is the rotated x-axis and the green vector is the rotated z-axis. For the purpose of easier illustration, they were all projected onto the original xy-plane.

Fig. 5.
figure 5

Various head poses of a subject from proposed SASE database.

4 Conclusion

We presented a 3D head pose database using the Kinect 2 camera, which can be used in a variety of relevant contexts, such as testing the performance of 3D head pose estimation algorithms. Due to the exhaustiveness of the data provided by the Kinect camera, which involves color and depth, the foregoing database can be considered a useful resource for producing training sets. In fact, the main motivation for creating the database has been the fact that no 3D head pose database has been offered through the existing literature by means of the second generation of the Kinect camera. For creating the database, 50 subjects where recorded while taking different head poses in front of the camera, which resulted in more than 600 sample frames in total per person, and a total size of the database of more than 30000 multi-modal head pose annotated frames.