Abstract
Head pose estimation has become very important in relation to facial and emotional recognition, as well as in human-computer interaction. There is an ultimate need for a 3D head pose database in order to develop head pose estimation methods using RGB and depth information. There are a few available datasets, such as Biwi Kinect head pose database, which is composed using Kinect 1, but it offers low-quality depth information. In this paper, a new 3D head database, SASE, is introduced. The data in SASE is acquired with Microsoft Kinect 2 camera, including RGB and depth information. The SASE database is composed by a total of 30000 frames with annotated markers. The samples include 32 male and 18 female subjects. For each person a large sample of head poses are included, within the bounds of yaw from \(-45\) to 45, pitch \(-75\) to 75 and roll \(-45\) to \(45^\circ \) of rotation around each axis. The details of acquiring the database and its characteristics are explained in detail.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
The visualisation and animation of human movements [1,2,3] has recently been garnering the attention of numerous researchers. In order to improve the computer-human interactions it is necessary to develop algorithms that can interpret the behavioural movements made by humans and also mimic these actions in a natural way.
Over the years it has become quite common to use large databases to train and test active appearance models (AAM) that pinpoint and track the locations of landmark points in a human face [4,5,6]. One of the fresher approaches presented in [7] mixed Lucas-Kanade optical flow with and active appearance model that utilizes gradient descent. As a different approach, the correlation between appearance and shape was used in [8]. This type of models follow and find the necessary facial features easily when the head orientation is near-frontal, with only slight changes in angle, but tend to go awry when the proportions of the face change due to rotation.
To prevent that, in [9] a method using auxiliary attributes was proposed. The authors added traits such as gender, whether the person was wearing glasses, or whether the person was looking to left, right or front. This extra info was included in their deep convolutional network. It was shown that this approach gives more accurate results, as the AAM could be aligned based on samples more similar to each other, rather than using a model without any data stratification.
There have been several techniques proposed on robust tracking by incorporating geometric constraints to correlate the position of facial landmarks [10, 11]. Even though single step face alignment methods have been proposed [12, 13], the most common and recent approach for face alignment is to model the relationship between texture and geometry with a cascade of regression functions [14, 15]. Many methods used RGB-D cameras to conduct real-time depth-based facial tracking and tried to register dynamic expression model with observed depth data [16,17,18].
However, the aforementioned method of head pose estimation relies heavily on the lighting conditions. To have accurate head orientation recognition in real-life, the method would have to be trained with a large variety of poses in a all kinds of different lighting conditions. On the other hand training with highly disperse data could make the classification less reliable and produce false detection of angle or facial landmarks. To cross that gap, in [19] the depth-data and RGB data from Kinect were used. As a preliminary step the authors constructed a 3D head shape-model. After which they took three RGB images of the subject, from the left, the right and the front. These images were fitted onto the 3D AAM. Later the 3D AAM was aligned to the input 3D frame by benefiting from using the RGB data as constraining parameters.
Fanelli et al. [20] proposed a head pose estimation purely based on depth data. Fanelli et al. [21] also tried using their method with data from the Microsoft Kinect 1, but the sensor at that time gave fairly inaccurate results due to the low quality of available depth information. With this work a depth database, BIWI, was also provided. However adding different AAM type algorithms on top of Fanelli’s framework has produced fairly accurate real-time facial landmark tracking applications [22, 23]. Now that the Kinect 2 is available with much more accurate depth information and higher resolution RGB images, the results of depth-based head pose estimation algorithms could be improved upon and further analysed to achieve faster and smoother facial landmark detection and tracking [24].
Another application for depth information is face-recognition that does not depend on lighting conditions. In [25] such real-time identification was proposed. The face was segmented out based on the depth discontinuity and disregarded those with a resolution lower than \(60\times 60\). Then a suitable number of canonical faces were formed, and consequentially iterative closest point (ICP) algorithm [26] was used to align gallery faces to the standardised faces. In the recognition stage, the ICP algorithm was used to align the probed face to the canonical ones, and then the gallery face with most similar aligned faces was picked as the match. This proved to be a robust and computationally cheap method.
One of the most important concepts to understand about the 3D sensors and algorithms trained on the data acquired by them, is that the RGB-D data provided by a sensor is unique, as some outputs are denser, while others produce a lesser error. Due to this fact, a method trained on a set of data from one sensor is incompatible with data from other sensors. The necessity for a head pose database for Microsoft Kinect 2 rises from the fact that nowadays this is one of the most accurate and easily available RGB-D sensors. Additionally, for further development of depth-based recognition methods, a variety of databases with heads in different poses is needed; thus, in this paper, we present the SASE database, which is gathered with Kinect 2 and can contribute to future research within the depth-related facial recognition field.
The SASE database is composed of depth and RGB frames of 50 subjects performing different head poses in front of the sensor. The head poses have high variations in yaw, pitch and roll angles, resulting in a myriad of poses. For each subjects 600+ frames were captured and most of them are labelled with location and rotation angles.
The rest of the paper is organized as follows: in Sect. 2 a short overview of available 3D face databases is presented. In Sect. 3 the acquisition of the SASE database, in quintessence and method of calculating ground truth values are described in detail. Also images of the setup, the process and necessary formula are provided, after which a conclusion is drawn in the last section.
2 Brief Overview of Existing 3D Head Databases
The number of available depth databases is quite insignificant, and most of them have been captured with sophisticated scanners that require lots of time for data collection. These types of devices have very few real-time applications, which reduces the overall fruitfulness of datasets captured with them. Databases that capture high definition laser scanners are: ND-2004 [27], BJUT-3D [28] and UMD-DB [29]. High-quality stereo-imaging systems were used for capturing the BU-3DFE [30], XM2VTSDB [31] and the Texas 3D-FRD [32].
Some databases were captured using a system of structured lights instead of a depth camera, like the 3D-RMA database [33] and the Bosphorus [34] database. While Bosphorus contains high quality data, only 4000 points are provided for the depth in 3D-RMA. By using synchronised cameras, the Spacetime faces [35], which contains face meshes made up of 23000 vertices, was captured. There is not RGB nor grayscale information provided along with the previously mentioned datasets.
Databases captured using Kinect 1 include the Biwi Kinect Database [20] and the KinecFaceDB [36]. Facewarehouse [37] contains raw depth RGB-D data and also reconstructed faces using Kinect Fusion. There are also two online 3D databases where the sensor is not specified: the University of York 3D Face Database [38] and the 3dMD [39] database, which as a project also contains 3D reconstruction of heads and entire human bodies. Samples from nine databases that had both 2D and 3D samples available are shown in Fig. 1. The detailed comparison of the aforementioned nine head pose databases is summarized in Table 1.
These databases are available for testing 3D facial recognition or head pose estimation algorithms. Regretfully, not all of them include varying head poses and one of them is missing 2D data (3D-RMA), while in the case of the others (captured with laser scanners), the RGB data was not well-aligned with the 3D data. Another issue with the available databases is that even though they can be used for the testing of 3D face recognition and head pose estimation, the fitted models are sensor dependent. It makes sense that a classifier trained on one type of input data would fail with the test data acquired using a different scanning device.
Given that there exist various methods for RGB-D head-pose estimation, but all require a database captured with the same type of scanner and there is a lack of such publicly avalible collections, in the next section we present the novel SASE database, captured with Kinect 2, that attempts to address the aforementioned issues.
3 SASE Database Description
3.1 Overall Description
The database introduced in this paper contains RGB-D information (\(424\times 512\) 16-bit depth frames and \(1080\times 1920\) RGB frames) obtained using the Microsoft Kinect 2 sensor of different head poses of 50 subjects, including 32 male and 18 female in the age range of 7–35 years old. The subjects were asked to move their heads slowly in front of the device to achieve different combinations of yaw, pitch and roll rotation. Altogether around 600+ frames of each subject were recorded. For those frames where the nose tip location was attainable, the ground truth of the 3D nose tip location and head orientation described by yaw, roll, pitch angles is provided by using the formulae shown in Sect. 3.4. The rest of the samples were retained as more sophisticated methods like ICP can be used in the future to label them. The depth information (scaled for display purposes) and corresponding RGB data can be seen in part (a) in Fig. 2.
3.2 Kinect 2
The Microsoft Kinect 2 consists of 3 main components, namely, RGB camera, IR emitter and IR sensor. The RGB resolution of this new sensor is \(1080\times 1920\) which is the resolution of a full HD image, in comparison to the Kinect 1’s \(480\times 640\). The IR is used to employ time of flight technology to calculate the distance of each point. Which results in 1 mm depth accuracy at around 1 m distance. Even though this version also gives false information at very abrubt edges (70+ degrees), the failure angles are steeper than the ones with Kinect 1 [40].
3.3 Acquisition Details
In this section details of the setup and recording process are explained thoroughly. Overall the recording process elapsed about a month as the subjects were recorded during a number of sessions, which differed in the number of people captured.
The software used for the capture was a python script written using the Kinect 2 python library and OpenCV. The laptop used for the capturing process has an i5-4200u processor with an integrated graphics card and 8GB of RAM. It also carried an SSD to speed up the frame rate. However, due to restrictions of the processor of the laptop, the frame rate was measured to be at 5fps.
The head poses in the database are with values of yaw varying from \(-75\) to 75, pitch and roll varying from \(-45\) to 45. These constrains were chosen because they represent the maximum angles that can be achieved under normal conditions by a human sitting in front of a camera, and only moving the head while not changing their body position. The aforementioned restrictions were achieved experimentally and are not necessarily applied to all humans but rather seem to be an average trend.
The angle limitations are different for each subject. This is due to the fact that not all people can rotate their head the same exact amount. In order to avoid this problem, all the people were trained in advance, so that they did not rotate their heads too much during the capturing process. Also participants were free to perform different facial expressions in the different poses when capturing the data in order to have a more natural database. This resulted in a collection of mostly neutral faces with some happy expressions mixed in. It is important to note that this data base is not focusing on representing various emotions and thus can not be used for emotion recognition applications.
The sketch and the actual experimental setup can be seen in Fig. 3(a) and (b), respectively. The Kinect 2 was placed on a stand, and the subject sat approximately 1 m distance away from the camera. A white canvas screen was used as background.
In order to label the database, five (in case of facial hair sometimes six) light blue stickers were stuck onto each participants face: one on the forehead/between eyebrows, one on the chin, one on the tip of the nose and two on the cheekbones/cheeks as can be seen in part (b) in Fig. 2.
These locations were picked as they are visible from various angles by the camera. However, the exact placement and even symmetry of the markers are unimportant as the markers remained unchanged throughout the whole recording process of each subject. Only the marker on the nose tip was placed exactly at the same spot for each person. Due to the fact that the 3D coordinates of the nose tip are considered to be the location of the head provided in the database labels.
The illumination condition was kept low in order not to over illuminate the light blue markers and make them undetectable. The color of the marker was chosen as it is easily distinguishable from the human face. The thickness of the stickers is negligible, thus they do not cause notable occlusions to the depth information.
3.4 Optimisation and Ground Truth Values
In order to calculate head poses, the initial pose of the person was taken as the reference pose. The initial pose has a frontal orientation, in which the subject is looking at the camera. Considering the noise of the sensor, 20 frames of this pose were captured to average a good starting value for further calculations. After which markers were used to calculate the pairwise difference between an averaged initial pose and the current pose.
The detection of these markers has been done by using their colour information. As the poses are changing, not all of these markers are visible to the acquisition device all the time. In order to be able to calculate the orientation of the head pose, we need at least three of the markers to be visible. Their real-world coordinates and the vectors between them were calculated. Then the rotation matrix between the initial and current vectors was found, which was used to obtain the orientation of the head. These vectors are illustrated in Fig. 4, part (a). The central point of the head is considered to be the nose tip because this is easy to locate using depth information and fits the application of the database.
The following optimisation process and calculation of rotation angles is described in fine detail in order to illustrate how the markers were used. Also it has been shown why these markers could be placed at different places for each subject.
In this paper, the head pose is viewed in a 3D Cartesian coordinate system. The x-axis is defined horizontally and parallel to the sensor, right-side is positive, the y-axis is defined vertically, pointed upwards, the z-axis is defined perpendicular to both of these axises, so that they form a left-hand system. In this coordinate system, the head pose can be defined as a set of six parameters, angles for pitch, yaw and roll as seen in Fig. 4 part (b), and 3D location coordinates x, y, z.
In this database the nose marker is used for calculating the translation of the head. By subtracting the location of the nose from the rest of the markers, the rotation of the head can be viewed as the rotation of an object around a fixed point in space. This way only rotation angles remain to be determined.
For calculation of the angles, all the acquired markers are matched to the original positions. For the first few frames, the average vectors starting from the nose are calculated. For next steps, all the vectors from the nose to all existing vertices (visible markers) are used to determine the angles from simple optimisation problem.
From Euler’s fixed point rotation theorem [41], it follows that any 3D rotation can be described as the product of 3 separate rotations around each axis. The pitch describes the rotation angle around the x axis, yaw around the y-axis and roll around the z-axis. Thus using the Euler fixed point rotations, the matrix that describes the rotation for pitch angle, \(\alpha \), is:
Similarly, matrices \(R_\beta \) and \(R_\gamma \) for yaw and roll are defined, respectively.
By using matrix multiplication, the overall rotation matrix \(R(\alpha \beta \gamma ) = (r_{i j}^{\alpha \beta \gamma })\) can be achieved by:
So when the initial vectors are in the matrix \(X = (x_{i j})\) and the new vectors are in the matrix \(\tilde{X} = (\tilde{x}_{i j})\), then the rotation can be written as:
In the case of more than three equations, a linear system may not be uniquely solvable. It is an overdetermined linear equation system [42], which can be solved as a least-squares optimisation problem:
The optimisation was performed by the default constrained optimisation process [43] provided by SciPy [44]. The minimum and maximum angle restrictions explained to the subjects were also fed into the optimisation process. In Fig. 5 various head poses of one of the subject in the SASE database are illustrated with the respective rotated bases vectors. Blue vector is the rotated y-axis, red vector is the rotated x-axis and the green vector is the rotated z-axis. For the purpose of easier illustration, they were all projected onto the original xy-plane.
4 Conclusion
We presented a 3D head pose database using the Kinect 2 camera, which can be used in a variety of relevant contexts, such as testing the performance of 3D head pose estimation algorithms. Due to the exhaustiveness of the data provided by the Kinect camera, which involves color and depth, the foregoing database can be considered a useful resource for producing training sets. In fact, the main motivation for creating the database has been the fact that no 3D head pose database has been offered through the existing literature by means of the second generation of the Kinect camera. For creating the database, 50 subjects where recorded while taking different head poses in front of the camera, which resulted in more than 600 sample frames in total per person, and a total size of the database of more than 30000 multi-modal head pose annotated frames.
References
Cao, C., Wu, H., Weng, Y., Shao, T., Zhou, K.: Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35(4), 126 (2016)
Shuster, G.S., Shuster, B.M.: Avatar eye control in a multi-user animation environment. US Patent Ap. 14/961,744, 7 Dec 2015
Demirel, H., Anbarjafari, G.: Data fusion boosted face recognition based on probability distribution functions in different colour channels. EURASIP J. Adv. Signal Process. 2009, 25 (2009)
Yan, S., Liu, C., Li, S.Z., Zhang, H., Shum, H.Y., Cheng, Q.: Face alignment using texture-constrained active shape models. Image Vis. Comput. 21(1), 69–75 (2003)
Liu, X.: Generic face alignment using boosted appearance model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)
Koutras, P., Maragos, P.: Estimation of eye gaze direction angles based on active appearance models. In: IEEE International Conference on Image Processing, pp. 2424–2428. IEEE (2015)
Adeshina, S.A., Cootes, T.F.: Automatic model matching using part based model constrained active appearance models for skeletal maturity. In: 2015 Twelve International Conference on Electronics Computer and Computation, pp. 1–5. IEEE (2015)
Zhou, H., Lam, K.M., He, X.: Shape-appearance-correlated active appearance model. Pattern Recogn. 56, 88–99 (2016)
Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes (2015)
Yang, H., Mou, W., Zhang, Y., Patras, I., Gunes, H., Robinson, P.: Face alignment assisted by head pose estimation. arXiv preprint arXiv:1507.03148 (2015)
Vlasic, D., Brand, M., Pfister, H., Popović, J.: Face transfer with multilinear models. ACM Trans. Graph. 24, 426–433 (2005). ACM
Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483. IEEE (2013)
Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10605-2_1
Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. Int. J. Comput. Vis. 107(2), 177–190 (2014)
Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 532–539. IEEE (2013)
Weise, T., Bouaziz, S., Li, H., Pauly, M.: Realtime performance-based facial animation. ACM Trans. Graph. 30, 77 (2011). ACM
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874. IEEE (2014)
Traumann, A., Daneshmand, M., Escalera, S., Anbarjafari, G.: Accurate 3D measurement using optical depth information. Electron. Lett. 51(18), 1420–1422 (2015)
Wang, H.H., Dopfer, A., Wang, C.C.: 3D AAM based face alignment under wide angular variations using 2D and 3D data. In: IEEE International Conference on Robotics and Automation, pp. 4450–4455. IEEE (2012)
Fanelli, G., Gall, J., Van Gool, L.: Real time head pose estimation with random regression forests. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17–624. IEEE (2011)
Fanelli, G., Weise, T., Gall, J., Gool, L.: Real time head pose estimation from consumer depth cameras. In: Mester, R., Felsberg, M. (eds.) DAGM 2011. LNCS, vol. 6835, pp. 101–110. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23123-0_11
Yang, F., Huang, J., Yu, X., Cui, X., Metaxas, D.: Robust face tracking with a consumer depth camera. In: IEEE International Conference on Image Processing, pp. 561–564. IEEE (2012)
Fanelli, G., Dantone, M., Van Gool, L.: Real time 3D face alignment with random forests-based active appearance models. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pp. 1–8. IEEE (2013)
Lusi, I., Anbarjafari, G., Meister, E.: Real-time mimicking of Estonian speaker’s mouth movements on a 3D avatar using kinect 2. In: International Conference on Information and Communication Technology Convergence, pp. 141–143. IEEE (2015)
Min, R., Choi, J., Medioni, G., Dugelay, J.L.: Real-time 3D face identification from a depth camera. In: International Conference on Pattern Recognition, pp. 1739–1742. IEEE (2012)
Chetverikov, D., Svirko, D., Stepanov, D., Krsek, P.: The trimmed iterative closest point algorithm. In: 16th International Conference on Pattern Recognition, 2002. Proceedings, vol. 3, pp. 545–548. IEEE (2002)
Faltemier, T.C., Bowyer, K.W., Flynn, P.J.: Using a multi-instance enrollment representation to improve 3D face recognition. In: IEEE International Conference on Biometrics: Theory, Applications, and Systems, pp. 1–6. IEEE (2007)
Baocai, Y., Yanfeng, S., Chengzhang, W., Yun, G.: BJUT-3D large scale 3D face database and information processing. J. Comput. Res. Dev. 6, 020 (2009)
Colombo, A., Cusano, C., Schettini, R.: UMB-DB: a database of partially occluded 3D faces. In: IEEE International Conference on Computer Vision Workshops, pp. 2113–2119. IEEE (2011)
Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.J.: A 3D facial expression database for facial behavior research. In: International Conference on Automatic Face and Gesture Recognition, pp. 211–216. IEEE (2006)
Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: the extended M2VTS database. In: Second International Conference on Audio and Video-based Biometric Person Authentication, vol. 964, pp. 965–966. Citeseer (1999)
Gupta, S., Castleman, K.R., Markey, M.K., Bovik, A.C.: Texas 3D face recognition database. In: IEEE Southwest Symposium on Image Analysis & Interpretation, pp. 97–100. IEEE (2010)
3D RMA: 3D database. http://www.sic.rma.ac.be/~beumier/DB/3d_rma.html. Accessed 15 Apr 2016
Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B., Akarun, L.: Bosphorus database for 3D face analysis. In: Schouten, B., Juul, N.C., Drygajlo, A., Tistarelli, M. (eds.) BioID 2008. LNCS, vol. 5372, pp. 47–56. Springer, Heidelberg (2008). doi:10.1007/978-3-540-89991-4_6
Zhang, L., Snavely, N., Curless, B., Seitz, S.M.: Spacetime faces: high-resolution capture for \({\sim }\) modeling and animation. In: Deng, Z., Neumann, U. (eds.) Data-Driven 3D Facial Animation, pp. 248–276. Springer, London (2008)
Min, R., Kose, N., Dugelay, J.L.: KinectFaceDB: a kinect database for face recognition. IEEE Trans. Syst. Man Cybern.: Syst. 44(11), 1534–1548 (2014)
Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: FaceWarehouse: a 3D facial expression database for visual computing. IEEE Trans. Vis. Comput. Graph. 20(3), 413–425 (2014)
University of york 3D face database. https://www-users.cs.york.ac.uk/nep/research/3Dface/tomh/3DFaceDatabase.html. Accessed 15 Apr 2016
3DMD head database. http://www.3dmd.com/. Accessed 15 Apr 2016
Smisek, J., Jancosek, M., Pajdla, T.: 3D with kinect. In: Fossati, A., Gall, J., Grabner, H., Ren, X., Konolige, K. (eds.) Consumer Depth Cameras for Computer Vision. Advances in Computer Vision and Pattern Recognition, pp. 3–25. Springer, London (2013)
Palais, B., Palais, R.: Euler’s fixed point theorem: the axis of a rotation. J. Fixed Point Theory Appl. 2(2), 215–220 (2007)
Trefethen, L.N., Bau III, D.: Numerical Linear Algebra, vol. 50. SIAM, Philadelphia (1997)
Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23(4), 550–560 (1997)
SciPY : Scientific python. http://docs.scipy.org/doc/scipy/reference/index.html. Accessed 18 July 2016
Acknowledgement
This work has been partially supported by Estonian Research Grand (PUT638) and Spanish project TIN2013-43478-P.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Lüsi, I., Escarela, S., Anbarjafari, G. (2016). SASE: RGB-Depth Database for Human Head Pose Estimation. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9915. Springer, Cham. https://doi.org/10.1007/978-3-319-49409-8_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-49409-8_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49408-1
Online ISBN: 978-3-319-49409-8
eBook Packages: Computer ScienceComputer Science (R0)