Keywords

1 Introduction

Visualization is an important pillar for multimedia computing. The devices used for visualizing multimedia contents can be broadly categorized into (i) non-wearable computer devices and (ii) wearable computer devices. Non-wearable computer technologies employ two-dimensional (2-D) and/or three dimensional (3-D) displays/technologies for multimedia content rendering [13]. Three dimensional (3-D) technologies offer one more dimension for visualizing the data flow and the interplay of programs in complex multimedia applications. However, in past few years, this trend is shifting from non-wearable visualization to wearable visualization experience [12, 15]. The wearable devices used for watching multimedia content are commonly known as virtual reality (VR) headsets and/or head mounted displays (HMD). These virtual reality (VR) headsets are becoming popular because of cheaper prices, more immersive experience, high quality, better screen resolution, low latency and better control. According to statistics, around 6.7 million people have used VR headsets in year 2015 and it is expected to grow to 43 million users in year 2016 and 171 million users in year 2018 [31]. These headsets, for instance, Oculus rift, Play station VR, Gear VR, HTC vive, google cardboard, etc. [7] (see Fig. 1) are being used in many research and industrial applications, for example, medical simulation [21], gaming [8, 24], 3D movie experience and scientific visualizations [23], etc. The integration of multimedia content inside VR headset is an interesting trend in the development of more immersive experience. The research community in wearable virtual reality field has been developing various hardware and software solutions to solve different issues related to virtual reality [18, 27]. One of the major issues related to VR headsets is that they occlude half of the human face (upper face region) and to reconstruct the full human face is the main contribution of this research work.

Fig. 1.
figure 1

Virtual reality headsets; (a) Oculus rift, (b) Gear VR, (c) HTC vive, (d) Google cardboard.

Fig. 2.
figure 2

Our wearable virtual reality setup: it consists of two RGB cameras along with a VR headset. Front camera is used for capturing lower face region and a side camera is used for capturing profile of an eye region.

We propose an optical sensor-based solution to address this issue. Our solution consists of a wearable camera setup where two cameras are used along with a VR headset as shown in Fig. 2. One camera is facing toward the lower face region (lips region) and the other camera is used to capture the side view of the eye-region. The full face is reconstructed using optical information of lower face region and partial eye region. We have used a google cardboard for our prototypic solution that can be extended to any VR headset.

Reconstructing a full face while wearing a VR headset is not an easy task [6]. The only efforts so far are done by Hao Li et al. [18] and Xavier et al. [3]. The former used an RGB-D camera and eight strain sensors to animate facial feature movements. They animate the facial movements through an avatar however they do not reconstruct the real face of a person wearing a VR headset. On the other hand, the later created a real-looking 3D human face model, where they trained a system to learn the facial expressions of the user from the lower part of the face only, and used it to change the model accordingly during testing. Compared to Hao Li et al., they came up with 3D face model of a person rather than using a 3D animated face. They have estimated the upper face information (eye region) based on lower face information (mouth region). But in literature, there is no direct correlation between upper and lower face regions [2, 34], hence estimating one based on the other is not a good practice. Furthermore, their training model is limited to few (e.g. six) discrete facial expression. However, in theory, human facial expressions are continuous and a combination of different expressions with intermediary emotions [10, 20, 35]. Hence, reconstructing a human face based on few emotions can be problematic.

Considering the above-mentioned limitations, we have revisited the same question (face reconstruction while wearing a VR headset) with an innovative approach, where both upper and lower face information is considered during training and testing phases. We have used an asymmetrical principal component analysis algorithm (aPCA) [26] to reconstruct an original face using lips and eye information. The lips information is used to estimate lower facial expression and eye information is used for the upper facial expression. The proposed approach is validated with qualitative and quantitative evaluation. To the best of our knowledge, we are among the first to consider this problem.

The rest of the paper is disposed as following. Section 2 presents related work for face reconstruction under occlusion. Section 3 gives a description of asymmetrical principal component analysis (aPCA) model. This section further describes training and testing phases. In Sect. 4, a qualitative and quantitative analysis is done on our proposed approach. Section 5 presents the discussion and limitations. Conclusion is presented in Sect. 6.

2 Background and Related Work

Full face Reconstruction of a person wearing VR headset is required in many multimedia applications. The most prominent advantage is in video teleconferencing applications. Face reconstruction while using VR headset has been less studied because headset obstructs a significant portion of human face. The closest works to our research are from Hao et al. [18] and Xavier et al. [3]. Hao et al. have developed a real time facial animation system that augments a VR display with strain gauges and a head-mounted RGB-D camera for facial performance capture in virtual reality. Xavier et al. have recently proposed a solution to reconstruct a real human face by, (i) building a 3D texture model of a person, (ii) building an expression shape model, (iii) projecting a 3D face model on occluded face and (iv) finally combining 3D face model with occluded test image/video. In this section, we further present previously developed optical techniques for face reconstruction.

Optical systems confront occluded faces very often due to the use of accessories, such as scarf or sunglasses, hands on the face, the objects that persons carry, and external sources that partially occlude the camera view. Different computer vision techniques have been developed to counter face occlusion problems [16]. Texture based face reconstruction technique first detects the occluded region of the face and then run a recovery process to reconstruct the original face [11, 19]. The recovery stage exploits the prior knowledge of the face and non-occluded part of the input image to restore the full face image. Furthermore, different model-based techniques have been proposed which exploit both shape and texture model of the face to reconstruct the original face [22, 29]. These techniques extract the facial feature points of the input face, fit the face model, and detect occluded region according to these facial feature points.

Principal Component Analysis (PCA) has been a fundamental technique for face reconstruction, for example, simple-PCA [17], FR-PCA [28], Kernal-PCA [4] and FW-PCA [9]. It starts by training the non-occluded face images by creating an eigenspace based on full image pixel intensities and/or some selective samples of pixel intensities. During testing phase, the occluded image is mapped to eigenspace and the principle component coefficients are restored by iteratively optimizing the error between the original image and the reconstructed image from eigenspace. The above-mentioned PCA based methods are successful if the occluded regions are not larger than the face region. However, the main challenge in our work is that the VR headset occludes a significant portion of a user’s face (nearly whole upper face), preventing effective face reconstruction from traditional PCA techniques. Hence, there is a need of an alternative method.

3 Extended Asymmetrical Principal Component Analysis (aPCA)

Asymmetrical principal component analysis (aPCA) has been previously used for video encoding and decoding [30]. In this work, we have extended aPCA for full face reconstruction for VR applications. Our algorithm consists of two phases, (i) training phase and (ii) testing phase. In training phase, we build an aPCA model by using the full frame and two half frames video sequences. Here, the full frame refers to full face of a person without VR headset and half frames refer to lips and eye regions. A person-specific aPCA based training model is built by synchronously recording full frame and half frames videos. The training model consists of mean faces and eigenspaces for full and half frames. The full frame eigenspace is created by using eigenvalues from half frames. The components spanning this space are called pseudo principal components and this space has the same size as a full frame. In this work, a user-specific training model is constructed for each individual, where each individual is asked to perform certain facial expressions during training session. During testing phase, a person wears a VR headset along with a wearable camera setup. A short calibration step is performed to align test half frames with trained half frames. When a new half frame is presented to a trained model, its own weights are found by projecting the new half frames onto the collection of trained half frame eigenspaces. These new weights with full-frame mean face and full-frame eigenspace are used to reconstruct the original face with all facial deformations (more prominently eyes and mouth deformation).

3.1 Training Phase

A user-specific training model is build based on three synchronous video sequences. Our wearable training setup is shown in Fig. 3(a). Our training setup consists of two cameras denoted by Fc and Sc. Fc captures the full face and lips region of a person as show in Fig. 4a and c, respectively. Sc captures the side region of an eye as shown in Fig. 4b. Let ff, \(hf_{l}\) and \(hf_{e}\) denote the full face, lips region and eye region information, respectively. A Full frame (ff) training is performed by exploiting information from half frames (\(hf_{l}\) and \(hf_{e}\)).

Fig. 3.
figure 3

Our setup: on the left we have a training setup where two cameras are used to capture full face, eye-region and lips-region. On the right we have a testing setup where we have an attached VR-headset with wearable setup, two cameras are used to capture eye-region and lips-region.

Fig. 4.
figure 4

A sample frame output of (a) full face, (b) eye region and (c) lips region.

Let, \(I_{e}\) and \(I_{l}\) be the intensity values of the eye and lips regions, respectively. The combined intensity matrix \(I_{hf}\) is denoted by,

$$\begin{aligned} I_{hf}= [I_{e} \quad I_{l}]\, \end{aligned}$$
(1)

The mean \(I_{hfo}\) is calculated as,

$$\begin{aligned} I_{hfo}= \frac{1}{N} \sum _{n=1}^{N} I_{hf_{(n)}}\, \end{aligned}$$
(2)

where, N is the total number of training frames. The mean is then subtracted from each basis in the training data, ensuring that the data is zero-centered.

$$\begin{aligned} \hat{I}_{hf}= I_{hf}- I_{hfo}\, \end{aligned}$$
(3)

Mathematically, PCA is an optimal transformation of an input data in the form of least square error sense. Due to space constraint, we direct readers to [26, 30] for more details. To this end, we need to find out the eigen vectors of the covariance matrix (\(\hat{I}_{hf}\) \(\hat{I}_{hf}^{T}\)). This can be done by singular value decomposition (SVD) [33]. SVD is a factorization method which divides a square matrix into three matrices,

$$\begin{aligned} \hat{I}_{hf}= U\Sigma V^T\, \end{aligned}$$
(4)

where, V = [\(b_{1}, b_{2}\,...\,b_{N}\)] is a matrix of an eigen vector and variable \(b_{n}\) corresponds to the eigen vector. The eigen space for the half frame \(\phi _{hf}\) = [\(\phi _{hf}^{1}\) ... \(\phi _{hf}^{N}\)] is constructed by multiplying V with the \(\hat{I}_{hf}\),

$$\begin{aligned} \phi _{hf}= \sum _{i} {b}_{i} \hat{I}_{hf_{(i)}}\, \end{aligned}$$
(5)

The half frame coefficients (\(\alpha _{hf}\)) from training frames are calculated as,

$$\begin{aligned} \alpha _{hf}= \phi _{hf}\,(I_{hf} - I_{hfo})^{T}\, \end{aligned}$$
(6)

Similarly, for full frame intensity values \(I_{ff}\), we follow Eqs. 2 and 3 to get,

$$\begin{aligned} I_{ffo}= & {} \frac{1}{N} \sum _{n=1}^{N} I_{ff_{(n)}}\, \end{aligned}$$
(7)
$$\begin{aligned} \hat{I}_{ff}= & {} I_{ff} - I_{ffo}\, \end{aligned}$$
(8)

The eigen space for the full frame \(\phi _{ff}\) = [\(\phi _{ff}^{1}\,\)...\(\,\phi _{ff}^{N}\)] is constructed by multiplying the eigen vector from half frame V = [\(b_{1}\), \(b_{2}\) ... \(b_{N}\)] with the \(\hat{I}_{ff}\). This eigen space is spanned by half frame components. The component spanning this space are called pseudo principal components; information where not all the data is a principal component. This space has the same size as of full frame.

$$\begin{aligned} \phi _{ff}= \sum _{i} {b}_{i} \hat{I}_{ff_{(i)}}\, \end{aligned}$$
(9)

The full and half frame Eigen spaces and mean intensities values are saved and are used in the online testing session.

3.2 Testing Phase

A VR display is attached to a wearable camera setup as shown in Fig. 3(b). The full frame (ff) is no more active during the testing phase. The testing phase is sub-divided into calibration and reconstruction phase.

Calibration Phase: In the start of the testing phase a manual calibration step is performed. The camera position during the training phase can be inconsistent with the camera position during the testing phase. To align test half-frame with trained half-frame, we propose the following calibration step.

The mean half frames from the training phase and the first half-frames from the testing phase are used for the calibration. Figure 5 (top row) shows half frames from the training phase and Fig. 5 (bottom row) shows half frames from the testing phase. We manually learn the feature parameters for both trained and test frames by using lips and eyes feature points. Let,

$$\begin{aligned} P_{l}= [P_{l}^{1} \quad P_{l}^{2} \quad P_{l}^{3} \quad P_{l}^{4}]\, \end{aligned}$$
(10)
Fig. 5.
figure 5

The half frames: top row contains frames from training phase and bottom row contains frames from testing phase

Fig. 6.
figure 6

Top row - Lips geometry, Bottom row - Eye Geometry. (a) four feature points, (b) width, height and center, (c) geometry used for angle calculation.

$$\begin{aligned} w_{l}= & {} \sqrt{(P_{l_{x}}^{2} - P_{l_{x}}^{1})^{2} + (P_{l_{y}}^{2} - P_{l_{y}}^{1})^{2}}\,\nonumber \\ h_{l}= & {} \sqrt{(P_{l_{x}}^{4} - P_{l_{x}}^{3})^{2} + (P_{l_{y}}^{4} - P_{l_{y}}^{3})^{2}}\,. \end{aligned}$$
(11)

The center coordinates \(c_{l}\) are calculated as,

$$\begin{aligned} c_{l}= [w_{l}, \ h_{l}]\, \end{aligned}$$
(12)

The \(w_{l}\), \(h_{l}\) and \(c_{l}\) are graphically shown in Fig. 6b (top row). The in-plane rotational angle \(\theta _{l}\) is calculated as,

$$\begin{aligned} \theta _{l} = arctan (\frac{P_{l_{y}}^{2} - P_{l_{y}}^{1}}{P_{l_{x}}^{2} - P_{l_{x}}^{1}}) \end{aligned}$$
(13)

The angle calculation \(\theta _{l}\) is graphically shown in Fig. 6c (top row). The scale \(s_{l}\), rotation \(R_{l}\) and translation \(t_{l}\) between the trained and test lips frames are calculated as,

$$\begin{aligned} s_{l}= & {} [ w_{l}^{test} / w_{l}^{train} ]\,\nonumber \\ R_{l}= & {} [ \theta _{l}^{test} - \theta _{l}^{train} ]\,\nonumber \\ t_{l}= & {} [ c_{l}^{test} - c_{l}^{train} ]\,. \end{aligned}$$
(14)

A similar procedure is applied to calculate scale \(s_{e}\), rotation \(R_{e}\) and translation \(t_{e}\) between the trained and test eyes frames as given below,

$$\begin{aligned} s_{e}= & {} [ w_{e}^{test} / w_{e}^{train} ]\,\nonumber \\ R_{e}= & {} [ \theta _{e}^{test} - \theta _{e}^{train} ]\,\nonumber \\ t_{e}= & {} [ c_{e}^{test} - c_{e}^{train} ]\,. \end{aligned}$$
(15)

Reconstruction Phase: During testing phase, half frame information is only available (see Fig. 5 (bottom row)). This half frame information along with trained model from training phase are used to reconstruct the original face of a person with respective facial deformation. The first step is to adjust the test half frames according to calibration step. Let, \(I_{hfe}\) and \(I_{hfl}\) are the test half frames, then \(\bar{I}_{hfe}\) and \(\bar{I}_{hfl}\) are calculated as,

$$\begin{aligned} \bar{I}_{hfe}= & {} s_{e}\,R_{e}\,{I}_{hfe} + t_{e}\,\nonumber \\ \bar{I}_{hfl}= & {} s_{l}\,R_{l}\,{I}_{hfl} + t_{l}\,. \end{aligned}$$
(16)

The calibrated test half frames (\(\bar{I}_{hf}\) = [\(\bar{I}_{hfe}\)   \(\bar{I}_{hfl}\)]) are then subtracted from the mean half frames (Eq. 2),

$$\begin{aligned} I_{hf}^{t} = \bar{I}_{hf} - I_{hfo}\, \end{aligned}$$
(17)

where, superscript t denotes the test phase. The coefficients (\(\alpha _{hf}^{t}\)) are calculated by using half frame Eigen space \(\phi _{hf}\) (Eq. 5),

$$\begin{aligned} \alpha _{hf}^{t}= \phi _{hf}\,(\bar{I}_{hf} - I_{hfo})^{T}\, \end{aligned}$$
(18)

The entire frame containing full face information is constructed by using following equation,

$$\begin{aligned} I= I_{ffo} + \sum _{n=1}^{M} \alpha _{hf}^{t}\,\phi _{ff_{(n)}}\, \end{aligned}$$
(19)

where, \(I_{ffo}\) is taken from Eq. 7, \(\alpha _{hf}^{t}\) from Eq. 18 and \(\phi _{ff}\) from Eq. 9. The M is a selected number of principal components used for reconstruction (\(M < N\)). The N is the total number of frames available for training and M is the number of most significant eigen images. In our experiment, N is around 1000 frames and M is 25 frames. The qualitative and quantitative results are presented in evaluation section.

4 Evaluation

We have performed an experiment with five subjects. For each subject, a training session of two minutes is done using our wearable setup as shown in Fig. 3(a). Each participant is requested to perform different (but natural) facial expressions, e.g. neutral, happy, sad, eye-blink, etc. Three synchronous video sequences of eye, mouth and full face regions are recorded as shown in Fig. 4. An offline training (Sect. 3.1) is performed for each individual using the recorded sequences. During testing session, a person wears a VR headset along with our wearable setup as shown in Fig. 3(b). A full face of a participant is reconstructed using two half frames according to the details in Sect. 3.2. Qualitative and Quantitative analyses are performed on our proposed approach. Qualitative analysis measures the reconstruction quality of a human face. Whereas, quantitative analysis measures the accuracy of our proposed approach.

4.1 Qualitative Analysis

The 75 % of data acquired during the training session is used for training and remaining 25 % of data is used for validation and testing purpose. Please note that we cannot do proper analysis with test data as upper face region is completely occluded by the VR display. The data is qualitatively analyzed with three different scenarios; (i) When just mouth information is used as a half frame during training (similar to [3]), (ii) when just eye information is used as a half frame during training and (iii) when both eye and mouth are used as a half frame during training. Figure 7 shows qualitative results on three users given the above-mentioned three scenarios. Left most face is the original face, second is the reconstructed face from mouth information (scenario i), third is the reconstructed face from the eye information (scenario ii) and the last is the reconstructed face from both mouth and eye information (scenario iii). The results clearly show that facial mimic is not just dependent on mouth area, facial mimic of a person can be modelled accurately by modeling the mouth and eye regions. The qualitative results on the test data is shown in Fig. 8.

Fig. 7.
figure 7

From left to right: (i) Original frame. (ii) Reconstructed frame from mouth information. (iii) Reconstructed frame from eye information. (iv) Reconstructed frame from eye and mouth information.

Fig. 8.
figure 8

Reconstruction results from test data.

4.2 Quantitative Analysis

We have performed two types of quantitative analysis;

  1. 1.

    Shape based quantitative analysis.

  2. 2.

    Appearance based quantitative analysis.

In shape based quantitative analysis, we have compared the differences between the original facial feature points and the reconstructed facial feature points. We have used constrained local model (CLM) [5] to capture these facial feature points. Figure 9 shows the CLM on the human face. The left side of Fig. 9 shows the original face frame and the right side shows the reconstructed face frame. The shape based analysis is performed on 25 % of validation data for each individual and the results are presented in Table 1 according to the following equation.

(20)

where, s, \({\overline{s}}\), N and F refer to spatial locations of original facial feature points, spatial locations of reconstructed facial feature points, number of facial feature points and number of frames, respectively. For this work, N= 66 and F = 250500. The shape analysis results show a small difference between the facial points of original and reconstructed face with an average difference of 1.5327 pixels.

Fig. 9.
figure 9

Constrained local model (CLM) on human face; Left - original face, Right - reconstructed face.

Fig. 10.
figure 10

VR plus Embodied telepresence based video teleconferencing scenario.

Table 1. Shape based quantitative analysis.
Table 2. Appearance based quantitative analysis.

In appearance based quantitative analysis, we have compared the intensity differences between the reconstructed face and the original test face. We have used the mouth region for comparison as upper face region is occluded during real testing. Appearance quality is measured through peak signal to noise ration (PSNR). This ratio depends on the mean square error (\(mse_{app}\)) between the original and reconstructed face. \(mse_{app}\) and PSNR is calculated according to:

$$\begin{aligned} mse_{app} = \sum _{j=1}^{h*v} \frac{(I_{j} - \bar{I}_{j})^2}{h*v}\, \end{aligned}$$
(21)

where, h and v are the horizontal and vertical resolution of the frames, respectively. \(I_{j}\) is the original test face and \(\bar{I}_{j}\) is the reconstructed face.

$$\begin{aligned} PSNR = 20 * \log (\frac{(255)^2}{mse_{app}})\, \end{aligned}$$
(22)

where 255 is the maximum value for the pixel intensity. The results are presented in Table 2. A higher PSNR value means that there is a low difference in pixel intensity between the original and reconstructed faces.

5 Discussion

The qualitative and quantitative results show the validity of proposed aPCA-based approach. The qualitative results are presented in the form of reconstructed face frames, whereas quantitative results yield high PSNR and low difference in facial feature points. The PSNR values greater than 40 is considered good [25] and our experiment yields between 48–75 PSNR values. Similarly, shape-based analysis give good results with maximum difference of 2.34 (pixels).

We plan to use the reconstructed face of a wearer in virtual reality (VR) based video teleconferencing application. We will use our embodied telepresence agent (ETA) [14] along with a VR headset for teleconferencing purpose. The application scenario is shown in Fig. 10 and will be considered in future work. In this work, we have used two half frames for face reconstruction. However, this work could be simplified with just one half frame information with some compromise on results.

We have cut a portion of a cardboard for an eye camera. The eye camera is mounted smartly which does not affect significantly the virtual reality experience. The eye camera is mounted externally for this work but this work can be extended by integrating small camera inside VR displays, for example in other works, such as [1, 27]. Furthermore, we have developed our own wearable setup for cameras. However, initially, we have mounted two cameras on google cardboard for testing purpose as shown in Fig. 11. There were two issue with this setup, (i) increase in weight and (ii) normalization issue between training and testing video sequences. These issues will be considered in future work.

This version of the work uses manual calibration step. In our future work, we plan to automate the calibration step by developing feature point localization technique. Furthermore, a problem with PCA is that it is very sensitive to light. To counter this problem, we are working to use edge feature information [32]. The half frame image will be converted to an edge map using a sobel filter (see Fig. 12) and the magnitude values of an edge image will be used to train the full face. During testing phase, the half frame edge map will be used to reconstruct the full face of a person. This work is in progress and will be considered in future publication.

Fig. 11.
figure 11

Modified google cardboard setup.

Fig. 12.
figure 12

From left to right (a) the RGB half frame. (b) the edge image of the half frame.

6 Conclusion

We have proposed a novel technique for face reconstruction, when face is occluded by virtual reality (VR) headset. Full face reconstruction is based on asymmetrical principal component analysis (aPCA) framework. The aPCA framework exploits lips and eye appearance information for full face reconstruction. We have estimated the upper face expressions by partial eye information and lower face expression by lips information. This version uses appearance information for face modeling. In future, we plan to use shape based (or feature based) technique for full face reconstruction.