Keywords

1 Introduction

Facial landmark points detection, also known as face alignment, is an important component for many face based applications, for example, face recognition/verification [9], face animation [6], expression recognition [2], etc. Thus it has been extensively studied [1, 4, 7, 15, 16, 22, 23, 28, 35, 36, 38, 41, 43]. However, diverse and severe changes in poses, occlusion, and illumination make facial landmark point detection remain an extremely challenging task in practice.

The existing facial landmark point detection methods can be roughly categorized as detection-based methods and regression-based methods. Detection-based methods depend on the detection of each landmark point, which may encounter difficulty when multiple candidates (false positive) or no candidates (occlusion) present. Then global shape constraints are imposed to find an optimal configuration; In contrast, regression-based methods directly estimate the coordinates of all the facial points, and they can be further divided into two subcategories: initial shape based methods [23, 28], and no initial shape based on methods [31, 41]. In initial shape based methods, a mean shape which is obtained from the training set is used as the initialization, and then it is gradually updated based on the shape constraints and local context information. Thanks to advances in Deep Neural Networks (DNNs), people have started to use DNNs to map an input face image to the coordinates of landmark points with an end-to-end framework, and mean shape is not needed any more. Such methods have achieved state-of-the-art performance.

Motivated by the successes of DNNs based facial landmark point detection, in this paper, we propose a Bi-Level strategy for facial landmark point detection. Specifically, the first level Convolutional Neural Networks (CNNs), which are termed as Global CNNs, leverage the whole faces to predict all landmarks points simultaneously. Based on the prediction from the first level CNNs, we sample patches around the predicted points, and use the second level CNNs to learn the mapping between these patches and the remaining displacement between the ground truth landmarks and those predicted by the Global CNNs. Since the local information is used for the second level CNNs, we term the second level CNNs as Local CNNs. Moreover, as Fig. 1 shows, we found through experiments that using face images at different scales helps the prediction of different facial points. The detection accuracy of a given landmark point is related to whether a node in the last layer corresponds to a semantic meaningful region in the raw image for inferring the location of this point. Different points have different optimal resolutions. For the given kernel size and network depth in our implementation, the \(96\times 96\) resolution is close to be optimal for most points. While for other points, say point 59–61, \(128\times 128\) is better. For point 9, \(78\times 78\) is better. We hence propose to combine multiple CNNs for faces of different resolutions within a Multi-column CNNs (MCNNs) framework. Such a framework is applied for both Global CNNs and Local CNNs.

Fig. 1.
figure 1

Performance comparisons of CNNs for faces at different resolutions (marked with different colors). Different landmark points clearly have different sensitivity to change in face resolution. The error for each landmark point is calculated as the distance between the ground truth and the predicted coordinates, normalized by the width of image.

The contributions of our framework can be summarized as follows: (i) Our work shows that optimal resolutions of different points are different, thus we propose a multi-columns CNNs architecture to leverage faces of multiple resolutions for face alignment. (ii) Our bi-level architecture leverages a coarse-to-fine strategy for face alignment, i.e., we use global CNNs to roughly estimate all points and use local CNNs to fine-tune the coordinates of all landmark points. (iii) We propose to predict the coordinates of all points simultaneously in local CNNs, which encodes the shape constraints of all facial points, thus further boosts the face alignment accuracy.

2 Related Work

Among many of the models proposed for facial landmark points detection, Active Appearance Models (AAM) based methods [12, 24] are very representative. AAM-like methods [12, 24] used Principle Component Analysis (PCA) to model the facial shape and texture. But the effectiveness of such global appearance models can be affected by severe pose variations as well as occlusion. Then local models, including Active Shape Model (ASM) [13, 16, 25], Constrained Local Model (CLM) [14] are proposed to remedy the situation. These models use local features extracted from patches around estimated landmark points to refine the estimates, together with certain global shape constraint imposed. Such methods can effectively handle local occlusion, but their accuracy can be effected by certain non-discriminative local patches/regions. Within the CLM framework, Saragih et al. [27] has proposed to use better optimization method to optimize the problem in CLM, and their method achieves better performance compared to CLM. In [3], a discriminative response map fitting with constrained local model (DRMF) [3] is proposed to learn probability response maps dictionaries and use linear regression, and such method further improved CLM. Moreover, Xiong et al. [38] has proposed a supervised descent method (SDM) to solve the nonlinear least square optimization problem in facial landmark point detection. It is also worth noting that all these local models depend on the initial shape, so they may get trapped in a local minimum. To resolve this issue, Cao et al. [6] has proposed to use multiple initializations, Burgos-Artizzu et al. [5] has proposed to adopt a restart strategy, and Zhu et al. [42] has proposed to use a coarse-to-fine shape searching over a shape space. These methods further boost the performance of facial landmark point detection.

Recently, deep learning has demonstrated successes in many computer vision tasks, including image classification [20], face recognition [29, 32, 33], as well as facial landmark point detection [31, 41]. In [31], Sun et al.  has proposed to use Deep Convolutional Neural Networks (DCNNs) to predict the five landmark points. Then multiple CNNs are trained to further refine the coordinates of each point. But in the second stage, since they refine each point separately, global shape constraints among all the landmarks are not considered. Zhang et al.  have proposed to use coarse-to-fine auto-encoder networks (CFAN) for facial landmark point detection. They use the auto-encoder to predict the landmark points on a low-resolution face first, then they use features extracted from all landmark points to further refine the their coordinates. But their method uses hand-crafted features, which may not be the best for the prediction of landmark points.

3 Our Method

We here propose a Bi-Level MCNNs framework for facial landmark point detection, in which the Global CNNs are used to estimate approximate locations of all the landmark points, and then Local CNNs are used to correct the errors in the predicted landmark points with the help from local information.

3.1 Global Convolutional Neural Networks

Given an image \(x\in \mathbb {R}^{m\times n}\), we denote the coordinates of p landmark points as \(S_{g}(x) \in \mathbb {R}^{2p}\), then DNNs based facial landmark point detection aims at learning a nonlinear function F that maps x to \(S_g(x)\), i.e., \( F: x \mapsto F(x) \approx S_g(x)\). To learn F, we need to solve the following problems:

$$\begin{aligned} F^* = \text{ arg }\min _F \Vert S_g(x) - F(x)\Vert . \end{aligned}$$
(1)

In this work, we adopt CNNs to model F in view of the following good properties of CNNs for facial landmark point detection. (i) In practice, faces come along with different illumination, poses, expression, even occlusion. Therefore the capacity of F should be high. Multi-layer nonlinearity in CNNs satisfies such requirements. (ii) In CNNs, each filter works as a detector, which makes CNNs be able to localize the landmark points. (iii) In a CNNs framework, all points are predicted simultaneously and that allows the model to implicitly learn and enforce certain global shape constraints. Therefore, such a model is able to handle difficult situations such as with missing points caused by occlusion or pose variation. (iv) Compared with Auto-encoders based models [41], CNNs share the filters over the whole image, so both the model’s complexity and computational complexity are significantly reduced.

Fig. 2.
figure 2

The architecture of Global CNNs. Dash line indicates the pretraining procedure.

On one hand, to learn the global shape of a face, it is desirable that a node in a top layer would be able to correspond to semantically meaningful part of faces, say an eye, or nose. So the size of the region that such a node corresponds to should not be too large. On the other hand, to precisely locate the coordinates of all facial landmark points, we need filters that can represent fine details around each point. So the receptive fields of such filters cannot be too large and higher resolution faces are preferred. To resolve this dilemma for facial landmark point detection, we therefore propose to use multiple CNNs whose inputs are the face image resized to different resolutions (from the full resolution to several downsized ones). We combine their output features (FC1 features in Fig. 2-(a)) to estimate the landmark points with a fully connected layer. We term such a framework as Multi-column CNNs (MCNNs) based facial landmark point detection.

The overall structure of the proposed MCNNs is shown in Fig. 2-(a) and it consists of four columns. The CNN in each column contains 4 convolutional layers, 3 max pooling layers, and one fully connected layer in the end. For each column, the sizes of filters and the number of features maps are all the same and specified in Fig. 2-(b). Motivated by the success of pretraining in RBM model [18], we also pre-train CNN in each column by mapping its FC1 features to the ground-truth coordinates of landmark points. Such pretraining is necessary because it helps find a good initialization of parameters and avoids the effect of gradient vanishing phenomenon in training deep neural networks (Please refer to Fig. 5-(e) to find out the effect of pretraining.). We also show some feature maps in the first convolutional layers Conv1 in Fig. 3. This figure shows that these the face shape has larger responses.

Remarks on MCNNs: The concept of Multi-column CNNs is proposed by Ciresan et al. in [11]. In [11], the CNNs in different columns have exactly the same architectures, while in this paper, CNNs in different columns use faces with different resolutions as inputs. Further, in [11], the outputs of CNNs in different columns are averaged, while we use one fully connected layer to connect all features to estimate the coordinates of landmark points in Global CNNs, or to estimate the displacement between the ground truth and the estimation of Global CNNs and Local CNNs. As shown in Fig. 5-(d) that such fully connected layer does improve the facial landmark detection accuracy, which demonstrates the effectiveness of our network architecture.

Fig. 3.
figure 3

Some feature maps in Conv1 layer of Global CNNs. Images are from the LFPW dataset

3.2 Local Convolutional Neural Networks

Global CNNs take a whole image as the input, and obtain an approximate estimation of all landmark points. After the Global CNNs, we use local information around each facial landmark to further predict the displacement between the ground truth \(S_g\) and current prediction \(S_1\). We denote such displacement as \({\varDelta }S \doteq S_g - S_1\). One possible choice for Local CNNs is that we design a different CNN for each of the different landmark points [31]. But such a strategy is not scalable because there will be too many networks to be optimized if we have many landmark points (typical standard is 68 landmark points). Moreover, the separate landmark points prediction based on such CNNs may not work well for those missing/occluded facial landmarks because no global shape constraints are imposed.

In this paper, we propose to model and train the Local CNNs for all points simultaneously. For each facial point, we take an \(l\times l\) patch around the prediction of Global CNNs \(S_1\). Then we stack all p patches and get an \(l\times l\times p\) cuboid. We use this cuboid as the input of the Local CNNs. However, the best patch sizes for predicting the displacement between the ground truth and the previous prediction can be different for different landmark points. For example, for landmark points with larger displacement, larger patch size should be used; and for landmark points with smaller displacement, smaller patches would be enough. To deal with such variability, similar to Global CNNs, we also adopt an MCNNs architecture for the Local CNNs. The overall structure of Local CNNs is shown in Fig. 4-(a). Specifically, we sample the aforementioned patch cuboids with the same sizes for faces with different resolutions based on the results of Global CNNs. Then we use these cuboids as the input to different CNNs in the Local MCNNs. In Local CNNs, the architecture of CNN in each column contains only two convolutional layers, one max pooling layer and one fully connected layer in the end, as shown in Fig. 4-(b). Local CNNs in each column of the Local MCNNs are also pretrained in a similar manner as in the global case. Then we concatenate the output FC1 features and map them to \({\varDelta }S\) Footnote 1. We denote the mapping function learned by the Local CNNs as \(f_L(*)\), and denote the patch cuboids for face \(x_i\) as \(P_{x_i}\). Combining the results of Global MCNNs and Local MCNNs, the final coordinates of facial landmarks can be calculated as: \(S_2(x_i)=S_1(x_i)+f_L(P_{x_i})\).

Fig. 4.
figure 4

The architecture of Local CNNs. Dash line indicates the pretraining procedure.

3.3 Comparison with Other Approaches

Our work is different from the existing DNN based facial landmark point detection frameworks, including CFAN [41] and DCNN [31], as well as Cascaded Regression methods with Multi-scale feature extraction and concatenation like SDM [38], in the following aspects.

Comparison with CFAN. The coarse-to-fine prediction strategy is commonly used in facial landmark point detection methods, including CFAN, but our method differs from CFAN in the following aspects: (i) Our framework is based on CNNs while CFAN is based on auto-encoder. (ii) Our Local CNNs use local patches as input to further improve the accuracy while CFAN use hand-crafted features as input in the last few levels. (iii) Our local networks only contain two levels while CFAN contains multiple layers. (vi) CFAN employs multi-scale process in a serial way, and each phase only uses face with one resolution. However, our method uses multi-scale process in parallel way, and each level uses faces with several different resolutions. Such multi-column strategy takes full advantage of our findings that different landmark points have different optimal resolutions.

Comparison with DCNN. (i) We propose to use MCNNs to take advantage of faces in multiple resolutions for facial landmark point detection, while Sun et al. [31] only use faces of one resolution for alignment. (ii) DCNNs use separate CNNs to refine separate points, but such a strategy is not scalable to facial landmark point detection with multiple points, say 68 points. Meanwhile such a strategy may distort the overall face shape without considering the global shape constraints of all facial points. We propose to use the same CNNs for all points, which implicitly models, learns and encodes the constrains among all points, and also makes our method more scalable to facial landmark point detection with many landmark points.

Comparison with SDM. (i) Non-deep methods like SDM employ the linear regression to learn the mapping from face images to facial landmarks, while deep model uses nonlinear function to regress the facial landmarks with lower error rate. (ii) Mean shape is used as initialization, which is far away from the ground truth. It is worth noting that even our Global CNNs already achieves almost the same performance with SDM on LFPW, as shown in Figs. 5-(c) and 7-(b).

4 Experiments

The implementation of our method is based on CAFFE framework developed by Jia et. al. [19]. We augment the training data through translation, rotation and scaling operations to greatly enlarge the size of training sets. Such data augmentation strategy helps avoid over-fitting and make trained model more robust to pose and scale variations in practical face images. We keep the same data augmentation strategy with [41], and generate 19 images for each face by using random translation, rotation and scaling parameters within the given range. So the whole training set is enlarged by 20 folds for all datasets.

4.1 Evaluation Protocol

We first use 5 commonly used image based facial landmark points detection datasets for evaluation, including XM2VTS [24], LFPW [4], HELEN [21], AFW [43], and iBug [26]. Images in LFPW, HELEN, AFW, and iBug are collected under uncontrolled scenarios, while images in the XM2VTS dataset are collected under a lab environment. Then we evaluate our method with a recently proposed video based facial landmark points detection dataset–the 300-VW [30] dataset.

For image based facial landmark points detection, we directly use the bounding box of faces provided by iBug websiteFootnote 2, the ground truth annotations of 68 landmark points are provided in [26].

We compare our method with the following baseline methods (1) DRMF [3], (2) SDM [38], (3) the work of Zhu et al. [43], (4) the work of Yu et al. [40], (5) CFAN [41], and (6) DCNN [31]. In these baselines, CFAN and DCNN use DNNs for facial landmark point detection and achieve state-of-art performance for facial landmark point detection. By following the work of [3, 41], we train the model with 68 landmark points, and evaluate the model with 66 landmark points (The inner corners of mouth are not used in evaluation stage.). The reason for this may be that the relative positions of inner and outer corners of mouth are stable, thus we can predict the coordinates of inner corners based on that of outer corners. The 68 landmark points are shown in Fig. 1. It is worth noting that the comparisons between our method with DRMF, SDM, the work of Zhu et al. , the work of Yu et al. , CFAN are based on 66 points, but our comparisons with DCNN are based on 5 points (two eye centers, two outer mouth corners, and nose tip) because DCNN is not scalable to multiple points.Footnote 3

As for the evaluation metric, we utilize the commonly used normalized root mean squared error (NRMSE) to measure the errors between the predicted landmark points and the ground truth. For the ith (\(\i =1,\ldots ,p\)) landmark point whose ground truth coordinates are (\(x^i_g\), \(y^i_g\)) and predicted coordinates are (\(x^i_p\), \(y^i_p\)), if the distance between two eye centers as d, then NRMSE can be calculated as follows \(\text {NRMSE} = \frac{1}{p}\sum _{i=1}^p\frac{\sqrt{(x^i_g-x^i_p)^2+(y^i_g-y_p^i)^2}}{d}\).

Here d is used to get rid of the effect of different image sizes in the comparison of different methods. Then a cumulative distribution function (CDF) of NRMSE is used for the evaluation of different methods.

4.2 Evaluation of Different Components in Bi-Level MCNNs

The following experiment is designed to measure the functions of different components in our bi-level MCNNs framework. We follow the commonly used experimental setup in LFPW [41] to conduct the model evaluation experiments. Specifically, the training set includes the LFPW training set [4], HELEN [21] and AFW [43], and the test set is the test set of LFPW [4]. All the following experiments follow the same setting unless a different setting is specified.

Single Column CNNs vs. MCNNs. We show the performance difference between single column CNNs and MCNNs for both Global CNNs and Local CNNs in Fig. 5-(a) and (b). It is clear to see that the single column CNN performs worse than MCNNs for both Global CNNs and Local CNNs, because MCNNs leverages the advantages of faces with different resolutions for global face estimation in Global CNNs, and MCNNs leverage the patches sampled at different scales for the displacement prediction of all landmark points in Local CNNs. The detection accuracy improvement of MCNNs over best single column CNN is about 8.7 % when NRMSE is 0.05 in Global CNNs, which is already very significant for facial landmark point detection.

Global MCNNs vs. Local MCNNs. The performance of Global CNNs and Local CNNs is shown in Fig. 5-(c). This figure clearly shows that Local CNNs improve upon the performance of Global CNNs because the local information is used to correct the prediction errors of the Global CNNs. Further, as we can see from the results, the CDF of Global CNNs is 0.45 when NRMSE is 0.1, which is significantly better than mean shape, and such a performance is already very close to the cumulative error distribution curve of SDM [38] on LFPW (Please refer to Fig. 7-(a)). This further demonstrates the effectiveness of MCNNs.

Fig. 5.
figure 5

The effect of different components in our MCNNs, evaluated on LFPW. (Best viewed in color)

Averaging vs. FC layer. We also compare the effect of averaging strategy and fully connected layer(FC) after multicolumn CNN on LFPW, and the results are shown in Fig. 5-(d). It clearly shows that our strategy achieves better accuracy than the averaging strategy in MDNN [11], which demonstrates the effectiveness our network architecture.

Importance of Pretraining for MCNNs. Figure 5-(e) shows the results of MCNNs with pretraining and MCNNs without pretraining. As there are too many parameters in MCNNs, if MCNNs are trained without pretraining, its performance is even worse than that of each single column CNN in MCNNs. The reason for this phenomenon may be that if there are too many parameters to be optimized in MCNNs, the deep architecture will cause the effect of gradient vanishing. Thus parameter optimization may easily get stuck in some poor local minimum. Similar to the pretraining in deep belief network [18], the preptraining in MCNNs may help MCNNs to start with a good initialization which makes the whole network converge to a good solution, hence helps the facial landmark point detection.

\({\varDelta }S\) regression vs. \(S_g\) regression in Local CNNs. In our Local CNNs, we use MCNNs to map the raw patches to the displacement (\({\varDelta }S\)) between the ground truth and current estimation of Global CNNs. Besides the \({\varDelta }S\) regression, we also try to use patches to directly estimate \(S_g\). We show results of these two different strategies in Fig. 5-(f). We can see that the \({\varDelta }S\) regression achieves higher accuracy. Since each patch does not contain any global information any more, it would be quite difficult to predict the absolute coordinates of the landmark points in the whole image by using local patches only. Therefore the displacement regression would be a better choice for local refinement.

Bi-Level is sufficient. We show the NRMSE on the training set for both Global CNNs and Local CNNs in Fig. 6 on the LFPW dataset. We can see that the training NRMSE for Global CNNs is not so good on training data, so the performance can be further improved by refinement. However, with the help of Local CNNs, the training NRMSE is significantly reduced and there is almost no space left to further improve the NRMSE. So this suggests Bi-Level MCNNs structure already has enough modeling capacity to fit the training data, and no deeper architectures are needed.Footnote 4

Time costs. We test the running time of our method on LFPW. Our algorithm is implemented with Spyder(Python 2.7) on the NVIDIA GeForce GTX 980 GPU platform. We run our program 22 times and obtain the average running time for each image. More precisely, the average running time of Global CNNs is 3.15ms, while the running time of Local CNNs is 1.24ms. The time cost of Global and Local CNNs is 9.51ms and 11.09ms, respectively on a CPU. Following the work Ref [17], on LFPW, the time reported doesnt include cropping and scaling time on LFPW (The time cost for cropping and scaling is about 0.7ms and 0.9ms, respectively).

Fig. 6.
figure 6

The cumulative error distribution curves of our method on the training set of LFPW.

Fig. 7.
figure 7

Performance comparison of different different methods on LFPW, HELEN, and iBug with 68 landmark points. (Best viewed in color)

4.3 Performance Comparison on Image Based Landmark Point Detection Datasets

The dateset introduction and experimental setup on LFPW, HELEN, and iBug are listed as follows: (i) Labeled Face Parts in the Wild (LFPW) dataset [4] consists of 1432 face images taken from wild condition. These faces are further divided into 1132 training images and 300 test images. The faces in this dataset show large variations in pose, illumination, expression and partial occlusion, which is aimed to test facial point detection under uncontrolled conditions. [4] provides only image URLs and some image links are not available any more. Therefore, by following the work of [41], we use the 811 training samples and 224 test samples provided on iBug website for training and testing. The DRMF method uses the tree-based face detector to arrive more accurate face detection. (ii) Helen dataset [21] is collected under wild conditions. There are 2330 images with large variations in pose, lighting, occlusion and expression. By following the evaluation convention on this dataset [41], the images from Helen training set, LFPW training set and AFW are used to train our models. The 330 images from Helen test set are used to evaluate all methods. (iii) The iBug dataset is recently released, and it is used for facial landmark detection challenge. This dataset contains 135 images. By following the work of [42], we use the images from Helen training set, LFPW training set and AFW as training set.

The performance of different methods on these datasets is shown in Fig. 7. We can see that our method achieves the best performance on LFPW, HELEN. On the very challenging iBug, our method achieves comparable accuracy with Coarse-to-Fine Shape Searching [42] which is a non-CNN based method, but with less running time (4.39ms vs 28ms). Specifically, on LFPW dataset, our method has the best performance on this dataset, even better than SDM, which benefitted from its supervised descent solution. Furthermore, by comparing this figure with Fig. 5-(a), we can see that the predictions of the Global CNNs already have the almost same performance as the final results of CFAN. In this experiment, testing data is collected in the lab and has similar pose, expression and illumination, while training data has significant variation in pose, expression and illumination. The distributions of training and test data are very different, but our method still achieves very good performance, which indicates our method is robust to the out-of-database challenge. On LPFW, our method achieves a better prediction accuracy with an improvement up to about 20 % than CFAN when NRMSE is smaller than 0.05, and such an improvement is rather significant. On HELEN, CFAN performs the best among all existing methods, whereas our method outperforms CFAN and achieves the best performance among all, which further demonstrates the effectiveness of our method.

Table 1. 49 points
Table 2. 68 points

We show the average error (%) of all points for Coarse to fine shape matching (CFSS) [42] and project out cascaded regression (POCR) [34] and our work under the following two commonly used settings in Tables 1 and 2. We can see that our method outperforms CFSS and POCR which demonstrates the effectiveness of our method.

Moreover, we also show the detected landmark points for some representative images in LFPW in Fig. 8. We can see that our model is robust to the variances in occlusion, expression, as well as pose.

Fig. 8.
figure 8

Some facial landmark point detection results on LFPW, HELEN, and iBug.

4.4 Comparisons on 300-VW Video Based Landmark Points Detection Challenge

We also evaluate our method with the ICCV’2015 300 Videos in the Wild (300-VW) Challenge [10, 30, 34] to show our method’s extension ability. This competition aims at evaluating the ability of different systems for fitting unseen persons with different poses, expressions, illuminations, occlusions, and image quality. The videos are further separated into three different categories: (i) Category one contains faces recorded with well lighting condition but with various head poses and occlusions; (ii) Category two contains faces captured with different illuminations and expressions; (iii) Category three contains videos captured under arbitrary conditions, including the change in illumination, occlusion, expression, pose, etc. We treat each frame as one image without taking advantage of the relationships of frames between consecutive frames.Footnote 5 By following the setup of 300-VW, we train one single model with all training data and test on different testing categories. The complete 300-VW dataset has been releasedFootnote 6. We requested the training/test split used in the 300-VW challenge from the competition organizers. Therefore the comparisons between the performance of our method and results reported in the competition is fair.

The performance of different methods is shown in Fig. 9. In category one, our method achieves comparable accuracy with the work of Yang et al. [39] which achieved the best accuracy in category one in this challenge. In category two, our work still achieves similar performance with that of Yang et al. [39], but is less accurate than the work of Xiao et al. [37], which achieved the highest accuracy in category two in this challenge. It is worth noting that in category two, [37] uses temporal information to track the landmark points. By considering the relationships between facial landmark points in neighbouring frames, the performance of our method is likely to be further boosted. Even though temporal information is not used in our work in category three where videos are recorded under arbitrary condition, such as videos with occlusions, large pose variations, and sudden changes in expressions, our method greatly outperforms the work of Yang et al. [39] and that of Xiao et al. [37], and achieves the highest accuracy. The performance of our method on 300-VW further demonstrates the effectiveness of our method for facial landmark points detection.

Fig. 9.
figure 9

Performance comparisons of different methods on 300-VW challenge (Best viewed in color)

5 Conclusion

We propose a multi-columns coarse-to-fine CNN framework for facial landmark points detection, and such strategy demonstrates good performance on publicly available datasets. In future, we will apply the proposed method for video based face alignment, i.e., use the output of previous frame as a rough estimation of next frame, and use local CNNs to fine-tune the landmark points. Further, the proposed framework can also be applied for pose estimation.