Keywords

1 Introduction

Expression is the most important mode of nonverbal communication between people. Human face is very expressive; we can express expressions such as happiness, anger, sadness, surprise, fear, disgust without the need of communicating through words. Analysis of facial expressions has been an attractive topic in the domain of computer vision. This is because it has a wide spectrum of potential applications such as behavioural analysis, psychology, human-computer interaction and so on. The process can be divided into three main tasks, i.e. face-detection, feature extraction and classification. For many years, the analysis has been done by using hand-crafted low-level descriptors, which are appearance based or geometric based [8]. These methods face challenges such as partial obstruction of facial regions, illumination variations, and head deflection. There have been advances in deep learning domain which solve these issues [8]. Deep learning methods such as CNN and RNN have been used for classification, feature extraction and classification. Researchers have been working towards improving the results of these algorithms. Yu et al. [21] fused the CNNs by learning the set of weights of the network response. Zhao et al. advocated the use deep belief networks (DBNs) to automatically learn the features and the recognition of features was done using multilayer perceptron (MLP) [23]. Using two different types of CNN, Jung et al. [6] extracted the temporal appearance features and temporal geometry features from the image sequences. Song et al. [18] have used 3D-CNNs for 3D object detection task. To improve the performance of existing models, we have devised a new methodology to achieve improved results. In this paper, we employ CNN understanding to develop a novel architecture which leads to improvement in the accuracy of classification. Preprocessed images are fed into two channels and normal images are fed through the third channel. After each CNN channel has extracted its features, we concatenate the fully connected layers of these three CNN channels. By doing so, we are able to combine all the features extracted by different CNN pipelines on the given images and train them at the same time. This provides a significant boost to the performance since we have richer information about the images. We have applied the algorithm on widely used benchmark FER datasets such as Jaffe [12], CK+ [11] and Oulu-CASIA database [22].

2 Proposed Method

The proposed methodology has been divided into two subsections. The first section explains prepocessing techniques, while the second part elaborates the feature extraction and classification methods. Figure 1, provides an in depth understanding of the proposed approach, which will be discussed in the following subsections.

Fig. 1.
figure 1

The figure shows the design of the model of our proposed approach. The input images are prepocessed and fed into the multi-channel CNN model, as shown.

2.1 Preprocessing

Preprocessing plays an important role, since most of the datasets available have a lot of background data, which decreases the efficiency of recognising expressions. Moreover, focusing only on the face, and removing the unrelated information from the images leads to a better training process for the classifier. Hence we crop the frontal faces from the images, and enhance their quality of content by applying beneficial filter combinations.

We begin by cropping the images. For face detection, Haar-Cascade detection [3], is used. In this method, cascade of classifiers are applied on every image for a particular window size, and a window in which all the features get passed, is concluded to have a face region. After the images are cropped, image processing filters are applied, to remove noise and reduce the unnecessary information from the images. These are as follows:

A. Median filtering: In median filtering [13] for each pixel, first the neighbourhood pixels are located, the values of the pixels are sorted in ascending order and then the value of the pixel under consideration, is replaced by the computed median. This process is repeated for each pixel of the image.

B. Gaussian smoothing filter: Gaussian smoothing filters [4] are also used in our proposed model for noise reduction. Unlike median filtering, Gaussian smoothing filtering is a linear method. For 2D images it has the isotropic (circularly symmetric) form:

$$\begin{aligned} G\text {(x,y)}=\frac{1}{2\pi \sigma ^2}e^-{\frac{x^2+y^2}{2\sigma ^2}} \end{aligned}$$

where \(\sigma \) is the degree of smoothing. This method is a weighted averaging of each pixels neighbourhood, and gives more weight to the central pixels than the farther neighbourhood pixels. On applying a Gaussian smoothing filter, we get an image with gentler smoothing and preserved edges. The standard deviation for Gaussian smoothing is set to 0.025, which is determined by method of experimentation.

C. Sobel Filter: Edges are regions in an image where there are sharp change in the brightness of the pixel. Sobel filters [7] are used to detect the nature of this change in intensity, and then conclude whether the region is an edge or not. Using this edge detection method, we convert the image to binary image, using 0.5 as the threshold value. The threshold value obtained by experimentation is chosen such that all the edges that are not stronger than threshold value are ignored. For each pixel, gradient approximations are calculated by convolving horizontal and vertical masks with the image and then the final value is the combined form of both \(G_x\) and \(G_y\), according to the following formula:

$$\begin{aligned} |G|=\sqrt{G_x^2+G_y^2} \quad A=\tan ^{ - 1}(\text {G}_y/{G}_x) \end{aligned}$$

Here \(G_x\) and \(G_y\) are the horizontal and vertical derivatives respectively and the orientation of the gradient i.e. A is calculated as above.

Discrete approximations of the masks are used for computation, in both Gaussian smoothing and Sobel edge detection.

2.2 Feature Extraction and Classification

After prepocessing is complete, feature extraction and classification was carried out. CNN is used as a classifier and our convolutional layers have filter maps with increasing filter count, resulting in 32, 64 and 128 filter map sizes, respectively.

The main structure of our model is based on multi-channel CNN architecture as shown in Fig. 1. The normal cropped images have been given to the first channel. And in the second channel, on cropped, median filtered pictures, Sobel edge detection with a specific threshold value has been applied and fed. Median filtering and Gaussian smoothing has been applied and those images are fed into the third channel. The fully connected layers of these three channels are then concatenated. Thus, in this manner the features extracted by three channels are merged. Now, the information rich, multi-channel network is trained, using the ADAM optimiser.

3 Experimental Results and Discussion

In this section we discuss the effectiveness of the proposed approach and the reason for its efficiency. The method has been evaluated on three datasets, namely JAFFE [12], CK+ [11] and Oulu-CASIA dataset [22].

Japanese Female Facial Expression database (JAFFE), contains 213 grayscale images of 10 female Japanese subjects. The Extended Cohn Kanade (CK+) database has 593 image sequences, of 123 subjects. Both JAFFE and CK+ have seven expressionss, namely: anger, happiness, sadness, surprise, sadness, disgust and neutral. The third dataset, used was Oulu-CASIA database. The dataset is classified according to the illumination in which the video was taken for weak, strong and dark. Under the category of strong, for each of the 80 subjects, from a number of images of varying degree of a specific expressions, we chose the last seven images of peak expression, for a particular person. Also, the images could be classified into six basic expressions; all of the above mentioned expressions, except neutral.

Fig. 2.
figure 2

The graphs are showing comparison of performance accuracy of models having different number of channels on the three chosen datasets.

In order to determine the number of optimum CNN channels to be used, the normal images of the three chosen datasets, were passed through a 1 channel, 2-channel, 3-channel and a 4-channel CNN model. For the simplicity of comparison, normal images, without any preprocessing are taken as inputs to the respective channels. As evident by the results shown in the graph, a contrast in accuracy was observed. The optimum number of channels were thus chosen to be three, as for more than three channels, overfitting reduced the accuracy and for less than three channels, the models learning efficiency and classification accuracy, did not improve much, as the model was not able to learn the features in a comprehensive manner (Fig. 2).

To improve the performance of the three channel model, the quality of the images were refined, by various preprocessing methods [5], as mentioned in the above section. According to our requirements, as we wanted better feature extraction, that would lead to an increased accuracy of Facial Expression Recognition, we explored different ways of information reduction i.e. removing uncorrelated information and keeping only necessary information and noise reduction in images.

For noise reduction, different types of filters were taken into consideration. We used median filtering because it gave a better performance, when compared to mean filtering. Unlike mean which can have any value, median value is always from the values present among the pixel values of the neighbourhood, making the image more representative and less blurry. Hence in the second and third channel prior to applying any other filter, we applied median filtering.

After applying Median filter in the second channel, for edge detection, we experimented with three major methods: Sobel edge detection, Canny edge detection [2] and Prewitt [14] edge detection. The main aim of edge detection was information reduction, and thus we looked for the most clear and defined edges, which were observed in Sobel edge detection. Canny had too many edges and thus did not accomplish our goal of uncorrelated information reduction and Prewitt had too less edges, and as a result was not suitable for a better feature extraction.

In the third channel, in addition to median filtering, on applying a Gaussian smoothing filter, we got an image with gentler smoothing and preserved edges. For example, black spots, extra strands of hair etc. were almost removed, and thus a better image that lead to an efficient expressions recognition was acquired (Fig. 3).

Table 1. Comparison of performance of our proposed approach and various state-of-the-art methods.
Fig. 3.
figure 3

As shown in the graphs above, for JAFFE dataset, the training and validation accuracy increases, and the training and validation loss decreases for increasing number of epochs.

In regard to various other methods of FER, from Table 1, it can clearly be seen that the proposed method performs better than the traditional hand-crafted methods of expressions recognition, due to the benefits of deep learning models [8]. In addition to this, in comparison to various deep learning models also, the proposed model performs fairly well. The multi-channel CNN model outperforms various deep learning single channel models, as shown in the table, due to simply its core concept [9, 17, 19].

The idea of multi-channel model is to connect the extracted features from different channels before the final output, so that the model can use richer information obtained from different channels. This is done by concatenating the fully connected layers of each channel and form one fully connected layer. This fully connected layer can now be used further, for classification. This process leads to a more superior feature extraction, which in turn provides a boost to the performance. It is more flexible as different types of input data can be merged, in an easier and simpler way. While training, the weights of each channel, get updated independently, and thus richer information is fed, while classification. And therefore, the proposed model is more robust to noise and errors, because of the multiple sources of independent information extraction. Fine tuning of images by the preprocessing stage, along with the innovative model suggested by us, gave the model immunity against noise, which ultimately lead to a better performance accuracy for FER.

4 Conclusion

In this paper, we have proposed an efficient method, for classification of facial expressions recognition. Various preprocessing steps like face cropping and Sobel edge detection help in focusing on the information content only. Noise reduction methods like median filtering and Gaussian smoothing improve the saturation, contrast and brightness of the image. Finally, lwhen the preprocessed images, were passed through our proposed model, consisting of a multi-channel CNN architecture, the features learned through the three channels in tandem, gave a boost to the feature learning and facial expressions classification. The testing accuracy, as quoted, is demonstrative of the effectiveness of proposed approach over many existing approaches.