Multichannel CNN for Facial Expression Recognition

Trivedi, Prapti; Mhasakar, Purva; Sujata; Mitra, Suman K.

doi:10.1007/978-3-030-34869-4_27

Prapti Trivedi¹⁴,
Purva Mhasakar¹⁴,
Sujata¹⁴ &
…
Suman K. Mitra¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11941))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

1552 Accesses
1 Citations

Abstract

In the past years there have been several attempts on the task of facial expression recognition. We have developed a new method based on the understanding of CNN and various image processing techniques. A multi-channel CNN architecture is proposed, which helps in performing improved facial expression recognition on frontal face images. For better feature extraction, fine tuning of images has been done by different preprocessing methods, namely Sobel edge detection, median filtering and Gaussian smoothing. Thereafter, the preprocessed images, have been fed in a novel manner in the proposed multi-channel CNN model. The model is evaluated on three challenging benchmark datasets - JAFFE, CK+ and Oulu-CASIA. The performance is comparable with various state-of-the-art approaches for facial expression recognition, which is evident from the results obtained.

You have full access to this open access chapter, Download conference paper PDF

Facial Expression Recognition Based on Multi-scale CNNs

Facial Expression Recognition Based on Complete Local Binary Pattern and Convolutional Neural Network

Enhancing CNN with Pre-processing Stage in Illumination-Invariant Automatic Expression Recognition

Keywords

1 Introduction

Expression is the most important mode of nonverbal communication between people. Human face is very expressive; we can express expressions such as happiness, anger, sadness, surprise, fear, disgust without the need of communicating through words. Analysis of facial expressions has been an attractive topic in the domain of computer vision. This is because it has a wide spectrum of potential applications such as behavioural analysis, psychology, human-computer interaction and so on. The process can be divided into three main tasks, i.e. face-detection, feature extraction and classification. For many years, the analysis has been done by using hand-crafted low-level descriptors, which are appearance based or geometric based [8]. These methods face challenges such as partial obstruction of facial regions, illumination variations, and head deflection. There have been advances in deep learning domain which solve these issues [8]. Deep learning methods such as CNN and RNN have been used for classification, feature extraction and classification. Researchers have been working towards improving the results of these algorithms. Yu et al. [21] fused the CNNs by learning the set of weights of the network response. Zhao et al. advocated the use deep belief networks (DBNs) to automatically learn the features and the recognition of features was done using multilayer perceptron (MLP) [23]. Using two different types of CNN, Jung et al. [6] extracted the temporal appearance features and temporal geometry features from the image sequences. Song et al. [18] have used 3D-CNNs for 3D object detection task. To improve the performance of existing models, we have devised a new methodology to achieve improved results. In this paper, we employ CNN understanding to develop a novel architecture which leads to improvement in the accuracy of classification. Preprocessed images are fed into two channels and normal images are fed through the third channel. After each CNN channel has extracted its features, we concatenate the fully connected layers of these three CNN channels. By doing so, we are able to combine all the features extracted by different CNN pipelines on the given images and train them at the same time. This provides a significant boost to the performance since we have richer information about the images. We have applied the algorithm on widely used benchmark FER datasets such as Jaffe [12], CK+ [11] and Oulu-CASIA database [22].

2 Proposed Method

The proposed methodology has been divided into two subsections. The first section explains prepocessing techniques, while the second part elaborates the feature extraction and classification methods. Figure 1, provides an in depth understanding of the proposed approach, which will be discussed in the following subsections.

2.1 Preprocessing

Preprocessing plays an important role, since most of the datasets available have a lot of background data, which decreases the efficiency of recognising expressions. Moreover, focusing only on the face, and removing the unrelated information from the images leads to a better training process for the classifier. Hence we crop the frontal faces from the images, and enhance their quality of content by applying beneficial filter combinations.

We begin by cropping the images. For face detection, Haar-Cascade detection [3], is used. In this method, cascade of classifiers are applied on every image for a particular window size, and a window in which all the features get passed, is concluded to have a face region. After the images are cropped, image processing filters are applied, to remove noise and reduce the unnecessary information from the images. These are as follows:

A. Median filtering: In median filtering [13] for each pixel, first the neighbourhood pixels are located, the values of the pixels are sorted in ascending order and then the value of the pixel under consideration, is replaced by the computed median. This process is repeated for each pixel of the image.

B. Gaussian smoothing filter: Gaussian smoothing filters [4] are also used in our proposed model for noise reduction. Unlike median filtering, Gaussian smoothing filtering is a linear method. For 2D images it has the isotropic (circularly symmetric) form:

$$\begin{aligned} G\text {(x,y)}=\frac{1}{2\pi \sigma ^2}e^-{\frac{x^2+y^2}{2\sigma ^2}} \end{aligned}$$

where $\sigma $ is the degree of smoothing. This method is a weighted averaging of each pixels neighbourhood, and gives more weight to the central pixels than the farther neighbourhood pixels. On applying a Gaussian smoothing filter, we get an image with gentler smoothing and preserved edges. The standard deviation for Gaussian smoothing is set to 0.025, which is determined by method of experimentation.

C. Sobel Filter: Edges are regions in an image where there are sharp change in the brightness of the pixel. Sobel filters [7] are used to detect the nature of this change in intensity, and then conclude whether the region is an edge or not. Using this edge detection method, we convert the image to binary image, using 0.5 as the threshold value. The threshold value obtained by experimentation is chosen such that all the edges that are not stronger than threshold value are ignored. For each pixel, gradient approximations are calculated by convolving horizontal and vertical masks with the image and then the final value is the combined form of both $G_x$ and $G_y$, according to the following formula:

$$\begin{aligned} |G|=\sqrt{G_x^2+G_y^2} \quad A=\tan ^{ - 1}(\text {G}_y/{G}_x) \end{aligned}$$

Here $G_x$ and $G_y$ are the horizontal and vertical derivatives respectively and the orientation of the gradient i.e. A is calculated as above.

Discrete approximations of the masks are used for computation, in both Gaussian smoothing and Sobel edge detection.

2.2 Feature Extraction and Classification

After prepocessing is complete, feature extraction and classification was carried out. CNN is used as a classifier and our convolutional layers have filter maps with increasing filter count, resulting in 32, 64 and 128 filter map sizes, respectively.

The main structure of our model is based on multi-channel CNN architecture as shown in Fig. 1. The normal cropped images have been given to the first channel. And in the second channel, on cropped, median filtered pictures, Sobel edge detection with a specific threshold value has been applied and fed. Median filtering and Gaussian smoothing has been applied and those images are fed into the third channel. The fully connected layers of these three channels are then concatenated. Thus, in this manner the features extracted by three channels are merged. Now, the information rich, multi-channel network is trained, using the ADAM optimiser.

3 Experimental Results and Discussion

In this section we discuss the effectiveness of the proposed approach and the reason for its efficiency. The method has been evaluated on three datasets, namely JAFFE [12], CK+ [11] and Oulu-CASIA dataset [22].

Japanese Female Facial Expression database (JAFFE), contains 213 grayscale images of 10 female Japanese subjects. The Extended Cohn Kanade (CK+) database has 593 image sequences, of 123 subjects. Both JAFFE and CK+ have seven expressionss, namely: anger, happiness, sadness, surprise, sadness, disgust and neutral. The third dataset, used was Oulu-CASIA database. The dataset is classified according to the illumination in which the video was taken for weak, strong and dark. Under the category of strong, for each of the 80 subjects, from a number of images of varying degree of a specific expressions, we chose the last seven images of peak expression, for a particular person. Also, the images could be classified into six basic expressions; all of the above mentioned expressions, except neutral.

In order to determine the number of optimum CNN channels to be used, the normal images of the three chosen datasets, were passed through a 1 channel, 2-channel, 3-channel and a 4-channel CNN model. For the simplicity of comparison, normal images, without any preprocessing are taken as inputs to the respective channels. As evident by the results shown in the graph, a contrast in accuracy was observed. The optimum number of channels were thus chosen to be three, as for more than three channels, overfitting reduced the accuracy and for less than three channels, the models learning efficiency and classification accuracy, did not improve much, as the model was not able to learn the features in a comprehensive manner (Fig. 2).

To improve the performance of the three channel model, the quality of the images were refined, by various preprocessing methods [5], as mentioned in the above section. According to our requirements, as we wanted better feature extraction, that would lead to an increased accuracy of Facial Expression Recognition, we explored different ways of information reduction i.e. removing uncorrelated information and keeping only necessary information and noise reduction in images.

For noise reduction, different types of filters were taken into consideration. We used median filtering because it gave a better performance, when compared to mean filtering. Unlike mean which can have any value, median value is always from the values present among the pixel values of the neighbourhood, making the image more representative and less blurry. Hence in the second and third channel prior to applying any other filter, we applied median filtering.

After applying Median filter in the second channel, for edge detection, we experimented with three major methods: Sobel edge detection, Canny edge detection [2] and Prewitt [14] edge detection. The main aim of edge detection was information reduction, and thus we looked for the most clear and defined edges, which were observed in Sobel edge detection. Canny had too many edges and thus did not accomplish our goal of uncorrelated information reduction and Prewitt had too less edges, and as a result was not suitable for a better feature extraction.

In the third channel, in addition to median filtering, on applying a Gaussian smoothing filter, we got an image with gentler smoothing and preserved edges. For example, black spots, extra strands of hair etc. were almost removed, and thus a better image that lead to an efficient expressions recognition was acquired (Fig. 3).

Table 1. Comparison of performance of our proposed approach and various state-of-the-art methods.

Full size table

In regard to various other methods of FER, from Table 1, it can clearly be seen that the proposed method performs better than the traditional hand-crafted methods of expressions recognition, due to the benefits of deep learning models [8]. In addition to this, in comparison to various deep learning models also, the proposed model performs fairly well. The multi-channel CNN model outperforms various deep learning single channel models, as shown in the table, due to simply its core concept [9, 17, 19].

The idea of multi-channel model is to connect the extracted features from different channels before the final output, so that the model can use richer information obtained from different channels. This is done by concatenating the fully connected layers of each channel and form one fully connected layer. This fully connected layer can now be used further, for classification. This process leads to a more superior feature extraction, which in turn provides a boost to the performance. It is more flexible as different types of input data can be merged, in an easier and simpler way. While training, the weights of each channel, get updated independently, and thus richer information is fed, while classification. And therefore, the proposed model is more robust to noise and errors, because of the multiple sources of independent information extraction. Fine tuning of images by the preprocessing stage, along with the innovative model suggested by us, gave the model immunity against noise, which ultimately lead to a better performance accuracy for FER.

4 Conclusion

In this paper, we have proposed an efficient method, for classification of facial expressions recognition. Various preprocessing steps like face cropping and Sobel edge detection help in focusing on the information content only. Noise reduction methods like median filtering and Gaussian smoothing improve the saturation, contrast and brightness of the image. Finally, lwhen the preprocessed images, were passed through our proposed model, consisting of a multi-channel CNN architecture, the features learned through the three channels in tandem, gave a boost to the feature learning and facial expressions classification. The testing accuracy, as quoted, is demonstrative of the effectiveness of proposed approach over many existing approaches.

References

Aly, S., Abbott, A.L., Torki, M.: A multi-modal feature fusion framework for kinect-based facial expression recognition using dual kernel discriminant analysis (DKDA). In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE (2016)
Google Scholar
Chen, X., Cheng, W.: Facial expression recognition based on edge detection. Int. J. Comput. Sci. Eng. Surv. 6(2), 1 (2015)
Article Google Scholar
Cuimei, L., Zhiliang, Q., Nan, J., Jianhua, W.: Human face detection algorithm via Haar cascade classifier combined with three additional classifiers. In: 2017 13th IEEE International Conference on Electronic Measurement & Instruments (ICEMI), pp. 483–487. IEEE (2017)
Google Scholar
Deng, G., Cahill, L.: An adaptive Gaussian filter for noise reduction and edge detection. In: 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference, pp. 1615–1619. IEEE (1993)
Google Scholar
Dharavath, K., Talukdar, F.A., Laskar, R.H.: Improving face recognition rate with image preprocessing. Indian J. Sci. Technol. 7(8), 1170–1175 (2014)
Google Scholar
Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)
Google Scholar
Kanopoulos, N., Vasanthavada, N., Baker, R.L.: Design of an image edge detection filter using the sobel operator. IEEE J. Solid-State Circuits 23(2), 358–367 (1988)
Article Google Scholar
Ko, B.: A brief review of facial emotion recognition based on visual information. Sensors 18(2), 401 (2018)
Article Google Scholar
Liu, S., Liu, Z.: Multi-channel CNN-based object detection for enhanced situation awareness. arXiv preprint arXiv:1712.00075 (2017)
Lopes, A.T., de Aguiar, E., De Souza, A.F., Oliveira-Santos, T.: Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recognit. 61, 610–628 (2017)
Article Google Scholar
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. IEEE (2010)
Google Scholar
Lyons, M.J., Akamatsu, S., Kamachi, M., Gyoba, J., Budynek, J.: The Japanese female facial expression (JAFFE) database. In: Proceedings of Third International Conference on Automatic Face and Gesture Recognition, pp. 14–16 (1998)
Google Scholar
Nagu, M., Shanker, N.V.: Image de-noising by using median filter and Weiner filter. Int. J. Innov. Res. Comput. Commun. Eng. 2(9), 5641–5649 (2014)
Google Scholar
Prewitt, J.M.: Object enhancement and extraction. Pict. Process. Psychopictorics 10(1), 15–19 (1970)
Google Scholar
Rivera, A.R., Castillo, J.R., Chae, O.O.: Local directional number pattern for face analysis: face and expression recognition. IEEE Trans. Image Process. 22(5), 1740–1752 (2013)
Article MathSciNet Google Scholar
Salmam, F.Z., Madani, A., Kissi, M.: Emotion recognition from facial expression based on fiducial points detection and using neural network. Int. J. Electr. Comput. Eng. 8(1), 52 (2018)
Google Scholar
Shi, H., Ushio, T., Endo, M., Yamagami, K., Horii, N.: A multichannel convolutional neural network for cross-language dialog state tracking. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 559–564. IEEE (2016)
Google Scholar
Song, S., Xiao, J.: Deep sliding shapes for Amodal 3D object detection in RGB-D images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 808–816 (2016)
Google Scholar
Sun, Y., Zhu, L., Wang, G., Zhao, F.: Multi-input convolutional neural network for flower grading. J. Electr. Comput. Eng. 2017 (2017)
Article Google Scholar
Yang, B., Cao, J., Ni, R., Zhang, Y.: Facial expression recognition using weighted mixture deep neural network based on double-channel facial images. IEEE Access 6, 4630–4640 (2018)
Article Google Scholar
Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435–442. ACM (2015)
Google Scholar
Zhao, G., Huang, X., Taini, M., Li, S.Z., PietikäInen, M.: Facial expression recognition from near-infrared videos. Image Vis. Comput. 29(9), 607–619 (2011)
Article Google Scholar
Zhao, X., Shi, X., Zhang, S.: Facial expression recognition via deep learning. IETE Tech. Rev. 32(5), 347–355 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India
Prapti Trivedi, Purva Mhasakar, Sujata & Suman K. Mitra

Authors

Prapti Trivedi
View author publications
You can also search for this author in PubMed Google Scholar
Purva Mhasakar
View author publications
You can also search for this author in PubMed Google Scholar
Sujata
View author publications
You can also search for this author in PubMed Google Scholar
Suman K. Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prapti Trivedi .

Editor information

Editors and Affiliations

Tezpur University, Tezpur, India
Bhabesh Deka
Indian Statistical Institute, Kolkata, India
Pradipta Maji
Indian Statistical Institute, Kolkata, India
Sushmita Mitra
Tezpur University, Tezpur, India
Dhruba Kumar Bhattacharyya
Indian Institute of Technology Guwahati, Guwahati, India
Prabin Kumar Bora
Indian Statistical Institute, Kolkata, India
Sankar Kumar Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trivedi, P., Mhasakar, P., Sujata, Mitra, S.K. (2019). Multichannel CNN for Facial Expression Recognition. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11941. Springer, Cham. https://doi.org/10.1007/978-3-030-34869-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-34869-4_27
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34868-7
Online ISBN: 978-3-030-34869-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)