Keywords

1 Introduction

Communication is of vital importance, whether it is a human or a machine [13]. While machines communicate through bits and bytes, humans can communicate in several ways such as speech, body gestures and facial expressions. Psychologically, emotions could be understood with the subjective experiences accompanied by physiological, behavioral and cognitive changes with reactions. Each emotion is distinctive in producing different patterns in brain activation signals. For example, considering an emotion with surprise, the heart rate in humans may be increased due to being startled, humans could also have our muscles being temporarily tensed and relaxed. The annotation of such signal is a complex task which can be supported by annotation tools and active learning algorithms [12, 20]. In order to capture human emotions facial expressions is one of the methods to identify the existence of emotion, but there is always a possibility to combine speech recognition system to detect emotion or gesture-based emotion detection [7]. Such kind of human emotion detection system requires multimodal data which contains facial expressions, vocal expressions, body gestures, and physiological signals [15].

Thus, a complex system needs to be realized which often requires a high computation power [6, 10, 16]. Machine Learning applications are often dependent on several libraries that utilize multi-core architectures and hence reducing the overall processing time. Several machine learning libraries are officially supported for different architectures. In this paper, Raspberry PI (RPI) is chosen as SBC due to its low cost and low power architectures. Having 64 bit support to RPI would eventually allow us to have more addressing space and also different instruction set that could operate on 64-bit data. Hence there is need of 64-bit Linux operating system with several machine learning libraries supporting inferencing on ARMV8 architecture. Additionally, a DNN needs to be trained to extract features using Convolutional Neural Networks (CNN) for Facial Emotion Recognition (FER) classification on FER 2013 public dataset. Several DNN architectures are evaluated for inference. With the availability of NCS from Intel, evaluation is done for loading the neural network in the NCS using SBC as the host system.

Further sections of the paper are organized as follows: Sect. 2 provides a short literature survey on Emotion Recognition. Emotion Recognition System is described in Sect. 3. In Sect. 4, we discuss on results obtained for evaluating the dataset, feature map outputs, DNN models training and performance evaluation. Further discussion along with a live video output to detect emotions is visualized in Sect. 5. Finally, Sect. 6 concludes the paper with future work.

2 Background

Ekman et al. initially identified the basic emotion theory comprising of emotions that are constant across the cultures, and they are identified through facial expressions [3]. Fasel et al. have made an extensive survey on automatic facial expressions [4]. A framework for Facial expression analysis are discussed which consists of (1) Face Acquisition (2) Facial Feature Extraction and (3) Facial Expression Classification. Several methods that are available for each of these categories are well explained.

Haar-like features were proposed by Viola Jones for detecting faces [21]. Training of such a classifier takes more time because of the large dataset typically around 5000 faces and 35000 non-faces, but once trained, an inference of the face region is fast enough and also computation friendly when compared to several other methods such as deep convolution networks. Haar cascade methods are widely used in object detection [17].

An Emotion Recognition is proposed by Baveye et al. which is based on video dataset that is annotated version of 30 movies along the valence and arousal axis [2]. They use machine learning algorithms that are based on CNN, Support Vector Machines (SVM) for Regression and also a combination of both known as Transfer Learning. With the availability of powerful embedded systems, researchers are up in identifying the performance of CNN on embedded systems. Pena et al. have presented the benchmarking for networks namely GoogLeNet [18], AlexNet [8], Network in Network [24], CIFAR10 [11] on target devices RPI, Intel Joule 570X and also using a NCS [14].

A framework using deep cascaded multi-task is proposed by Zhang et al. where the model develops an inherent correlation and eventually boost the network performance [25]. The design consists of three cascaded CNN networks, where predictions are carried out in a coarse-to-fine manner. These networks are trained using WIDERFACE [23] dataset for face detection and Annotated Facial Landmarks in the Wild (AFLW) benchmarking for face alignment. Although this multi-task cascaded CNN is complex, it has proven to meet the real-time performance as well. Weighted Mixture Deep Neural Network (WMDNN) is used by fusing the weights from two different CNN, while the first DNN uses a VGG16 network that is trained on the ImageNet database and the other uses a shallow CNN [22].

With all the advancement in CNN’s and DNN’s, it is clear that the objective of using these networks are tending towards real-time processing using devices such as the Internet of Things (IoT), SBC, etc., Since these devices are low powered, additional add-on devices are being designed for processing huge DNN networks. Research is towards optimizing the hardware to make CNN’s being used on low power hardware chips. Andri et al. provide a hardware accelerator specifically optimized for binary weighted CNN’s [1]. The usage of CNN’s during training addresses the need for removing extensive computations by limiting to binary weights, resulting in reduced bandwidth and storage.

3 Emotion Recognition System

Emotion Recognition System consists of three phases:

  • Face Detection

  • Emotion Feature Extraction

  • Emotion Classifier.

During the Face Recognition phase, a bounding box of the detected face is obtained first. To obtain a bounding box, two approaches have been used, Multi-Task Cascaded convolutional Neural Network (MTCNN) and Haar Cascade method. Once the bounding box is acquired in the image, the face region is cropped and resized to DNN input size, i.e., \(48\times 48\) pixels. Since a DNN of multiple layers is used, Emotion Feature Extraction and Emotion Classifier are combined, and the output of the Emotion Recognition DNN is the classifier output which has the probability estimation of emotions.

3.1 Deep Neural Networks

FER 2013 Database. As part of ICML challenge in 2013 [5], facial expression recognition database was introduced that was created using a search engine provided by Google, which is part of “In the wild” methodology of creating the database. The dataset was prepared by Pierre Luc Carrier and Aron Courville. In total, there are 35,887 images provided in the CSV file with the data columns consisting of Emotion Type, 1-D Grayscale image of dimension \(48\times 48\) and Image to be used for Training, Public (Validation) or Private (Testing) respectively.

Figure 1 shows the dataset distribution for 7 Emotions. We see an imbalanced dataset distribution for all the emotions. For example, Disgust emotion has the least number of data samples while Happy emotion has significantly more data samples. Overall, the distribution with other emotions except Disgust and Happy seems to be reasonably distributed.

Fig. 1.
figure 1

FER2013 dataset distribution for 7 emotions

Models. The input images are of size \(48\times 48\) pixels that are single channel as they are grayscale. The DNN model consists of several CNN layers, Pooling layers, and Fully Connected (FC) layers. The kernel filter size plays a crucial role in computation complexity, and hence its size should be as minimal as possible. By using cascading of convolutional layers, it facilitates the model to learn more features but also keep the filter size minimal. We use Keras [9] as a machine learning API that uses TensorFlow [19] for model definition and training. A summary of models along with the number of parameters required for training is shown in Table 1.

Table 1. Training Model and its parameter count

To begin with, the model is designed to have a low number of features so that it gets trained to features such as lines and edges, and slowly the feature size keeps increasing. As an example, Fig. 2 shows the DNN architecture for all variants of ER DNN models. The input to the network is a \(48\times 48\) grayscale image, a convolution layer with kernel size \(3\times 3\) is used along with Rectilinear Unit Function (ReLU) Activation, Max pooling of \(2\times 2\) size is used and padding as ‘same’. The final block is a fully connected layer that has two 1024 Unit Dense layers, and the last layer of the model output is the classification output of 7 different classes that consist the probability estimate of each emotion class.

Fig. 2.
figure 2

Emotion Recognition DNN Model: Feature extraction and classification based on CNN

4 Results

4.1 Dataset Validation Results

Model evaluation and Model selection are the two factors that could be considered for dataset validation. The model selection consists of hyper-parameter tuning for a specific class of models such as neural networks, linear models, etc., and model evaluation finds the estimate of the predictive power for a specific model which is unbiased.

Figure 3 shows the 10-Fold cross-validation result for the neural network model shown in Fig. 2 which is an 11-CNN layer network. The mean value of the accuracy was 67.05%. As the standard deviation was 0.006 (0.6%), the test error that occurs during inference would be ±0.6%.

Fig. 3.
figure 3

Training and Validation results for 10-Fold cross validation

Fig. 4.
figure 4

Feature Map Outputs at several layers for ermodel_cls7_conv3

4.2 Feature Map Output

The feature maps that the DNN was trained can be visualized as shown in Fig. 4. To get these visualizations, an image of emotion “Happy” was given as input to ermodel_cls7_conv3 network model which successfully classified the image as “Happy”. For convolutional layers, the feature maps are the outputs of ReLU activation; the maps are shown in the figure represents a color map distribution of ReLU output for convolutional layer 1, 2 and 3. The ReLU functions activity could be visualized in the graph where a minimum value of zero is indicated with navy blue color and maximum value with yellow color. We could see that the DNN first learns to detect edges and slowly adapts to learn more detailed features.

Table 2. Emotion Recognition Models with accuracies for Train, Validation and Test dataset for FER2013

4.3 DNN Models Training

Emotion Recognition models were trained using FER2013 dataset. There were five network models each configured with an increasing size of convolution layers as shown in Table 1. The model name can be interpreted as ermodel - Emotion Recognition Model with cls7 as number of emotion classifications and convX representing X layers of convolution in the model. A summary of training results for those networks is shown in Table 2. The accuracy for ermodel_cls7_conv11 for Test dataset was 69% with train accuracy of 95.28% and is the highest among all the networks with 4,317,543 parameter count. ermodel_cls_conv8 has the lowest parameters to be trained, i.e., 2,826,759, having an accuracy of 90.73% and 67.9% on the Training dataset and Test dataset respectively.

4.4 SBC Performance Measurements

To evaluate the ER models inference time on NCS, NCSDK provides a profiler tool which is used to measure the inference time required by NCS. Table 3 provides the summary of the mvNCProfile tool output.

Table 3. Emotion Recognition Models inference time on NCS device

All the trained models for emotion recognition was used on RPI 3B+ hardware for detecting emotions through a live feed from camera interface. Different hardware configurations were used to evaluate four parameters:

  • CPU Load: This metric is used to evaluate the RPI CPU in executing the DNN network for inference, and any I/O operations involving frame read from the camera, loading the DNN to NCS, etc.

  • DNN Load Time: This metric is used to evaluate the initialization time of the application that involves loading of DNN libraries and also the Emotion Recognition model.

  • Frame Rate: This metric is a measure of RPI capability to process the live feed images from the camera in one second.

  • Processing Time: This metric provides the measure of time taken to process a single image captured from camera which includes reading the camera image, face region detection (RPI/NCS), emotion recognition inference (RPI/NCS) and displaying emotion graph in the Application GUI.

Figure 5(a)–(d) shows the plot of the above metrics. ER model inference is carried out using NCS and RPI, Face detection using Haar Cascade method is done using RPI, but while using MTCNN both RPI and also NCS (dual) are used.

Fig. 5.
figure 5

Performance parameters measured on RPI and NCS

5 Discussion

As there are several configurations in which the ER could be run successfully for detecting emotions, some of the configurations come with the price of cost and computation. In case of total price being the limitation, then ER system could be completely run on RPI and in this case, using Haar Cascade method for face detection would be the best solution as it gives 4.5 fps with a detection rate of 222 ms. To get an excellent fps and a detection rate of 100 ms then using NCS would be more appropriate. Again in this configuration, Haar Cascade on RPI and ER on NCS has given a frame rate of 8 fps. Although the ermodel_cls7_conv3 gives good frame rate, it would be better to use ermodel_cls7_conv8 or ermodel_cls7_conv11, as it has shown better accuracy in detecting emotions during live video feed testing. Furthermore, ermodel_cls7_conv8 has less number of trained parameters and hence lower computation power and processing time. Furthermore, with the DNN models completely running on NCS, has a very low CPU usage, has not only stable processing time and loading time but also independent of ER models with a stable frame rate between 5 to 6 fps. The overall system with this configuration would be expensive to realize as it requires more NCS devices and could also lead to communication overhead.

Figure 6 shows the ER graph for Angry, Fear, Happy, Sad, Surprise and Neutral. Two different subjects were asked to enact several emotions and the ER Application was configured to use Haar Cascade method for face detection and ermodel_cls7_conv11 as ER model. Even though Disgust emotion was enacted, the ER model was not able to classify it to the right class. This is due to the model not being trained with sufficient images for Disgust class.

Fig. 6.
figure 6

ER Application showing ER graph of six basic emotions

6 Conclusion

Emotion Recognition System was realized on a SBC that could run on low power requirements. The system is used to recognize emotions through a live video interface that could successfully detect face regions and also predict the emotions. The work carried out required a Linux Operating system supporting ARMV8 on RPI so that the libraries required for DNN inference could be loaded successfully into SBC. Furthermore, the application used as a graphical user interface to display camera feed and emotion graph was designed using multi-threading concepts and hence utilizing the SBC multi-core architecture. By using CNN, several DNN architectures was realized to extract the features from the desired FER dataset. The CNN feature maps reveal the learned features in several CNN layers that were used for the classifier which was a simple MLP.

The choice of RPI for an SBC has proven to have sufficient CPU computation power for running the DNN algorithms. Results show that multiple DNN algorithms such as MTCNN and ER could perform well enough with a frame rate of around 2 fps on RPI. Given the fact that the change in emotions is not that instant in real-world applications, the frame rates achieved for all the configurations and ER DNN models are within the acceptable range. Furthermore, the best test accuracy achieved for FER2013 dataset was around 69% for ermodel_cls7_conv11 ER model. The 10-Fold CV result for the model was with mean accuracy \(67.05\%\pm 0.6\%\) which indicates that the model is quite stable.