1 Introduction

Deep forest learning is a recent method initiated by [14, 21] with the motive of approaching classification and regression problems by making a conventional classifier (shallow learners) like the random forest (decision tree) to learn deep. The prevalence of Deep Neural Network (DNN) in Machine Learning (ML) and Artificial Intelligence (AI) can never be overemphasised. Deep learning is said to be as old as Artificial Neural Network (ANN) but went into hibernation due to its computational complexity and the demand for the large volume of data [5]. In the recent years, the availability of sophisticated computational resources and invention of the internet that give room for the collection of large datasets play a remarkable role in bringing deep learning back into the forefront of machine learning models. Deep learning has proven its worth in several areas of classification and regression computation with an efficient and optimal solution. Beyond reasonable doubt, deep learning outperformed the conventional classifiers in most machine learning tasks like; image processing, computer vision, pattern matching, biometrics, bioinformatics, speech processing and recognition, etc. Nevertheless, despite the computational prowess of Deep learning, its quest for large datasets and computational resources consumption is still a challenge. Therefore, there is a need to explore other machine learning models and see the opportunities to enhance their capability for better efficiency and accuracy.

Deep forest is still very new in machine learning and this implies that its application is yet to be explored. Both Forward Thinking Random Forest and gcForest are the popularly available deep forest models. And the reports of the models give the similar performance even if not more, as DNN in their experiments on MNIST dataset, with additional advantages of low computational time, limited hyper-parameter tuning and dynamic adaptation to the quantity of available dataset. Our task in this paper is to develop a deep learning model from the ensemble of forest trees for the classification of facial expression into six basic emotions, while depending on the forest tree inherent affinity for multi-class problems. Facial expression recognition is a multi-class problem and its goal is to detect human affective state from the deformation experience in the face due to facial muscles response to emotion states. To the best of our knowledge, this work is the first of its kind that engages a layer by layer enssemble of forest tree approach to the task of facial expression classification.

In this paper; Sect. 2 contains the details description of the related works, it captures the performances and the limitations of some of the classification models on facial expression recognition data. Section 3 contains a brief introduction to random forest and the description of the proposed deep forest framework for facial expression recognition. In Sect. 4 we discuss the databases for the experiments while Sect. 5 contains details of the experiment performed and the result analysis. Section 6 is the conclusion of the work.

2 Related Works

The complexity of facial expression and the subtle variations in its transition give rise to several challenges experienced in the field. One of the major challenges of facial expression is its classification into the six category of classes proposed by [9]. Many classifiers and regression algorithms have been proposed severally to address the challenge, the methods include Support Vector Machine (SVM) [15], Boosting Algorithm (AdaBoost) [7], Convolution Neural Network (CNN) [2] Decision Tree [18], Random Forest [4], Artificial Neural Network (ANN) [19], to mention a few. The listed classifiers have reportedly produced various promising results depending on the approach.

The impressive performance of Decision tree towards classification problems makes its evident application in several machine learning fields. [18] used decision tree to classify feature extracted from a distance based feature extraction method. Although there are not many works in facial expression recognition with decision tree method because of overfitting challenge in its performance with high dimensional data [16], the available ones are either presented its boosting (AdaBoost) or an ensemble (forest tree) version. Decision tree has been graciously enhanced by the introduction of Forest tree [3]. A random forest tree is an ensemble of learner algorithms in which individual learner is considered to be a weak learner. Decision tree algorithm has been widely explored as a weak learner for random forest tree, and this is likely the reason for describing a random forest as the ensemble of decision trees. [6] in their work extends the capability of random forest tree to a spatiotemporal environment, where subtle facial expression dynamics is more pronounced. The model conditioned random forest pair-wisely on the expression label of two successive frames whereby the transitional variation of the present expression in the sequence is minimized by the condition on the most recent previous frame. [12] hybridized deep learning and decision tree, and the hybridization was based on the representation learning of deep learning feature and the divide and conquer techniques of decision tree properties. A differentiable backpropagation was employed to enhance the decision tree to achieve an end to end learning, and also preserving representation learning at the lower layers of the network. So that the representation learning would minimize any likely uncertainty that could emerge from split nodes and thus minimized the loss function. The concept of Deep Forest is beyond the integration of decision tree into Deep Neural Network as proposed in [12]. [14, 21] thoroughly highlighted; computation complexity cost as a result of using backpropagation for the multilayers training of nonlinear activation function, massive consumption of memory during the training of complex DNN models, overfitting and non-generalization to small volume of data and complexity in hyperparameter tuning; as the challenges encountered while implementing Deep Neural Network. Therefore, there is a need for a deep learning model type that would minimize the challenges in the existing deep learning models. [14] proposed a deep learning model (Forward Thinking Deep Random Forest) different from ANN, in which the neurons were replaced by a shallow classifier. The network of the proposed model was formed by layers of Random Forest, and decision tree which is the building blocks of forest tree was used in place of neurons. The model was made to train layer by layer as opposed to the once-off training complexity and rigidity experienced in DNN. Likewise, the evolving Deep Forest learning (gcForest) proposed by [21] ensures diversity in its architecture, where the architecture consists of layers with different random forests. Both models successfully implement deep learning from Random Forest without backpropagation. Although the mode of achieving this slightly differs, while gcForest ensures connection to the subsequent layer using the output of the random forest of the preceding layer, the connection to the subsequent layer in FTDRF is the output of the decision tree in the random forest of the preceding layer. As earlier stated, it was reported that both models outperform DNN on the performance evaluation experiment on MNIST datasets.

3 Deep Forest Learning

Before providing the details of Deep Forest learning operations, it suffices to discuss the basic concept of Random Forest tree.

3.1 Random Forest

Random Forest tree was introduced by Breiman [3], before the advent of Breiman’s work, tree learning algorithm (decision tree) had been in existence, the algorithm was effective and efficient. Its implementation could either be shallow or a full grown tree (deep tree). Shallow tree learning model has a great affinity for overfitting resulting from the model high bias and low variance features, which is often addressed by boosting (AdaBoost) algorithm. Breiman established the ensemble idea on the early works of [1, 8, 20] and proposed a random forest algorithm which is efficient for both regression and classification tasks. Breiman implements both bootstrapping and bagging techniques by randomly creates several bootstrap samples from a raw data distribution so that each new sample will act as another independent dataset drawn from the true distribution. And after fit, a weak learner (decision tree) to each of the samples created. Lastly, computes the average of the aggregate output. The operation that would be performed on the aggregate of the output of the weak classifiers is determined by the task (classification or regression). In case of a regression problem, the aggregate is the average of all the learners’ output and if classification the class with the highest volt is favoured. Random forest is known for its fast and easy implementation, scalable with the volume of information and at the same time maintain sufficient statistical efficiency. It could be adopted for several prediction problems with few parameter tuning, and retaining its accuracy irrespectives of data size.

3.2 Proposed Facial Expression Deep Forest Famework

Deep forest learning architecture as presented in Fig. 1 is a layer by layer architecture in which each layer comprises of many forests, a layer links with its successive layer by passing the output of the forest tree as input to the next layer in the architecture. This work enhance the deep forest model proposed by [21] by introduction of trees with different features at strategic positions for better performance. The model consists of two phases; the feature learning phase and the deep forest learning phase. The feature learning phase is integrated for the purpose of feature extraction similar to convolution operation in DNN. It uses windows of different sizes to scan the raw images (face expression images), in a process of obtaining a class representative vector. The class vector is a N-dimensional feature vector extracted from a member of a class and then use for the training of the Deep Forest.

The second phase is the main deep forest structure; a cascade structure in the form of a progressive nested structure of different forest trees. The model implements four different forest trees classifiers (two Random Forest, ExtraTree and Logistic Regression classifiers), and each of the forest trees contains 500 trees. The difference between the forest trees is in their mode of selecting the representing feature for a split at every node of the tree.

The learning principle of the deep forest is the layer to layer connectivity, that is, a layer communicates with its immediate predecessor layer by taking as input, the forest tree output o f the preceding layer. The efficiency of the cascade structure lies in its ability to concatenate the original input with the features inherited at each layer. The motive is to update each layer with the original pattern and also to make the layer achieves reliable predictions. The concatenation of the original input layer thus enhances the generalization of the structure. Each layer is an ensemble of forests, the connection from one layer to another layer is done through the output of the forests. Forest processes start with bagging (bootstrap Aggregation). If there is N data sample, then some numbers n subsets of R randomly chosen samples with replacement is created such that each subset is used to train a tree, and the aggregate forest contains n trees. The tree growth for each of the forests starts from the root with the whole dataset, then each node containing an associate sample is split into two with reference to the randomly selected feature from the Forest. The two subsets are then distributed on the two children nodes, and the splitting continues until there is a pure sample of a class at the leaf node of the tree or the predefined condition is satisfied.

Fig. 1.
figure 1

Deep forest architecture

For each instance of a class, class distribution estimation is computed, and then averaging across all trees for each forest. This becomes the class vector to be concatenated with the original feature vector and send to the cascade next layer as input. Which implies each class will have one class vector, the number of augmented features extracted depends on the number of class multiply by the number of trees in the deep forest model. In order to control overfitting, K-fold is used to generate the class vector for each forest. At every layer expansion, cascade performance evaluation is estimated. At a point in the training where there is no significant improvement in the performance, the training is halt. This account for the control that Deep Forest has over its architecture.

3.3 Mathematical Illustration of the Framework

Data description Let \(\chi = R^m\) represent the input space, and let \(Y = {y_1,..........y_c}\) be the output space. Then every sample \(x_i\in \chi \) has corresponding \(y_i\in Y\) the training sample \(\varDelta _m\) is:

$$\begin{aligned} \varDelta _m = {(x_1,y_1),...........,(x_m,y_m)} \end{aligned}$$

At each layer there are forests and each forest contains learning algorithms that could be regarded as functions which give the image of the input data as the output of the forest. Then each forest in the first layer, \(L_1\) contains set of learning function say \(\alpha ^{l1}\) with general behaviour: \(\alpha ^{l1} :\chi _i \rightarrow \chi _i^{l1}\) where \(\chi _i\) is the input data into the layer1 and \(\chi _i^{l1}\) is the image of \(\chi _i\) then all functions in layer1 are represented as:

$$\begin{aligned} \alpha ^{l1}= {\alpha _1^{l1},............,\alpha _n^{l1}} \end{aligned}$$
$$\begin{aligned} \chi _i^{l1}={\alpha _1^{l1}(\chi _i),...........\alpha _n^{l1}(\chi _i)} \end{aligned}$$

this implies that a new data is gotten at layer 1, which means:

$$\begin{aligned} \varDelta _m = \varDelta _m^{l1}={(\chi _1^{l1},y_1),.....,(\chi _m^{l1},y_m)} \end{aligned}$$

The process continues as long as there is a significant performance in the model at every successive layer. At every layer k in the model where tree is appreciable improve in the performance of the model, it suffices to recall that the input to layer k is \(\chi _i^{(k-1)}\)

$$\begin{aligned} \chi _i^{lk} = {\chi _1^{lk}\times \chi _2^{lk}\times ...........\times \chi _n^{lk}} \end{aligned}$$
$$\begin{aligned} \chi _i^{lk} = {\alpha _1^{lk}(\chi _i),...........,\alpha _n^{lk}(\chi _i)} \end{aligned}$$

the output of layer k is:

$$\begin{aligned} \varDelta _m^{lk} = {(\chi _1^{lk},y_1),...........(\chi _m^{lk},y_m)} \end{aligned}$$

the layer stop growing at layer n where there is no significant increase in performance of the model. At layer n there is an assurance of having \(\chi _i^{l(n-1)}\) converging closely to \(y_i\). Note that, the output of each layer is the average of the probability distribution for instances in the leaf node of the trees for each forest. Let \(P = {p_1,.............,p_t}\) be the class vector probability of each node of the tree. For each sample of input \(\chi _i^{l(n-1)}\) the probability vector of the leaf node is given as:

$$\begin{aligned} P_i^{ln}(\chi _i^{l(n-1)}) = (P_1^{ln}(\chi _i^{l(n-1)}),........,P_t^{ln}(\chi _i^{l(n-1)})) \end{aligned}$$

then the output of Forest \(\beta \) in a layer \(l^n\) is the average of the probability vectors of all trees in the forest; as given in (1):

$$\begin{aligned} \beta _j = \frac{1}{J}\sum _{j=1}^{J} P_j(\chi _t) \end{aligned}$$
(1)

where J is the number of trees in a Forest and T is the number of class vector estimation at the leaf node.

4 Database

In this section we briefly introduce the two databases (BU-3DFE and CK+) that we are proposing for the experiment here. Figures 2 and 3 are the respective samples of the expression images in BU-3DFE and CK+.

Fig. 2.
figure 2

Selected expression images samples from BU-3DFE datasets. The arrangement from left: Angry, Disgust, Fear, Happy, Sad and Surprise

4.1 Binghamton University 3D Facial Expression (BU-3DFE)

This database was introduced at Binghamton University by [17], it contains 100 subjects with 2500 facial expression models. 56 of the subjects were female and 44 were male, the age group ranges from 18 years to 70 years old, with a variety of ethnic/racial ancestries, including White, Black, East-Asian, Middle-east Asian, Indian, and Hispanic Latino. 3D face scanner was used to capture seven expressions from each subject, in the process, four intensity levels were captured alongside for each of the 6 basic prototypical expressions. Associated with each expression shape model, is a corresponding facial texture image captured at two views (about \(+45^\circ \) and \(-45^\circ \)). As a result, the database consists of 2,500 two view’s texture images and 2,500 geometric shape models.

Fig. 3.
figure 3

Selected expression images for each of the emotion states from CK+ datasets. The arrangement from left: Angry, Disgust, Fear, Happy, Sad and Surprise

4.2 Cohn Kanade and Cohn Kanade Extension (CK and CK+) Database

[11] released a facial expression database in 2000, the database contains 97 subjects between the ages of 18 and 30; 65 were female and the remaining 35 were male. The subjects were chosen from multicultural people and races. There were 486 sequences collected from the subjects and each sequence started from neutral expression and ended at the peak of the expression. The peak of the expressions was fully FACS coded and emotion labeled, but the label was not validated. [13] itemized three challenges with CK databases challenges; invalidation of emotion labels because it did not depict what was actually performed. Unavailable common performance metrics for algorithm performance evaluation, as a result of no standard protocol for a common database. [13], having identified the challenges with CK database proposed its extension termed extended Cohn Kanade (CK+) database. In CK+ the number of subjects was increased by 27 and the number of sequence by 22, there were slight changes in the metadata also, age group of the subject ranged between 18 and 50, male was 31, and female was 69. The emotion labels were revised and validated using FACS investigator guide as a reference and confirmed by appropriate expert researchers. Leave-one-out subject cross-validation and area underneath the Receiver Operator Characteristics curve were proposed as metrics for Algorithm performance evaluation.

5 Experiment

The experiment was conducted on two datasets; the Cohn Kanade extension (CK+) and the Binghamton University 3D Facial Expression (BU-3DFE) datasets. We used only the peak images for the six basic emotion states (Anger, Disgust, Fear, Happy, Sad, Surprise) of 2D images from each of the data sets, and the total number of expression images used from BU-3DFE is 600 (100 images per emotion, 54 female and 46 male). In CK+ dataset; the total number of images extracted was 309 but the number of images per emotion varied (AN = 45, DI = 59, FE = 25, HA = 69, SA = 28, SU = 83). We split each of the extracted data into two; the training set (80%) and the validation set (20%). The training set was used to train the forest and the validation set was used for the performance evaluation. The model depth (the number of layers) is automatically determined, each layer consists of three different pairs of forests, and each forest contains 500 trees.

Before feeding the images as input for processing data processing techniques such as face detection, face alignment and histogram equalization were applied on the data so as to minimise data redundancy and intensity variation that may possibly challenge the performance of the system. As earlier stated we split the input into the training data and the validation data. Growing the forests with the training data set, we used 5-fold cross-validation to minimized chances of overfitting.

We tested the trained model on the validation set and passed each instance of the validation as representative feature to the cascade forest classification process. The output of the cascade forest returned probability predictions from each forest in the last layer of the cascade. As a result, the mean of the predictions was computed, and finally, the class with maximum value is the outcome of the prediction. For performance evaluation we use accuracy as our metrics and also employ confusion matrix for proper analysis of the result.

Furthermore, we conducted an investigation on the effect of number of classifiers on the behaviour of Deep Forest model. Initially, on both datasets (CK+ and BU-3DFE) we used 4 forest classifiers, and obtained average accuracy of 93.22% with only 5 layers added and 7 estimators in each layer for CK+ dataset. When each of the classifiers was doubled, the accuracy remained but ten layers were added with 7 estimators in each layer. This is different in the case of BU-3DFE dataset, the initial 4 classifiers gave accuracy of 57.98% and added 8 layers with 8 estimators in each layer. When each of the classifiers was doubled, the accuracy increased by almost 10% and added 10 layers with 8 estimators in each layer. Summary of the investigation is provided in Table 1.

Table 1. Summary of the investigation conducted on the Deep Forest model with increase in number of classifiers
Table 2. The result comparison of FERAtt (Facial Expression Recognition with Attention Net) with Deep Forest learning
Fig. 4.
figure 4

Confusion matrix of Deep Forest predictions on BU-3DFE dataset

Fig. 5.
figure 5

The graph of the recognition rate against number of predictions of BU-3DFE test data

Fig. 6.
figure 6

Confusion matrix of Deep Forest predictions on CK+ dataset

Fig. 7.
figure 7

The graph of the recognition rate against number of predictions of CK+ test data

5.1 Result

Figures 4 and 6 are the confusion matrices of the model probabilistic predictions accuracy on the BU-3DFE and CK+ respectively. Also, Figs. 5 and 7 are the graph of average recognition rate on the test data of BU-3DFE and CK+.

In Fig. 4, the prediction of the model is most for the surprise at 95%. Followed by happy at 90% then disgust at 55%, both sad and fear have 50% prediction accuracy and angry has the least prediction at 40%. Figure 6 shows that the model gives 100% prediction for Angry, disgust, Fear and happy instances, 94% for surprise and 40% for sad.

We justify the performance of Deep forest on Facial expression classification by comparing its performance with the state of the art DNN method (FERAtt) [10]. Table 2, presents both our result and FERAtt result and clearly Deep forest gives better accuracy (93.22%) than the accuracy achieved in FERAtt (86.67%) on CK+ dataset. while accuracy gotten with FERAatt (75.22%) on BU3DFE dataset is more than Deep Forest (65.53%). But it should be noted that FERAtt could not use a small dataset, because the authors reported that the data were augmented and also combined with Coco data. Also FERAtt demands for high computing device like GPU for its appreciable time of computation, unlike the Deep Forest that performed its layer by layer learning on the available computing device (intel(R)Core(TM)i7-4770sCPU @3.10 GHz 3.10 GHz and RAM: 8 GB) at an appreciable time.

Obviously, the result of the experiment compliments the claim of [21]. It shows that Deep Forest has the inherent capability for small datasets. The average prediction accuracy of the model on CK+ (309 data) is 93.22% and BU-3DFE (600) is 65.53%. Although, Deep Forest is challenge with the issue of memory consumption, yet it could be a an alternative to DNN if its features are greatly explored.

6 Conclusion

We have presented a Deep learning approach other than the popularly known DNN for Facial Expression Recognition. And our work proved that Deep forest could preform very well even in a wild environment and with a sparsely distributed and unbalanced dataset. Also the outcome of the further investigation conducted in the experiment, is the evidence of dynamic control behaviour of deep forest over its model. The result of this work is an incite for exploring possibilities of enhancing Deep Forest model, which is the focus of the future work.