1 Introduction

Facial recognition is an interesting [9], and prevalent problem recently [18] because of its many popular real-world application areas, ranging from entertainment [31], security control [1], cosmetology [15], to biometrics [6, 14]. Age and gender classification of faces, in particular, has rapidly gained more popularity among others [12]; it plays a very significant role in our social lives in which we rely on the two attributes of the face for our daily interactions [20].

Age and gender classification tasks have been approached with some many methods, many of which are incapable of solving the two problems accurately. Most of the popular approaches have been handcrafted which manually engineer features from the face, and focuses on extracting handcrafted features to explore the discriminative information needed for the estimation task [16, 19, 25, 30]. Different machine learning methods studied by many researchers for age and gender classification were only efficient on face images captured under controlled conditions; few of those methods are designed to handle the many challenges of unconstrained real-life imaging conditions achieving unsatisfactory results [4, 7].

Recently, Convolution Neural Networks (CNNs) has proven to be the most suitable method for facial recognition, especially in age and gender classification. It can classify the age and gender of face images relying on its good feature extraction technique [2, 5, 11, 21, 26, 29]. Availability of both large data for training and high-end computer machines, also help in the adoption of the deep CNN methods for the classification task. This consequently shows its relevance to classify unconstrained real-world age and gender tasks automatically achieving significant performance over existing methods [17, 24, 27, 32]. We, therefore, present a CNN-based model (in Fig. 1) for age group and gender classification of unfiltered real-life face images of individuals. Our main contributions are as follows:

  1. 1.

    We propose a new CNN model to process age and gender classification of unconstrained real-life faces where we categorize the facial analysis task as a classification problem, that considers each age and gender as a class label.

  2. 2.

    We design a robust face detection and alignment algorithms that localize face in the image, detect facial landmarks of unconstrained faces in real-time and transform the image into an output coordinate space.

  3. 3.

    We also pre-train our model on a very large facial aging dataset containing unconstrained age and gender labels, to learn the bias and particularities of the dataset and also to avoid overfitting.

  4. 4.

    Finally, we employ two popular datasets benchmark for training and validation. The experimental results when evaluated on OIU-Adience benchmark dataset for age and gender classification, show that our novel CNN model achieves better performance compared with state-of-the-art on the same dataset and hence can satisfy the requirements of many real-world applications.

The remainder of this paper is arranged as follows: Sect. 2 briefly studies the related works in age and gender classification, Sect. 3 describes our proposed approach, Sect. 4 presents the experiments and the experimental analysis on OIU-Adience dataset of unconstrained faces with age and gender labels and then discusses the achieved results while conclusion and future works are drawn in Sect. 5.

Fig. 1.
figure 1

The pipeline of our proposed model

2 Related Works

In the past years, several methods have been proposed to solve the age and gender classification problem. Some of those methods focus more on constrained images while only a few studies age and gender classification of unconstrained real-world faces. Recently, CNN has received increasing attention in the computer vision community especially for classifying age and gender of face images from uncontrolled imaging environment [11]. To mention a few, Eidinger et al. [12] studied age and gender classification of face images acquired in challenging in-the-wild scenarios. Firstly, they collected face images of people labeled for age and gender from online image repositories. They also proposed a dropout-SVM approach for the estimation task with a robust face alignment technique to prepare the in-the-wild images for better result. Their approach achieved a better result when compared to the state-of-the-art. Levi and Hassner [20] also investigated a five-layer CNN method to classify the age and gender of the person using the faces collected from unconstrained settings. The model is trained and evaluated on Adience benchmark for age and gender estimation where the results reflect a remarkable baseline for CNN-based models and can improve with better system design. Subhani and Anto in [29] proposed a five-layer CNN based architecture for age and gender classifications on Adience benchmark images using direct Convolutional Neural System engineering. The model achieved a better result than the current state-of-the-art methods when evaluated on Adience dataset. Zhang et al. [32] developed a novel CNN-based model for age group and gender classifications of the in-the-wild images, named “Residual Networks of Residual Networks (RoR)”. RoR model was initially pretrained on ImageNet dataset, then finetuned on IMDB-WIKI dataset to learn the peculiarity of each dataset before finally finetuning on Adience benchmark dataset. The experimental results achieved new state-of-the-art results on Adience dataset. In 2018, Duan et al. [10] proposed a hybrid novel age estimation model named CNN2ELM, to predict the age and gender of face images. CNN2ELM includes three convolutional neural networks (CNN) models and two extreme learning machine (ELM) structures. The models are pretrained on the ImageNet dataset before finetuning on the IMDB-WIKI, MORPH-II, Adience benchmark, and LAP-2016 datasets. The three CNNs are used for features extraction while the two ELM structures classify the age group and gender.

Although most of the methods discussed above made lots of improvement on age and gender classification, where some are aimed at unconstrained imaging conditions, our novel CNN structure can still achieve a better result. It is not only suitable on constrained images but also able to classify the age and gender of unconstrained real-life facial images.

3 Proposed Approach

The approach for the age group and gender classification of unconstrained real-life face images as presented in Fig. 1, consists of the following main components:

3.1 Face Detection

The image preprocessing stage starts with face detection to detect an input image by localizing face in the image before detecting the key facial structures on the face object of interest. To accomplish this task, we employ a dlib library that uses “pre-trained HOG + Linear SVM”. The detector, an improvement of [8] and [23], is an effective and reliable model to localize the face in the image; it can locate the bounding box (x, y)-coordinates of a face in an image.

3.2 Landmark Detection and Face Alignment

Given the face region from face detection phase, we can then apply a face landmark method to detect the key facial structures on the face area of interest including the mouth, right eyebrow, left eyebrow, right eye, left eye, nose, and jaw. The designed landmark detector algorithm detects facial landmarks of unconstrained faces in real-time.

Also, before we pass our face images through our CNN model for training and evaluation, there is a need to normalize and align the face images to obtain better accuracy. The goal of this is to warp and transform the images into an output coordinate space. Having achieved the (x, y)-coordinates of the eyes through landmark detection, we then compute the angle between them and generate their midpoint. An affine transformation is then applied to warp the images into a new output coordinate space for centered images, an equally scaled, and well-rotated eyes lying along a horizontal line.

Fig. 2.
figure 2

The pipeline of our proposed model

3.3 Architecture of Our CNN Model

In this section, we describe the design of our novel CNN structure in Fig. 2. Our network architecture includes two stages: feature extraction and classification. The feature extraction stage contains the convolutional layer, activation layer (rectified linear unit (ReLU)), batch normalization (instead of the deprecated Local Response Normalization), max-pooling layer, and a dropout. The feature extraction stage has four convolutional layers with their corresponding parameters, including the number of each filter, the kernel size of each filter, the stride, etc. The first convolutional layer consists of 96, \(7 \times 7 \) kernels and a stride of \(4 \times 4\). The second, third and fourth series of convolutional layers applied the same structure as the first but with different filter and filter size. Second convolutional layer consist of 256, \(5 \times 5\) filters, third is near identical to the second convolutional layers but with an increase in the number of filters to 384 and a reduction of the filter size to \(3 \times 3\). The last and fourth convolutional layer set has a filter of 256 and a filter size of \(3 \times 3\). All the convolutional layers have a fixed dropout of 25% to improve generalization and reduce overfitting.

The classification stage contains two fully-connected layers that classify the age group and gender tasks. The first fully-connected layers contain 512 neurons, followed by a ReLU, batch normalization and a dropout layer at a dropout ratio of 50%. The last fully-connected layer output 512 features which are densely mapped to 8 or 2 neurons for classification tasks. A softmax with cross-entropy loss function is adopted to obtain a probability for each class.

Cross-Entropy: Cross-entropy loss measures the performance of a classification model and generates an output that is between 0 and 1. Cross-entropy loss decreases as the predicted probability converges to the correct label; the lower the cross-entropy result, the better the classification model to generalize.

In binary classification, with the number of classes N equals 2, it is therefore defined as:

$$\begin{aligned} -{(z\log (p) + (1 - z)\log (1 - p))} \end{aligned}$$
(1)

but for multi-class classification with N > 2, we calculate a separate loss for each label of observation and then sum the outcome (see Eq. 2).

$$\begin{aligned} -\sum _{c=1}^Nz_{o,c}\log (p_{o,c}) \end{aligned}$$
(2)

where N is the number of classes, z is the binary indicator (0 or 1) if class label c is the actual classification for observation o, log is the natural log, and p is the predicted probability observation o of class label c.

Table 1. Summary of the popular Facial Aging Databases

3.4 System Training

In this section, we present the training details of the two classifiers for age group and gender on Adience dataset that correctly predict the age group and the gender of unconstrained face images. The age classifier will be responsible for predicting the age of eight different classes while gender classifier will classify gender into two classes. We initially pre-train the two CNN based classifiers on a very large IMDb-WIKI benchmark dataset containing unconstrained real-life faces with age and gender label. This is important so that the two classifiers will learn the bias from large image samples to generalize on the test image samples and also reduce the risk of overfitting. For IMDb-WIKI dataset, we split into two: 90% for training, and 10% for validation while 70% of OIU-Adience images is used for training and the remaining 30% is equally split, 15% for validation and 15% for testing. The images in the datasets were originally rescaled to \(256\times 256\) pixel, then cropped to \(224\times 224\) pixel before being passed into the network. We also train the network using a batch size of 64. The optimization of the proposed model for the classifiers is carried out by using a stochastic gradient descent method with mini-batches of size 256 and a momentum value of 0.9 with a weight decay of 0.0005. The training starts with an initial learning rate of 0.0001 then decrease by a factor of 10 whenever there is no improvement in the accuracy result. The training on the classifiers is terminated when the network begins to overfit on the validation set. To further improve our model performance, we employ data augmentation on both the training and testing images and also utilize dropout regularization methods. We calculate SGD as defined in Eq. 3:

$$\begin{aligned} \beta = \beta - \eta \cdot \nabla _\beta J( \beta ; x^{(i)}; y^{(i)}) \end{aligned}$$
(3)

where \( \eta \) is defined as the learning rate, \( \nabla _\beta J \), the gradient of the loss term with respect to the weight vector \( \beta \).

Fig. 3.
figure 3

Age group and gender distribution of face images in OIU-Adience dataset.

4 Experiments

In this section, we describe the specifications of the employed OIU-Adience and IMDb-WIKI benchmark databases, and experimental analysis of our model on OIU-Adience benchmark with age and gender labels.

4.1 Description of the Dataset

We employ two standard facial aging datasets to train and validate our approach. We initially train our model on IMDb-WIKI database [32] and then finetune it on the original OIU-Adience benchmark [12] of unconstrained facial images.

OIU-Adience dataset [12] consists of about 26,000 face images from ideal real-life and unconstrained environments. Hence, It reflects all the features that are expected of an image collected from challenging uncontrolled scenarios with a high degree of variations in noise, pose, appearance among others. It has eight different age categories (0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 48–53, 60+) and two gender labels.

IMDb-WIKI database [32] is the largest publicly available dataset for age estimation of people in the wild, containing more than half a million images with accurate age labels between 0 and 100 years. For the IMDb-WIKI dataset, the images were crawled from IMDb and Wikipedia; IMDb contains 460,723 images of 20,284 celebrities and Wikipedia with 62,328 images. The images of IMDb-WIKI dataset are obtained directly from the website, as such the dataset contains many low-quality images, such as “human comic” images, sketch images, severe facial mask, full body images, multi-person images, blank images, and so on.. The specification of the datasets is highlighted in Table 1 while the detailed distribution of OIU-Adience images for the age and gender categories, is presented in Fig. 3.

Table 2. Results in literature for Age group and Gender classification on OIU-Adience benchmark using classification accuracy.

4.2 Experimental Results and Discussion

A novel CNN model which classify unconstrained face images to age group and gender has been proposed. Different empirical experiments have been carried out to evaluate the performance of the proposed approach for classifying a person to the correct age group and gender on Adience dataset. The performance of the two classifiers is measured by two standard metrics common in the literature: confusion matrix and accuracy.

Confusion Matrix [22]. This evaluates the performance of multi-class age group and binary gender classification model on sets of test images. The metric summarizes the performance of the classification algorithm in a table with four different combinations of predicted and actual classes. We therefore presents a confusion matrix to the eight classes (0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 48–53, 60+) age grouping results and for binary class gender classification results. The metric generates the results of our proposed method on OIU-Adience dataset for age group and gender classification.

Accuracy [28]. This calculates the closeness of the measured (predicted) value to the standard or known (ground truth) value. It is calculated as the percentage of face images that were classified into correct age-groups (or gender). It measures the proportion of true results (both true positives and true negatives) among the total number of face image samples tested (see Eq. 4).

$$\begin{aligned} \mathbf{Accuracy } = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(4)

where TP is the number of true positive value, TN is the number of true negative value, FP is the number of false positive value, and FN is the number of false negative value.

It is important to comment that the variation in the classification result for age and gender as presented in Fig. 5(a) and (b) respectively, is attributed to the different number of samples for age and gender annotations which are not evenly distributed, and also the peculiarity of each class.

From the confusion matrix table in Fig. 5(a), it is noticed that the 8–13 and 0–2 age group labels are estimated with the highest accuracy compared to the other age groups. In the case of the 0–2 age group, this could be attributed to the fact that face images of infants contain distinctive features that enable the classifier to distinguish this age group easily. For 8–13 group, that might be as a result of its size and distinctive features in those image category. 48–53 age group label was recorded with the lowest accuracy, the result might be as a result of its small size. The confusion matrix of the gender classification is presented Fig. 5(b). From this figure, we see that our approach recognizes males easily compared to females, achieving better accuracy.

In addition to applying a confusion matrix metric, we also evaluate the accuracy of the best configuration of our method in terms of classification accuracy, on OIU-Adience benchmark dataset, and compare our results with the state of the art methods. Table 2 compares the accuracy of the best configuration of our method with that of state-of-the-art techniques for the OIU-Adience dataset. For the Age group Classification, our model achieves a classification accuracy of 84.8%, and this improves over best-reported state of the art result for accuracy in Duan et al. [10] by 18.3%. We also evaluate our method for classifying a person to the correct gender on the same OIU-Adience dataset where we train the model for classification of two gender classes, and report the result on classification accuracy with pre-training on the IMDb-WIKI dataset, and finetuning on the original dataset. As presented in Fig. 4(b), we achieve an accuracy of 89.7% compared to the previous state-of-the-art of 88.2% reported in Duan et al. [11]. Our approach, therefore, achieves the best results not only on the age group estimation but also on gender classification; it outperforms the current state-of-the-art methods. The graphs in Fig. 4(a) and (b) present the results of the two classifications on the OIU-Adience dataset.

As presented in Figs. 6, 7 and 8, it is recorded that our model can correctly predict the age group and gender of faces. However, there are few cases where face images were incorrectly classified, this is could be as a result of different degree of variability attributed to unconstrained images including low resolution, non-frontal, lighting conditions, and heavy makeup (see Fig. 9).

Fig. 4.
figure 4

Graphs of accuracy results for age group and gender classification.

Fig. 5.
figure 5

Graphs of confusion matrix results for age group and gender classification.

Fig. 6.
figure 6

Age group classification

Fig. 7.
figure 7

Male: gender classification

Fig. 8.
figure 8

Female: gender classification

Fig. 9.
figure 9

Faces with misclassification

5 Conclusions and Future Work

The proposed CNN-based classification model is designed for the age group and gender classification of unconstrained real-life faces. The novel approach relied on the features extraction ability and classification proficient of the CNN architecture. The satisfactory performance of the classification model is attributed mainly to our new CNN architecture, that was initially pre-trained on very large IMDb-WIKI dataset before being fine-tuned on the original dataset. Robust face detection and good alignment technique also contributed greatly to the classification accuracy of the approach. An extensive evaluation of the newly-designed model on OIU-Adience benchmark for age and gender classification, confirms the applicability of our method on unconstrained real-world face images. Exact age and gender classification of human’s face will be an interesting research field to study in the future.