Keywords

1 Introduction

Video surveillance has been extensively used in the field of public security. As the face information has the advantages of unique identification and easy accessibility, face detection and tracking based on video sequence has been part of the most significant means to locate and track the criminal suspects. Human head is a 3D object, in the video image, 1D information would be dropped during the process of transformation from 3D to 2D. When different angles of 3D face are projected to 2D, different parts of the face will be stretched and compressed, leading to huge difference between different angles of 2D face. Most existing face detection methods require that the detected person must be positive. In some cases, however, due to the movement of the detected person and various imaging angles, it is difficult to obtain a frontal image. Therefore, multi-angle face detection has been one of the research hotspots.

In view of the problem of multi-angle face detection, many methods have been proposed. In some research, Haar-like features were extended differently and classifiers were constructed based on Adaboost algorithm to accomplish multi-angle face detection [1,2,3,4]. The methods based on different skin color models were proposed which were used for coarse detection to improve efficiency, [5, 6]. Guo proposed an Adaboost-SVM algorithm, in which features were fused with Haar-like and edge-orientation field features, and it was combined with the improved decision tree cascade structure to carry out the multi-angle face detection, [7]. But because the number of Haar-like feature is very large and it will be much larger when adapting to different angles, training process based on Haar-like feature is quite slow. So the method based on Multi-block Local Binary Pattern feature (MB-LBP) was proposed to detect human faces, [8]. Reference [9] proposed a method based on MB-LBP feature and controlled cost-sensitive Adaboost (CCS- Adaboost). Reference [10] proposed an algorithm cascading two SVM classifiers trained by HOG and LBP features respectively to implement face detection.

In this paper, the method based on multi-feature fusion to detect multi-angle face is proposed. Firstly, three Adaboost classifiers of a single feature are constructed respectively and trained by processed training samples. Secondly, preprocessed testing samples are sent into three classifiers for the original detection to get suspect face regions and their weights. Finally, the refined face regions are obtained by voting and weighted calculation. The method performance will be evaluated in terms of accuracy and efficiency.

The rest of the paper is organized as follows. The method of multi-angle face detection is discussed in Sect. 2. Simulation experiment and analysis are presented in Sect. 3. Conclusions are given in Sect. 4.

2 Method of Multi-angle Face Detection

The method of multi-angle face detection based on multi-feature fusion is divided into three parts, preprocessing, training and detection, as shown in Fig. 1.

Fig. 1.
figure 1

Flowchart of multi-angle face detection

2.1 Preprocessing

Due to the conditions of testing samples are different, in order to reduce their effects, it is necessary for testing samples to conduct illumination compensation and histogram equalization. As testing samples contain not only human face regions but also non-face skin regions which are needed to be screened, so testing samples are normalized to 300 × 300 size.

The skin color feature has good clustering characteristics, and the skin color information can be separated from the background. In the YC b C r color space, Y represents the luminance information, C b and C r represents chrominance information. The conversion relation from RGB to YC b C r is calculated as follows:

$$ \left[ {\begin{array}{*{20}c} Y \\ {C_{b} } \\ {C_{r} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {0.299} & {0.587} & {0.114} \\ { - 0.1687} & { - 0.3313} & {0.5} \\ {0.5} & { - 0.4187} & { - 0.0813} \\ \end{array} } \right]\,\left[ {\begin{array}{*{20}c} R \\ G \\ B \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} 0 \\ {128} \\ {128} \\ \end{array} } \right] $$
(1)
$$ \left[ {\begin{array}{*{20}c} Y \\ {C_{b} } \\ {C_{r} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {0.299} & {0.587} & {0.114} \\ { - 0.1687} & { - 0.3313} & {0.5} \\ {0.5} & { - 0.4187} & { - 0.0813} \\ \end{array} } \right]\,\left[ {\begin{array}{*{20}c} R \\ G \\ B \\ \end{array} } \right] + \left[ {\begin{array}{*{20}c} 0 \\ {128} \\ {128} \\ \end{array} } \right] $$
(2)

Skin color in \( {\text{C}}_{\text{b}} - {\text{C}}_{\text{r}} \) color space is aggregated into a elliptic model as in (2) and (3).

$$ \frac{{(x - ec_{x} )^{2} }}{{a^{2} }} + \frac{{(y - ec_{y} )^{2} }}{{b^{2} }} = 1 $$
(3)
$$ \left[ {\begin{array}{*{20}c} x \\ y \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\cos \theta } & {\sin \theta } \\ { - \sin \theta } & {\cos \theta } \\ \end{array} } \right]\,\left[ {\begin{array}{*{20}c} {C_{b}^{'} - C_{x} } \\ {C_{r}^{'} - C_{y} } \\ \end{array} } \right] $$
(4)

where c x  = 109.38, c y  = 152.02, \( \theta \) = 2.53, ec x  = 1.60, ec y  = 2.41, a = 25.39, and b = 14.03. If the pixel satisfies the (2) and (3) which can be considered as skin color, it is recorded as 1, otherwise as 0 to get a binary image. Therefore, after morphology process, skin region can be obtained.

2.2 Feature Extraction

In this paper, Haar-like, MB-LBP and HOG feature is extracted respectively and multi-angle face detection classifier is constructed through training of a large number of face and non-face samples.

Haar-like feature templates consist of a simple combination of rectangles, in which there are the identical white and black rectangles, [11]. The feature value of the template is defined as the difference between the sum of pixels in the white rectangle region and it in the black rectangle region, which reflects the local change of the gray level in the image, [12]. There are three kinds of commonly used Haar-like features, edge feature, linear feature and diagonal feature, mainly representing horizontal, vertical and diagonal direction information. As the Haar-like feature is sensitive to the edge or line segments, so it’s often used to distinguish the face region and non-face region. As shown in Fig. 2(a) and (b) reflects edge feature of horizontal and vertical direction respectively, Fig. 2(c) and (d) reflects linear feature of horizontal and vertical direction respectively, and Fig. 2(e) reflects the diagonal feature, [13].

Fig. 2.
figure 2

Haar-like feature

The traditional method to calculate feature value is to cumulate gray value of each pixel area directly and the computation is enormous, so integral figure is used to simplify the calculation. For original image I(x,y), the integral figure ii(x,y) is the sum of gray values in the black areas in Fig. 2f, and it can be defined as:

$$ ii(x,y) = \sum\nolimits_{{x^{'} \le x,y^{'} \le y,}} {I(x^{\prime } ,y^{\prime } )} $$
(5)

where \( I(x^{\prime } ,y^{\prime } ) \) is the gray value of \( ({\text{x}}^{\prime } ,{\text{y}}^{\prime } ) \), ii(x,y) can be got by (5) and (6),

$$ s(x,y) = s(x,y - 1) + I(x,y) $$
(6)
$$ ii(x,y) = ii(x - 1,y) + s(x,y) $$
(7)

where \( s(x,y) \) is the sum of accumulated pixel values of each row in the original image, as in (7).

$$ s(x,y) = \sum\nolimits_{{y^{\prime } \le y,}} {I(x,y^{\prime } )} $$
(8)

while \( s(x, - 1) = 0,\,ii( - 1,y) = 0 \).

As shown in Fig. 2(g),the sum of gray values of area D can be calculated through \( ii_{1} \), \( ii_{2} \), \( ii_{3} \), and \( ii_{4} \). \( ii_{1} = A \), \( ii_{2} = A + B \), \( ii_{3} = A + C \), \( ii_{4} = A + B + C + D \), so \( D = ii_{4} + ii_{1} - (ii_{2} + ii_{3} ) \).

LBP (Local Binary Pattern) feature is defined in the 3 × 3 neighborhood. As shown in Fig. 3(a), the gray value of center pixel in the neighborhood is the threshold value. Compared with the threshold value, if 8 adjacent gray pixel value is greater, record it as 1, else as 0 to produce an 8-bit binary number (10010110). Then it is converted to a decimal number to obtain LBP encoding (150) of the center pixel in the neighborhood, reflecting the texture information of this region, [14].

Fig. 3.
figure 3

The process of LBP and MB-LBP feature extraction

MB-LBP (Multi Block-Local Binary Pattern) is used to improve the LBP feature. As shown in Fig. 3(b), the rectangular region is divided into image blocks which are divided into small areas, and the average gray value of the small area is regarded as the gray value of the image block. And then, LBP coding for pixel gray scale is converted into the coding for image blocks. If the image block size is 3 × 3 and the size of each small area is 1, the MB-LBP feature is the LBP feature at this time, as shown in Fig. 4. Therefore, when the image size is small, the gray value of each pixel of the image corresponds to the average gray value of the image block in the larger image, which is equivalent to extracting the MB-LBP feature. For an image size of 24 × 24, with the upper left corner of the image as the origin, with 6 × 6 pixels constituting an image block, 50% overlap between adjacent blocks, a total of 49 blocks, you can get 49 × 256 = 12544 dimensions MB-LBP feature.

Fig. 4.
figure 4

LBP and MB-LBP feature image

HOG (Histogram of Oriented Gradient) feature can be obtained by calculating the histogram of the oriented gradient of the local image, which is used to describe the edge or gradient information of the local region. To calculate this feature, firstly, the image is grayed and gamma normalized. Secondly, the gradient size and gradient direction of each pixel are computed by using the [−1,0,1] and [1,0,−1]T gradient operators. Thirdly the image is divided into several small cells and 0°–360° is divided into nine bins. The 9D HOG feature vector of each cell unit can be obtained by accumulating weighted votes for gradient orientation over spatial cells. Blocks are composed of several cells and HOG feature of the block can be obtained by normalizing all series HOG feature vector of cells. Finally, the HOG features of all blocks are connected in series to obtain the HOG feature of the whole image. For an image size of 24 × 24, with the upper left corner of the image as the origin, with 3 × 3 pixels constituting a cell, with 2 × 2 cell constituting a block, 50% overlap between adjacent blocks, a total of 49 blocks, horizontal and vertical directions can get 49 × 4 × 9 = 1764 dimensions HOG features respectively.

2.3 Classifier Building

Adaboost is an algorithm to linearly combine many classifiers and form a much better classifier. Firstly, dataset was trained to get weak classifiers through voting. Then, weights were set differently for each weak classifier to achieve the global optimum. Finally, according to the cascade structure, strong classifiers were obtained, [15]. As shown in Fig. 5, Adaboost cascade classifier is a coarse to fine structure, where Y represents the face area, N represents the non-face area. Three features of training samples are extracted to construct Haar-like, MB-LBP and HOG classifiers respectively.

Fig. 5.
figure 5

Flowchart of Adaboost cascade classifier

If the classifier is based on a single feature, for the same sample, there will be a situation in which feature A can detect the face correctly while feature B cannot or feature A mistakenly regards the area as face while feature B rules it out. Therefore, in this paper, as shown in Fig. 6, a method combined three features is proposed to integrate these classifiers.

Fig. 6.
figure 6

Flowchart of classifier building.

Samples will be sent to three classifiers to get suspected face areas and their weight respectively. And then, vote. When the area is detected by two classifiers at least, do the weighting. When the weight is greater than threshold, the area can be regarded as face. Voting and integrating are calculated as follows.

Firstly, classify the rectangles output by three classifiers. Record the location and weight \( w_{i} (i \ge 0) \) of each rectangle and each rectangle corresponds to an index m i . If the ratio of overlap areas of two rectangles is more than 0.5 of the smaller area, these two rectangles are regarded as one category. Count the number of categories by traversing all rectangles and each category corresponds to an index \( n_{j} (0 \le j \le i) \).

Secondly, calculate the fusion weight W j of each category. After traversing all rectangles of each category, set the biggest weight as the original weight W j of the category and record the index number of the corresponding rectangle. Then fuse the surplus rectangles \( m_{k} (k \le i) \) of each cateegory with the rectangle corresponding to the biggest weight, and calculate the proportion v of fused rectangle accounting for the larger rectangle. The fusion weight can be calculated as in (9).

$$ W_{j} = W_{j} + w_{k} \times v $$
(9)

Finally, voting, weighting, and threshold decision. If the number of rectangles of each category is above one, i.e. the area is detected by at least two classifiers, calculate the integrated weight as in (10),

$$ Weight = W_{j}^{2} + num^{2} $$
(10)

where Weight represents the total weight, W j represents the fusion weight of the category and num represents the number of rectangles of each category. According to the threshold to determine the final output detection area.

The experiment will be carried out on the test set using four kinds of classifiers and performance will be evaluated in terms of accuracy and real-time.

3 Simulation Experiment and Result Analysis

Simulation experiment is carried out based on VS2012 platform, opencv 2.4.8. Face training samples are from CAS - PEAL database [16] and MIT database, through histogram equalization and size normalization, and 8000 pieces of which constitute the face training set. Non-face training samples are from CMU database and MIT database, through size normalization, and 24000 pieces of which constitute the non-face training set. Testing samples are from FDDB [17] including 135 images and 224 faces.

3.1 Simulation Experiment

The process of classifiers training is divided into five parts, as shown in Fig. 7, initialization, loading samples, logical judgment, computation and save. Firstly, initialize the false rate and the number of strong classifiers to 0.5 and 30. Secondly, load face samples with pos2424.vec file as positive samples and non-face samples with neg.txt as negative samples. Thirdly, judge whether the false rate and the number of strong classifiers reach the initialization. If one of them satisfy the initialization, do the fifth stage. Then, calculate the feature values and return to the second stage. Finally, save the strong classifiers information as a XML file. After training, the number of strong classifiers of three features is 15, 16, 15.

Fig. 7.
figure 7

Flowchart of training.

Real-time performance is evaluated by detection time, which means the time required by dealing with a single image and is obtained through processing 135 images and taking the average time. Accuracy performance is evaluated by detection rate and false rate, as shown in Table 1.

Table 1. Algorithm results

3.2 Result Analysis

From Table 1, in terms of detection time: Multi-feature fusion < MB-LBP < Haar-like < HOG, the time of HOG required is the longest. Because during the process of classifier training, the average number of weak classifiers included in strong classifier per layer of HOG is about 455 while it of Haar-like is 35 and it of MB-LBP is 15, detection time of HOG is the longest. In terms of detection rate: HOG < Multi-feature fusion < MB-LBP < Haar-like. Although after weighting and integration, compared with Haar-like and MB-LBP feature, the detection rate of Multi-feature fusion is lower and the false rate is 2.11%, filtering out most of the non-face region detected mistakenly. As shown in Figs. 8a, b, c and d is the detection result based on MB-LBP, Haar-like, HOG and Multi-feature fusion respectively.

Fig. 8.
figure 8

Results of detection.

From Fig. 8a, after weighting and integration, classifier based on Multi-feature fusion can filtrate non-face region detected by single feature, reducing the false rate, and from Fig. 8b the classifier can detect face region which single feature cannot, increasing the detection rate compared with HOG. As shown in Fig. 8c, only one classifier can detect the face region so that the there is no integrated weight calculation, and as shown in Fig. 8d, at least two single feature classifier can detect the face but the integrated weight is less than the threshold, leading to a decrease in detection rate.

It is necessary for face detection to detect the face region correctly and exclude the non-face region, so the method not only needs to ensure a higher detection accuracy but also is required to reduce the false rate as much as possible. In conclusion, the multi-feature fusion method is more suitable for multi-angle face detection.

4 Conclusion

Multi-angle face detection has grown up to a hot topic in the field of face detection. A method of multi-angle face detection based on multi-feature fusion is proposed in this paper, the effect of the method is better than the method based on single feature in the terms of real-time and accuracy, because it combined advantages of three features and has a high confidence of the refined face region through voting and weighting calculation. The results (the detection rate is 88.39% and the false rate is 2.11%) of the simulation experiment shows that the method is suitable for multi-angle face detection. In the future, the research will focus on searching a more appropriate threshold which can improve detection rate and can guarantee the false rate, and studying on applying this method to video sequence to accomplish multi-angle face detection based on video sequences.