Elsevier

Neurocomputing

Volume 230, 22 March 2017, Pages 184-196
Neurocomputing

Data augmentation for face recognition

https://doi.org/10.1016/j.neucom.2016.12.025Get rights and content

Highlights

  • We present five data augmentation methods specific to face images.

  • Landmark perturbation method is able to generate different kinds of transformed face images automatically.

  • Different hairstyles and glasses of face image can be automatically synthesized.

  • Face images with different poses and illuminations can be generated according to 3D face model.

Abstract

Recently, Deep Convolution Neural Networks (DCNNs) have shown outstanding performance in face recognition. However, the supervised training process of DCNN requires a large number of labeled samples which are expensive and time consuming to collect. In this paper, we propose five data augmentation methods dedicated to face images, including landmark perturbation and four synthesis methods (hairstyles, glasses, poses, illuminations). The proposed methods effectively enlarge the training dataset, which alleviates the impacts of misalignment, pose variance, illumination changes and partial occlusions, as well as the overfitting during training. The performance of each data augmentation method is tested on the Multi-PIE database. Furthermore, comparison of these methods are conducted on LFW, YTF and IJB-A databases. Experimental results show that our proposed methods can greatly improve the face recognition performance.

Introduction

Face recognition in unconstrained environment has become increasingly prevalent in many applications, such as identity verification, intelligent visual surveillance and immigration automated clearance system. The classical pipeline of a modern face recognition system typically consists of face detection, face alignment, feature representation, and classification. Among them, feature representation is the most fundamental step. An excellent feature can improve the performance to some degree. Up to now, many approaches of face representation have been proposed. Hand crafted features, such as LBP [1], SIFT [2], were early used to extract image's appearance feature. Later, encoding-based features were developed to learn discriminative feature from data. For example, Fisher vector [3] use unsupervised learning techniques to learn the encoding dictionary from training data. Recently, convolutional neural networks (CNNs) provides a supervised or unsupervised learning framework for robust feature learning, and has demonstrated state-of-the-art performances [4], [5].

Since LeNet-5 [6] was firstly proposed by LeCun et al., variant CNNs have been designed and are prevalent in image classification [7], [8] and object detection [9]. They also have brought a revolution in face recognition, and even outperform human recognition performance [10], [11], [5]. For example, DeepID3 [10], FaceNet [11], BAIDU [5], have reached over 99% face verification accuracy on the widely used Labeled Faces in the Wild (LFW) database [12].

In order to achieve better performance, the networks become much deeper and wider [13]. Therefore, directly training a deep network from scratch requires a large amount of labeled face images, because there are many parameters in a deep network. Sometimes, training with limited data will easily leads to overfitting. With large network and limited training data the test error keeps increasing after several epochs even though the training error is still decreasing as the training epoch increased [14]. In order to address this problem, a large number of strategies have been proposed: fine-tuning models trained from other large public databases (e.g., ImageNet [15]), adopting various regularization methods(e.g., Dropout [14], Maxout [16], and DropConnect [17]), collecting more training data [18], [4], [11]. At present, collecting more training data is directly way to improve the performance. With more training data, the trained model has stronger generalization ability. Many state-of-the-art methods are based on large scale training datasets. For instance, DeepFace [4] trained on 4 Million photos of 4 k people; FaceNet [11] trained on 200 Million photos of 8 Million people.

By taking great advantage of social networks on Internet, a large number of images, including faces, objects, scenes, can be easily crawled by search engines. Being able to access large amount of data meets the needs of deep learning training, but annotating data is a tedious, laborious, and time-consuming work, which even requires volunteers with specific expert knowledge. As size of dataset increasing, mistakes, such as wrong labeling, redundancy and duplication are inevitable. Needless to say, getting a large scale database with correctly labeled is too difficult and expensive for research groups, particularly in academia. Therefore, data augmentation methods have been emerged to generate large number of training data using label-preserving transformations, such as flipping and cropping [7], [19], color casting [20], blur [21], etc. Experiments in [19] have shown that flipping and cropping reduced the top-1 error rate by over 2% in the ILSVRC-2013. Color casting, blur and contrast transformations, help the trained model equipped with a strong generalization ability to unseen but similar noise patterns in the training data [7], [20], [21].

However, the above mentioned methods, which can be efficient to improve neural network based image classification systems for different circumstances, are still not enough for face images. Face image has its own particularity and the main challenges for face recognition including poses, illumination, occlusion, etc. The previous common used data augmentation methods, which just make some simple transformations, cannot handle these problems. Hence, face specified data augmentation methods have been proposed. Jiang et al. [22] proposed an efficient 3D reconstruction method to generate face images with different poses, illuminations and expressions. Mohammadzade and Hatzinakos [23] proposed an expression subspace projection method to synthesize new expression images for each person. Seyyedsalehi et al. [24] tried to generate visual face images with different expressions by using nonlinear manifold separator neural network (NMSNN). Most of previous methods are suitable to constrained environment and only generate fixed types visual face images.

As various poses, illumination and occlusion are common problems in face recognition, these factors not only influence face image pre-processing such as face alignment but also affect face image feature extraction. Meanwhile, the training dataset of face recognition is limited and each person only has a few types of images. Even though DCNNs have a powerful representation ability, they still need different kinds of face images in each subject to learn face variations. At present, the limited training dataset is far from enough for robust feature representation model training and seriously decrease the recognition accuracy in these situations. In this paper, we propose five special data augmentation methods dedicated to these factors: (LP), hairstyles synthesis (HS), glasses synthesis (GS), poses synthesis (PS) and illuminations synthesis (IS). These methods aim to alleviate the impacts of misalignment, pose variance, illumination changes and partial occlusions. Moreover, they can be widely used to unconstrained environment. LP method which randomly perturbs the locations of landmark position before face normalization makes feature extraction model robust to misalignment (e.g., translation, rotation, scaling and shear). HS and GS can generate different hairstyles and glasses giving a face image, which enlarge the training set and make the model robust to similar occlusion. 3D face reconstruction, in contrast to [22], is able to reconstruct 3D face model from image with large pose. When the 3D face model reconstructed, we can use it to imitate different poses and illumination, which make the DCNN model robust to different poses and illuminations. Each data augmentation method is verified on Multi-PIE database. The comparison of different data augmentation methods are conducted on Labeled Faces in the Wild database (LFW) [12], YouTube Faces database (YTF) [25] and IARPA Janus Benchmark A database (IJB-A) [26]. Experimental results show that the proposed data augmentation methods can greatly improve the performance of face recognition.

The rest of this paper is organized as follows. Section 2 reviews related previous works. Our approaches of data augmentation are introduced in Section 3. The experimental results are presented in Section 4 and conclusions are drawn in Section 5.

Section snippets

Related work

At present, only a few datasets are publicly available, e.g. CASIA-WebFace dataset [27] including 10,575 subjects and 494,414 images, CACD dataset [28] including 2000 subjects and 163,446 images. Compared to the dataset used by the Internet giants like Google [11], which contains 200 million images and 8 million unique identities, the existing publicly accessible face datasets are relatively small and not enough for large DCNN model training.

Thus, a number of data augmentation methods have been

Data augmentation

Due to limited training dataset and each person only has a few types of images, it is not sufficient to train a deep and robust DCNN. A reasonable way to enlarge the training dataset size is to use data augmentation. As revealed in previous works [7], [20], [21], data augmentation methods help the trained DCNN model equipped with a strong generalization ability to unseen but similar noise patterns in the training data. In this Section, we will introduce five data augmentation methods specific

Experiment setup

GoogLeNet [8] is a typical DCNN architecture, whose excellent performance has been shown at ILSVRC-2014 contest. It has 27 layers and introduces inception model to find the optimal local construction. The feature extracted from the trained GoogLeNet model is not only discriminative but also low-dimension and sparsity. Due to these merits, we adopted GoogLeNet as default network to train the models for different data augmentation methods throughout our experiments.

CASIA-WebFace dataset [27],

Conclusion

This paper presents five data augmentation methods for improving face recognition performance, which aim at increasing the effective size of the training set. Compared with previous data augmentation methods, our methods are dedicated to face images and are more efficient in various situations. Experimental results on Multi-PIE database confirm the effectiveness of our methods. Experimental results on popular LFW, YTF and IJB-A databases show that our methods can significantly improve the

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) (Grant no. 61472386), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDA 06040103), and Chongqing Research Program of Basic Research and Frontier Technology-No. cstc2016jcyjA0011. The authors would like to thank You-Ji Feng and Cheng Cheng, for valuable discussions.

Jiang-Jing Lv received the B.S. degree in information and computing science from University of Science and Technology of Hunan, Hunan, China, in 2012. He is currently pursuing a Ph.D. degree in pattern recognition at Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China. His research interests include face recognition and deep learning.

References (57)

  • K. Simonyan, O.M. Parkhi, A. Vedaldi, A. Zisserman, Fisher vector faces in the wild, in: Proceedings of the British...
  • Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: closing the gap to human-level performance in face verification,...
  • J. Liu, Y. Deng, C. Huang, Targeting Ultimate Accuracy: Face Recognition via Deep Embedding, arXiv preprint...
  • Y. LeCun et al.

    Backpropagation applied to handwritten zip code recognition

    Neural Comput.

    (1989)
  • A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in:...
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with...
  • S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, in:...
  • Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: Face Recognition with Very Deep Neural Networks, arXiv preprint...
  • F. Schroff, D. Kalenichenko, J. Philbin, Facenet: a unified embedding for face recognition and clustering, in:...
  • G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for studying face recognition...
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference...
  • G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Improving Neural Networks by Preventing...
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in:...
  • I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout networks, in: Proceedings of the...
  • L. Wan, M. Zeiler, S. Zhang, Y.L. Cun, R. Fergus, Regularization of neural networks using dropconnect, in: Proceedings...
  • Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in: Proceedings of the IEEE...
  • A.G. Howard, Some Improvements on Deep Convolutional Neural Network Based Image Classification, arXiv preprint...
  • R. Wu, S. Yan, Y. Shan, Q. Dang, G. Sun, Deep Image: Scaling up Image Recognition, arXiv preprint...
  • Cited by (113)

    View all citing articles on Scopus

    Jiang-Jing Lv received the B.S. degree in information and computing science from University of Science and Technology of Hunan, Hunan, China, in 2012. He is currently pursuing a Ph.D. degree in pattern recognition at Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China. His research interests include face recognition and deep learning.

    Xiao-Hu Shao received the B.E. degree in Telecommunication Engineering from China University of Geosciences in 2009. He received the M.E. degree in Signal and Information Processing from University of Electronic Science and Technology of China in 2012. He joined in Chongqing Institute of Green and Intelligent Technology (CIGIT), Chinese Academy of Sciences as a research trainees from 2012 to 2015. He is pursuing Ph.D. degree in CIGIT and supervised by professor Xi Zhou. His research interests include object detection, 3D face reconstruction and face recognition.

    Jia-Shui Huang received the M.S. and Ph.D. degrees in Computer Science from the Zhejiang University, Zhejiang, China, in 2006 and 2010 respectively. He is currently an associate professor at Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences. His research interests include computer vision and machine learning, with focus on face recognition and deep learning.

    Xiang-Dong Zhou is an associate professor at the Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences. He received the B.S. degree in Applied Mathematics, the M.S. degree in Management Science and Engineering from National University of Defense Technology, Changsha, China, the Ph.D. degree in pattern recognition and artificial intelligence from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1998, 2003 and 2009, respectively. He was a postdoctoral fellow at Tokyo University of Agriculture and Technology from March 2009 to March 2011. From May 2011 to October 2013, he was a research assistant and later an associate professor at the Institute of Software, Chinese Academy of Sciences. His research interests include machine learning and pattern recognition.

    Xi Zhou received the B.S. and M.S. degrees in electronic science and technology from University of Science and Technology of China, Hefei, China, and the Ph.D. degree in electrical and computer engineering from University of Illinois at Urbana-Champaign, Champaign, IL, USA, in 2010. He is a Professor with Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, Chongqing, China, and the Founding Lead of the Intelligent Multimedia Research Center. He has authored or co-authored more than 40 technical papers with Google Scholar Citation more than 600 times. His research interests include pattern recognition, machine learning, and computer vision and multimedia. Dr. Zhou received the Best Paper Award from the International Conference on Image Processing in 2007, the Best Student Paper Award from International Conference on Pattern Recognition in 2008, and the Best Paper Award from ACM Multimedia in 2013.

    View full text