Elsevier

Neurocomputing

Volume 218, 19 December 2016, Pages 197-202
Neurocomputing

Masked face detection via a modified LeNet

https://doi.org/10.1016/j.neucom.2016.08.056Get rights and content

Abstract

Detecting masked faces in the wild has been emerging recently, which has rich applications ranging from violence video retrieval to video surveillance. Its accurate detection retains as an open problem, mainly due to the difficulties of low-resolution and arbitrary viewing angles, as well as the limitation of collecting sufficient amount of training samples. Such difficulties have been significantly challenged the design of effective handcraft features as well as robust detectors. In this paper, we tackle these problems by proposing a learn-based feature design and classifier training paradigm. More particularly, a modified LeNet, termed MLeNet, is presented, which modifies the number of units in output layer of LeNet to suit for a specific classification. Meanwhile, MLeNet further increases the number of feature maps with smaller filter size. To further reduce overfitting and improve the performance with a small quantity of training samples, we firstly increase the training dataset by horizontal reflection and then learn MLeNet via combining both pre-training and fine-tuning. We evaluate the proposed model on a real-world masked face detection dataset. Quantitative evaluations over several state-of-the-arts and alternative solution have demonstrated the accuracy and robustness of the proposed model.

Introduction

Detecting video clips related to potential terrorists retains as a fundamental demand in the management of massive-scale video corpus, which is highly beneficial to the applications of public security. A variety of definitions exist to define a person as the terrorist in a given video clip, among which one obvious definition comes from the masked faces. As a specific task of face detection, the detection of masked faces poses significant difficulties, which differs from the traditional face detection (potentially with partial occlusions) that has been studied intensively decades long. On one hand, it encompasses challenges such as variational poses, and lighting that have historically hampered traditional face detection paradigms. On the other hand, its severe occlusion has significantly challenged the existing face detection algorithms, since the severely missing of face structure.

Tracking back to the literature, previous works in face detection mainly rely on handcraft feature designs, such as the well-known Fisher Face [1], Harr-like features with cascade detector [2], and Gabor-like high dimensional features with AdaBoost detector [3]. One essential limitation lies in the need of sufficient amount of training samples to achieve a satisfactory detection accuracy. Recently, exemplar-based face detection [4] has been shown to be effective, because a large exemplar database was leveraged to cover all possible visual variations. However, it requires a large face database for detection and tends to produce false alarms in the presence of highly cluttered backgrounds. In order to reduce the number of required exemplars, the efficient boosted exemplar-based face detector [5] was proposed to further improve the detection accuracy and make the detector faster and more memory efficient by discriminatively training and selectively assembling exemplars as weak detectors in the boosting framework. However, these methods would be failed if they rely on a small quantity of face training dataset. Recently, deep learning architectures have been studied as well, which make use of CNN with GPU based computing architecture to bring breakthrough in benchmark evaluations, such as Labeled Faces in the Wild (LFW) [6], [7], [8], Face Detection Data Set and Benchmark (FDDB) [9], [10].

In particular, the convolutional network can automatically learn effective feature representation of object from training data [11], [12]. Most notably, the Alexnet [13] shows ground breaking performance on the ImageNet 2012 classification challenge. After that, The CNNs have had a lead performance in ImageNet classification and object detection benchmark, such as GoogleNet [14] with about 6.8 M parameters, ResNet-18 network [15] with about 11.6 M parameters, VGG-19 [16] with about 144 M parameters. However, the models with large number of parameters would be overfitting when these models are trained on a small quantity of training dataset, especially our real-world masked face detection dataset with about 1000 training samples. To further tackle the challenge induced by limited training data, Hinton and Salakhutdinov [17] introduce pre-training to generate a good initialization for large deep neural networks. In contrast, The LeNet introduced in [18] shows good performance in recognizing hand-written digit characters with relatively few parameters. However, the need of large amount of training data still hesitates its direct application in our scenario of masked face detection. In view of this issue, in this paper, we introduce a modified LeNet, termed MLeNet, which modifies the number of units in output layer of LeNet to suit for a specific classification with a small quantity of training samples. Meanwhile, MLeNet further increases the number of feature maps with smaller filter size as shown in Table 1 which further improves the performance of classification with comparable network overhead to LeNet. Combining MLeNet with sliding window, the detection of masked faces is done in a multi-scale fashion.

The parameters of MLeNet can be learned via stochastic gradient descent, and we introduce some tricks which are combining both pre-training and fine-tuning to prevent the MLeNet from overfitting. Notably, pre-training is done by directly borrowing the models from LeNet. Fine-tuning is done by adapting the network structure to a very limited number of training instances. In additional, we increase the number of data set to two times via horizontal reflection. Well known schemes like sliding window and non-maximal suppression [19] are also integrated into the proposed MLeNet based detector. Quantitatively, the experimental comparisons to a set of state-of-the-art (e.g. LeNet[18], RFD[4]) and classic (e.g. Harr-like features with AdaBoost [2]) detectors have demonstrated that the proposed model can achieve superior performance on detecting masked faces.

The rest of this paper is organized as follows. Section 2 describes the proposed detection model (MLeNet) for masked man. Section 3 presents the detailed quantitative evaluation with comparisons to a set of state-of-the-arts. We conclude this paper in Section 4 and discuss our future work.

Section snippets

The Proposed Method

In this section, we introduce the MLeNet for the face detection of possible terrorist. Firstly, we introduce the structure and learning weights of MLeNet which is different with LeNet. Secondly, we combine pre-training and fine-tuning with data augmentation to further improve the performance of MLeNet with a very limited number of training samples. Finally, detecting masked faces is done via combining sliding window and non-maximum suppression.

Detecting masked faces

Based on the MLeNet described in Section 2.2, we generate a detector to classify whether a given fixed window is masked face or not. But how to generate a given fixed window which is a candidate of masked faces? Recently, there are two frameworks generating these candidates. The one is followed by R-CNN [20] in which selective search region proposals [21] are generated. The other is followed by DPM [22] in which the candidates of masked faces are generated by sliding window. The first framework

Experiments

We verify the proposed work on the masked man dataset cropped from violence videos. The dataset consists of 1140 images including 240 positive ones and 900 negative ones. We randomly select 150 positive samples and 750 negative samples as the training set, 50 positive samples and 50 negative samples as the validation set, and the remaining 140 images as the test set. To reduce overfitting and the error rate of detector, we increase the number of training instances to two times via horizontal

Conclusion

In this paper, we propose a novel model for localizing masked faces in images. Our model is based on the convolutional neural networks (MLeNet) and the sliding window. MLeNet achieves the satisfactory performance on detecting masked faces. In addition, to further reduce overfitting and improve the performance with a small quantity of training samples, we firstly increase the training dataset by horizontal reflection, and then learn MLeNet via combining both pre-training and fine-tuning.

Acknowledgements

This work is supported by the National Key R&D Program (No. 2016YFB1001503), the Special Fund for Earthquake Research in the Public Interest No.201508025, the Nature Science Foundation of China (No. 61422210, No. 61373076, No. 61402388, and No. 61572410), the CCF-Tencent Open Research Fund, the Open Projects Program of National Laboratory of Pattern Recognition, and the Xiamen Science and Technology Project (No. 3502Z20153003).

Shaohui Lin received B.S. degree from Sanming University, Fujian, China, in 2011 and M.S. degree from Jimei University, Fujian, China, in 2014. He is currently pursuing the Ph.D. degree in Information and Computing Science at Xiamen University. His research interests include machine learning, and computer vision.

References (24)

  • P.N. Belhumeur et al.

    Eigenfaces vs. Fisherfacesrecognition using class specific linear projection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the IEEE...
  • C. Liu et al.

    Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition

    IEEE Trans. Image Process.

    (2002)
  • X. Shen, Z. Lin, J. Brandt, Y. Wu, Detecting and aligning faces by image retrieval, in: Proceedings of the IEEE...
  • H. Li, Z. Lin, J. Brandt, X. Shen, G. Hua, Efficient boosted exemplar-based face detection, in: Proceedings of the IEEE...
  • Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in: Proceedings of the IEEE...
  • Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust, arXiv preprint...
  • Y. Sun, X. Wang, X. Tang, Hybrid deep learning for face verification, in: Proceedings of the IEEE Conference on...
  • V. Jain, E.G. Learned-Miller, Fddb: A Benchmark for Face Detection in Unconstrained Settings, UMass Amherst Technical...
  • H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in: Proceedings...
  • Y. Bengio et al.

    Representation learninga review and new perspectives

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings of European Conference on...
  • Cited by (58)

    • A YOLO-NL object detector for real-time detection

      2024, Expert Systems with Applications
    • SMD-YOLO: An efficient and lightweight detection method for mask wearing status during the COVID-19 pandemic

      2022, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Novel proposals by improving the existing masks detectors, face detectors or other object detectors can achieve enthralling results and escalate the research in this domain. Lin et al. [36] proposed an modified LeNet (MLeNet) network for detecting masked faces in the wild through video surveillance and violence video retrieval. The authors modified the number of units in output layer and increased the number of feature maps with smaller filter size to manually design a learn-based feature and classifier training paradigm.

    • MSSort-DIA<sup>XMBD</sup>: A deep learning classification tool of the peptide precursors quantified by OpenSWATH

      2022, Journal of Proteomics
      Citation Excerpt :

      CNN excels at processing multiple arrays [27], and can automatically learn the potential spatial correlation of the given data according to its structure [28,29]. The LeNet-5 developed by LeCun which applied back-propagation algorithm performs well in recognizing hand-written digit characters with relatively few parameters [13,30,31]. Considering that LeNet can overcome the variance of XICs caused by normalization, we referred the structure of LeNet to design our model.

    • AI in human behavior analysis

      2022, Human-Centered Artificial Intelligence: Research and Applications
    • Low-cost image analysis with convolutional neural network for herpes zoster

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      Finally, it can improve on the earliest stage in the first week. Moreover, for face analysis, there are large datasets with CNN with available models designed for ImageNet, and the following: Resnet, VGG [23], GoogleNet, AlexNet [24]. In this field a Population, intervention, comparison, outcome and context (PICOC) is used in this research article.

    View all citing articles on Scopus

    Shaohui Lin received B.S. degree from Sanming University, Fujian, China, in 2011 and M.S. degree from Jimei University, Fujian, China, in 2014. He is currently pursuing the Ph.D. degree in Information and Computing Science at Xiamen University. His research interests include machine learning, and computer vision.

    Ling Cai received the Ph.D. degree in electrical engineering from Shanghai Jiao Tong University, Shanghai, China, in 2011. He is currently an associate professor in the School of Information Science and Engineering at Xiamen University. He current research interests include computer vision, pattern recognition and machine learning.

    Xianming Lin received the Ph.D. degree in Intelligent Multimedia Information Processing from the School of Information Science and Engineering, Xiamen University, Xiamen, China, in 2014. He is an Assistant Professor with Xiamen University. His current research interests include mobile visual retrieval, computer vision and machine learning.

    Rongrong Ji serves as the Professor of Xiamen University, where he directs the Intelligent Multimedia Technology Laboratory (http://imt.xmu.edu.cn) and serves as a Dean Assistant in the School of Information Science and Engineering. He has been a Postdoc research fellow in the Department of Electrical Engineering, Columbia University from 2010 to 2013, worked with Professor Shih-Fu Chang. He obtained his Ph.D. degree in Computer Science from Harbin Institute of Technology, graduated with a Best Thesis Award at HIT. He had been a visiting student at University of Texas of San Antonio worked with Professor Qi Tian, and a research assistant at Peking University worked with Professor Wen Gao in 2010, a research intern at Microsoft Research Asia, worked with Dr. Xing Xie from 2007 to 2008.

    He is the author of over 40 tired-1 journals and conferences including IJCV, TIP, TMM, ICCV, CVPR, IJCAI, AAAI, and ACM Multimedia. His research interests include image and video search, content understanding, mobile visual search, and social multimedia analytics. Dr. Ji is the recipient of the Best Paper Award at ACM Multimedia 2011 and Microsoft Fellowship 2007. He is a guest editor for IEEE Multimedia Magazine, Neurocomputing, and ACM Multimedia Systems Journal. He has been a special session chair of MMM 2014, VCIP 2013, MMM 2013 and PCM 2012, would be a program chair of ICIMCS 2016, Local Arrangement Chair of MMSP 2015. He serves as reviewers for IEEE TPAMI, IJCV, TIP, TMM, CSVT, TSMC A⧹B⧹C and IEEE Signal Processing Magazine, etc. He is in the program committees of over 10 top conferences including CVPR 2013, ICCV 2013, ECCV 2012, ACM Multimedia 2013-2010, etc.

    View full text