Masked face detection via a modified LeNet

doi:10.1016/j.neucom.2016.08.056

Neurocomputing

Volume 218, 19 December 2016, Pages 197-202

https://doi.org/10.1016/j.neucom.2016.08.056 Get rights and content

Abstract

Detecting masked faces in the wild has been emerging recently, which has rich applications ranging from violence video retrieval to video surveillance. Its accurate detection retains as an open problem, mainly due to the difficulties of low-resolution and arbitrary viewing angles, as well as the limitation of collecting sufficient amount of training samples. Such difficulties have been significantly challenged the design of effective handcraft features as well as robust detectors. In this paper, we tackle these problems by proposing a learn-based feature design and classifier training paradigm. More particularly, a modified LeNet, termed MLeNet, is presented, which modifies the number of units in output layer of LeNet to suit for a specific classification. Meanwhile, MLeNet further increases the number of feature maps with smaller filter size. To further reduce overfitting and improve the performance with a small quantity of training samples, we firstly increase the training dataset by horizontal reflection and then learn MLeNet via combining both pre-training and fine-tuning. We evaluate the proposed model on a real-world masked face detection dataset. Quantitative evaluations over several state-of-the-arts and alternative solution have demonstrated the accuracy and robustness of the proposed model.

Introduction

Detecting video clips related to potential terrorists retains as a fundamental demand in the management of massive-scale video corpus, which is highly beneficial to the applications of public security. A variety of definitions exist to define a person as the terrorist in a given video clip, among which one obvious definition comes from the masked faces. As a specific task of face detection, the detection of masked faces poses significant difficulties, which differs from the traditional face detection (potentially with partial occlusions) that has been studied intensively decades long. On one hand, it encompasses challenges such as variational poses, and lighting that have historically hampered traditional face detection paradigms. On the other hand, its severe occlusion has significantly challenged the existing face detection algorithms, since the severely missing of face structure.

Tracking back to the literature, previous works in face detection mainly rely on handcraft feature designs, such as the well-known Fisher Face [1], Harr-like features with cascade detector [2], and Gabor-like high dimensional features with AdaBoost detector [3]. One essential limitation lies in the need of sufficient amount of training samples to achieve a satisfactory detection accuracy. Recently, exemplar-based face detection [4] has been shown to be effective, because a large exemplar database was leveraged to cover all possible visual variations. However, it requires a large face database for detection and tends to produce false alarms in the presence of highly cluttered backgrounds. In order to reduce the number of required exemplars, the efficient boosted exemplar-based face detector [5] was proposed to further improve the detection accuracy and make the detector faster and more memory efficient by discriminatively training and selectively assembling exemplars as weak detectors in the boosting framework. However, these methods would be failed if they rely on a small quantity of face training dataset. Recently, deep learning architectures have been studied as well, which make use of CNN with GPU based computing architecture to bring breakthrough in benchmark evaluations, such as Labeled Faces in the Wild (LFW) [6], [7], [8], Face Detection Data Set and Benchmark (FDDB) [9], [10].

In particular, the convolutional network can automatically learn effective feature representation of object from training data [11], [12]. Most notably, the Alexnet [13] shows ground breaking performance on the ImageNet 2012 classification challenge. After that, The CNNs have had a lead performance in ImageNet classification and object detection benchmark, such as GoogleNet [14] with about 6.8 M parameters, ResNet-18 network [15] with about 11.6 M parameters, VGG-19 [16] with about 144 M parameters. However, the models with large number of parameters would be overfitting when these models are trained on a small quantity of training dataset, especially our real-world masked face detection dataset with about 1000 training samples. To further tackle the challenge induced by limited training data, Hinton and Salakhutdinov [17] introduce pre-training to generate a good initialization for large deep neural networks. In contrast, The LeNet introduced in [18] shows good performance in recognizing hand-written digit characters with relatively few parameters. However, the need of large amount of training data still hesitates its direct application in our scenario of masked face detection. In view of this issue, in this paper, we introduce a modified LeNet, termed MLeNet, which modifies the number of units in output layer of LeNet to suit for a specific classification with a small quantity of training samples. Meanwhile, MLeNet further increases the number of feature maps with smaller filter size as shown in Table 1 which further improves the performance of classification with comparable network overhead to LeNet. Combining MLeNet with sliding window, the detection of masked faces is done in a multi-scale fashion.

The parameters of MLeNet can be learned via stochastic gradient descent, and we introduce some tricks which are combining both pre-training and fine-tuning to prevent the MLeNet from overfitting. Notably, pre-training is done by directly borrowing the models from LeNet. Fine-tuning is done by adapting the network structure to a very limited number of training instances. In additional, we increase the number of data set to two times via horizontal reflection. Well known schemes like sliding window and non-maximal suppression [19] are also integrated into the proposed MLeNet based detector. Quantitatively, the experimental comparisons to a set of state-of-the-art (e.g. LeNet[18], RFD[4]) and classic (e.g. Harr-like features with AdaBoost [2]) detectors have demonstrated that the proposed model can achieve superior performance on detecting masked faces.

The rest of this paper is organized as follows. Section 2 describes the proposed detection model (MLeNet) for masked man. Section 3 presents the detailed quantitative evaluation with comparisons to a set of state-of-the-arts. We conclude this paper in Section 4 and discuss our future work.

Section snippets

The Proposed Method

In this section, we introduce the MLeNet for the face detection of possible terrorist. Firstly, we introduce the structure and learning weights of MLeNet which is different with LeNet. Secondly, we combine pre-training and fine-tuning with data augmentation to further improve the performance of MLeNet with a very limited number of training samples. Finally, detecting masked faces is done via combining sliding window and non-maximum suppression.

Detecting masked faces

Based on the MLeNet described in Section 2.2, we generate a detector to classify whether a given fixed window is masked face or not. But how to generate a given fixed window which is a candidate of masked faces? Recently, there are two frameworks generating these candidates. The one is followed by R-CNN [20] in which selective search region proposals [21] are generated. The other is followed by DPM [22] in which the candidates of masked faces are generated by sliding window. The first framework

Experiments

We verify the proposed work on the masked man dataset cropped from violence videos. The dataset consists of 1140 images including 240 positive ones and 900 negative ones. We randomly select 150 positive samples and 750 negative samples as the training set, 50 positive samples and 50 negative samples as the validation set, and the remaining 140 images as the test set. To reduce overfitting and the error rate of detector, we increase the number of training instances to two times via horizontal

Conclusion

In this paper, we propose a novel model for localizing masked faces in images. Our model is based on the convolutional neural networks (MLeNet) and the sliding window. MLeNet achieves the satisfactory performance on detecting masked faces. In addition, to further reduce overfitting and improve the performance with a small quantity of training samples, we firstly increase the training dataset by horizontal reflection, and then learn MLeNet via combining both pre-training and fine-tuning.

Acknowledgements

This work is supported by the National Key R&D Program (No. 2016YFB1001503), the Special Fund for Earthquake Research in the Public Interest No.201508025, the Nature Science Foundation of China (No. 61422210, No. 61373076, No. 61402388, and No. 61572410), the CCF-Tencent Open Research Fund, the Open Projects Program of National Laboratory of Pattern Recognition, and the Xiamen Science and Technology Project (No. 3502Z20153003).

Shaohui Lin received B.S. degree from Sanming University, Fujian, China, in 2011 and M.S. degree from Jimei University, Fujian, China, in 2014. He is currently pursuing the Ph.D. degree in Information and Computing Science at Xiamen University. His research interests include machine learning, and computer vision.

References (24)

P.N. Belhumeur et al.
Eigenfaces vs. Fisherfacesrecognition using class specific linear projection
IEEE Trans. Pattern Anal. Mach. Intell.
(1997)
P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the IEEE...
C. Liu et al.
Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition
IEEE Trans. Image Process.
(2002)
X. Shen, Z. Lin, J. Brandt, Y. Wu, Detecting and aligning faces by image retrieval, in: Proceedings of the IEEE...
H. Li, Z. Lin, J. Brandt, X. Shen, G. Hua, Efficient boosted exemplar-based face detection, in: Proceedings of the IEEE...
Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in: Proceedings of the IEEE...
Y. Sun, X. Wang, X. Tang, Deeply learned face representations are sparse, selective, and robust, arXiv preprint...
Y. Sun, X. Wang, X. Tang, Hybrid deep learning for face verification, in: Proceedings of the IEEE Conference on...
V. Jain, E.G. Learned-Miller, Fddb: A Benchmark for Face Detection in Unconstrained Settings, UMass Amherst Technical...
H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in: Proceedings...

Y. Bengio et al.

Representation learninga review and new perspectives

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings of European Conference on...

Cited by (58)

A YOLO-NL object detector for real-time detection
2024, Expert Systems with Applications
In recent years, YOLO object detection models have undergone significant advancement due to the success of novel deep convolutional networks. The success of these YOLO models is often attributed to their use of guidance techniques, such as expertly tailored deeper backbone and meticulously crafted detector head, which provides effective mechanisms to tradeoff between accuracy and efficiency. However, these sluggish-reasoning models are not capable of handling false detection and negative phenomena, facing challenges include improving the robustness of scaled objects detection against occlude and densely sophisticated scenarios.
To address these limitations, we propose a novel object detector, You Only Look Once and None Left (YOLO-NL). Our model includes a novel global dynamic label assignment strategy, which allocates labels for specific anchors to maintain a balance between higher precision detection and finer localization. To enhance the detection capability of multi-scale objects in complex scenes, we separately upgrade CSPNet and PANet using the shortest-longest gradient strategy and self-attention mechanism. To meet the need for fast inference, we propose the Rep-CSPNet network using the reparameterization method to convert residual convolutions to ghost linear operations. Additionally, we accelerate the feature extraction process by deploying the serial SSPP structure. The proposed model is robust to scale objects against negative effectives such as dust, dense, ambiguous, and obstructed scenes. YOLO-NL achieved a mAP of 52.9% on the COCO 2017 test dataset, exhibiting a significant improvement of 2.64% compared to the baseline YOLOX. It is worth noting that YOLO-NL can perform high-accuracy and high-speed face mask detection in real-life scenarios. The YOLO-NL model was employed on self-built FMD and large open-source datasets, and the results show that it outperforms the other state-of-the-art methods, achieving 98.8% accuracy while maintaining 130 FPS.
IYOLO-NL: An improved you only look once and none left object detector for real-time face mask detection
2023, Heliyon
Object detection is a fundamental task in computer vision that aims to locate and classify objects in images or videos. The one-stage You Only Look Once (YOLO) models are popular approaches to object detection. Real-time monitoring of mask wearing is necessary, especially for preventing the spread of the COVID-19 virus. While YOLO detectors facing challenges include improving the robustness of object detectors against occlusion, scale variation, handling false detection and false negative, and maintaining the balance between higher precision detection and faster inference time. In this study, a novel object detection model called Improved You Only Look Once and None Left (IYOLO-NL) based on YOLOv5 was proposed for real-time mask wearing detection. To fulfill the requirement of real-time detection, the lightweight IYOLO-NL was developed by using novel CSPNet-Ghost and SSPP bottleneck architecture. To prevent any missed correct results, IYOLO-NL integrates the proposed PANet-SC with a multi-level prediction scheme. To achieve high precision and handle sample allocation properly, the proposed global dynamic-k label assignment strategy was utilized in an anchor-free manner. A large dataset of face masks (FMD) was created, consisting of 6130 images, for use in conducting experiments on IYOLO-NL and other models. The experiment results show that IYOLO-NL surpasses other state-of-the-art (SOTA) methods and achieves 98.8% accuracy while maintaining 130 FPS.
SMD-YOLO: An efficient and lightweight detection method for mask wearing status during the COVID-19 pandemic
2022, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Novel proposals by improving the existing masks detectors, face detectors or other object detectors can achieve enthralling results and escalate the research in this domain. Lin et al. [36] proposed an modified LeNet (MLeNet) network for detecting masked faces in the wild through video surveillance and violence video retrieval. The authors modified the number of units in output layer and increased the number of feature maps with smaller filter size to manually design a learn-based feature and classifier training paradigm.
At present, the COVID-19 epidemic is still spreading worldwide and wearing a mask in public areas is an effective way to prevent the spread of the respiratory virus. Although there are many deep learning methods used for detecting the face masks, there are few lightweight detectors having a good effect on small or medium-size face masks detection in the complicated environments.
In this work we propose an efficient and lightweight detection method based on YOLOv4-tiny, and a face mask detection and monitoring system for mask wearing status. Two feasible improvement strategies, network structure optimization and K-means++ clustering algorithm, are utilized for improving the detection accuracy on the premise of ensuring the real-time face masks recognition. Particularly, the improved residual module and cross fusion module are designed to aim at extracting the features of small or medium-size targets effectively. Moreover, the enhanced dual attention mechanism and the improved spatial pyramid pooling module are employed for merging sufficiently the deep and shallow semantic information and expanding the receptive field. Afterwards, the detection accuracy is compensated through the combination of activation functions. Finally, the depthwise separable convolution module is used to reduce the quantity of parameters and improve the detection efficiency. Our proposed detector is evaluated on a public face mask dataset, and an ablation experiment is also provided to verify the effectiveness of our proposed model, which is compared with the state-of-the-art (SOTA) models as well.
Our proposed detector increases the AP (average precision) values in each category of the public face mask dataset compared with the original YOLOv4-tiny. The mAP (mean average precision) is improved by 4.56% and the speed reaches 92.81 FPS. Meanwhile, the quantity of parameters and the FLOPs (floating-point operations) are reduced by 1/3, 16.48%, respectively.
The proposed detector achieves better overall detection performance compared with other SOTA detectors for real-time mask detection, demonstrated the superiority with both theoretical value and practical significance. The developed system also brings greater flexibility to the application of face mask detection in hospitals, campuses, communities, etc.
MSSort-DIA<sup>XMBD</sup>: A deep learning classification tool of the peptide precursors quantified by OpenSWATH
2022, Journal of Proteomics
Citation Excerpt :
CNN excels at processing multiple arrays [27], and can automatically learn the potential spatial correlation of the given data according to its structure [28,29]. The LeNet-5 developed by LeCun which applied back-propagation algorithm performs well in recognizing hand-written digit characters with relatively few parameters [13,30,31]. Considering that LeNet can overcome the variance of XICs caused by normalization, we referred the structure of LeNet to design our model.
OpenSWATH is an analysis toolkit commonly used for data independent acquisition (DIA). Although the output of OpenSWATH is controlled at 1% false discovery rate (FDR), the output report still contains many peptide precursors with low similarity fragments. At the last step of OpenSWATH for peptide quantification, researchers usually need to manually check the similarity of the extracted ion chromatograms (XICs) of fragments to distinguish the high confidence and the low confidence peptide precursors. Here we developed an algorithm with a Graphic User Interface named MSSort-DIA^XMBD, which combines the deep convolutional neural network (CNN) and the double-threshold segmentation process, to automatically recognize the high confidence precursors and low confidence precursors. To train the model of MSSort-DIA^XMBD, we built a database contained about 50,000 manually classified peptide precursors acquired from different instrument platforms and different species. With the double-threshold segmentation strategy, MSSort-DIA^XMBD can reduce the number of the low confidence peptides required for manual inspections to less than 10% and be used as the last step of OpenSWATH to visualize and classify the MS/MS data of peptide precursors.
Although the output of OpenSWATH is controlled at 1% false discovery rate (FDR), the output report still contains many peptide precursors with low similarity fragments. At the last step of OpenSWATH for peptide quantification, researchers usually need to manually check the similarity of fragment XICs to distinguish the high confidence and the low confidence peptide precursors. However, manual inspection is inefficient. For instance, it takes about 50 h to sort even a small dataset of 1000 MS/MS spectra manually. In this paper we developed a software named MSSort-DIA^XMBD to automatically recognize the high confidence precursors. We manually classify 50,000 peptide precursors as training set to train a convolutional neural network. After training finished, MSSort-DIA^XMBD takes only a few minutes to automatically classify 20,000 peptide precursors, leaving a small portion of fuzzy ones for manual inspection. On the benchmarked dataset, MSSort-DIA^XMBD can significantly improve the efficiency and accuracy of recognition of high confidence peptide precursors.
AI in human behavior analysis
2022, Human-Centered Artificial Intelligence: Research and Applications
Various types of biosignals can be measured and utilized to monitor and predict user behavior. In particular, by using wearable sensors and imaging techniques, it has become possible to obtain related information while minimizing user behavior disruption. However, existing analytic models for user behavior prediction do not accurately grasp the feature of data and have a problem in that the accuracy of behavior prediction is poor when applied in the real world. Researchers have utilized artificial intelligence (AI) technology based on machine/deep learning to solve these problems. In this chapter, the authors reviewed how AI and deep learning techniques have contributed to analyzing human behavior. The authors proposed implications and future tasks for human behavior analysis using AI based on the review.
Low-cost image analysis with convolutional neural network for herpes zoster
2022, Biomedical Signal Processing and Control
Citation Excerpt :
Finally, it can improve on the earliest stage in the first week. Moreover, for face analysis, there are large datasets with CNN with available models designed for ImageNet, and the following: Resnet, VGG [23], GoogleNet, AlexNet [24]. In this field a Population, intervention, comparison, outcome and context (PICOC) is used in this research article.
Herpes zoster virus (HZV) or varicella-zoster virus (VZV) affects the trigeminal nerve, at the earliest possible stage will avoid the eyes injuries. In this paper, the new framework develops a new method with convolutional neural networks (CNN), the detection for the early stage of the HZV is tested with 1,000 images. It is 89.6% with low-cost image analysis, besides, the database has been analyzed with other architectures in order to validate the most appropriate algorithm. The process is pre-processing, segmentation, extraction, and classification. The VZV produces two illness: i) Varicella called chickenpox, and ii) Herpes Zoster. In order to obtain a machine learning process, it considers building blocks of convolutional layer neural network associated to a new process for early Herpes Zoster (HZ) disease detection system, structured in four stages as pre-processing, segmentation, extraction and classification. In particular, the new process includes a classification process with a comparison between the K-Nearest Neighborhood (KNN), artificial neural networks (ANN), and logistic model tree (LMT) regression for the comparison. The effectiveness during eight days is 98.1%, for early detection with minimal information. However, the training process produces 33% false positives and an average 90% true positive rate. Early HZ detection and the failures associated with electronic devices were shown and used for facial and pattern recognition associated with nerve location. With this research, the difficulties concerning to the data management and deep learning were corroborated during eight days of the illness, to better understand the process and technology that enable a successful classification.

View all citing articles on Scopus

Ling Cai received the Ph.D. degree in electrical engineering from Shanghai Jiao Tong University, Shanghai, China, in 2011. He is currently an associate professor in the School of Information Science and Engineering at Xiamen University. He current research interests include computer vision, pattern recognition and machine learning.

Xianming Lin received the Ph.D. degree in Intelligent Multimedia Information Processing from the School of Information Science and Engineering, Xiamen University, Xiamen, China, in 2014. He is an Assistant Professor with Xiamen University. His current research interests include mobile visual retrieval, computer vision and machine learning.

Rongrong Ji serves as the Professor of Xiamen University, where he directs the Intelligent Multimedia Technology Laboratory (http://imt.xmu.edu.cn) and serves as a Dean Assistant in the School of Information Science and Engineering. He has been a Postdoc research fellow in the Department of Electrical Engineering, Columbia University from 2010 to 2013, worked with Professor Shih-Fu Chang. He obtained his Ph.D. degree in Computer Science from Harbin Institute of Technology, graduated with a Best Thesis Award at HIT. He had been a visiting student at University of Texas of San Antonio worked with Professor Qi Tian, and a research assistant at Peking University worked with Professor Wen Gao in 2010, a research intern at Microsoft Research Asia, worked with Dr. Xing Xie from 2007 to 2008.

He is the author of over 40 tired-1 journals and conferences including IJCV, TIP, TMM, ICCV, CVPR, IJCAI, AAAI, and ACM Multimedia. His research interests include image and video search, content understanding, mobile visual search, and social multimedia analytics. Dr. Ji is the recipient of the Best Paper Award at ACM Multimedia 2011 and Microsoft Fellowship 2007. He is a guest editor for IEEE Multimedia Magazine, Neurocomputing, and ACM Multimedia Systems Journal. He has been a special session chair of MMM 2014, VCIP 2013, MMM 2013 and PCM 2012, would be a program chair of ICIMCS 2016, Local Arrangement Chair of MMSP 2015. He serves as reviewers for IEEE TPAMI, IJCV, TIP, TMM, CSVT, TSMC A⧹B⧹C and IEEE Signal Processing Magazine, etc. He is in the program committees of over 10 top conferences including CVPR 2013, ICCV 2013, ECCV 2012, ACM Multimedia 2013-2010, etc.

View full text

Masked face detection via a modified LeNet

Abstract

Introduction

Section snippets

The Proposed Method

Detecting masked faces

Experiments

Conclusion

Acknowledgements

Eigenfaces vs. Fisherfacesrecognition using class specific linear projection

IEEE Trans. Pattern Anal. Mach. Intell.

Gabor feature based classification using the enhanced Fisher linear discriminant model for face recognition

IEEE Trans. Image Process.

Representation learninga review and new perspectives

IEEE Trans. Pattern Anal. Mach. Intell.