Abstract
The development of object detection systems is normally driven to achieve both high detection and low false positive rates in a certain public dataset. However, when put into a real scenario the result is generally an unacceptable rate of false alarms. In this context we propose to add an additional step that models and filters the typical false alarms of the new scenario while roughly maintaining the ability to detect the objects of interest. We propose to use the false alarms of the new scenario to train a deep autoencoder and to model them. The latter will act as a filter that checks whether the output of the detector is one of its typical false positives or not based on the reconstruction error measured with the Mean Squared Error (MSE) and the Peak Signal-to-Noise Ratio (PSNR). We test the system using an entirely synthetic novel dataset for training and testing the autoencoder generated with Unreal Engine 4. Results show a reduction in the number of FPs of up to 37.9% in combination with the PSNR error while maintaining the same detection capability.
Similar content being viewed by others
Keywords
1 Introduction
In security, detecting potentially dangerous situations as soon as possible is of vital importance. Constant human supervision of the images provided by Closed Circuit Television (CCTV) systems is non feasible. In the last decades, several efforts have been made to create automated video surveillance (AVS) systems that can locate potentially threatening objects or events in the video sequence [11]. With the extended use of modern deep learning techniques, these systems obtain promising results [10, 19].
The development of these detection systems is normally driven to achieve both high detection and low false positive rates. Ideally, training data would contain representative instances from all possible application scenarios. In practice, obtaining such a huge amount of data is not feasible in terms of time and resources. This problem forces data scientists to be cautious about overfitting and poor generalization when training new models [20]. Some techniques such as dataset partitioning, L1 and L2 regularizations or early stopping are applied to alleviate them [14]. However, misclassification of samples in new scenarios must be addressed. In this respect, it is conceivable that a weapon detector could be trained using a dataset containing instances from all possible weapons that provides accurate detections and a small number of false positives. Then, when put into a surveillance system in a real scenario the result is generally an unacceptable rate of false alarms [18]. This means that the system will almost certainly be switched off, specially in cases where the incidence of the event of interest is very low. In this context we propose to add an additional step that models and filters the typical false alarms that are particular of the new scenario while roughly maintaining the ability to detect the objects it was trained for.
We focus on a handgun detection problem [10]. When such detector runs in a new scenario over a period of time, all of those detections, most likely false positives, can be stored and leveraged. Here we propose to use those false positives to train a deep autoencoder to model the typical false alarms of the particular scenario (and this can be done down to the level of individual cameras). The one-class classification approach has been widely used in the literature to detect abnormal and extreme data [6]. Autoencoders have shown the ability to perform this task even where other techniques fail [5].
In addition, due to the large number of images required to train deep learning models, we propose to use an entirely synthetic dataset for training and testing the autoencoder. It consists of frames captured from a realistic 3D environment that resembles the new scenario in which the detector would be deployed.
The rest of the paper is organized as follows. Section 2 gives the details of the handgun detector used. Section 3 covers the generation of the synthetic dataset of the new detection scenario. Section 4 describes the procedure followed to filter the false positives of the new scenario. Finally, Sect. 5 shows the performance of the proposed method and Sect. 6 summarizes the main conclusions.
2 Handgun Detector
To address the reduction of the false positive rate of the handgun detector we need one as a starting point. Classical machine learning methods based on keypoint matching and feature extraction and classification have been extensively applied to RGB images taken by CCTV videocameras [10]. Most of these methods use the sliding window approach with which the detection problem is solved as a classification problem in every examined window. This approach not only works with traditional methods but also with the new deep learning classification architectures. Convolutional neural networks (CNNs) can be used in the same way as support vector machines (SVMs) or cascade classifiers without having to represent the image as a set of features before performing classification.
The problem with the sliding window approach is the variability in the object locations within the image and the large differences in their aspect ratios. Thus, the number of regions to be examined is huge. In [2] this problem is addressed by taking different regions of interest from the image using a selective search to extract a manageable number of regions called region proposals. R-CNN, Fast R-CNN and Faster R-CNN are the main detection architectures based on such region proposals approach [13]. In addition, there are other detection networks like YOLO (You Only Look Once) [12] or SSD (Single Shot Detector) [8] that are able to predict the bounding boxes and the class probabilities for these boxes, examining the whole image only once.
Due to the unavailability of pretrained handgun detection models, we have trained one handgun detector with a dataset provided by the University of Seville, Spain. The dataset contains 871 images with a total of 177 annotated handguns. Images come from 2 different CCTV controlled scenarios. The CNN architecture selected was the Faster R-CNN (Fig. 1).
3 Synthetic Dataset
Collecting and annotating data to train deep learning networks is a tedious task. It is even more complex for detection and segmentation problems where someone selects not only the class but also a rectangle around every desired object in the image or pixel-level contour. The easiest solution would be to use an existing dataset for the problem addressed but this is not always possible.
To test the hypothesis in our work, we have used Unreal Engine 4 [17] to generate synthetic data from a school hall (Fig. 2). There are other popular alternatives such as Unity [16] or Lumberyard/CryEngine [9] that can be also used.
The synthetic scenario is similar to a high-school corridor. It is rendered with people walking across it generating a dataset of 3000 images where some people are carrying handguns, mobile phones or nothing in their hands. From these images, 2657 contains handguns, 343 do not contain the object of interest and the total number of handguns is 5437. Since we control the dataset generation, it is possible to automatically annotate where each handgun is. To store the annotations we have used XML files with the format defined in the Pascal VOC 2012 Challenge [1]. Although images containing the object of interest are not strictly necessary to train the autoencoder, they are needed to test the improvement of the detector+autoencoder system since reducing the false positive ratio might have a negative effect in the detection ratio.
4 Autoencoder
Autoencoder networks learn an approximation to an input data distribution. In other words, they learn how to compress input data into a short code and then reconstruct that code into something as close as possible to its original input. The structure of these networks consist of an encoder path that reduces the dimensionality ignoring signal noise and a decoder path that makes the reconstruction. Autoencoders are commonly used for anomaly detection or one-class classification [4].
In our case the autoencoder is trained to model one class: the typical false positives of the particular scenario (see Figs. 3 and 4).
Once the autoencoder is trained, it is applied to reconstruct images from a test dataset that contains also instances from the object of interest. If the reconstruction error is measured, it is lower for the class used to train the autoencoder and thus, a threshold can be established to separate the two classes (in this case, false positive or real positive). Therefore, the trained autoencoder will act as a filter that checks whether the output of the detector is one of its typical false positives (Fig. 5). In this work, two different error measures were applied, the Mean Squared Error (MSE) [15] and the Peak Signal-to-Noise Ratio (PSNR) [7]. Figure 6 illustrates the reconstruction error of an autoencoder in a toy example using the MSE measure.
The autoencoder structures used are shown in Fig. 7. In both cases, the input size is a \(64 \times 64\) image with 3 channels. Their compressive paths consists of 1 or 6 convolutional and max-pooling layers. Similarly, the reconstruction paths have also 1 or 6 convolutional and up-sampling layers.
The dataset used is composed of 2100 images from the synthetic scenario. From them, 1200 images where used to generate the false positives to train the autoencoder and the remaining 900 to validate it.
5 Results
The Faster R-CNN model trained obtained an mAP of 79.33%. Its training process took around 2 days to complete 62 epochs in 2 nVIDIA Quadro M4000 cards using Keras with TensorFlow backend and CUDA 8.0 installed in a PC running Ubuntu 14.04 LTS.
We applied both the handgun detector and the detector+autoencoder approach to a test dataset composed of 900 images from the synthetic scenario, containing 808 instances of the object. The histograms of the reconstruction error of the autoencoders by error measure are depicted in Figs. 8 and 9. The best separation of the FPs from the TPs was obtained with the larger autoencoder structure when using MSE and the smaller when using PSNR. Although the TPs are overlapped with the FPs in both cases, the first part of the histograms, that do not contain TPs, contain the 26.5% of all the FPs when the MSE is used and the 37.9% when the PSNR is used. This means that those FPs can be potentially filtered with the autoencoder by selecting a threshold based on the reconstruction error. Furthermore, if the image reconstructions of the FPs and the TPs are compared, it is possible to notice how the FPs are reconstructed better than the TPs (see Fig. 10).
In addition, the corresponding precision-recall curves were also obtained. Considering the possible outputs of the detector, where TP and TN represent the number of true positives and true negatives and FP and FN stand for the number of false positives and the number of false negatives, the precision (p) and the recall (r) values can be calculated as shown in Eq. 1 [3].
The experimental results show a reduction in the number of false positives while maintaining the detection capabilities (Figs. 11 and 12). The figures depict the precision-recall curves of both the detector and the detector+autoencoder approaches. Those curves were obtained varying the detector confidence threshold and fixing a certain threshold for the autoencoder. When compared, they show a maximum increase in the precision of 0.015 at the same recall values when the autoencoder and the MSE are used (maximum distance between precisions without displacing the recall values to the left) and a maximum increase in the precision of 0.020 at the same recall values when the autoencoder and the PSNR are used (threshold = 77.24).
In addition, it is also worth noting that, although the results with the shortest autoencoder architecture and the PSNR error measure are superior to those obtained by the deepest architecture and the MSE error measure, the reconstructed image is noisier (see Fig. 13).
With respect to the computational times, the deep autoencoders training processes took only 45 min to complete 500 epochs in an nVIDIA GTX 1060 MaxQ card using Keras with TensorFlow backend and CUDA 9.0 installed in a Windows 10 PC.
6 Conclusions
This work focuses on reducing the number of false alarms when a surveillance application is run in a new particular scenario. A synthetic scenario has been generated with the game engine Unreal Engine 4, resembling a surveillance camera inside a high-school hall where people is walking by. Images coming from this scenario were used as input of a pretrained handgun detector. Its false positive detections where then used to train a deep autoencoder to model them and act like a filter, removing those false positives.
The autoencoder has demonstrated to be able to reduce the number of FPs in up to 37.9% in combination with the PSNR error while maintaining the same capability of detecting handguns. Notice that the number of FPs and the detection ability of the detector is a trade-off that should be always considered. The detector+autoencoder approach proposed helps keep a good balance between them since it is possible to not affect both at the same time for the first range of the autoencoder thresholds.
Overall, our approach can be used with generic detectors (i.e. a generic handgun detector) and with different particular scenarios (down to the level of individual cameras, which depending on the point of view, lighting, etc, will produce different false positives). Thus, in practice we would only need the generic detector and one trained autoencoder per camera feed.
As future work, it would be useful to consider other detection architectures to check if they can influence the autoencoder. In addition, other error metrics can be used to get the autoencoder threshold.
References
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC 2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). http://arxiv.org/abs/1311.2524
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and F-Score, with implication for evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 345–359. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_25
Gutoski, M., Ribeiro, M., Aquino, N.M.R., Lazzaretti, A.E., Lopes, H.S.: A clustering-based deep autoencoder for one-class image classification. In: 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6 (2017)
Hofer-Schmitz, K., Nguyen, P.H., Berwanger, K.: One-class Autoencoder approach to classify Raman spectra outliers. In: European Symposium on Artificial Neural Networks, ESANN 2018, pp. 189–194 (2018)
Khan, S.S., Madden, M.G.: One-class classification: taxonomy of study and review of techniques. Knowl. Eng. Rev. 29(3), 345–374 (2014). https://doi.org/10.1017/S026988891300043X
Kotevski, Z., Mitrevski, P.: Experimental comparison of PSNR and SSIM metrics for video quality estimation. In: Davcev, D., Gómez, J.M. (eds.) ICT Innovations 2009, pp. 357–366. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-10781-8_37
Liu, W., et al.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015). http://arxiv.org/abs/1512.02325
Lumberyard. https://aws.amazon.com/es/lumberyard. Accessed 09 Apr 2019
Olmos, R., Tabik, S., Herrera, F.: Automatic handgun detection alarm in videos using deep learning. CoRR abs/1702.05147 (2017). http://arxiv.org/abs/1702.05147
Raghunandan, A., Mohana, M., Pakala, R., Aradhya, H.V.R.: Object detection algorithms for video surveillance applications. In: IEEE - 7th International Conference on Communication and Signal Processing, April 2018. https://doi.org/10.1109/ICCSP.2018.8524461
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. CoRR abs/1506.02640 (2015). http://arxiv.org/abs/1506.02640
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). http://arxiv.org/abs/1506.01497
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Tan, C.C., Eswaran, C.: Performance comparison of three types of autoencoder neural networks. In: 2008 Second Asia International Conference on Modelling Simulation (AMS), pp. 213–218, May 2008. https://doi.org/10.1109/AMS.2008.105
Unity. https://unity.com. Accessed 09 Apr 2019
Unreal Engine 4. https://www.unrealengine.com. Accessed 09 Apr 2019
Vállez, N., Bueno, G., Déniz, O.: False positive reduction in detector implantation. In: Peek, N., Marín Morales, R., Peleg, M. (eds.) AIME 2013. LNCS, vol. 7885, pp. 181–185. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38326-7_28
Xu, D., Yan, Y., Ricci, E., Sebe, N.: Detecting anomalous events in videos by learning deep representations of appearance and motion. Comput. Vis. Image Underst. 156, 117–127 (2017). https://doi.org/10.1016/j.cviu.2016.10.010. http://www.sciencedirect.com/science/article/pii/S1077314216301618. Image and Video Understanding in Big Data
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv pre-print (2016). http://arxiv.org/abs/1611.03530
Acknowledgments
We thank Professor Dr. J.A. Alvarez for the surveillance images provided for training the handgun detector. This work was partially funded by projects TIN2017-82113-C2-2-R by the Spanish Ministry of Economy and Business and SBPLY/17/180501/000543 by the Autonomous Government of Castilla-La Mancha and the ERDF.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Vallez, N., Velasco-Mata, A., Corroto, J.J., Deniz, O. (2019). Weapon Detection for Particular Scenarios Using Deep Learning. In: Morales, A., Fierrez, J., Sánchez, J., Ribeiro, B. (eds) Pattern Recognition and Image Analysis. IbPRIA 2019. Lecture Notes in Computer Science(), vol 11868. Springer, Cham. https://doi.org/10.1007/978-3-030-31321-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-31321-0_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31320-3
Online ISBN: 978-3-030-31321-0
eBook Packages: Computer ScienceComputer Science (R0)