Piggybacking Detection Based on Coupled Body-Feet Recognition at Entrance Control

Siegmund, Dirk; Tran, Vinh Phuc; von Wilmsdorff, Julian; Kirchbuchner, Florian; Kuijper, Arjan

doi:10.1007/978-3-030-33904-3_74

Dirk Siegmund¹¹,
Vinh Phuc Tran¹¹,
Julian von Wilmsdorff¹¹,
Florian Kirchbuchner¹¹ &
…
Arjan Kuijper¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11896))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1898 Accesses
1 Citations

Abstract

A major risk of an automated high-security entrance control is that an authorized person takes an unauthorized person into the secured area. This practice is called “piggybacking”. Known systems try to prevent it by using physical barriers combined with sensory or camera based algorithms. In this paper we present a multi-sensor solution for verifying the number of persons that stand within a defined transit area. We use sensors that are installed in the floor to detect feet as well as camera shots taken from above. We propose an image-based approach that uses change detection to extract motion from a sequence of images and classify it by using a convolutional neural network. Our sensor-based approach shows how user interactions can be used to facilitate safe separation. Both methods are computationally efficient so they can be used in embedded systems. In the evaluation, we were able to achieve state-of-the-art results for both approaches individually. Merging both methods sustainably prevents piggybacking, at a BPCER of 7.1%, where bona fide presentations are incorrectly classified as presentation attacks.

You have full access to this open access chapter, Download conference paper PDF

A Look at Feet: Recognizing Tailgating via Capacitive Sensing

AGNES: Abstraction-Guided Framework for Deep Neural Networks Security

Enhancing Smart Home Security Using Deep Convolutional Neural Networks and Multiple Cameras

Article 01 June 2024

1 Introduction

The main goal of an automated access control system is the prevention of unrestricted access of an unauthorized person. Such un-staffed control systems can be found at many different places like office buildings, prisons or airports. Usually, access is granted providing a physical item (e.g. key) or a biometric property. A general problem is that an authorized person can take an unauthorized person into the secured area. This practice is entitled as “tailgating” or “piggybacking”. Many barriers like (drop-arm-) turnstiles are therefore equipped with sensors like infrared break-beams in order to eliminate this vulnerability. However, these available systems are designed to achieve high flow rates and can easily be defeated. In places where higher security is needed, mantrap portals or video surveillance systems are being used. These systems regulate the access of only a single person through a transit space. Permitted subjects enter and close the portal, so that a software can verify the number of people present in the transit space. After a successful verification, the system unlocks a second door to give access to the secured area. Previous research in the field of tailgating/piggybacking prevention shows that none of the existing systems is completely safe. Computer vision approaches using video captures can detect intrusion analyzing movements and distinguish it from certain behavior. Often, optical flow is used in such application to detect motion. In contrast, we present an approach based on an change detection that uses an adaptive background model. Compared to optical flow this method is more computational efficient and allows real-time calculation as we will show late in the paper. Another advantage over optical flow is that the data can be augmented more easily because the original feature space is preserved. The fundamental problem of all imaging approaches, however, is that in practice they are limited to the field of view of the camera. Persons who want to overcome these systems can do so in the simplest case by hiding behind a permitted person (see Sect. 2). Sensors mounted in the floor can here be of great help, but are limited to the fact that all feet must be on the floor (Fig. 1).

In this paper we combine both approaches and establish a system that can not be overcome easily. We are the first to show how to couple the use of ground-based position data of the foot with an image-based algorithm from the top-view. In this multi-model approach we verify the number of feet by using an active user interaction scheme and couple it with an image-based verification from the top-view perspective. The used dataset includes multiple humans in a scene, close to each other and causing occlusion and illumination changes. It consists of 21 2D image frames from a regular camera, taken during 3–4 s of recording [1]. We want to point out three important benefits of our method: (1) Our system is designed to work on low-cost hardware like a Raspberry Pi in real time. This is important, as images of these cameras should not be transmitted to be processed elsewhere than the place where they are mounted. (2) An interactive capacitive sensing approach ensures a limited amount of conductive material on the floor (see Sect. 3). (3) Our new feature descriptor maximize movement detection by using foreground segmentation and convolutional neural networks (see Sect. 3.3). A discussion of related work follows in Sect. 2 while we present our results obtained in Sect. 4.

2 Related Work

In research, several computer-vision methods have been developed in order to overcome the weaknesses of non-smart entrance gates. Most methods use a top-view perspective and pattern recognition methods to distinguish between one and more than one person in an observed area. An imaging approach using thermal sensors was introduced by Siegmund et al. [2] but shows disadvantages especially, in cases where the intruder uses equipment to hide himself. RGB-D images have been used by the same authors to create models of different verification attempts [3] and evaluated them in attempts with and without identity claim. Their method consists of change- and blob detection and uses an AdaBoost machine-learning classifier where they achieved an EER of 5% for scenarios with id. claim and of 11% without id. claim. When analyzing sequences of 2D images, change detection using a gaussian mixture model was used in order to detect and count contours [4]. Rauter introduced a motion based head-shoulder detector in order to detect intrusion [5]. Optical flow is a strong feature descriptor as it extracts motion and direction besides its position. A latter method using that descriptor achieved an EER of 5.17% by creating histograms of image sequences and classifying them via machine learning [1]. The disadvantage of optical flow is the high computational cost. A drawback of all these methods is the camera angle, which allows people to hide on the floor or between the legs of a permitted person. A recently published study makes use of capacitive sensors on the ground to detect and count feet [6] on the floor. It is build upon a floor-based indoor positioning system in grid layout [7]. Its active capacitive measuring system [8] is efficient for remote sensing and can reliably recognize a foot in up to 10 cm height. An obvious limitation, is when two people are standing with only one foot each on the sensor surface.

3 Methodology

We assume that people have vested interest in coming through the access control point with as little hassle as possible. Previous studies have shown that users “optimize” their behavior in a way that it is convenient to them [9]. Nevertheless, a system that can be used in practice must find a compromise of usability and safety. But there are other things that need to be considered when developing such a system:

1.
The detection method must be flexible with regard to light and clothing as well as different stature of the users.
2.
The proposed solution should verify a subject, without the need to claim their identity.
3.
Authorized subjects sometimes need to pass carrying different objects, which can be of and kind and appearance.
4.
The proposed method should be reasonably fast, in order to be used in an embedded computer close by.

Camera based method do not seem sufficient to guarantee reliable piggybacking detection as they suffer from the limited viewing range of the camera. For this reason, we monitor the floor area by additional verification, verifying that there is nobody hiding on the floor. In the next Section, we therefore present an approach, which interactively controls the that area by means of a capacitive grid of sensors. In Sect. 3.3, we present a new image-based method that combines time-based change detection and convolutional neural networks (CNN) in compliance with the conditions mentioned here.

3.1 Dataset

In the image-based method we use the dataset introduce by Siegmund et al. [1] which includes 60 bona fide verification attempts and 216 piggybacking attacks by 12 different participants (see Fig. 2). The participants cover a wide range of physical characteristics, like different height, weight, body shape. The attack schemes were shot with two subjects present at a transit area. Six different scenarios are carried out where the attackers showed different approaches to spoof the system and/or hide. Each recording consists of a total of 21 RGB images, recorded over a period of 3–4 s. In case of the capacitive sensing grid a test-group, consisting of 12 people with different shoe size (between 37 and 48) was acquired. Each subject was recorded at least once alone and several times with another subject. When evaluating the combined approach, a total of 87 assaults was carried out in different compositions of the test group. The test group was also explained the function of the system and direct feedback of their success was provided.

3.2 Capacitive Sensing Grid

In an earlier study, an approach using capacitive active feet detection sensors was presented [6]. Capacitive sensors are proximity sensors that detect nearby conductive objects by creating an electric field [10] (Fig 3).

Since the range of these sensors depend on the size of the electric field, it is possible to detect feet even away from the ground. Therefore, this technology is particularly suitable for the application described. We use the same sensors as the authors of that paper which provide a continuous signal for analysis at a frequency of 4 Hz. In our prototype, we have selected a monitored area of $800 \times 800$ mm, which acts as the transit area. We mounted $7 \times 7$ sensors in the floor, located in a grid used for the alignment. The sensors are mounted in the middle of each cell at a horizontal distance of 100 mm between each sensor. We use a copper plate as electrode because it shows the best ratio of range and sensibility compared to other material. The initial capacitance value of each sensor acts as a baseline value.

Active Feet Verification. Previous studies have shown that although capacitive sensors are able to reliably detect feet, but they can not always ensure that they are only a single person. We think that access to high-security areas can be expected to include following interaction of the person entitled to access. First, we propose to ask the to put both feet on the ground. In a next step, the user is asked to lift one foot. So if more than two people are in the area, the intruder would now have to prevent all his feet from touching the ground. We evaluated this procedure in a first scenario by using marked positions on the floor and in a second in which the user is able to freely choose the position of their feet on the floor. Since the measured capacitance changes even when approaching a sensor, we define a threshold $\epsilon $ above which a sensor is considered activated. For this we asked our test group to stand on certain sensors areas without touching the surrounding ones. We then calculated the difference between activated and surrounded sensors for all sensors. We determined $\epsilon $ based on the minimum difference between activated and surrounding sensors plus 20% of the delta. We interpret the sensor grid output as image sg with x rows and y columns. Equation 1 applies fixed-level thresholding to the $n^{th}$ single channel matrix ${sg}^n(x,y),~n=0.0,\dotsc ,1.0$ using $\epsilon $ as threshold.

(1)

We get the activated sensors in the resulting image dst where ${dst}^n(x,y)$ is not 0. In order to detect the lift of a foot, we evaluate for a period of 8 frames whether the previously defined sensors have been activated. If this is the case, the user is asked to lift one leg. Successful validation is achieved when the number of activated sensors has halved in at least three out of eight dst images. We determined the number of only three validation images through experiments that revealed that users need some reaction time. In the second scenario, where users were not given a marked position, successful validation also takes place in two steps. First, the number of activated sensors is counted over 8 frames. The number must not exceed a defined number of sensors. Then it is validated whether the number of activated sensors has halved in the second step.

3.3 Image Based Approach

Our method is based on extracting motion features from image sequences using change detection. The reason for this is that a learning algorithm based on the very complex and limited data that we use can be difficult to generalize. Therefore, the complexity must be reduced without losing the information necessary for the detection. Since the background model dynamically adapts to any background, the proposed method is applicable to every background. This is done by finding the difference between the current and previous frames (background model subtraction). For the first three background frames from each scene we create a model of background pixels by K Gaussian’s and then check the weight of the mixture representing color proportions. After calculating the background we take the next frames as the foreground and calculate the difference from the background frame.

Motion Detection via Background Subtraction. Our method presented here aims learning models for the cases of bona fide and attack. We assume that the amount, intensity and location of movements in a room differ according to whether one or two people are in it. In doing so, we interpret pixels in the foreground mask as movements. For this reason, objects that are carried by people are not getting included into the feature vector, if they do not move. In our dataset, for each shot situation there are image sequences consisting of 21 pictures collected over 3–4 s, we denote them as ${i_0,i_1,...i_{20}}$. The first three frames in a sequence are not getting used for feature extraction as they are needed for training the change detection algorithm. For calculating the movement models, we use the frame instances ${i_3,i_4,...i_{20}}$ over time. Thereby, we are able to determine the changes detected in image CD from the background model to the next time instance denoted as ${CD_{i_3:},CD_{i_4},...CD_{i_{20}}}$. We use a Gaussian mixture model based approach [11] where the decision that a pixel belongs to the background is made if:

$$\begin{aligned} \begin{aligned} p(x^{\rightarrow (i)} | BG > c_{thr}(=p(x^{\rightarrow (i)}|CD) i (CD)|p(BG)), \end{aligned} \end{aligned}$$

(2)

where $c_{thr}$ is a threshold value and the value of a pixel at time i in RGB is denoted by $x^{\rightarrow (t)}$. We will refer $p(x^{\rightarrow }|BG)$ as the background model. The background model is estimated from a training set ${i_0,i_1,i_2}$. The estimated model is denoted by $p(x^{\rightarrow }|X, BG)$ and depends on the training set as denoted explicitly. So we calculate the background model using the first three frames and set the learning rate to 0.001 afterward. By doing so, the model is getting updated every 1000 frames which is slow enough to detect all changes background and foreground in the following 18 frames and fast enough to capture changes in e.g. the illumination conditions. We apply change detection frame by frame calculating individual grayscale foreground masks (see Fig. 4) for each instance. Another property that we want to depict is the amount of movement. We do so by accumulating the individual foreground masks to a single result image dst. Equation 3 scales each foreground mask dividing the each pixel of the single-channel mask ${CD_{i}}(x,y)$, ranging from $0,\dotsc ,255$ by 255 and weighting it with a factor of $\sigma $. In our experiments we achieved the best results with a $\sigma $ of 30.

$$\begin{aligned} {dst}^n(x,y)= ({{CD_{i}}}^n(x,y) / 255) * \sigma , \end{aligned}$$

(3)

By doing so, pixel that got recognized as foreground multiple times get a higher value than pixel that have been foreground only for a short time. Therefore, micro movements get visible in the resulting image dst (see Fig. 4).

For the classification task we decided to train a convolutional neural network classifier. As we expect that our relatively small dataset could cause the network to show overfitting, we decided to augment the data as follows. First we mirror each dst image horizontally and vertically, then we rotate them clockwise 179 times by $2^\circ $. On 10% of the data we add additionally Gaussian noise in order to improve generalization of the network. We balance both classes of our trainings data by skipping some rotation steps for the attack image classes. After data augmentation we received 33.124 images of the bona fide class and 41.041 images in the attack class.

Learning a Binary Classifier. Following method is architectural inspired of the proposed Google-LeNet [12] but uses an architecture that is quite different from a traditional CNN design like LeNet-5 model. We used inception modules, which perform multiple convolution operations and max pooling in parallel. Therefore its not obligatory to choose a certain convolution kernel size for a certain layer. This approach is not just efficient in classification results but also in computational efficiency. The reason for the computational gain is $1 \times 1$ convolution operation which is applied before every $3 \times 3$ or $5 \times 5$ convolution of the inception module, it results in dimensionality reduction.

Proposed architecture (see Fig. 5) takes a grayscale image of $256 \times 256$ pixel as input. To avoid internal covariate shift batch normalization is used for all convolutional and fully connected layer [13]. All weights are initialized with a normal distribution, using a standard deviation of 0.1 and zero mean. A batch size of 300 is used with the equal representation of both classes. The Training process ran for 200 epoch for each fold of the data set and each epoch contained 3 batches. As both classes are mutually exclusive, a softmax classifier is used with cross-entropy loss function. For all fully-connected layers a dropout of 0.5 is implemented. For stochastic optimization, an Adam optimizer is used with a learning rate of 0.01. Except the final layer, all layers including those inside the inception modules use ReLU activation. Sigmoid activation is used for the final layer.

4 Experiments and Results

We performed individual experiments in order to ensure that our assumptions about both approaches are correct. The experiments are evaluated based on APCER^{Footnote 1} and BPCER^{Footnote 2}. As the imaging dataset is collected under conditions where the users did not follow any constrains regarding their position, we can not give any separate results. However, it can be assumed that the results on marked position would rather improve here. Due to the small amount of data available, the evaluation is performed using a leave-one-out approach. To make sure that there is no augmented data in the training set, we omitted the complete shot. Using only the imaging approach #4 we achieved an APCER of 1.93% at BPCER of 3.80%. We observed false positives especially in cases where a second person was hiding on the floor. However, it must be said that the attackers had no knowledge about the algorithm used and therefore could not respond specifically.

Table 1. Results in comparison with competitive methods.

Full size table

Experimented with the number of sensors activated in the sensing grid we found our, that a number of $2 \times 4$ sensors represent a good compromise between flexibility against great feet and safety. In the case of the marked position on the ground we could not detect any successful attacks in these experiments. However, there were cases in which unintentionally surrounding sensors were activated, which led to an increased BPCER. The approach without marking the position on the floor got circumvented in particular when feet were arranged diagonally to the grid. Since we did not use machine learning in comparison to the comparative study, a single threshold for the activation sensors proved to be too weak. Due to the good results in the case of detection of attacks, we have conducted a fusion on decision level to combine both approaches. Through this procedure, we were able to successfully detect all attacks, but a BPCER of 7.1% must be accepted (Table 1).

5 Conclusion

We presented a novel approach for identifying attacks in an autonomous access control system. We identified piggybacking attacks in which attackers try to pass through the system separately by using image and floor-mounted sensors. Our evaluation proved the layout and performance of the proposed interactive sensor-grid and recognized all piggybacking attacks when combined with the imaged based method. A limitation of the system is the requirement of the user to place himself on a position marked on the ground, since the performance otherwise decreases strongly. Both presented methods do not require high computing power and can therefore be used on single-board computers.

Notes

1.
APCER: Proportion of attack presentations using the same PAI species incorrectly classified as bona fide presentations in a specific scenario.
2.
BPCER: Proportion of bona fide presentations incorrectly classified as presentation attacks in a specific scenario.

References

Siegmund, D., Fu, B., Samartzidis, T., Wainakh, A., Kuijper, A., Braun, A.: Attack detection in an autonomous entrance system using optical flow. In: 7th International Conference on Crime Detection and Prevention (ICDP 2016), pp. 1–6. IET (2016)
Google Scholar
Siegmund, D., Handtke, D., Kaehm, O.: Verifying isolation in a mantrap portal via thermal imaging. In: 2016 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–4, May 2016
Google Scholar
Siegmund, D., Wainakh, A., Braun, A.: Verification of single-person access in a mantrap portal using RGB-D images. In: XII Workshop de Visao Computacional (WVC), November 2016
Google Scholar
Chan, T.W., Yap, V.V., Soh, C.S.: Embedded based tailgating/piggybacking detection security system. In: 2012 IEEE Colloquium on Humanities, Science and Engineering (CHUSER), pp. 277–282. IEEE (2012)
Google Scholar
Rauter, M.: Reliable human detection and tracking in top-view depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 529–534 (2013)
Google Scholar
Siegmund, D., Dev, S., Fu, B., Scheller, D., Braun, A.: A look at feet: recognizing tailgating via capacitive sensing. In: Streitz, N., Konomi, S. (eds.) DAPI 2018. LNCS, vol. 10922, pp. 139–151. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91131-1_11
Chapter Google Scholar
Braun, A., Heggen, H., Wichert, R.: CapFloor – a flexible capacitive indoor localization system. In: Chessa, S., Knauth, S. (eds.) EvAAL 2011. CCIS, vol. 309, pp. 26–35. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33533-4_3
Chapter Google Scholar
Braun, A., Wichert, R., Kuijper, A., Fellner, D.W.: Capacitive proximity sensing in smart environments. J. Ambient Intell. Smart Environ. 7(4), 483–510 (2015)
Article Google Scholar
Perš, J., Sulić, V., Kristan, M., Perše, M., Polanec, K., Kovačič, S.: Histograms of optical flow for efficient representation of body motion. Pattern Recogn. Lett. 31(11), 1369–1376 (2010)
Article Google Scholar
Baxter, L.K.: Capacitive Sensors: Design and Applications. Wiley, Hoboken (1996)
Book Google Scholar
Zivkovic, Z., et al.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31. Citeseer (2004)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

Download references

Author information

Authors and Affiliations

Fraunhofer Institute for Computer Graphics Research (IGD), Fraunhoferstrasse 5, 64283, Darmstadt, Germany
Dirk Siegmund, Vinh Phuc Tran, Julian von Wilmsdorff & Florian Kirchbuchner
Technische Universität Darmstadt, Hochschulstr. 10, 64289, Darmstadt, Germany
Arjan Kuijper

Authors

Dirk Siegmund
View author publications
You can also search for this author in PubMed Google Scholar
Vinh Phuc Tran
View author publications
You can also search for this author in PubMed Google Scholar
Julian von Wilmsdorff
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kirchbuchner
View author publications
You can also search for this author in PubMed Google Scholar
Arjan Kuijper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dirk Siegmund .

Editor information

Editors and Affiliations

Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Information Science, Havana, Cuba
Yanio Hernández Heredia
University of Information Science, Havana, Cuba
Vladimir Milián Núñez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siegmund, D., Tran, V.P., von Wilmsdorff, J., Kirchbuchner, F., Kuijper, A. (2019). Piggybacking Detection Based on Coupled Body-Feet Recognition at Entrance Control. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_74

Download citation

DOI: https://doi.org/10.1007/978-3-030-33904-3_74
Published: 22 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)