Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Since many wearable devices store highly sensitive user information such as health data, a secure and usable authentication mechanism to restrict access to unauthorized users is paramount. A straightforward solution is entry-point authentication relying on personal identification numbers (PINs), passwords or graphical patterns [18]. However, frequent use of entry-point authentication potentially disrupts user activities [2, 4]. Moreover, in comparison to smartphones, unlocking patterns on wearable devices such as Google Glass are more vulnerable to shoulder-surfing [14, 19] since the Glass touchpad is easily observable from a distance.

An alternative is to use an implicit and continuous authentication system, which runs in the background without disrupting the user, and authenticates the user whenever he/she performs a designated action, which in our case is using the touchpad. The system only triggers entry-point authentication if an intrusion is detected. Provided the method is reliable, this approach reduces the number of times a legitimate user needs to undergo entry-point authentication. Many continuous authentication schemes have previously been proposed in the literature for smartphones [7, 9, 10, 19], however, they may not provide similar accuracy or may be computationally heavy on wearables such as Glass. A smaller touchpad of Glass compared to a smartphone is likely to show less variation in gestures across different users, thereby impacting accuracy. Also, running computationally expensive applications can deplete the battery faster on Glass [11].

These factors motivate a feasibility study of continuous authentication on wearables. Towards this goal, in this paper, we assess the feasibility of continuous authentication on Glass. Our key contributions are as follows. First, to the best of our knowledge, we are the first to study the feasibility of touch gesture based continuous authentication on smart glasses in terms of classification accuracy and computational cost by using Google Glass as a use case. Although Glass itself may or may not be continued as a product, our work is still relevant since our scheme can be extended to other smart glasses with touchpads namely RECON, SiME, GlassUP, ORA-S and Icis, as well as other touchpad devices, e.g., smartphones. Second, we model a touch gesture as one or more forces applied on the touchpad by the user’s finger over the duration of the gesture. A resulting novel feature is the downward force feature which is a product of pressure and size values extracted from the device’s touch event.

Third, to authenticate the user, besides using support vector machine (SVM) with Gaussian radial basis function (RBF) classifier (widely used for continuous authentication on smartphones), we introduce a new classifier based on Chebyshev’s concentration inequality. Previous research on touch gesture based continuous authentication on smartphones has shown that during testing (authentication), instead of using features from a single sample of a gesture, using features from a block of samples of the gesture shows improved classification accuracy [7, 10, 15]. We note that this observation implicitly uses the assumption that the average value of a feature over a block is more likely to be concentrated around the mean. The justification of this comes from concentration inequalities, which give probabilistic bounds on the deviation of the average of identically distributed random variables from their true mean. This led us to propose the Chebyshev classifier. Lastly, by extending our experiments to smartphone touch data, we find that the size of the touchpad has an effect on classification accuracy; smaller touchpads, as in smart glasses, exhibit less variation across users.

2 Related Work

Entry Point Authentication: Zheng et al. [20] collected the tapping behaviour of 80 different users when entering PINs on smartphones and extracted four features (acceleration, pressure, size, and time) from the collected data, achieving a 3.65 % equal error rate. Shahzad et al. [16] created a system named GEAT. However, unlike the scheme proposed by Zheng et al., GEAT differentiates the user on the basis of their sliding behaviour and uses unique features such as finger velocity, device acceleration, and slide time, achieves a 0.5 % equal error rate. Similarly, Luca et al. [6] exploit user sliding behaviour while unlocking smartphone patterns, achieving an accuracy of around 50 %. In comparison to the these works, our study focuses on continuous authentication.

Continuous Authentication: Numerous schemes [3, 7, 10, 19] have been proposed for continuous authentication on smartphones. Hui et al. [19] collected data from 31 volunteers for different touch operations such as keystroke, slide, pinch and handwriting to test their continuous authentication scheme and showed that the slide gesture is the best in classifying users, while handwriting performs the worst. Similarly, Frank et al. [7] proposed a scheme using a set of 30 touch-based features and tested it on 41 users. Their classifier achieved a median equal error rate of 0 % within the same usage session and 2–3 % across different sessions. The reason why these two schemes achieve exceptionally high authentication accuracy might be due to the fact that users were static and were given specific tasks to be performed. In comparison, we did not enforce any such restriction on the users. Li et al. tested a continuous authentication scheme based on sliding and tap gestures [10] and extracted features such as the position and area of first touch, duration and average curvature of slide. SilentSense [3] used finger movements and user motion patterns and achieved 99 % accuracy. In contrast to our study, the temporal effect of user behaviour on accuracy is not studied in the last two schemes. A more recent work from Mondal and Bours [12] uses a trust-based approach for continuous authentication, where instead of waiting for a fixed number of gestures from the user before making a decision, the system updates its trust value, about the current user being the target user, with every gesture and locks the user when the trust value falls below a pre-defined threshold. This approach can be applied to any continuous authentication mechanism including ours.

A somewhat related topic is the recently introduced sensor-enhanced keystroke dynamics [8], which augments traditional timing-based keystroke dynamics with motion sensors available on smart devices. Not only does this approach increase the accuracy of traditional keystroke dynamics and gesture-based authentication [8], it has also been shown to be more resistant to statistical attacks using general population statistics [17].

Overall our work is different from previous works in three major ways: (1) we assess the feasibility of touch gestured based continuous authentication on smart glasses. Smart glasses, such as Google Glass, present unique challenges such as smaller form factor and lesser computational power compared to smartphones, (2) we propose a new classifier based on concentration inequalities, and (3) we propose new force-based features.

3 Background

The Google Glass: Google Glasses (cf. Fig. 1a) contain an optical display mounted on the lens, which contains a small screen (cf. Fig. 1b). The user can navigate using voice commands or by interacting with the touchpad located on the side through taps or swipes (forward, backward or downward). Swipes can be done through one, two or three fingers. Note that not all apps (cards) and their menu items can be interacted using voice and require a touch gesture.

Fig. 1.
figure 1

The Google glass (images courtesy of Wikipedia and Google).

Definitions: For the rest of this paper, a gesture is defined as a tap or a swipe with one finger on the touchpad. For each gesture, the set of data recorded by the Glass touchpad, e.g., the point of contact, is called a sample. A sample contains a time-ordered sequence of one or more readings, which correspond to data recorded at different discrete time intervals during the duration of a gesture. Each reading contains data corresponding to one or more variables called features. The authentication mechanism takes as input a set of gestures and either (implicitly) accepts or rejects the user depending on whether or not the set matches the gestures of the target user. True positive rate (TPR) is defined as the fraction of times the target user is correctly accepted. False positive rate (FPR) is defined as the fraction of times the attacker is (wrongly) accepted as the target user. Equal error rate (EER) is defined as the rate at which both acceptance and rejection errors are equal, i.e., when \(1 - \text {TPR} = \text {FPR}\). EER is widely used as a measure of classification accuracy. A related measure is the average error rate (AER), which is defined as \(\frac{1}{2}(1 - \text {TPR} + \text {FPR})\) and is useful when EER is unknown. Receiver operating characteristic (ROC) curve shows the trend of TPR against FPR. Variability in these rates is introduced by changing different parameter values of the authentication system.

4 Continuous Authentication for Google Glass

4.1 Architecture

The proposed system architecture, as shown in Fig. 2, has a training and a testing phase. The system listens for gesture events that are triggered whenever the user performs gestures on the touchpad. Once an event is triggered, elementary features such as the start and end point of gestures are extracted. From the start and end points, the gesture type (tap, forward, backward or downward swipe) is identified, after which higher-level features, e.g., force exerted on the touchpad, are derived. Some of the features in our system are derived as a function of time and require further processing for consistent inter-comparison. After going through this post-processor, our system feeds the resulting features to the classifier. During training, the classifier generates different classification models for different gesture combinations. During the testing phase, real-time gesture data from the current user is processed to obtain the feature sets as above, which are then fed to the classifier for prediction.

Fig. 2.
figure 2

System architecture.

Table 1. Total number of samples, average and minimum sample size per user, and average gap (in seconds) for gestures obtained in our user study.

4.2 Data Collection

We collected data for four gestures: tap, forward swipe, backward swipe, and downward swipe from the Glass touchpad (v 18.1, Android) using a background process, which reads the raw touch data values at runtime. Glass is equipped with the Synaptics T1320 touchpad. More technical details, such as the structure of touch packets, are given in the full paper. We selected 30 volunteers consisting of 8 females and 22 males within the 18–45 age bracket and asked them to use Google Glass for a few hours. All were colleagues and students with a computer science background. They were free to explore Glass as they liked and use any application installed on the device. Each user was trained how to operate Glass prior to data collection. Table 1 shows the quantity of gesture data collected from the users. Forward swipe is the most frequently used gesture, followed by the tap; downward swipe being the least frequent gesture. Backward swipes can be used in place of forward swipes to navigate in the opposite direction, explaining their relatively less usage. Moreover, downward swipes are mostly used for quitting an app or cancelling an action and hence their frequency is the lowest.

4.3 Gesture Model and Feature Extraction

We model the touchpad as a rectangle \(\mathcal {R}\) on a two dimensional xy-plane, where the origin is the bottom-left corner. We distinguish between two types of gestures, tap (\(\mathsf {T}\)) and swipe. Swipe is further divided into forward (\(\mathsf {F}\)), backward (\(\mathsf {B}\)) and downward (\(\mathsf {D}\)). We model each gesture as one or more forces (exerted by user’s finger) acting over the course of a gesture. Our main assumption is that the magnitude and source of these forces over the time duration of the gesture are characteristics of a user.

Modelling the Tap Gesture: The tap is characterised by the downward force applied by the finger on the touchpad. This force, denoted \(\mathbf {F}_z\), acts downwards on \(\mathcal {R}\), i.e., along the z-axis. The source is the point on \(\mathcal {R}\) where the user taps. This is shown in Fig. 3a. The magnitude of \(\mathbf {F}_z\) is calculated using pressure P and area (size) A readings from the touch event as \(F_z = PA\). Note that our hypothesis is that it is the correlation between the pressure and area values that is expected to be consistent across samples, instead of treating the two separately, as is done in [6] for instance. As the tap is performed over a time interval, say \(\varDelta t\), we denote the magnitude of \(\mathbf {F}_z\) over time as \(F_z(t)\), which is a time series. Figure 3b visualises the possible shape of \(F_z\) over the duration of tap. \(F_z(t)\) can be calculated over discrete points t in the interval \(\varDelta t\) through corresponding pressure and area values. We also use tap duration (\(\varDelta t\)) as a feature.

Fig. 3.
figure 3

Force based gesture models: (a) the tap force, (b) the magnitude of the force curve \(F_z(t)\) over the interval \(\varDelta t\), (c) the two forces active during a swipe, and (d) the source of the force \(\mathbf {F}_{xy}\) estimated through the angle \(\theta \).

Modelling the Swipe Gesture: We model a swipe as two forces acting on \(\mathcal {R}\) simultaneously. The first is \(\mathbf {F}_z\), the force acting downwards on \(\mathcal {R}\), as in the case of tap. The second, denoted \(\mathbf {F}_{xy}\), is a force acting along the direction of swipe (xy-plane). These two forces are visualized in Fig. 3c. To estimate the source of \(\mathbf {F}_z\), we use the start point \((x_0, y_0)\) and the end point \((x_1, y_1)\) of the swipe. The source of the force \(\mathbf {F}_{xy}\) is estimated as the angle \(\theta \) between the straight line joining these two points and the y-axis as shown in Fig. 3d. To estimate the duration of the forces, in addition to the swipe duration \(\varDelta t\), we also include the swipe length l. The magnitude of \(\mathbf {F}_z\) is again estimated as the time series \(F_z(t)\) of individual pressure and area (PA) values. The magnitude of \(\mathbf {F}_{xy}\) is also modelled as a time series \(F_{xy}(t)\) with the difference that individual values are the magnitude of velocity at discrete time intervals. This is done since in classical mechanics, force is considered proportional to acceleration which can be determined by change in velocity. Table 2 summarizes the list of features.

Table 2. List of features.

Post-processing the Time Series: The time series for the magnitude of force (\(F_{z}(t)\) and \(F_{xy}(t)\)) can be misaligned due to the non-uniform sampling rate of the device and difference in duration of the gesture. To get a consistent comparison of time series from different readings, we do the following: (a) we align the first sample of the two time series at time \(t = 0\); (b) we resample each time series at intervals of \(t_{\mathsf {int}} = 0.01\) s (slightly lower than the system average of \(\approx 0.012\) s) similar to the approach is used in [16]; (c) we use a cut-off point \(t_{\mathsf {off}} = 0.3\), after which all values are discarded. Most time series span an interval \(\varDelta t\), which is less than \(t_{\mathsf {off}}\). For such cases, all values at time \(\varDelta t< t < t_{\mathsf {off}}\) are mapped to 0.

4.4 Chebyshev Classifier

Many researchers have indicated that a block of samples used for testing shows an improved performance over using individual samples [7, 10, 15], where the average reading of the feature over the block is used as a single instance for input to the classifier. We note that if a sample block is to be used, a classifier based on concentration inequalities can be employed. A concentration inequality bounds the probability that a random variable deviates from its expected value. The deviation from the expected value decreases (probabilistically) with an increase in the block size of identically distributed random variables. We thus propose a one class classifier based on the concentration inequality called Chebyshev’s inequality. The use of this inequality is not unprecedented in anomaly or outlier detection in a somewhat different manner [1]. A further advantage of Chebyshev’s inequality is that it does not make any assumptions on the probability distribution of data (which may be unimodal or multimodal).

Let X be a random variable representing a unitary feature, i.e., any feature other than a time-series based feature. Let \(\mathbf {x} = (x_1,\ldots , x_n)\) denote n samples of this unitary feature. The corresponding random variables are denoted \(X_1, \ldots , X_n\). We assume that these random variables are independent and identically distributed (i.i.d.), since they correspond to different samples (of the same gesture type). Let \(\text {E}[X] = \mu _X\) and \(\text {Var}[X] = \sigma _X^2\) denote the expected value (mean) and variance of X, respectively. Then for any \(\tau > 0\), \(\Pr \left[ \left| X - \text {E}[X] \right| \ge \tau \right] \le \frac{\text {Var}[X]}{\tau ^2} \Rightarrow \Pr \left[ \left| X - \mu _X \right| \ge \tau \right] \le \frac{\sigma _X^2}{\tau ^2}\) is known as Chebyshev’s inequality [13, Sect. 8, p. 431]. Consider the random variable \(\overline{S}_n = \frac{1}{n}\sum _{i = 1}^n X_i\). Since the \(X_i\)’s are i.i.d., we have \(\text {E}[\overline{S}_n] = \frac{1}{n}\sum _{i = 1}^n \text {E}[X_i] = \frac{n}{n}\mu _X = \mu _X\), and \(\text {Var}[\overline{S}_n] = \text {Var} \left[ \frac{1}{n}\sum _{i = 1}^n X_i \right] = \frac{1}{n^2}\text {Var} \left[ \sum _{i = 1}^n X_i \right] = \frac{1}{n^2}\sum _{i = 1}^n \text {Var}[X_i] = \frac{n}{n^2}\sigma _X^2 = \frac{\sigma _X^2}{n}\). Using Chebyshev’s inequality on \(\overline{S}_n\) and the above two results, we get

$$\begin{aligned} \Pr \left[ \left| \overline{S}_n - \text {E}[\overline{S}_n] \right| \ge \tau \right] \le \frac{\text {Var}[\overline{S}_n]}{\tau ^2} \Rightarrow \Pr \left[ \left| \frac{1}{n} \sum _{i = 1}^n X_i - \mu _X \right| \ge \tau \right] \le \frac{\sigma _X^2}{n\tau ^2} \end{aligned}$$
(1)

for any \(\tau > 0\). A qualitative explanation of this inequality is that as n increases, the average of a sample is more likely to be concentrated around the mean. Now, let \(\rho = \frac{\sigma _X^2}{n\tau ^2}\). Rearranging we get \(\tau = \frac{\sigma _X}{\sqrt{n\rho }}\). By specifying a value of \(\rho \) in this equation, i.e., a bound on probability, we can obtain a corresponding threshold \(\tau \). This then gives us a straightforward classification method for features: Given a sample \({x'_1, x'_2, \ldots , x'_n}\), purported to be generated from the same distribution as X, we calculate the sample mean and see if this lies within the threshold \(\tau \) determined by \(\rho \). If yes, then the sample is classified as belonging to the target user; otherwise it is rejected. Similarly, for a time-series based feature we can use this classifier with slight modification as detailed in the full version of the paper. Thus given an n-element sample \(\mathbf {x} = (x_1, x_2, \ldots , x_n)\) and the parameter \(\rho \), we have the Chebyshev feature classifier \(f(\mathbf {x}, \rho )\) which outputs 1 if the sample belongs to the target user and 0 otherwise. To make an overall decision given samples from a set of m features \(\chi = \{\mathbf {x}_1, \ldots , \mathbf {x}_m\}\), we have the following classifier, which we call the Chebyshev classifier:

$$\begin{aligned} g(\chi , \rho , \epsilon ) = {\left\{ \begin{array}{ll} 1, \qquad \text {if } \sum _{i = 1}^m f(\mathbf {x}_i, \rho ) > \epsilon m \\ 0, \qquad \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

We call \(\epsilon \) the decision threshold and \(\epsilon m\) the decision boundary. Through our experiments we found \(\epsilon = \frac{2}{3}\) to give the best EER.

4.5 SVM Classifier

Our second classifier is the binary class SVM with Gaussian radial basis function (RBF) kernel. We used its implementation available through the LIBSVM library [5].To construct the feature space for SVM, we represented the time series based features as \(\frac{t_{\mathsf {off}}}{t_\mathsf {int}} = 30\) dimensional vectors. The whole feature space of the SVM is then a vector of all unitary features and time series based features represented in the aforementioned way. Constructed in this way, the SVM classifier is given training data. To obtain the best classification results, we performed a grid search with 10-fold cross validation on the training data to find the optimal values for its parameters, i.e., C and \(\gamma \) [5]. Notice that the training phase needs data both from the legitimate (target) user and other users (represented as the second class). As this type of data represents unbalanced data (more data from the second class), we used a weighted scheme SVM. After a user model has been created by the SVM, the authentication phase or testing phase can be carried out. Let \(\chi \) be a set of samples of features to be tested against the user model, where we assume the sample size of each feature to be \(n \ge 1\). For each feature \(\mathbf {x} \in \chi \) with n samples denoted by \(\mathbf {x} = (x_1, \ldots , x_n)\), the average value \(\frac{1}{n}\sum _{i=1}^n x_i\) is used in the final feature vector.

5 Evaluation and Results

5.1 Experimental Setup

To evaluate the performance of Chebyshev classifier, we consider three sets of users denoted by \(U_1\), \(U_2\), and \(U_3\), containing 10, 20 and 30 users, respectively. For all user sets, our experimental setup is as follows. To obtain the True Positive Rate (TPR), we randomly select a target user, and use a random set of 50 samples from this user as the training set. The test set used for authentication, consists of the remaining samples. Given a fixed value of n, a random sample of length n is obtained from the test set. The random test sample is then fed to the classifier, which was trained using the training data. The decision from the classifier is then logged. This process was repeated 500 times each with a new random target user. Note that due to randomness, the training set for the same user is different over different trials. Finally, the number of times, out of the 500 tests, the target user was accepted was used to compute TPR.

The False Positive Rate (FPR) is calculated in the same manner as TPR except that the classifier was given a test sample of size n from all the samples of a random attacker selected from \(U_1\) (respectively from \(U_2\), and \(U_3\)), excluding the target user. FPR was calculated as the rate at which the attacker was accepted. The size of the training set for tap and forward swipe was 50, whereas backward swipe and downward swipe had training set sizes of 25 and 10, respectively, since for these gestures we had lower number of available samples (see Table 1).

For the SVM classifier we divided the pool of 30 users into three disjoint sets. The first set, labelled \(U_1\), consists of 10 target users for whom we had at least 75 samples for all gesture types and is fixed. The remaining 20 users are modelled as attackers and are assigned to two sets labelled \(U_2\) (10 attackers) and \(U_3\) (20 attackers). For each user in \(U_1\), the training data consists of a random sample of a fixed size from the user’s data. This constitutes positive samples for the target user required for binary class SVM training. The negative samples for the target user came from the data of the remaining 9 target users in \(U_1\). That is, the data from the remaining 9 users was used in the training phase to model the mock attacker. The data of the users from \(U_2\) and \(U_3\) is used to compute FPR.

5.2 Chebyshev Classifier Results

First, we empirically determined the decision threshold \(\epsilon \) in Eq. 2. For this, we used the user set \(U_1\), and chose tap and forward swipes as gestures. Since tap and forward swipes have a total of \(m = 13\) features (cf. Table 2), \(\epsilon m\) ranges from 6 (majority decision) to 12 (unanimous decision). We construct a ROC curve for each of these cases. As n increases we observe that majority decision does not produce the best result. Figure 4a shows the ROC curves when \(n = 15\). The different values of FPR and TPR are obtained by varying the probability parameter \(\rho \) in the Chebyshev classifier from 1.00 to 0.1 with steps of 0.05. The dashed line in the figure is the line with \(\text {TPR} = 1 - \text {FPR}\), which meets the ROC curve at the EER value.

Fig. 4.
figure 4

ROC curves - Chebyshev classifier.

We can see no significant improvement beyond \(m =9\). Since \(\epsilon m = 9\) implies \(\epsilon \approx 0.69\), we use the nearest approximation \(\epsilon = \frac{2}{3}\) and the decision boundary \(\lceil \epsilon m \rceil \) for the Chebyshev classifier in Eq. 2. This corresponds to the two-third majority rule. Table 3 shows the decision boundaries for various combination of gestures used in our evaluation which are obtained by choosing \(\epsilon = \frac{2}{3}\).

Table 3. The decision boundaries corresponding to the decision threshold \(\epsilon = \frac{2}{3}\) for different combination of gestures from the Chebyshev classifier.

Next, we studied the impact of n on the EER. Figure 4b shows the EER for the combination \(\mathsf {T} + \mathsf {F}\) against different values of n with the user set \(U_1\) (notice that there are n taps and n forward-swipes in each test sample). The ROC curves show improvement as n increases, starting with an EER of about 30 % for \(n = 1\) and an EER of around 3 % for \(n = 25\). The trend of improving EER with increasing n is shown by all gesture combinations and all user sets, \(U_1\), \(U_2\) and \(U_3\), as shown in Table 4. Note that for a gesture combination containing multiple gestures, e.g., \(\mathsf {T} + \mathsf {F}\), authentication can trigger as soon as it collects a minimum of n samples for each gesture. From Table 4, we observe that the tap gesture as a standalone gesture performs worse in terms of EER as compared to the swipes. The EER of the forward and backward swipes are comparable, with forward swipes narrowly edging out. The downward swipe performs worse than the other two swipe types, which is potentially due to fewer data points available for training. The EER deteriorates by 3 to 4 percent when using the data sets \(U_2\) (20 users) and \(U_3\) (30 users) as compared to data set \(U_1\) (10 users). However, we do not see a noticeable deterioration in EER when comparing data sets \(U_2\) and \(U_3\), which suggests that adding more number of users to the system does not deteriorate the accuracy of the system by a huge factor. Our most important gesture combination is \(\mathsf {T} + \mathsf {F}\) since the bulk of activities on Glass can be performed by a combination of these two gestures. With \(n = 10\) taps and forward swipes each, EER is less than 10 %.

Table 4. EER for different gesture combinations and n - Chebyshev classifier (Glass).

Finally we also looked at the relationship of EER with \(\rho \), and found that for a given n and gesture combination a fixed value of \(\rho \) can be used which appears independent of the size of the user set. Details are in the full version of the paper.

5.3 SVM Classification Results

The accuracy of the SVM classifier as measured by the average error rate (AER) is shown in Table 5. The classification accuracy is varied against two parameters: training size |T| and testing size n for each gesture combination listed in the table. The training set size was varied from 25 to 75 at intervals of 25. Note that AER for all gesture combinations decreases with increasing training size, since it gives the classification algorithm more information for accurate prediction. However, this may also lead to overfitting, which is indeed the case with downward swipe with training set of size 75. The AER of the SVM classifier also improves with increasing number of test samples, i.e., n. The tap gesture performs the worst amongst all the individual gestures and forward swipe outperforms all other gestures, which is consistent with the observation from the Chebyshev classifier. As observed with Chebyshev classifier earlier, the AER does not significantly deteriorate with more number of users in the system (\(U_3\) against \(U_2\)).

Table 5. AER for different gesture combinations and n - SVM classifier (Glass).

5.4 Distinguishing Features

To determine if individual features have distinguishing capabilities, we use the Chebyshev feature classifier f on user set \(U_2\) to obtain true positive (TP) and false positive (FP) frequencies for the features of all four gestures as shown in Fig. 5. The x-axis shows 31 features (4 for tap plus 9 each for forward, backward and downward swipes). The TP frequencies are above 400 (out of 500) for all gesture types except the downward swipe (last nine features in the figure), which is most likely due to its small training set size, i.e., 10. Nevertheless, observe that the FP frequencies are lower than the corresponding TP frequencies for all features. We therefore included all features for classification as each can effectively distinguish between users. For more details of the setup and exact frequencies, see the full version of the paper.

Fig. 5.
figure 5

TP & FP frequencies obtained via Chebyshev feature classifier for all features.

5.5 Comparison of the Two Classifiers

To compare the two classifiers in terms of classification accuracy, we use EER readings from the Chebyshev classifier based on the set of 20 users, i.e., the set \(U_2\) shown in Table 4, and we use the AER readings from SVM based on training set of size 50 from Table 5.Footnote 1 We first consider \(n = 10\) for the purpose of our comparison. By looking at Tables 4 and 5 we can see that compared to the SVM classifier, Chebyshev’s error rate is lower for taps, forward swipes and backward swipes. For all other combinations the two classifiers have similar error rates. For other values of n, we observe that the SVM classifier performs slightly better when \(n = 1\), but the Chebyshev classifier’s performance rapidly improves with increasing n, outperforming SVM in the three aforementioned gesture types. For combination of gestures, the performance of the two is very similar. These findings suggest that in terms of accuracy both classifiers are effective on Glass and hence can be used on similar wearables.

To compare the computational overhead of the two classifiers, we evaluated the time taken by model generation and prediction. Both these components are illustrated in Fig. 2. We first implemented both components of the two classifiers on a desktop computer. The SVM classifier was implemented in Java (via LIBSVM), whereas we used Python to implement the Chebyshev classifier. The results of the model generation and prediction time are shown in Table 6.

Table 6. Model generation and prediction time (ms) for gestures on a PC.

Not surprisingly, for both the classifiers model generation takes longer than prediction. For both model generation and prediction, the Chebyshev classifier is many orders of magnitude faster than SVM. This suggests that using SVM for training on Glass can be computationally expensive in terms of power and heat generation. However, three important points need to be considered here. First, high model generation time is not inherent to SVM. In fact, it is due to the use of the RBF kernel; a linear SVM is likely to yield much lower model generation time. Secondly, we do not consider the high model generation time as a drawback of the SVM classifier, as (a) model generation is done infrequently, and (b) model generation can be outsourced to the Cloud (depending on connectivity). Lastly, a smaller grid search, i.e., restricting the ranges of the parameters C and \(\gamma \), may result in faster model generation time, at the possible expense of accuracy. Alternatively, although the optimum range of these SVM parameters depend on user data, it may be possible to experimentally determine whether the optimum values lie within narrow ranges for touch based gestures. Nevertheless, our focus was more on accuracy than speed.

We, therefore, chose to implement only the predictor component of SVM on Glass to check the actual performance. The classification models were generated offline on a desktop computer and loaded on to the Glass. On the other hand, for Chebyshev classifier we implemented both the model generator and predictor on Glass. The results from our experiment are shown in Table 7. As can be seen, Chebyshev is faster than SVM in terms of prediction time and needs little time for model generation on Glass. Having said that, the prediction time for SVM is also small enough to be practical. In terms of space requirements, both classifiers require storing gesture data which is in the order of a few kilobytes. For the model, Chebyshev classifier needs to store the means, variances and co-variances for all features, whereas the SVM classifier needs to store the support vectors. The model space complexity also increases with gesture combinations. Typically, the model size ranges from 15 KB for a simple tap to 400 KB for all gestures. In any case, Glass has 8 GB of storage capacity, and the total space required by the classifiers is only in the order of a few megabytes. The main advantage of using the Chebyshev classifier, in our opinion, is its ease of implementation (as it requires standard functions and therefore does not require external libraries).

Table 7. Model generation and prediction time (ms) for different gestures on Glass.

5.6 Generalization: Results on Smartphone Data

To test the generalizability of our proposed system on smartphones, we used publicly available smartphone gesture data which was collected by the authors of [19].Footnote 2 The data consists of 120 taps, and 20 forward, backward and downward-swipes each for 31 users. We chose 30 of the 31 users for our study. We further fixed training size of 50 for taps and 10 for all swipe gestures. The rest of the data was used as the testing set. The other details of the experimental setup remain the same as in Sect. 5.1. The results of applying Chebyshev and SVM on the smartphone data are shown in Tables 8 and 9, respectively.

The trends observed in the results for both the classifiers on the smartphone data remain similar to Glass data. We observe that the accuracy of the system increases with increasing testing size, i.e., n. The system is able to achieve accuracy of 98 %-99 % with \(n \ge 7\) with all 4 gestures combined. We also observed two marked differences in the accuracy of the classifier between the smartphone data and Glass data. First, the accuracy of the system on all the swipe gestures on the smartphone is better than Glass. However, this might be due to the fact that the total number of swipe gestures were smaller, i.e., 20, in the smartphone data. Secondly, the accuracy of the system is less impacted with increasing number of users on smartphone than Glass. A plausible reason for these two differences might be due to the difference in touchpad size of the two devices. Bigger touchpad size allows for more variation in the gesture patterns. It is interesting to investigate whether other gesture-based authentication mechanisms proposed for smartphones exhibit a similar trend on smart glasses.

Table 8. EER for different gesture combinations and n - Chebyshev classifier (phone).
Table 9. AER for different gesture combinations and n - SVM classifier (phone).

5.7 Effect of Behavioural Evolution on Classification Accuracy

As the gesture behaviour of users may change over time, we studied its evolution through an extended study on three users asking them to use Glass for five days over two weeks. The five days were spaced as: day 1, 2, 3, 7 and 14. We used a fixed training size of 20. To test the permanence of a user’s gesture model, We experimented with the following three scenarios related to how the training model was generated. (a) Same Day: This scenario serves as the benchmark. The testing data is matched against training data collected from the same day. (b) First Day: In this scenario, each user model is generated using data from day one. The model is then tested against data collected on subsequent days. For instance, day seven against day one. (c) Adaptive: In this scenario, the user model is updated every day, by iteratively replacing a fixed number of samples in the training data of previous days with random samples from the data of the same day. For example, to create the training data for day 3, we randomly replaced 8 samples from the training data of day one with 4 samples from day two and 4 samples from day three.

For the Chebyshev classifier, for each simulation run we use one random user as the target user and the remaining two as the attack users. In case of the SVM classifier, each of the three user is taken as a target user. The training data for the target user consists of a random sample of a fixed size from the target user’s data. This constitutes positive samples for the target user required for SVM training. The negative samples for the target user come from the data of the remaining two target users. The attackers’ data come from a fixed set of three users who did not participate in the evolution study and whose data was collected for earlier experiments. The results are shown in Fig. 6 for both the classifiers. As expected, the same day scenario achieves the highest accuracy amongst all the scenarios for a given day. We can also observe that the accuracy of the first day scenario is the worst, suggesting that the touch biometrics are not quite stable over time and hence an adaptive approach should be considered to maintain accuracy over time. Using adaptive approach in our experiments clearly shows performance improvements over the first day scenario, especially for the Chebyshev classifier. Note that replacing older samples with newer ones means that the classifiers need to be re-trained. For the Chebyshev classifier, this is not an issue since re-training takes around 1 s at worst (cf. Table 7). For SVM, training takes longer, but this is not a substantial hurdle due to the reasons discussed in Sect. 5.5.

Fig. 6.
figure 6

The evolution of EER - Chebyshev classifier and AER - SVM classifier. Legend: same day training data; adaptive training data; first day training data.

6 Some Limitations and Discussion

We did not consider the effect of user posture, e.g., walking versus sitting, on touch gestures. Although this difference may not be as profound as in the case of smartphones, since the Glass is mounted on the user’s head and is relatively stable, it needs to be experimentally determined. Since the focus of our research has been touch gesture based continuous authentication, we have overlooked voice characteristics (as mentioned before, the user can also perform certain operations in Glass through voice commands) or readings from other sensors such as accelerometer and gyroscope. Our continuous authentication system can be augmented by including distinguishing features from voice or other sensors. Also, as is the case for any behaviour biometric system, it is important to test our system on the larger population to validate its accuracy, a feat we were unable to perform due to limited resources.

Since the Chebyshev classifier is based on a concentration inequality, it will be interesting to employ other concentration inequalities such as Hoeffding or Bernstein’s inequalities to compare the results. As a classifier’s performance is also dependent on the features being used, it will be interesting to expand on the feature model introduced in this paper. For instance, one may model the swipe feature as an interaction between the two forces (downward and planar), instead of taking the two forces separately. A resulting feature could be a three dimensional magnitude of force over time.

7 Conclusion

Due to smaller touchpad size and relatively meagre resources of current smart glasses hardware (CPU, battery) compared to modern smartphones, it is not straightforward to assume that gesture based implicit authentication systems proposed for smartphones would yield high classification accuracy and low computational load on smart glasses, such as Google Glass. The results of our study indicate that gesture based continuous authentication is indeed both computationally and accuracy-wise feasible on Glass. Among other contributions of our work is the proposal of a new classifier based on Chebyshev’s concentration inequality, which can be added to other classifiers used in the field of implicit authentication. Our secondary contributions include modelling touch gestures in a new way from which we extract new features such as downward (as measured by pressure and area readings) and planar (as measured by velocity readings) force as a function of time, and the finding that classification accuracy is dependent on the size of the touchpad.